Assignment: Context Engineering#
Assignment Metadata#
Field |
Description |
|---|---|
Assignment Name |
Context Engineering for Production RAG |
Course |
AI Integration |
Project Name |
|
Estimated Time |
120 minutes |
Framework |
Python 3.11+, LangChain 1.x, OpenAI API, tiktoken |
Learning Objectives#
By completing this assignment, you will be able to:
Design a token budget for a multi-component context window
Implement a context assembly pipeline with retrieval and tool outputs
Build a conversation history manager with summarization-based truncation
Optimize context layout for caching and grounding
Measure how context allocation affects output quality
Problem Description#
You are building a technical support agent that needs to assemble context from multiple sources — retrieved documentation, customer profiles, ticket history, and conversation history — within a fixed token budget. Your task is to implement the context assembly pipeline and measure how different budget allocations affect response quality.
Technical Requirements#
Environment Setup#
Python 3.11 or higher
Required packages:
langchain[openai]>= 1.0tiktoken>= 0.7.0
Dataset#
Prepare a test set with:
At least 10 support questions spanning different categories
A mock documentation corpus (at least 20 chunks)
Mock customer profiles and ticket histories
Tasks#
Task 1: Token Budget Design (20 points)#
Define a token budget for a 128k context window:
Allocate tokens across: system prompt, few-shot examples, retrieved docs, tool outputs, conversation history, and response reserve
Justify your allocation decisions
Implement a budget tracker that:
Counts tokens per component using tiktoken
Warns when a component exceeds its allocation
Reports total utilization as a percentage
Test your budget with at least 3 scenarios: minimal context, typical context, and maximum context
Task 2: Dynamic Context Assembly (25 points)#
Build an assembly pipeline that:
Retrieves relevant documentation chunks (top-5 by relevance)
Fetches customer profile data
Injects recent ticket history
Combines all components into a single prompt
Implement relevance filtering that:
Scores retrieved documents by relevance
Drops documents below a similarity threshold
Re-ranks remaining documents (most relevant first and last)
Add source attribution so the model can cite
[Source N]in its responses
Task 3: Conversation History Management (25 points)#
Implement a history manager that:
Tracks token usage for conversation history
Triggers summarization when history exceeds its budget
Always preserves the system message and last 3 exchanges
Compare two truncation strategies:
Simple drop: remove oldest messages
Summarize and compress: use an LLM to summarize old messages
Measure the impact on response coherence across a 10-turn conversation
Task 4: Caching and Grounding (30 points)#
Restructure your context for cache efficiency:
Identify which components are stable vs. dynamic
Place stable components first in the context
Measure cache hit rate across 10 sequential requests
Implement grounding checks:
Add instructions that require source citation
Verify that responses reference provided context
Measure faithfulness (% of claims supported by context)
Run an experiment varying retrieved context budget from 2,000 to 20,000 tokens:
Measure response quality (LLM-as-judge, 1–5 scale)
Measure faithfulness score
Plot the results and identify the optimal allocation
Submission Requirements#
Required Deliverables#
Source code (Python scripts or Jupyter notebook)
README.mdwith setup and usage instructionsToken budget design document with justifications
Experiment results with charts
Analysis report (findings, recommendations)
Submission Checklist#
All code runs without errors
Token budget tracking works correctly
Context assembly produces valid prompts
History management handles overflow gracefully
Experiment results are reproducible
Evaluation Criteria#
Criteria |
Points |
|---|---|
Token budget design & implementation |
20 |
Dynamic context assembly |
25 |
Conversation history management |
25 |
Caching and grounding experiments |
30 |
Total |
100 |
Hints#
Tip
Use
tiktoken.encoding_for_model("gpt-4o")for accurate token countingStart with a simple budget and iterate — don’t over-optimize on the first pass
For the grounding experiment, use a simple LLM-as-judge prompt to score faithfulness
The lost-in-the-middle effect is real — test with most-relevant-first vs. random ordering
Log the actual tokens per component for each request so you can debug budget violations