Curator: Single-Turn Context Injection for AI Agents
TLDR: I built a context retrieval system that reduces the cost of running Claude Code (and other AI agents) by as much as 65%, by injecting the right context before the agent starts working.
AI agents like Claude Code can do impressive things when given enough context, but they’re inefficient at finding it. Instead of reading the exact 100 lines it needs across 5 files, an agent reads thousands of lines searching for that information. Every unnecessary token costs money and dilutes output quality. The typical approach to this problem goes one of two ways: dump everything into the prompt and pay for tokens the model doesn’t need, or leave context out and let the agent fumble through file reads and tool calls trying to piece it together. Both are expensive, and both produce worse output than an agent that starts with exactly the right context.
I built Curator to solve this. It’s a context retrieval system that sits upstream of your AI agent, figures out exactly what knowledge the agent needs for a given task, and injects it before the agent starts working.
Why Claude Code with non-code content?
A reasonable question: if this is a Claude Code optimization, why test it on marketing and biotech content instead of code? A few reasons.
First, Claude Code is the only Claude interface that supports hooks, which is how Curator injects context before the agent starts. Claude UI and Cowork don’t have this capability yet. Future tests will cover Claude UI via MCP and static generated context index files, as well as other systems like OpenAI and Gemini.
Second, code context has been the market’s primary focus. Non-code context has different and arguably harder challenges: it doesn’t naturally belong in a codebase, it’s difficult to quantify, and it’s impossible to validate from the outside without deep domain knowledge. If context injection works on messy, ambiguous business content, it will work on code.
Third, this is how real startups actually use AI agents. They’re not just writing code. They’re generating marketing copy, analyzing clinical data, and synthesizing business documents. The non-code use case is where context retrieval matters most and where agents waste the most effort exploring.
The numbers
I benchmarked Curator against three baselines on a marketing content corpus (14 files, 69 segments). Each condition was run N=10 for statistical confidence.
Competitive analysis email
| Condition | N | Cost | Reads | Turns | Duration |
|---|---|---|---|---|---|
| No retrieval (control) | 10 | $0.155 | 5.8 | 6.8 | 37s |
| Naive RAG | 10 | $0.171 | 5.3 | 6.3 | 41s |
| Full context dump | 10 | $0.121 | 2.1 | 2.9 | 33s |
| HyPE (Curator) | 10 | $0.065 | 0.0 | 1.0 | 28s |
Campaign plan (multi-document planning task)
| Condition | N | Cost | Reads | Turns | Duration |
|---|---|---|---|---|---|
| No retrieval (control) | 10 | $0.253 | 11.0 | 9.4 | 77s |
| Naive RAG | 10 | $0.243 | 8.3 | 9.3 | 67s |
| Full context dump | 10 | $0.081 | 0.0 | 1.0 | 43s |
| HyPE (Curator) | 10 | $0.077 | 0.0 | 1.0 | 40s |
The campaign task is the most demanding - it requires pulling from audience personas, content calendars, pricing, product roadmap, and campaign briefs. The control condition burns 11 file reads and 9.4 turns. Curator reduces this to a single turn with zero reads, saving 69.6%.
Across both tasks, the same patterns hold:
Naive RAG is no better than no retrieval. It costs more on the competitive task ($0.171 vs. $0.155) and barely helps on the campaign task (-4%). It injects the wrong context, and Claude falls back to reading files anyway.
Full context dump helps, but Curator matches or beats it. On the competitive task, full dump is inconsistent ($0.06-$0.20 variance). On the campaign task, full dump and Curator are nearly tied - but full dump injects all 69 segments while Curator injects only what’s essential. The difference matters at scale.
Curator wins because it’s selective. Every run completes in 1 turn with 0 reads. Cost is consistent. The agent doesn’t explore because the context is actually complete for the task.
An obvious question: did we just tell Claude to stop reading files, and it stopped? Not quite. The instruction to treat injected context as sufficient only works when the context actually is sufficient. In earlier iterations with worse retrieval, Claude ignored the framing and searched for files anyway. And when we tried softer framing - “read additional files only if needed” - Claude treated it as an invitation to explore regardless. The single-turn behavior only emerged when the retrieved context was good enough that Claude genuinely didn’t need anything else. Quality evaluations confirmed the outputs were actually better with Curator’s context injection, not worse - the agent wasn’t just doing less work, it was doing better work with better inputs.
Cross-domain validation
To confirm these results aren’t domain-specific, I built a second corpus in biotech clinical operations: 14 documents covering a fictional drug trial (MRD-401), including protocol versions, safety reports, interim analyses, and site feasibility assessments. 48 segments. The corpus was deliberately designed with contradictions - a safety report claiming “no new safety signals” contradicted by a later report with a liver signal under evaluation, protocol versions that partially supersede each other, budget figures that don’t match across documents.
| Condition | N | Cost | Reads | Turns | Duration |
|---|---|---|---|---|---|
| No retrieval (control) | 10 | $0.191 | 2.2 | 4.2 | 30s |
| Naive RAG | 10 | $0.159 | 2.7 | 4.7 | 32s |
| Full context dump | 10 | $0.135 | 2.2 | 4.2 | 28s |
| HyPE (Curator) | 10 | $0.068 | 0.0 | 1.0 | 23s |
Same pattern. Curator is actually stronger on clinical content (-64.5%) than marketing (-58%), likely because the clinical corpus has more complex interdependencies between documents, meaning the control condition wastes more effort searching.
Task type still drives cost more than domain complexity. Focused synthesis tasks complete in 1 turn. Formal document generation (IND submissions, blog posts) triggers multi-turn exploration regardless of what you inject.
How it works
The core retrieval technique is called HyPE - Hypothetical Prompt Embeddings. It’s a twist on HyDE (Hypothetical Document Embeddings) that generates hypothetical tasks to improve retrieval instead of hypothetical answers to improve search. I arrived at this approach independently while building Curator, then found existing research describing the same core idea. The technique itself isn’t new. What’s new is the applied system I built around it.
The problem with standard retrieval
Standard RAG embeds the user’s query, searches a vector database for similar content, and returns the top results. This works for document search, where the user is looking for content that resembles their question.
But agent context retrieval is a different problem. When someone asks an AI agent to “write a competitive analysis email,” the content they need - brand voice guidelines, product positioning docs, competitor pricing - looks nothing like the query. There’s a semantic gap between what the user is asking and what the agent needs.
With standard RAG, the next step would be adding a reranker to improve the relevance of retrieved results. I tried Jina reranker-v2 as a post-retrieval reranking step. It made things worse - degrading retrieval quality on 5 out of 6 test tasks. Generic rerankers optimize for query-document similarity, which is the wrong objective when the query is a task description and the documents are reference material.
HyPE: closing the semantic gap
HyPE closes the gap at index time. For each segment of content, I generate 10 hypothetical task descriptions - prompts a user might write where this content would be essential.
Take a segment containing brand voice guidelines. Standard embedding places it near other content about brand voice. HyPE generates prompts like “write a marketing email for our spring campaign,” “draft social copy for a product launch,” “create a competitive positioning one-pager” - the actual tasks where those guidelines matter. Each hypothetical prompt gets embedded alongside the segment.
At runtime, the user’s real prompt gets embedded and compared against all hypothetical prompt embeddings. The brand voice segment now matches “write a competitive analysis email” because one of its hypothetical prompts is semantically close. No LLM call at retrieval time - just vector search.
Scoring
Each segment can have up to 10 hypothetical prompt embeddings. The composite score is:
score = ln(match_count + 1) x avg_similarity
The scoring formula rewards segments that match multiple tasks, but with diminishing returns - matching two tasks is better than one, but matching ten isn’t ten times better. Segments scoring above 0.75 are classified as essential and injected into the agent’s context.
After targeted retraining - regenerating hypothetical prompts for underperforming segments - essential recall hit 100% across all tasks (up from ~70-80% on initial cold-start).
The breakthrough: less is more
The biggest performance jump didn’t come from retrieval. It came from how I framed the injected context.
In earlier versions, I had two tiers: essential content (injected in full) and supporting content (listed as “available if needed”). Savings were around 24%. Then I removed the supporting tier entirely - just essential content, with framing that told the agent “you have everything you need.” Savings jumped to 51%.
The reason is behavioral. When you tell Claude “here’s some context, and here’s more if you need it,” it treats the supporting tier as an invitation to explore. The cost driver in agent workflows isn’t the size of the initial context injection - it’s the number of turns. Each additional turn re-sends the full conversation.
Injecting 3,000 tokens of essential context once costs about $0.002. That same content, discovered across 5 file reads over 5 turns, costs roughly $0.03 - 15x more.
Architecture
The system is a three-step pipeline:
1. Ingest. Each file gets segmented by an LLM that identifies semantic boundaries - not arbitrary chunk sizes, but meaningful sections. One LLM call per file.
2. Generate use cases. For each segment, an LLM generates 10 hypothetical prompts plus constraints. Each prompt gets embedded with Google’s Gemini embedding model (768 dimensions). About $0.0075 per segment, but only runs at index time.
3. Retrieve. At runtime, the user’s prompt gets embedded (one API call, ~$0.0001) and compared against all hypothetical prompt embeddings via pgvector in Supabase. No LLM calls. Retrieval latency is around 700ms.
The runtime cost is essentially zero. A 60-segment corpus costs about $0.76 to fully index, and retrieval is a rounding error after that.
What I learned
Turns are the dominant cost driver. Not tokens, not context size - turns. Every time an agent takes another action, the full conversation gets re-sent. Reducing turns from 8 to 3 is worth more than any token-level optimization.
“If needed” is expensive framing. Giving an AI agent optional context is like giving a developer optional documentation - they’ll read all of it. If you know what’s essential, inject it and frame it as complete.
Targeted retraining is cheap and effective. When a specific task underperforms, you can regenerate hypothetical prompts for just the underperforming segments with task-specific hints. Email retrieval went from 33% to 100% essential recall by retraining 4 segments. Zero runtime cost increase.
Limitations
Long-form generation resists injected context. Curator saves ~60% on focused tasks like emails and competitive analysis, but only ~17% on blog posts. When Claude perceives a task as requiring “research,” it reads source files regardless of what you inject. I haven’t solved this yet.
Cold-start retrieval isn’t perfect. The initial 10 hypothetical prompts per segment get you to ~70-80% essential recall. Hitting 100% requires targeted retraining. The retraining loop is cheap (minutes of work, fractions of a cent), but it’s not zero.
Longer context windows don’t eliminate this problem. 1M+ token context windows won’t make retrieval irrelevant, because each additional turn re-sends the entire conversation. An agent that takes 7 turns to explore files pays for the full context 7 times over. Curator’s savings come from reducing turns to 1, which matters regardless of context window size.
The corpus is small. 14 files, 62 segments, 620 hypothetical prompt embeddings. At 1,000+ segments, the embedding space could get noisy enough that precision degrades. I expect HyPE’s advantage to grow with corpus size (the semantic gap problem gets worse with more content), but I haven’t proven that yet.
Why this matters
The AI industry is scaling toward agents that take autonomous actions. The standard approach is to give these agents access to everything and let them figure it out. That works, but it’s slow and expensive.
Context retrieval that actually understands the relationship between tasks and knowledge can make these agents dramatically more efficient. Not by making the models smarter, but by making sure they start with the right information.
If you’re looking to get more out of your existing AI tools or struggling with AI context problems, I’d love to connect. You can find me on LinkedIn or reach me at me@patrickjohnkelly.com.