Patrick John Kelly

Curator: Single-Turn Context Injection for AI Agents

TLDR: Most AI agents waste over half their cost exploring for context across multiple turns. I built a context injection system that reduces most tasks to a single turn - cutting costs by 58% on focused tasks, with measurable savings across every task type tested.


Most AI agent platforms have a context problem. They either dump everything into the prompt and pay for tokens the model doesn’t need, or they leave context out and watch the agent fumble through file reads and tool calls trying to find what it needs. Both approaches are expensive and slow.

I built Curator to solve this. It’s a context retrieval system that sits upstream of your AI agent, figures out exactly what knowledge the agent needs for a given task, and injects it before the agent starts working. In benchmarks against Claude Code with no retrieval, Curator reduces costs by 58% on competitive tasks, cuts task completion from 7 turns to 1, and improves output quality by ensuring the agent always has the right context.

The core retrieval technique is called HyPE - Hypothetical Prompt Embeddings. It’s a twist on HyDE (Hypothetical Document Embeddings) that generates hypothetical tasks to improve retrieval instead of hypothetical answers to improve search. I arrived at this approach independently while building Curator, then found existing research describing the same core idea. The technique itself isn’t new. What’s new is the applied system I built around it - a production retrieval pipeline with a custom scoring formula, a Claude Code hook that injects context at runtime, benchmarked cost and quality improvements, and a retraining loop for iterative improvement.

The problem with standard retrieval

The standard approach to context retrieval is straightforward: embed the user’s query, search a vector database for similar content, return the top results. This works well enough for document search, where the user is looking for content that resembles their question.

But agent context retrieval is a different problem. When someone asks an AI agent to “write a competitive analysis email,” the content they need - brand voice guidelines, product positioning docs, competitor pricing - looks nothing like the query. There’s a semantic gap between what the user is asking and what the agent needs to do a good job.

You can throw a reranker at this and hope a cross-encoder bridges the gap. I tried Jina reranker-v2 as a post-retrieval reranking step. It made things worse - degrading retrieval quality on 5 out of 6 test tasks. The issue is that generic rerankers optimize for query-document similarity, which is exactly the wrong objective when the query is a task description and the documents are reference material.

How HyPE works

HyPE closes the semantic gap at index time instead of query time. For each segment of indexed content, I generate 10 hypothetical task descriptions - prompts that a user might write where this content would be essential.

Take a segment containing brand voice guidelines. Standard embedding would place it near other content about brand voice. HyPE generates prompts like “write a marketing email for our spring campaign,” “draft social copy for a product launch,” “create a competitive positioning one-pager” - the actual tasks where those guidelines matter. Each hypothetical prompt gets embedded alongside the segment.

At runtime, the user’s real prompt gets embedded and compared against all the hypothetical prompt embeddings. A brand voice segment now matches “write a competitive analysis email” because one of its hypothetical prompts is semantically close to that task. No LLM call at retrieval time - just an embedding lookup and vector search.

The scoring formula

Each segment can have up to 10 hypothetical prompt embeddings. When a query comes in, I count how many of those embeddings exceed a similarity threshold of 0.45, and compute the average similarity across all matches. The composite score is:

score = ln(match_count + 1) × avg_similarity

The natural log dampens the effect of multiple matches while keeping the relationship superlinear - a segment that matches on 6 hypothetical prompts scores meaningfully higher than one matching on 2, but not 3x higher. This rewards breadth of relevance without letting a single segment dominate just because it has many loosely-related use cases.

Segments scoring above 0.75 are classified as essential and their full content gets injected into the agent’s context. That threshold was tuned empirically - below it, you start injecting content that’s nice-to-have but not load-bearing.

The breakthrough: snippets mode

The biggest performance jump didn’t come from the retrieval algorithm. It came from how I framed the injected context.

In earlier versions, I had two tiers: essential content (injected in full) and supporting content (listed as “available if needed”). The idea was reasonable - give the agent core context upfront and let it pull in more if required. Cost savings were around 24%.

Then I removed the supporting tier entirely. Just essential content, with framing that told the agent “you have everything you need.”

Savings jumped from 24% to 51%.

The reason is behavioral. When you tell Claude “here’s some context, and here’s more if you need it,” it treats the supporting tier as an invitation to explore. It reads more files, takes more turns, and each turn re-sends the full conversation at Anthropic’s cache creation rates. The cost driver in agent workflows isn’t the size of the initial context injection - it’s the number of turns. Each additional turn costs $0.01-0.03 in cache overhead as the full conversation gets re-processed.

Injecting 3,000 tokens of essential context once costs about $0.002. That same content, discovered across 5 file reads over 5 turns, costs roughly $0.03 - 15x more, plus the latency of those extra round trips.

The numbers

I benchmarked Curator against a control condition (Claude Code with full file access, no retrieval) across a marketing content corpus. The competitive task was run N=10 times per condition for statistical confidence.

Marketing corpus (14 files, 69 segments)

TaskNControlCuratorSavings
Competitive analysis email10$0.155$0.065-58%

On the competitive task, file reads dropped from 5.8 to 0. Turns dropped from 6.8 to 1.0. End-to-end task completion time dropped from 37s (control) to 28s (Curator). Every single snippets run completed in a single turn with zero file reads - Claude had everything it needed from the injected context.

The control runs showed high variance ($0.13-$0.53), driven by cold Anthropic cache hits on the first runs. Snippets runs were remarkably consistent ($0.054-$0.098), because the cost is determined by the fixed context injection rather than unpredictable exploration behavior. Control steady-state (excluding cold cache outliers) was ~$0.155; even against that conservative baseline, savings hold at 58%.

HyPE vs. naive RAG

To validate that the HyPE technique actually matters - and that you can’t just embed raw segment content and get the same results - I ran the same competitive task at N=10 with a standard naive RAG baseline. Same corpus, same injection template, same everything except the retrieval method: naive RAG embeds each segment’s content directly (1 embedding per segment) instead of generating hypothetical task descriptions (10 per segment).

ConditionNCostReadsTurnsDuration
No retrieval (control)10$0.1555.86.837s
Naive RAG10$0.1715.36.341s
Full context dump10$0.1212.12.933s
HyPE (Curator)10$0.0650.01.028s

Three findings here:

Naive RAG is no better than no retrieval at all. It actually costs more - $0.171 vs. $0.155 - because it injects context (costing tokens) but injects the wrong context. The segments most textually similar to “write a competitive analysis email” aren’t the segments you actually need to write a good competitive analysis email. Claude gets the injected context, doesn’t find what it needs, and falls back to reading files anyway: 5.3 reads, 6.3 turns.

Full context dump helps, but inconsistently. Injecting all 69 segments gives the agent everything it needs - somewhere in the pile. Cost drops to $0.121 and reads drop to 2.1. But variance is massive ($0.06-$0.20). Some runs complete in 1 turn; others, Claude still explores despite having everything in context. Giving the model the right information mixed with a lot of irrelevant information produces unpredictable behavior. The lowest-cost runs also benefited from prompt caching - the full dump injects identical content every time, so Anthropic’s cache kicks in on later runs. In production with varied queries, you wouldn’t see that effect.

HyPE wins because it’s selective. It injects only what’s essential and nothing else. Every run completes in 1 turn with 0 reads. Cost is consistent ($0.054-$0.098). The agent doesn’t explore because the framing tells it the injected context is complete - and because the context actually is complete for the task, that framing holds.

This is the semantic gap in action. Standard retrieval can’t bridge the gap between “what the user asks” and “what the agent needs.” Dumping everything works sometimes but wastes tokens and produces inconsistent results. HyPE bridges the gap at index time, so retrieval is precise and the agent behaves predictably.

Retrieval quality

After targeted retraining - regenerating hypothetical prompts for specific underperforming segments - essential recall hit 100% across all tasks (up from ~70-80% on the initial cold-start run). The system retrieves every segment that a human evaluator marked as critical.

TaskEssential recallFull recallPrecisionF1
Competitive100%80%80%0.80
Blog100%55%73%0.63
Email100%73%73%0.73

Full recall (including nice-to-have segments) is lower, which is fine - the whole point of snippets mode is that you don’t need everything, just the essentials.

Constraints are free

Each segment can also have extracted constraints - rules like “subject lines must be under 50 characters” or “always include a preview text field.” I ran an A/B test to check whether injecting constraints alongside content affected cost:

TaskWith constraintsWithoutDelta
Competitive$0.034$0.031+$0.003
Email$0.027$0.028-$0.001

Noise-level differences. Constraints are essentially free to inject and they measurably improve output quality - the email task now consistently follows the 50-character subject line rule and includes preview text.

Cross-domain validation

The marketing results above could be domain-specific - maybe HyPE only works for marketing content. To test this, I built a second corpus in a completely different domain: biotech clinical operations. 14 clinical documents covering a fictional drug trial (MRD-401), including protocol versions, safety reports, interim analyses, investigator brochures, and site feasibility assessments. 48 segments, each with its own set of hypothetical prompt embeddings.

The clinical corpus was deliberately designed to include contradictions - a safety report claiming “no new safety signals” contradicted by a later report with a liver signal under evaluation, protocol versions that partially supersede each other, budget figures that don’t match across documents. Real-world business knowledge is messy. If the system only works on clean data, it doesn’t work.

TaskNComplexityTurnsFile ReadsCost
Safety report summary3Simple (1-2 docs)1.00$0.038
IND submission overview3Medium (3-4 docs)4.31.7$0.131
Risk assessment memo3Complex (5+ docs)1.00$0.060

Two findings stand out. First, HyPE transfers across domains without modification. The same pipeline, same scoring formula, same thresholds - applied to clinical documents it had never seen - produced the same pattern: focused synthesis tasks complete in 1 turn with 0 file reads.

Second, task type drives cost, not domain complexity. The risk assessment memo required synthesizing information from 5+ documents with factual contradictions, yet it completed in a single turn at $0.060. The IND submission - a formal document-writing task - triggered multiple turns and file reads at $0.131. This matches the marketing corpus exactly: when Claude perceives a task as requiring formal document generation, it explores regardless of what you inject. Focused synthesis tasks stay in 1 turn.

The clinical constraints A/B test confirmed the marketing finding: constraints are cost-neutral to inject in both domains.

Architecture

The system is a three-step pipeline:

1. Ingest. Each file gets segmented by an LLM that identifies semantic boundaries - not arbitrary chunk sizes, but meaningful sections like “email formatting rules” or “competitor positioning.” One LLM call per file.

2. Generate use cases. For each segment, an LLM generates 10 diverse hypothetical prompts plus any constraints embedded in the content. Each prompt gets embedded with Google’s Gemini embedding model (768 dimensions). This is the expensive step - about $0.0075 per segment - but it only runs at index time.

3. Retrieve. At runtime, the user’s prompt gets embedded (one API call, ~$0.0001) and compared against all hypothetical prompt embeddings via pgvector in Supabase. No LLM calls. Total retrieval latency is around 700ms.

The runtime cost is essentially zero. All the intelligence is front-loaded into the indexing step. A 60-segment corpus costs about $0.76 to fully index, and retrieval is a rounding error after that.

What I learned

Turns are the dominant cost driver. Not tokens, not context size - turns. Every time an agent takes another action, the full conversation gets re-sent. Reducing turns from 8 to 3 is worth more than any token-level optimization.

“If needed” is expensive framing. Giving an AI agent optional context is like giving a developer optional documentation - they’ll read all of it. If you know what’s essential, inject it and frame it as complete.

Generic rerankers hurt domain-specific retrieval. Cross-encoders optimized for query-document similarity actively degrade performance when the “query” is a task and the “documents” are reference material. HyPE’s task-oriented embeddings already capture the right relationship.

Targeted retraining is cheap and effective. When a specific task underperforms, you can regenerate hypothetical prompts for just the underperforming segments with task-specific hints. Email retrieval went from 33% to 100% essential recall by retraining 4 segments. Zero runtime cost increase.

Limitations and open questions

Designed for non-code context. Curator is deliberately focused on business knowledge - marketing guidelines, clinical protocols, product docs, brand voice, operational procedures. Code context is a crowded space with strong existing tools (LSPs, tree-sitter, IDE integrations). The unsolved problem is everything else - the unstructured business knowledge that AI agents need but can’t find through code analysis. That said, the technique should transfer to code documentation and architectural decision records. I haven’t tested that yet.

Long-form generation resists injected context. Snippets mode saves ~60% on focused tasks like emails and competitive analysis, but only ~17% on blog posts. The pattern is consistent: when Claude perceives a task as requiring “research” - long-form writing, formal documents, multi-source synthesis - it reads source files regardless of what you inject. Short synthesis tasks complete in 1 turn with zero file reads. Blog posts still take 7-10 turns. I haven’t solved this yet.

Cold-start retrieval isn’t perfect. The initial 10 hypothetical prompts per segment get you to ~70-80% essential recall out of the box. Hitting 100% requires targeted retraining - regenerating use cases for specific underperforming segments. The system improves with iteration, not magic. That retraining loop is cheap (minutes of work, fractions of a cent), but it’s not zero.

Longer context windows don’t eliminate this problem. A natural objection is that 1M+ token context windows will make retrieval irrelevant. They won’t - because turns, not input tokens, are the dominant cost driver. A model with a 1M context window that takes 7 turns to explore files still re-sends the full conversation 7 times. Curator’s savings come from reducing turns to 1, which matters regardless of context window size. Longer windows do make the full-dump approach more viable for small corpora, but the behavioral problem - models exploring when they’re uncertain about completeness - persists.

The corpus is small. 14 files, 62 segments, 620 hypothetical prompt embeddings. The scoring formula and thresholds were tuned at this scale. At 1,000+ segments, the embedding space could get noisy enough that precision degrades - segments with loosely related hypothetical prompts could start surfacing where they shouldn’t. I expect HyPE’s advantage over naive retrieval to grow with corpus size (because the semantic gap problem gets worse with more content to search through), but I haven’t proven that yet.

Some task benchmarks need more runs. The competitive task has been validated at N=10 with consistent results (58% savings, 0 reads, 1 turn on every run). Blog and email tasks have been run at smaller sample sizes. Expanding those to N=10 is planned.

Only one RAG baseline tested so far. Naive vector similarity is the most common retrieval approach, and HyPE decisively beats it. But I haven’t yet tested against reranked retrieval, query decomposition, or hybrid approaches. Those comparisons are planned.

Why this matters

The AI industry is scaling toward agents that take autonomous actions - writing code, drafting documents, managing workflows. The standard approach is to give these agents access to everything and let them figure it out. That works, but it’s slow and expensive.

Context retrieval that actually understands the relationship between tasks and knowledge can make these agents dramatically more efficient. Not by making the models smarter, but by making sure they start with the right information.

If you’re looking to get more out of your existing AI tools or struggling with AI context problems, I’d love to connect. You can find me on LinkedIn or reach me at me@patrickjohnkelly.com.