How It Works¶
Architecture Overview¶
graph TD
subgraph "Your Code (unchanged)"
A[Agent] --> B[LLM Framework]
end
subgraph "Transport Layer"
B --> C["httpx.Client.send()"]
end
subgraph "AgentOpt (transparent)"
C --> D[Interceptor]
D --> E[Response Cache]
D --> F[Call Recorder]
F --> G[Token Counter]
F --> H[Latency Timer]
end
C --> I[LLM API]
The Interception Layer¶
Every major LLM SDK uses httpx for HTTP requests. AgentOpt patches httpx.Client.send() and httpx.AsyncClient.send() at the class level, intercepting every LLM API call transparently:
your_agent(input)
+-- framework internals (LangChain, CrewAI, etc.)
+-- httpx.Client.send() <-- intercepted here
+-- LLM API (OpenAI, Anthropic, etc.)
This means:
- Zero code changes to your agent or framework
- No proxy server to configure
- Works with any framework that uses httpx (OpenAI, Anthropic, LangChain, CrewAI, LlamaIndex, AG2, ...)
Supported providers
Any LLM provider whose SDK uses httpx is automatically supported. This includes OpenAI, Anthropic, Google (Gemini), Mistral, Cohere, and any OpenAI-compatible API endpoint.
What Gets Tracked¶
For each intercepted call, AgentOpt records:
| Field | Description |
|---|---|
model |
Model name (e.g., gpt-4o) |
prompt_tokens |
Input token count |
completion_tokens |
Output token count |
latency_seconds |
Wall-clock API call duration |
data_id |
Which datapoint triggered this call |
combo_id |
Which model combination was being evaluated |
cached |
Whether the response was served from cache |
Attribution with ContextVars¶
AgentOpt uses Python's contextvars to attribute LLM calls to the correct datapoint and model combination. Each thread and async task gets its own context, so parallel evaluations never interfere:
with tracker.track(data_id="dp_1", combo_id="gpt4o+haiku"):
result = agent(input_data)
# All LLM calls in this block are tagged with dp_1 + gpt4o+haiku
This is what makes parallel=True safe — concurrent evaluations are fully isolated.
Response Caching¶
AgentOpt caches at the HTTP level:
| Property | Detail |
|---|---|
| Cache key | SHA-256 of request body (model + messages + params), excluding stream |
| In-memory | Thread-safe dict, always active |
| On disk | Optional SQLite database, flushed by background thread |
When the same prompt hits the same model, the cached response is returned instantly — preserving the original latency measurement for fair comparisons.
Cache savings in practice
If two model combinations share the same planner model, the planner call for each datapoint is identical and hits the cache. With 9 combinations and 3 planner models, you only pay for 3 unique planner calls per datapoint — not 9.
The Selection Loop¶
For each model combination:
- Build — Instantiate the agent class with the candidate models via
agent(models) - Run — Execute the agent on every datapoint in the evaluation dataset
- Track — Record token usage, latency, and cost via the interception layer
- Score — Evaluate each output using
eval_fn(expected, actual) - Aggregate — Compute mean accuracy, latency, and estimated cost
- Rank — Sort by accuracy (ties broken by latency), report results
Different selection algorithms vary in how they choose which combinations to evaluate, but steps 1-6 are the same for all of them.