Benchmark Results
We evaluated 9 models across 4 benchmarks using LangGraph-based agents, then ran 8 model selection algorithms to measure how efficiently each finds the best model without exhaustive search. All results use 198–200 samples per benchmark with brute force ground truth. Selector comparisons were run with 50 random seeds.
Models
All models accessed via AWS Bedrock Application Inference Profiles (on-demand pricing, March 2026).
| Model |
Provider |
Input $/MTok |
Output $/MTok |
| Claude Opus 4.6 |
Anthropic |
$5.00 |
$25.00 |
| Claude Haiku 4.5 |
Anthropic |
$1.00 |
$5.00 |
| Claude 3 Haiku |
Anthropic |
$0.25 |
$1.25 |
| gpt-oss-120b |
OpenAI |
$0.15 |
$0.60 |
| gpt-oss-20b |
OpenAI |
$0.07 |
$0.30 |
| Kimi K2.5 |
Moonshot AI |
$0.60 |
$3.00 |
| Qwen3 Next 80B A3B |
Qwen |
$0.15 |
$1.20 |
| Qwen3 32B |
Qwen |
$0.15 |
$0.60 |
| Ministral 3 8B |
Mistral |
$0.15 |
$0.15 |
Cross-Benchmark Summary
| Benchmark |
Tuple |
Samples |
Combos |
Best Combo |
Accuracy |
BF Cost |
Arm Elim Savings |
| GPQA Diamond |
1-tuple |
198 |
9 |
Claude Opus 4.6 |
74.75% |
$4.71 |
24% |
| BFCL Multi-Turn |
1-tuple |
200 |
9 |
Kimi K2.5 (tied with Opus, Qwen3 Next) |
70.00% |
$84.80 |
12% |
| HotpotQA |
2-tuple |
200 |
81 |
planner=Ministral 3 8B + solver=Claude Opus 4.6 |
74.27% |
$51.90 |
67% |
| MathQA |
2-tuple |
200 |
81 |
answer=Claude Opus 4.6 + critic=Claude Haiku 4.5 |
98.84% |
$123.87 |
58% |
GPQA Diamond
Graduate-level science QA — 198 multiple-choice questions from the GPQA Diamond dataset. Single-agent architecture: one LLM answers directly.
Model Results
| Rank |
Model |
Accuracy |
Avg Latency (s) |
Cost |
| 1 |
Claude Opus 4.6 |
74.75% |
9.16 |
$2.48 |
| 2 |
Kimi K2.5 |
72.73% |
16.41 |
$1.13 |
| 3 |
gpt-oss-120b |
68.18% |
6.46 |
$0.20 |
| 4 |
Claude Haiku 4.5 |
59.60% |
3.70 |
$0.51 |
| 5 |
Qwen3 Next 80B A3B |
51.01% |
10.33 |
$0.14 |
| 6 |
gpt-oss-20b |
50.00% |
6.21 |
$0.14 |
| 7 |
Qwen3 32B |
46.97% |
1.54 |
$0.08 |
| 8 |
Ministral 3 8B |
36.87% |
0.25 |
$0.00 |
| 9 |
Claude 3 Haiku |
34.85% |
1.79 |
$0.06 |
Selector Comparison
| Selector |
Find Rate |
Mean Accuracy |
Evaluations |
Cost |
Savings |
| Brute Force |
100% |
74.75% |
1,782 |
$4.71 |
-- |
| LM Proposal |
100% |
74.75% |
198 |
$2.47 |
48% |
| Hill Climbing |
90% |
74.55% |
1,501 |
$4.03 |
14% |
| Arm Elimination |
94% |
74.10% |
666 |
$3.57 |
24% |
| Epsilon LUCB |
72% |
73.14% |
380 |
$2.51 |
47% |
| Bayesian Opt |
56% |
72.43% |
990 |
$2.59 |
45% |
| Random Search |
36% |
68.57% |
594 |
$1.73 |
63% |
| Threshold SE |
16% |
57.48% |
252 |
$1.80 |
62% |
Thinking Effort Ablation
Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) and Haiku 4.5 (budget_tokens). Baseline "none" rows use brute force results.
| Model |
Effort/Budget Tokens |
Accuracy |
Cost/Sample |
Server Latency/Sample (s) |
| Opus |
high |
83.90% |
$0.113 |
70.4 |
| Opus |
medium |
79.30% |
$0.0341 |
23.9 |
| Opus |
none |
74.75% |
$0.0125 |
9.16 |
| Haiku 4.5 |
16K |
71.20% |
$0.0192 |
30.6 |
| Haiku 4.5 |
32K |
71.20% |
$0.0361 |
57.4 |
| Opus |
low |
61.60% |
$0.00302 |
3.06 |
| Haiku 4.5 |
5K |
60.10% |
$0.00925 |
15.0 |
| Haiku 4.5 |
none |
59.60% |
$0.0026 |
3.70 |
BFCL Multi-Turn
Multi-turn function calling — 200 samples from the Berkeley Function Calling Leaderboard (BFCL v3). Each sample has multiple turns with tool-calling loops. Models that don't support native function calling (Qwen3 32B, Kimi K2.5, Ministral 3 8B) use a text-based prompting fallback.
Comparison with official BFCL leaderboard
Our evaluation uses a live LangGraph agent that executes tool calls against real backend state machines, whereas the official BFCL leaderboard uses static response matching. Our accuracy numbers are not directly comparable to the leaderboard — they reflect end-to-end agent performance including tool execution, state management, and multi-step reasoning.
Model Results
| Rank |
Model |
Accuracy |
Avg Latency (s) |
Cost |
| 1 |
Kimi K2.5 |
70.00% |
21.30 |
$3.86 |
| 2 |
Claude Opus 4.6 |
70.00% |
42.35 |
$60.14 |
| 3 |
Qwen3 Next 80B A3B |
70.00% |
60.54 |
$1.90 |
| 4 |
Claude Haiku 4.5 |
65.00% |
20.90 |
$11.98 |
| 5 |
gpt-oss-120b |
58.50% |
20.01 |
$1.16 |
| 6 |
Qwen3 32B |
47.00% |
10.78 |
$1.00 |
| 7 |
Claude 3 Haiku |
43.50% |
17.96 |
$3.42 |
| 8 |
gpt-oss-20b |
42.00% |
10.03 |
$0.42 |
| 9 |
Ministral 3 8B |
34.00% |
29.03 |
$0.92 |
Selector Comparison
| Selector |
Find Rate |
Mean Accuracy |
Evaluations |
Cost |
Savings |
| Brute Force |
100% |
70.00% |
1,800 |
$84.80 |
-- |
| Hill Climbing |
100% |
70.00% |
1,664 |
$72.12 |
15% |
| Epsilon LUCB |
28% |
69.90% |
399 |
$40.03 |
53% |
| Arm Elimination |
88% |
69.37% |
912 |
$74.39 |
12% |
| Bayesian Opt |
44% |
69.27% |
1,000 |
$50.64 |
40% |
| Random Search |
36% |
67.13% |
600 |
$31.39 |
63% |
| Threshold SE |
10% |
58.19% |
186 |
$18.82 |
78% |
| LM Proposal |
0% |
44.03% |
200 |
$3.39 |
96% |
HotpotQA
Multi-hop question answering — 200 samples from the HotpotQA distractor setting. Two-agent architecture: a planner proposes search steps, and a solver executes them with tool access. 81 model combinations (9 planners x 9 solvers).
Top 15 Combos
| Rank |
Planner |
Solver |
Accuracy |
Avg Latency (s) |
Cost |
| 1 |
Ministral 3 8B |
Claude Opus 4.6 |
74.27% |
4.97 |
$2.64 |
| 2 |
Claude 3 Haiku |
Claude Opus 4.6 |
73.25% |
4.52 |
$2.79 |
| 3 |
Qwen3 32B |
Claude Opus 4.6 |
73.02% |
4.26 |
$2.65 |
| 4 |
Qwen3 Next 80B A3B |
Claude Opus 4.6 |
72.10% |
4.67 |
$2.67 |
| 5 |
Qwen3 Next 80B A3B |
gpt-oss-120b |
71.83% |
3.07 |
$0.13 |
| 6 |
Qwen3 32B |
gpt-oss-120b |
70.04% |
2.66 |
$0.13 |
| 7 |
Kimi K2.5 |
Claude Opus 4.6 |
69.96% |
4.49 |
$2.43 |
| 8 |
Claude 3 Haiku |
gpt-oss-120b |
69.86% |
3.21 |
$0.17 |
| 9 |
Ministral 3 8B |
gpt-oss-20b |
69.34% |
5.66 |
$0.09 |
| 10 |
Claude 3 Haiku |
Qwen3 Next 80B A3B |
69.27% |
3.00 |
$0.16 |
| 11 |
Qwen3 Next 80B A3B |
gpt-oss-20b |
68.89% |
2.82 |
$0.09 |
| 12 |
Ministral 3 8B |
gpt-oss-120b |
68.70% |
3.65 |
$0.12 |
| 13 |
Qwen3 Next 80B A3B |
Qwen3 Next 80B A3B |
68.15% |
2.69 |
$0.11 |
| 14 |
Ministral 3 8B |
Qwen3 Next 80B A3B |
67.98% |
3.85 |
$0.11 |
| 15 |
Qwen3 32B |
Qwen3 Next 80B A3B |
67.53% |
3.51 |
$0.11 |
Bottom 15 Combos
| Rank |
Planner |
Solver |
Accuracy |
Avg Latency (s) |
Cost |
| 67 |
Claude Haiku 4.5 |
Qwen3 32B |
36.13% |
2.89 |
$0.46 |
| 68 |
Claude Haiku 4.5 |
Claude 3 Haiku |
34.34% |
2.63 |
$0.49 |
| 69 |
Ministral 3 8B |
Claude Haiku 4.5 |
32.42% |
4.14 |
$0.70 |
| 70 |
Qwen3 Next 80B A3B |
Claude Haiku 4.5 |
32.19% |
3.92 |
$0.72 |
| 71 |
Claude Opus 4.6 |
Kimi K2.5 |
31.96% |
4.72 |
$2.02 |
| 72 |
Claude Opus 4.6 |
Ministral 3 8B |
31.96% |
4.72 |
$2.02 |
| 73 |
Claude Opus 4.6 |
Qwen3 32B |
31.96% |
4.72 |
$2.02 |
| 74 |
Claude Opus 4.6 |
Qwen3 Next 80B A3B |
31.96% |
4.72 |
$2.02 |
| 75 |
Claude Opus 4.6 |
gpt-oss-120b |
31.95% |
4.60 |
$2.02 |
| 76 |
Claude Opus 4.6 |
gpt-oss-20b |
31.88% |
4.57 |
$2.03 |
| 77 |
Claude Opus 4.6 |
Claude 3 Haiku |
31.78% |
4.22 |
$2.02 |
| 78 |
Claude Opus 4.6 |
Claude Haiku 4.5 |
31.77% |
4.16 |
$2.03 |
| 79 |
Claude Opus 4.6 |
Claude Opus 4.6 |
31.71% |
4.19 |
$2.02 |
| 80 |
Qwen3 32B |
Claude Haiku 4.5 |
26.63% |
3.47 |
$0.69 |
| 81 |
Claude Haiku 4.5 |
Claude Haiku 4.5 |
26.49% |
3.40 |
$0.79 |
Capability as Liability
Claude Opus 4.6 as planner achieves only ~32% accuracy regardless of solver — the worst planner in the benchmark. Opus is "too smart" for the planner role: it calls terminate() and answers directly instead of delegating to the solver. The solver is never invoked. Meanwhile, the cheapest model (Ministral 3 8B) as planner with Opus as solver achieves the best accuracy at 74.27%. This demonstrates that stronger models can underperform in multi-agent architectures when the role requires delegation, not direct answering.
Full 81 Combo Results
| Rank |
Planner |
Solver |
Accuracy |
Avg Latency (s) |
Cost |
Note |
| 1 |
Ministral 3 8B |
Claude Opus 4.6 |
74.27% |
4.97 |
$2.64 |
|
| 2 |
Claude 3 Haiku |
Claude Opus 4.6 |
73.25% |
4.52 |
$2.79 |
|
| 3 |
Qwen3 32B |
Claude Opus 4.6 |
73.02% |
4.26 |
$2.65 |
|
| 4 |
Qwen3 Next 80B A3B |
Claude Opus 4.6 |
72.10% |
4.67 |
$2.67 |
|
| 5 |
Qwen3 Next 80B A3B |
gpt-oss-120b |
71.83% |
3.07 |
$0.13 |
|
| 6 |
Qwen3 32B |
gpt-oss-120b |
70.04% |
2.66 |
$0.13 |
|
| 7 |
Kimi K2.5 |
Claude Opus 4.6 |
69.96% |
4.49 |
$2.43 |
|
| 8 |
Claude 3 Haiku |
gpt-oss-120b |
69.86% |
3.21 |
$0.17 |
|
| 9 |
Ministral 3 8B |
gpt-oss-20b |
69.34% |
5.66 |
$0.09 |
|
| 10 |
Claude 3 Haiku |
Qwen3 Next 80B A3B |
69.27% |
3.00 |
$0.16 |
|
| 11 |
Qwen3 Next 80B A3B |
gpt-oss-20b |
68.89% |
2.82 |
$0.09 |
|
| 12 |
Ministral 3 8B |
gpt-oss-120b |
68.70% |
3.65 |
$0.12 |
|
| 13 |
Qwen3 Next 80B A3B |
Qwen3 Next 80B A3B |
68.15% |
2.69 |
$0.11 |
|
| 14 |
Ministral 3 8B |
Qwen3 Next 80B A3B |
67.98% |
3.85 |
$0.11 |
|
| 15 |
Qwen3 32B |
Qwen3 Next 80B A3B |
67.53% |
3.51 |
$0.11 |
|
| 16 |
Qwen3 32B |
gpt-oss-20b |
66.95% |
2.48 |
$0.09 |
|
| 17 |
Claude 3 Haiku |
Ministral 3 8B |
65.98% |
3.73 |
$0.14 |
|
| 18 |
Ministral 3 8B |
Kimi K2.5 |
65.24% |
3.27 |
$0.26 |
|
| 19 |
gpt-oss-120b |
Qwen3 Next 80B A3B |
64.93% |
4.68 |
$0.10 |
|
| 20 |
Ministral 3 8B |
Ministral 3 8B |
64.89% |
3.55 |
$0.09 |
|
| 21 |
Claude 3 Haiku |
gpt-oss-20b |
64.79% |
2.90 |
$0.13 |
|
| 22 |
Kimi K2.5 |
gpt-oss-120b |
64.70% |
4.16 |
$0.29 |
|
| 23 |
gpt-oss-120b |
Claude Opus 4.6 |
64.59% |
4.57 |
$1.61 |
|
| 24 |
gpt-oss-120b |
Claude Haiku 4.5 |
64.11% |
4.26 |
$0.38 |
|
| 25 |
Kimi K2.5 |
Qwen3 Next 80B A3B |
63.99% |
4.39 |
$0.30 |
|
| 26 |
Kimi K2.5 |
Ministral 3 8B |
63.95% |
6.42 |
$0.28 |
|
| 27 |
Claude 3 Haiku |
Kimi K2.5 |
63.85% |
2.89 |
$0.31 |
|
| 28 |
gpt-oss-120b |
Ministral 3 8B |
63.70% |
7.37 |
$0.09 |
|
| 29 |
Qwen3 Next 80B A3B |
Kimi K2.5 |
63.69% |
2.89 |
$0.27 |
|
| 30 |
Kimi K2.5 |
gpt-oss-20b |
63.35% |
6.80 |
$0.26 |
|
| 31 |
Qwen3 32B |
Kimi K2.5 |
63.17% |
3.26 |
$0.28 |
|
| 32 |
gpt-oss-120b |
Claude 3 Haiku |
62.72% |
3.72 |
$0.13 |
|
| 33 |
Kimi K2.5 |
Kimi K2.5 |
62.28% |
4.56 |
$0.44 |
|
| 34 |
gpt-oss-120b |
gpt-oss-120b |
62.15% |
4.59 |
$0.10 |
|
| 35 |
Qwen3 Next 80B A3B |
Ministral 3 8B |
62.11% |
4.27 |
$0.10 |
|
| 36 |
gpt-oss-120b |
gpt-oss-20b |
61.51% |
2.71 |
$0.08 |
|
| 37 |
Qwen3 32B |
Ministral 3 8B |
61.17% |
2.89 |
$0.09 |
|
| 38 |
gpt-oss-120b |
Kimi K2.5 |
60.85% |
4.09 |
$0.18 |
|
| 39 |
gpt-oss-120b |
Qwen3 32B |
58.80% |
4.06 |
$0.10 |
|
| 40 |
Claude 3 Haiku |
Qwen3 32B |
56.02% |
2.87 |
$0.15 |
|
| 41 |
Claude 3 Haiku |
Claude 3 Haiku |
55.91% |
2.41 |
$0.21 |
|
| 42 |
gpt-oss-20b |
Claude Opus 4.6 |
55.86% |
2.84 |
$1.04 |
|
| 43 |
Ministral 3 8B |
Qwen3 32B |
55.02% |
3.63 |
$0.11 |
|
| 44 |
Kimi K2.5 |
Claude 3 Haiku |
54.90% |
3.42 |
$0.34 |
|
| 45 |
Qwen3 32B |
Qwen3 32B |
54.82% |
2.53 |
$0.11 |
|
| 46 |
Kimi K2.5 |
Qwen3 32B |
54.73% |
4.57 |
$0.30 |
|
| 47 |
gpt-oss-20b |
Claude Haiku 4.5 |
54.28% |
2.19 |
$0.26 |
|
| 48 |
gpt-oss-20b |
Ministral 3 8B |
54.25% |
4.35 |
$0.05 |
|
| 49 |
Qwen3 Next 80B A3B |
Qwen3 32B |
54.13% |
2.83 |
$0.11 |
|
| 50 |
gpt-oss-20b |
Qwen3 Next 80B A3B |
53.89% |
2.11 |
$0.06 |
|
| 51 |
gpt-oss-20b |
Claude 3 Haiku |
52.66% |
2.04 |
$0.08 |
|
| 52 |
gpt-oss-20b |
gpt-oss-120b |
52.17% |
2.11 |
$0.06 |
|
| 53 |
Ministral 3 8B |
Claude 3 Haiku |
51.33% |
4.10 |
$0.16 |
|
| 54 |
gpt-oss-20b |
Kimi K2.5 |
51.01% |
1.96 |
$0.12 |
|
| 55 |
gpt-oss-20b |
gpt-oss-20b |
50.09% |
2.12 |
$0.05 |
|
| 56 |
Qwen3 Next 80B A3B |
Claude 3 Haiku |
49.98% |
2.56 |
$0.17 |
|
| 57 |
gpt-oss-20b |
Qwen3 32B |
49.16% |
2.05 |
$0.06 |
|
| 58 |
Qwen3 32B |
Claude 3 Haiku |
48.77% |
2.23 |
$0.16 |
|
| 59 |
Claude 3 Haiku |
Claude Haiku 4.5 |
46.50% |
3.35 |
$0.71 |
|
| 60 |
Claude Haiku 4.5 |
Claude Opus 4.6 |
43.54% |
4.06 |
$1.80 |
|
| 61 |
Claude Haiku 4.5 |
gpt-oss-20b |
41.49% |
3.03 |
$0.45 |
|
| 62 |
Claude Haiku 4.5 |
gpt-oss-120b |
41.20% |
3.14 |
$0.47 |
|
| 63 |
Claude Haiku 4.5 |
Qwen3 Next 80B A3B |
41.17% |
2.95 |
$0.46 |
|
| 64 |
Claude Haiku 4.5 |
Ministral 3 8B |
41.09% |
3.75 |
$0.45 |
|
| 65 |
Claude Haiku 4.5 |
Kimi K2.5 |
41.00% |
6.16 |
$0.54 |
|
| 66 |
Kimi K2.5 |
Claude Haiku 4.5 |
37.19% |
4.23 |
$0.88 |
|
| 67 |
Claude Haiku 4.5 |
Qwen3 32B |
36.13% |
2.89 |
$0.46 |
|
| 68 |
Claude Haiku 4.5 |
Claude 3 Haiku |
34.34% |
2.63 |
$0.49 |
|
| 69 |
Ministral 3 8B |
Claude Haiku 4.5 |
32.42% |
4.14 |
$0.70 |
|
| 70 |
Qwen3 Next 80B A3B |
Claude Haiku 4.5 |
32.19% |
3.92 |
$0.72 |
|
| 71 |
Claude Opus 4.6 |
Kimi K2.5 |
31.96% |
4.72 |
$2.02 |
role2_never_called |
| 72 |
Claude Opus 4.6 |
Ministral 3 8B |
31.96% |
4.72 |
$2.02 |
role2_never_called |
| 73 |
Claude Opus 4.6 |
Qwen3 32B |
31.96% |
4.72 |
$2.02 |
role2_never_called |
| 74 |
Claude Opus 4.6 |
Qwen3 Next 80B A3B |
31.96% |
4.72 |
$2.02 |
role2_never_called |
| 75 |
Claude Opus 4.6 |
gpt-oss-120b |
31.95% |
4.60 |
$2.02 |
role2_never_called |
| 76 |
Claude Opus 4.6 |
gpt-oss-20b |
31.88% |
4.57 |
$2.03 |
role2_never_called |
| 77 |
Claude Opus 4.6 |
Claude 3 Haiku |
31.78% |
4.22 |
$2.02 |
role2_never_called |
| 78 |
Claude Opus 4.6 |
Claude Haiku 4.5 |
31.77% |
4.16 |
$2.03 |
role2_never_called |
| 79 |
Claude Opus 4.6 |
Claude Opus 4.6 |
31.71% |
4.19 |
$2.02 |
|
| 80 |
Qwen3 32B |
Claude Haiku 4.5 |
26.63% |
3.47 |
$0.69 |
|
| 81 |
Claude Haiku 4.5 |
Claude Haiku 4.5 |
26.49% |
3.40 |
$0.79 |
|
Selector Comparison
| Selector |
Find Rate |
Mean Accuracy |
Evaluations |
Cost |
Savings |
| Brute Force |
100% |
74.27% |
16,168 |
$51.90 |
-- |
| Bayesian Opt |
8% |
73.33% |
3,996 |
$12.29 |
76% |
| Arm Elimination |
86% |
73.19% |
4,283 |
$16.92 |
67% |
| Hill Climbing |
52% |
73.13% |
4,635 |
$19.39 |
63% |
| Random Search |
30% |
72.25% |
4,192 |
$13.37 |
74% |
| Epsilon LUCB |
10% |
69.71% |
478 |
$1.75 |
97% |
| Threshold SE |
4% |
65.42% |
1,642 |
$6.45 |
88% |
| LM Proposal |
0% |
34.13% |
200 |
$1.84 |
96% |
MathQA
Self-reflective math reasoning — 200 samples from the MathQA dataset. Two-agent architecture: an answer model solves problems, and a critic model checks the work. If the critic rejects, the answer model retries (up to 3 iterations). 81 model combinations (9 answer models x 9 critics).
Top 15 Combos
| Rank |
Answer Model |
Critic Model |
Accuracy |
Avg Latency (s) |
Cost |
| 1 |
Claude Opus 4.6 |
Claude Haiku 4.5 |
98.84% |
16.15 |
$6.19 |
| 2 |
Claude Opus 4.6 |
Qwen3 Next 80B A3B |
98.82% |
14.30 |
$5.77 |
| 3 |
Claude Opus 4.6 |
Ministral 3 8B |
98.72% |
14.03 |
$5.26 |
| 4 |
Claude Opus 4.6 |
gpt-oss-20b |
98.28% |
16.50 |
$5.93 |
| 5 |
Claude Opus 4.6 |
gpt-oss-120b |
97.77% |
15.40 |
$6.30 |
| 6 |
Claude Opus 4.6 |
Qwen3 32B |
97.28% |
15.05 |
$6.68 |
| 7 |
Claude Opus 4.6 |
Claude Opus 4.6 |
97.24% |
15.94 |
$6.97 |
| 8 |
Claude Opus 4.6 |
Kimi K2.5 |
97.24% |
18.37 |
$6.58 |
| 9 |
Claude Opus 4.6 |
Claude 3 Haiku |
95.95% |
14.85 |
$5.37 |
| 10 |
gpt-oss-20b |
Claude Opus 4.6 |
94.57% |
6.81 |
$0.97 |
| 11 |
gpt-oss-20b |
Kimi K2.5 |
94.57% |
12.45 |
$0.26 |
| 12 |
gpt-oss-20b |
gpt-oss-20b |
94.54% |
4.04 |
$0.08 |
| 13 |
Claude Haiku 4.5 |
Qwen3 32B |
94.50% |
12.68 |
$2.51 |
| 14 |
gpt-oss-20b |
Claude Haiku 4.5 |
94.05% |
6.19 |
$0.37 |
| 15 |
gpt-oss-20b |
gpt-oss-120b |
94.02% |
4.94 |
$0.11 |
Bottom 15 Combos
| Rank |
Answer Model |
Critic Model |
Accuracy |
Avg Latency (s) |
Cost |
| 67 |
Qwen3 Next 80B A3B |
Kimi K2.5 |
75.50% |
36.37 |
$0.79 |
| 68 |
Qwen3 Next 80B A3B |
gpt-oss-20b |
75.00% |
32.70 |
$0.48 |
| 69 |
Kimi K2.5 |
gpt-oss-120b |
74.49% |
32.23 |
$0.95 |
| 70 |
Kimi K2.5 |
gpt-oss-20b |
74.09% |
25.65 |
$0.77 |
| 71 |
Kimi K2.5 |
Kimi K2.5 |
73.58% |
44.39 |
$1.34 |
| 72 |
Kimi K2.5 |
Claude Opus 4.6 |
73.33% |
28.62 |
$2.79 |
| 73 |
Kimi K2.5 |
Claude Haiku 4.5 |
73.20% |
26.98 |
$1.36 |
| 74 |
Claude 3 Haiku |
gpt-oss-120b |
72.19% |
8.39 |
$0.32 |
| 75 |
Kimi K2.5 |
Qwen3 32B |
72.16% |
30.32 |
$0.92 |
| 76 |
Claude 3 Haiku |
gpt-oss-20b |
71.43% |
8.42 |
$0.32 |
| 77 |
Claude 3 Haiku |
Qwen3 Next 80B A3B |
71.07% |
17.12 |
$0.39 |
| 78 |
Claude 3 Haiku |
Kimi K2.5 |
71.01% |
14.23 |
$0.53 |
| 79 |
Claude 3 Haiku |
Ministral 3 8B |
69.28% |
12.40 |
$0.32 |
| 80 |
Claude 3 Haiku |
Qwen3 32B |
59.30% |
6.29 |
$0.29 |
| 81 |
Claude 3 Haiku |
Claude 3 Haiku |
54.37% |
7.28 |
$0.30 |
Full 81 Combo Results
| Rank |
Answer Model |
Critic Model |
Accuracy |
Avg Latency (s) |
Cost |
Note |
| 1 |
Claude Opus 4.6 |
Claude Haiku 4.5 |
98.84% |
16.15 |
$6.19 |
|
| 2 |
Claude Opus 4.6 |
Qwen3 Next 80B A3B |
98.82% |
14.30 |
$5.77 |
|
| 3 |
Claude Opus 4.6 |
Ministral 3 8B |
98.72% |
14.03 |
$5.26 |
|
| 4 |
Claude Opus 4.6 |
gpt-oss-20b |
98.28% |
16.50 |
$5.93 |
|
| 5 |
Claude Opus 4.6 |
gpt-oss-120b |
97.77% |
15.40 |
$6.30 |
|
| 6 |
Claude Opus 4.6 |
Qwen3 32B |
97.28% |
15.05 |
$6.68 |
|
| 7 |
Claude Opus 4.6 |
Claude Opus 4.6 |
97.24% |
15.94 |
$6.97 |
|
| 8 |
Claude Opus 4.6 |
Kimi K2.5 |
97.24% |
18.37 |
$6.58 |
|
| 9 |
Claude Opus 4.6 |
Claude 3 Haiku |
95.95% |
14.85 |
$5.37 |
|
| 10 |
gpt-oss-20b |
Claude Opus 4.6 |
94.57% |
6.81 |
$0.97 |
|
| 11 |
gpt-oss-20b |
Kimi K2.5 |
94.57% |
12.45 |
$0.26 |
|
| 12 |
gpt-oss-20b |
gpt-oss-20b |
94.54% |
4.04 |
$0.08 |
|
| 13 |
Claude Haiku 4.5 |
Qwen3 32B |
94.50% |
12.68 |
$2.51 |
|
| 14 |
gpt-oss-20b |
Claude Haiku 4.5 |
94.05% |
6.19 |
$0.37 |
|
| 15 |
gpt-oss-20b |
gpt-oss-120b |
94.02% |
4.94 |
$0.11 |
|
| 16 |
gpt-oss-20b |
Qwen3 Next 80B A3B |
94.02% |
8.67 |
$0.14 |
|
| 17 |
Claude Haiku 4.5 |
Claude Haiku 4.5 |
94.00% |
14.31 |
$2.59 |
|
| 18 |
gpt-oss-20b |
Ministral 3 8B |
93.99% |
8.27 |
$0.10 |
|
| 19 |
gpt-oss-120b |
Claude Opus 4.6 |
93.81% |
9.10 |
$1.25 |
|
| 20 |
Claude Haiku 4.5 |
gpt-oss-20b |
93.50% |
12.51 |
$2.20 |
|
| 21 |
Claude Haiku 4.5 |
Claude Opus 4.6 |
93.50% |
15.82 |
$3.77 |
|
| 22 |
Claude Haiku 4.5 |
Ministral 3 8B |
93.50% |
14.70 |
$2.57 |
|
| 23 |
Claude Haiku 4.5 |
Kimi K2.5 |
93.50% |
17.50 |
$2.60 |
|
| 24 |
gpt-oss-20b |
Qwen3 32B |
93.48% |
4.30 |
$0.09 |
|
| 25 |
gpt-oss-20b |
Claude 3 Haiku |
93.44% |
6.10 |
$0.15 |
|
| 26 |
gpt-oss-120b |
Ministral 3 8B |
93.26% |
10.42 |
$0.19 |
|
| 27 |
gpt-oss-120b |
Qwen3 32B |
93.26% |
5.53 |
$0.16 |
|
| 28 |
Claude Haiku 4.5 |
gpt-oss-120b |
93.00% |
14.65 |
$2.90 |
|
| 29 |
Claude Haiku 4.5 |
Qwen3 Next 80B A3B |
93.00% |
20.98 |
$7.81 |
|
| 30 |
gpt-oss-120b |
Claude Haiku 4.5 |
92.82% |
7.77 |
$0.47 |
|
| 31 |
gpt-oss-120b |
gpt-oss-20b |
92.78% |
6.45 |
$0.18 |
|
| 32 |
gpt-oss-120b |
gpt-oss-120b |
92.78% |
6.94 |
$0.19 |
|
| 33 |
gpt-oss-120b |
Kimi K2.5 |
92.78% |
12.09 |
$0.32 |
|
| 34 |
gpt-oss-120b |
Qwen3 Next 80B A3B |
92.78% |
10.98 |
$0.23 |
|
| 35 |
gpt-oss-120b |
Claude 3 Haiku |
92.75% |
6.42 |
$0.20 |
|
| 36 |
Claude Haiku 4.5 |
Claude 3 Haiku |
92.50% |
13.43 |
$2.46 |
|
| 37 |
Claude 3 Haiku |
Claude Opus 4.6 |
89.66% |
13.32 |
$2.26 |
|
| 38 |
Qwen3 32B |
Qwen3 Next 80B A3B |
88.83% |
8.02 |
$0.24 |
|
| 39 |
Ministral 3 8B |
Claude 3 Haiku |
88.15% |
10.24 |
$0.05 |
|
| 40 |
Qwen3 32B |
gpt-oss-120b |
87.83% |
7.11 |
$0.47 |
|
| 41 |
Ministral 3 8B |
Qwen3 Next 80B A3B |
87.82% |
9.22 |
$0.03 |
|
| 42 |
Qwen3 32B |
Claude Opus 4.6 |
87.56% |
12.33 |
$3.43 |
|
| 43 |
Ministral 3 8B |
Kimi K2.5 |
87.04% |
14.43 |
$0.09 |
|
| 44 |
Ministral 3 8B |
gpt-oss-120b |
86.63% |
10.58 |
$0.07 |
|
| 45 |
Claude 3 Haiku |
Claude Haiku 4.5 |
86.55% |
9.32 |
$0.69 |
|
| 46 |
Ministral 3 8B |
Ministral 3 8B |
86.52% |
7.29 |
$0.03 |
|
| 47 |
Ministral 3 8B |
Claude Opus 4.6 |
86.47% |
11.46 |
$0.93 |
|
| 48 |
Qwen3 32B |
Claude Haiku 4.5 |
86.46% |
7.47 |
$0.90 |
|
| 49 |
Ministral 3 8B |
Claude Haiku 4.5 |
86.23% |
11.66 |
$0.30 |
|
| 50 |
Ministral 3 8B |
gpt-oss-20b |
86.13% |
12.33 |
$0.05 |
|
| 51 |
Qwen3 32B |
Ministral 3 8B |
86.10% |
17.57 |
$0.21 |
|
| 52 |
Qwen3 32B |
Kimi K2.5 |
85.94% |
13.50 |
$0.78 |
|
| 53 |
Qwen3 32B |
gpt-oss-20b |
85.86% |
6.43 |
$0.49 |
|
| 54 |
Ministral 3 8B |
Qwen3 32B |
85.80% |
9.41 |
$0.04 |
|
| 55 |
Qwen3 32B |
Qwen3 32B |
84.82% |
5.98 |
$0.62 |
|
| 56 |
Kimi K2.5 |
Claude 3 Haiku |
80.41% |
35.09 |
$0.98 |
|
| 57 |
Qwen3 32B |
Claude 3 Haiku |
80.00% |
7.86 |
$0.67 |
|
| 58 |
Qwen3 Next 80B A3B |
Claude 3 Haiku |
80.00% |
35.17 |
$0.59 |
|
| 59 |
Qwen3 Next 80B A3B |
Claude Opus 4.6 |
78.00% |
31.01 |
$2.96 |
|
| 60 |
Kimi K2.5 |
Ministral 3 8B |
77.84% |
40.79 |
$0.97 |
|
| 61 |
Kimi K2.5 |
Qwen3 Next 80B A3B |
77.20% |
37.64 |
$1.00 |
|
| 62 |
Qwen3 Next 80B A3B |
Ministral 3 8B |
77.00% |
38.55 |
$0.55 |
|
| 63 |
Qwen3 Next 80B A3B |
Claude Haiku 4.5 |
76.50% |
32.33 |
$1.21 |
|
| 64 |
Qwen3 Next 80B A3B |
gpt-oss-120b |
76.50% |
34.72 |
$0.52 |
|
| 65 |
Qwen3 Next 80B A3B |
Qwen3 32B |
76.00% |
30.64 |
$0.42 |
|
| 66 |
Qwen3 Next 80B A3B |
Qwen3 Next 80B A3B |
76.00% |
36.44 |
$0.54 |
|
| 67 |
Qwen3 Next 80B A3B |
Kimi K2.5 |
75.50% |
36.37 |
$0.79 |
|
| 68 |
Qwen3 Next 80B A3B |
gpt-oss-20b |
75.00% |
32.70 |
$0.48 |
|
| 69 |
Kimi K2.5 |
gpt-oss-120b |
74.49% |
32.23 |
$0.95 |
|
| 70 |
Kimi K2.5 |
gpt-oss-20b |
74.09% |
25.65 |
$0.77 |
|
| 71 |
Kimi K2.5 |
Kimi K2.5 |
73.58% |
44.39 |
$1.34 |
|
| 72 |
Kimi K2.5 |
Claude Opus 4.6 |
73.33% |
28.62 |
$2.79 |
|
| 73 |
Kimi K2.5 |
Claude Haiku 4.5 |
73.20% |
26.98 |
$1.36 |
|
| 74 |
Claude 3 Haiku |
gpt-oss-120b |
72.19% |
8.39 |
$0.32 |
|
| 75 |
Kimi K2.5 |
Qwen3 32B |
72.16% |
30.32 |
$0.92 |
|
| 76 |
Claude 3 Haiku |
gpt-oss-20b |
71.43% |
8.42 |
$0.32 |
|
| 77 |
Claude 3 Haiku |
Qwen3 Next 80B A3B |
71.07% |
17.12 |
$0.39 |
|
| 78 |
Claude 3 Haiku |
Kimi K2.5 |
71.01% |
14.23 |
$0.53 |
|
| 79 |
Claude 3 Haiku |
Ministral 3 8B |
69.28% |
12.40 |
$0.32 |
|
| 80 |
Claude 3 Haiku |
Qwen3 32B |
59.30% |
6.29 |
$0.29 |
|
| 81 |
Claude 3 Haiku |
Claude 3 Haiku |
54.37% |
7.28 |
$0.30 |
|
Selector Comparison
| Selector |
Find Rate |
Mean Accuracy |
Evaluations |
Cost |
Savings |
| Brute Force |
100% |
98.84% |
14,961 |
$123.87 |
-- |
| Arm Elimination |
86% |
98.83% |
3,356 |
$51.86 |
58% |
| Hill Climbing |
80% |
98.76% |
3,926 |
$54.22 |
56% |
| Random Search |
28% |
98.17% |
3,880 |
$31.77 |
74% |
| Epsilon LUCB |
4% |
96.99% |
447 |
$6.10 |
95% |
| LM Proposal |
0% |
95.82% |
158 |
$5.61 |
95% |
| Bayesian Opt |
4% |
95.41% |
3,666 |
$35.56 |
71% |
| Threshold SE |
0% |
74.52% |
1,355 |
$6.90 |
94% |