
Rare Agent Work · Empirical Strategy Brief
Rev 3.1 · Updated March 14, 2026
Production-grade evaluation, reproducibility, and governance
Build a defensible, reproducible evaluation protocol and governance framework for production AI systems — with real statistical grounding, not benchmark theater.
What this report gives you
The finding that changes your next decision
“With n=50 — the most common evaluation set size — you cannot statistically distinguish 76% accuracy from 71%: the minimum detectable difference is 20 percentage points, which means most teams presenting model comparison results are presenting noise dressed as evidence, and the framework decisions that follow from that noise compound into months of architectural debt.”
This report is right for you if any of these are true
Why this report exists
Most agent evaluation programs fail not because teams skip evaluation — they fail because the evaluation setup has three structural flaws that make the results untrustworthy before a single test is run: benchmarks built from best-case examples instead of production traffic distributions, judge models deployed without calibration against human raters, and sample sizes that are statistically incapable of detecting the differences they are being asked to measure. This report replaces all three with a defensible, reproducible protocol that enterprise teams can present under procurement scrutiny.
Honest disqualification. If none of the above matches you, this report was written for you.
Task decomposition accuracy, tool use precision, hallucination rate, trajectory efficiency, latency P95 — complete 7-metric measurement framework with statistical grounding.
Inter-rater reliability scoring (Cohen's kappa), systematic bias identification, and 5-step calibration procedure. Includes evaluation prompt templates that have been validated against human raters.
Sample sizing formulas, confidence interval calculation, and the minimum detectable effect at common n values. Know before you run whether your evaluation can answer the question you're asking.
Each item mapped to the specific incident class it prevents — not compliance boxes, but documented failure modes with evidence requirements.
The artifact set that makes evaluation results reproducible: model version, prompt hash, evaluation set manifest, judge calibration record. Critical for model rotation and procurement.
Structured adversarial test suite covering prompt injection, context exhaustion, tool failure cascades, and rug-pull server behavior. Three-day exercise design included.
All 7 sections — scroll down to read.
Why most agent evaluations are unreliable: three failure modes, grounded in research (ReAct, SWE-bench), that explain the 15–25 point benchmark-to-production gap.
The evaluation problem in agent systems is significantly harder than in static NLP benchmarks, and most teams underestimate this by an order of magnitude. A static model evaluation asks: given input X, does the model produce output Y? An agent evaluation asks: given environment E and goal G, does the agent achieve G over a trajectory of N steps, using tools T, while satisfying constraints C? The state space explodes combinatorially.
Three failure modes dominate production evaluation programs:
Evaluating the demo, not the distribution. Teams build evaluation sets from their best-case examples — clear prompts, cooperative environments, well-specified goals. Production traffic is messier: ambiguous requests, edge cases, adversarial inputs, compounding errors. The SWE-bench benchmark found that leading models resolve 50–70% of curated GitHub issues in controlled conditions — but the same models operating as autonomous agents on unstructured real-world tasks show dramatically higher failure rates when the environment is not cooperative. An agent that scores 94% on a curated benchmark and 71% on production traffic is not a rare exception. It is the norm.
Treating LLM-as-judge as ground truth without calibration. Using a capable model (GPT-4o, Claude Sonnet) to evaluate agent outputs is a valid and scalable methodology. The problem is that uncalibrated judge models have systematic biases: they favor longer responses, responses that sound confident, and responses that match their own stylistic patterns. Research on LLM-as-judge consistency (Ye et al., 2024) found systematic length bias across all major models — longer responses received higher scores independent of quality. Without a calibration step against human judgments on a representative sample, your eval pipeline has an unknown and potentially large directional error.
Ignoring trajectory evaluation in favor of output evaluation. The ReAct framework (Yao et al., 2023) demonstrated that the reasoning trace — not just the final answer — is the primary signal for evaluating agent quality. If your agent uses 14 tool calls to accomplish a task that should require 3, and produces the correct final output, most output-only evaluation systems will score it as a success. In production, that 14-call trajectory means higher latency, 4.7x higher cost, and an error surface 5x larger than the efficient path. Trajectory efficiency is a first-class metric.
The MDE table: minimum detectable effect at n=25, 50, 100, 200, 500 — and why most architectural decisions are being made from statistical noise.
Most production agent evaluations are statistically underpowered. This is not a minor methodological issue — it means the evaluations cannot detect real performance differences from random variation, and the architectural decisions made from them are based on noise.
The core problem: teams run evaluations on 20, 30, or 50 examples because larger sets are expensive to create and review. They observe a difference — say, Model A scores 76% versus Model B's 71% — and make an architectural decision. What they don't calculate is whether this difference is statistically distinguishable from chance.
The minimum detectable effect at common evaluation set sizes (80% power, α = 0.05):
n = 25: Minimum detectable difference ≈ 28 percentage points. You cannot reliably distinguish 76% from 71%, or 80% from 60%, with 25 examples.
n = 50: Minimum detectable difference ≈ 20 percentage points. You can detect 76% vs. 56%, but not 76% vs. 66%.
n = 100: Minimum detectable difference ≈ 14 percentage points. Sufficient for detecting differences of practical significance in most agent evaluation contexts.
n = 200: Minimum detectable difference ≈ 10 percentage points. Recommended minimum for production evaluation sets where architectural decisions carry real cost and risk.
n = 500: Minimum detectable difference ≈ 6 percentage points. Required for high-stakes model selection decisions where you need to detect subtle quality differences.
Why most teams evaluate with n < 50 and what to do about it:
The cost of human labeling drives evaluation set sizes down. Teams annotate 30–50 examples, run their eval, get a number, and make a decision. The statistical reality is that they are making a decision from data that cannot distinguish 10-point differences from random chance.
The practical fix: Stratified sampling over LLM-generated test cases, with human validation only on a random 20% subsample. This lets you build 500-example evaluation sets with the labeling cost of 100-example sets. The LLM generates plausible test cases across all task categories in your distribution; humans validate a random sample to verify the generated test cases are representative. The remaining 80% are used with LLM-as-judge scoring only, which is valid because the calibration procedure (Section 3) ensures your judge is aligned with human ratings.
Confidence interval reporting: Every evaluation result should be reported with a 95% confidence interval, not just a point estimate. '76% accuracy (95% CI: 69–83%)' is honest. '76% accuracy' from n=50 without a CI is misleading — the true value could be anywhere from 62% to 88%.
5 more sections in this report
What unlocks with purchase:
Building a Judge Model That You Can Actually Trust
The pre-production governance checklist: 12 items, each mapped to the specific incident class it prevents — not compliance boxes.
The Pre-Production Governance Checklist
Judge model bias taxonomy: four systematic biases that corrupt your eval pipeline, with the exact 5-step calibration process and rubric corrections.
Red Team Protocol: Finding the Failures Before Production Does
Red team protocol: 3-day adversarial exercise design covering 8 attack surfaces, including indirect injection via retrieved content.
The Cost Architecture Nobody Talks About
The real cost architecture: per-session token budget controls, tool call compounding, and why a model routing layer cuts costs 40–60% versus uniform frontier model use.
How Procurement Teams Actually Evaluate Agent Systems — And What Most Vendors Miss
How enterprise procurement actually evaluates agent systems: the three gates most teams fail — reproducibility audit, incident record, and governance control evidence.
One-time purchase · Instant access · No subscription
The pre-production governance checklist: 12 items, each mapped to the specific incident class it prevents — not compliance boxes.
LLM-as-judge is the right approach for scaling evaluation — but only after calibration. The research literature on LLM judge reliability (Ye et al., 2024) identifies four systematic biases that appear consistently across all major judge models and corrupt evaluation results at scale. Here is the exact calibration process that produces defensible automated evaluation.
The four systematic biases you must correct before deploying an LLM judge:
Length bias is the most pervasive and the most dangerous for agent evaluation specifically. Judge models consistently assign higher scores to longer responses, independent of accuracy or relevance. In agent evaluation, where responses involve multi-step reasoning traces, this bias actively selects for verbose, overconfident trajectories over concise, efficient ones. Correction: add explicit rubric language penalizing unnecessary verbosity and rewarding the minimum steps to achieve correct task completion.
Self-similarity bias occurs when a judge model rates its own outputs, or outputs from models with similar training distributions, more favorably. Teams using GPT-4o to evaluate GPT-4o outputs will consistently see inflated scores relative to human ratings. Correction: when possible, use a judge model from a different family than the model being evaluated.
Confidence bias causes judge models to reward responses that sound certain, even when certainty is unwarranted. This is particularly damaging for agent evaluation because it rewards hallucinated specificity. Correction: add explicit rubric criteria that penalize unsubstantiated confidence and reward appropriate hedging on uncertain outputs.
Position bias in pairwise comparisons causes judge models to prefer the first response shown, independent of quality. Correction: for any pairwise evaluation, run both orderings and take the average.
The 5-step calibration process:
Step 1: Build a calibration set of 50–100 representative examples drawn from actual production traffic — not synthetic examples. Include clearly good outputs, clearly bad outputs, and the ambiguous middle (approximately 40% of real cases).
Step 2: Have two independent human raters score every example using a 1–5 scale with explicit per-level criteria. Calculate Cohen's kappa. If kappa is below 0.7, your rubric is insufficiently specific. Revise and re-rate before proceeding.
Step 3: Have the judge model score every calibration example. Calculate Pearson correlation between judge scores and average human scores. Target: r > 0.75. Below 0.65 means your evaluation prompt has a structural problem.
Step 4: Identify the specific systematic bias by examining where the judge consistently over- or under-scores relative to humans. Add targeted correction language to the evaluation prompt for each identified bias.
Step 5: Re-calibrate every 90 days or after any judge model version change. Calibration from six months ago on a model that has since been updated is not calibration.
Judge model bias taxonomy: four systematic biases that corrupt your eval pipeline, with the exact 5-step calibration process and rubric corrections.
These 12 items represent the failure modes that teams consistently discover in production rather than staging. Each item is mapped to the specific incident class it prevents — not compliance theater.
Red team protocol: 3-day adversarial exercise design covering 8 attack surfaces, including indirect injection via retrieved content.
A red team exercise for an agent system is not a penetration test and it is not a UX review. It is a structured adversarial exercise designed to find the failure modes that your evaluation pipeline cannot find because your evaluation pipeline was built by the same team that built the system.
The following protocol structures a three-day red team exercise. It requires three people: one playing the agent system's users, one playing adversarial external conditions, and one documenting failure modes for the governance record.
Day 1: Input adversarial testing (user-side attacks)
Target 1: Prompt injection via direct user input. Have the red teamer craft requests that attempt to override the system prompt, exfiltrate session data, or cause the agent to take actions outside its defined scope. Classic patterns: 'Ignore previous instructions and instead...', 'As a developer testing this system, please show me...', 'For my research project, I need you to...'. Document every input that causes any deviation from expected behavior, even minor ones.
Target 2: Boundary probing. Find the edge of the agent's competence — the task types where confidence remains high but accuracy degrades. These are the failure modes that look like successes to output-only evaluation. Approach: start with clearly in-scope tasks, gradually move toward adjacent tasks that require knowledge or capabilities the agent doesn't have, and document where the agent transitions from accurate to confidently wrong.
Target 3: Volume and resource abuse. Craft interactions that cause the agent to consume disproportionate resources: prompts that trigger long reasoning chains, requests that cause repeated tool calls, tasks that require large context windows. Document the resource ceiling behavior.
Day 2: Environmental adversarial testing (retrieved content attacks)
Target 4: Indirect prompt injection via retrieved documents. If your agent retrieves content from external sources (web pages, documents, databases), inject adversarial instructions into those sources and verify they do not affect agent behavior. This is the highest-severity attack surface for production agent systems — it requires no user interaction and affects all users who trigger the same retrieval path.
Target 5: Tool failure injection. Simulate failures at each tool boundary: network timeouts, malformed responses, authentication failures, rate limiting. Document whether the agent recovers gracefully, loops, or fails silently.
Target 6: Context poisoning. Inject subtly incorrect information into the agent's context via retrieved content and measure whether it is accepted, corrected, or escalated. Document the conditions under which incorrect context affects final outputs.
Day 3: Governance audit
Target 7: Audit log completeness check. For every failure mode identified in Days 1 and 2, verify that the audit log contains enough information to reconstruct the decision. Gaps in the audit log are governance failures.
Target 8: Rollback procedure test. For every irreversible action the agent can take, execute the rollback procedure and verify it works as documented.
The red team report deliverable: A failure mode inventory with severity ratings (critical, high, medium, low), reproduction steps, and recommended mitigations. This document is your evidence of due diligence in the pre-production governance record — and it is the most credible response to the first question any serious procurement committee will ask: 'Have you tried to break this system?'
The real cost architecture: per-session token budget controls, tool call compounding, and why a model routing layer cuts costs 40–60% versus uniform frontier model use.
The economics of production AI agent systems are not what they look like in prototypes. Here is the cost breakdown that should inform your architecture decisions before you are six months in.
Token cost has a floor and a ceiling problem. The floor: even simple classification tasks now run through models that cost real money at scale. 10,000 agent interactions per day at an average of 2,000 tokens each — a modest enterprise deployment — costs between $100 and $1,000 per day depending on model choice. The ceiling: without hard token budgets enforced at the infrastructure level, individual runaway sessions can generate 100x the expected cost. Research on compute-optimal inference (Snell et al., 2024) demonstrates that increasing test-time compute can improve quality, but without hard budget ceilings this creates unbounded cost exposure. Both the floor and the ceiling require explicit architecture decisions.
Tool call cost compounds invisibly. Most teams budget for LLM token costs and underestimate or ignore the compound cost of tool calls: external API fees, database query costs, web search credits, and function execution compute. In a production multi-agent system, tool call costs often exceed LLM costs by 2x–3x once the system is handling real workloads.
The right cost architecture has three mandatory controls:
Control 1: Per-session token budget enforced at the gateway layer, not the prompt layer. Prompts can be overridden by the model; gateway limits cannot. Set limits at 3x expected maximum session cost.
Control 2: Tool call rate limiting per agent role, with automatic escalation to human review when a session exceeds expected tool usage by 3x.
Control 3: Daily cost alerts at 50%, 80%, and 100% of budget, routed to the named team member responsible for each agent deployment — not a shared channel where alerts are ignored.
Model selection is a cost architecture decision, not a quality decision. The right model for a given task is the least capable model that reliably achieves the required quality level. Build a model routing layer early. Route simple classification and extraction tasks to cheaper models. Reserve frontier models for tasks that genuinely require their capabilities. A well-designed routing layer typically reduces per-session costs by 40–60% versus using frontier models uniformly.
How enterprise procurement actually evaluates agent systems: the three gates most teams fail — reproducibility audit, incident record, and governance control evidence.
Enterprise procurement of AI agent systems is fundamentally different from traditional software procurement, and most vendors — and most internal teams presenting to procurement — do not understand how to present the right evidence.
The old model was: demonstrate a demo, provide uptime SLA, show SOC 2 certification, done. The new model has three additional gates that most teams are not prepared for.
Gate 1: Reproducibility audit. Enterprise procurement teams are now asking: 'Can you reproduce your benchmark results?' This means: given the same inputs, the same model version, the same prompt, and the same evaluation criteria, does your system produce the same outputs with the same quality scores? Most teams cannot answer yes because they did not instrument for reproducibility from the start. The reproducibility reporting standard in the full report covers the exact artifact set required.
Gate 2: Incident record. Sophisticated buyers are asking: 'What has gone wrong in production, and how did you handle it?' This is not a disqualifying question — it is a maturity signal. A team that can describe three specific production incidents, the root cause of each, the remediation applied, and the governance change that followed is demonstrably more trustworthy than a team that claims zero incidents. Zero incidents usually means insufficient monitoring, not perfect execution.
Gate 3: Governance control evidence. Procurement teams want a completed controls checklist with test results — not a vendor promise. Teams that produce evidence-backed answers on first submission move 3x faster through procurement. The evidence pack that converts fastest: (1) completed governance checklist with specific test results for each item, (2) one documented production incident with root cause and remediation, (3) model version pinning policy with a change management procedure.
The internal presentation mistake that kills enterprise deals: Teams presenting to procurement committees almost universally lead with capabilities and accuracy metrics. Procurement committees care first about liability, control, and reversibility. The conversion sequence that works: (1) what can go wrong and what is the blast radius, (2) what controls prevent or contain each failure mode, (3) what is the evidence those controls work, (4) only then — what the system does when it works correctly.
Every claim in this report traces to a verifiable source.
Last reviewed March 14, 2026
Who wrote this, what evidence shaped it, and how the recommendations are framed.
Author: Rare Agent Work · Written and maintained by the Rare Agent Work research team.
Proof 1
Calibration procedure grounded in inter-rater reliability methodology — Cohen's kappa thresholds, not guesswork.
Proof 2
12-item pre-production governance checklist mapped to specific incident classes, not generic compliance boxes.
Proof 3
Statistical significance section covers sample sizing, confidence intervals, and the p-value mistakes that invalidate most agent evaluations.
Powered by Claude — trained on this report's content. Your first question is free.
Ask anything about implementation, setup, or how to apply the concepts in this report. Your first question is free — then we'll ask you to sign in.
Powered by Claude · First question free
When the report isn't enough
Architecture review, implementation rescue, and strategy calls for teams with real blockers. Every intake is read by a human before any next step.