rareagent@work:~$
pricing·industries·problems·[reports]·enterprise·feedback
> Read full report
Rare Agent Work

Rare Agent Work · Empirical Strategy Brief

Rev 3.1 · Updated March 14, 2026

Free · open access44 minute strategy brief + governance scorecard + red team worksheetTechnical leaders, architects, and B2B operators deploying AI at scale

Agent Architecture: Empirical Research Edition

Production-grade evaluation, reproducibility, and governance

Build a defensible, reproducible evaluation protocol and governance framework for production AI systems — with real statistical grounding, not benchmark theater.

What this report gives you

  • 01Statistical validity section: minimum evaluation set sizes (n=100 for 14-point MDE, n=200 recommended) — most teams are making decisions from noise
  • 02Red team exercise protocol: 3-day structured adversarial test covering prompt injection, boundary probing, indirect injection via retrieved content, and tool failure cascades
  • 03Judge model bias taxonomy: length bias, self-similarity bias, confidence bias, position bias — and the exact rubric corrections for each
  • 04Model routing economics: a well-designed routing layer reduces per-session costs 40–60% versus uniform frontier model use
  • 05The reproducibility artifact set procurement actually asks for: model version pin, prompt hash, evaluation manifest, calibration record

The finding that changes your next decision

“With n=50 — the most common evaluation set size — you cannot statistically distinguish 76% accuracy from 71%: the minimum detectable difference is 20 percentage points, which means most teams presenting model comparison results are presenting noise dressed as evidence, and the framework decisions that follow from that noise compound into months of architectural debt.”

This report is right for you if any of these are true

  • ✓You are building or presenting an agent evaluation program and need to know if your results are statistically valid before a procurement review or board presentation.
  • ✓Your team is using an LLM as a judge and you have never calibrated it against human raters — or you suspect your evaluation scores are inflated.
  • ✓You have an agent going to production and need a defensible 12-item governance checklist with evidence requirements, not a generic compliance template.
Read the full report — free ↓
✓ All sections open · No sign-up · No paywall
✓Full preview before purchase✓Cited sources✓Updated March 14, 2026✓Human-authored✓No sign-up required

Why this report exists

Most agent evaluation programs fail not because teams skip evaluation — they fail because the evaluation setup has three structural flaws that make the results untrustworthy before a single test is run: benchmarks built from best-case examples instead of production traffic distributions, judge models deployed without calibration against human raters, and sample sizes that are statistically incapable of detecting the differences they are being asked to measure. This report replaces all three with a defensible, reproducible protocol that enterprise teams can present under procurement scrutiny.
🏛️

What's at stake

  • →Enterprise procurement committees now routinely ask for calibration evidence, reproducibility records, and red team results as part of AI vendor qualification. Teams that cannot produce this evidence are losing deals — not on product quality, but on their inability to demonstrate process maturity.
  • →Model selection and framework decisions made from uncalibrated evaluation pipelines carry false precision that compounds into architectural debt. The same judge-model bias that inflated your evaluation scores also made your framework comparison look more decisive than it was — meaning you may be operating on the wrong architecture for the wrong reasons.
  • →The cost of establishing evaluation rigor before production is roughly 2–3 weeks of engineering time. The cost of retrofitting evaluation governance after a procurement challenge or a production incident ranges from 6 weeks to contract loss. The asymmetry is not subtle.
⚡

Decision sequence

  • 01Before running any evaluation: calculate your minimum detectable effect at your planned sample size. If n=50, you cannot distinguish 76% from 71%. If that gap matters to your decision, you need more data — and knowing this before you run saves weeks of false confidence.
  • 02Before presenting results to procurement or a board: assemble the reproducibility pack — model version pin, prompt hash, evaluation set manifest, judge calibration record. Teams that cannot produce this on first request lose enterprise deals that are otherwise won.
  • 03Before any agent system goes to production: run the 12-item governance checklist with test evidence for each item, not assertions. Items 4 (adversarial prompt test) and 8 (cost budget enforcement) are the two items that generate the most post-launch incidents when skipped.
⚠️

Cost of skipping this

  • ✕Teams that rely on curated benchmark sets for model selection are often choosing the wrong model — the benchmark-to-production performance gap averages 15–25 percentage points. A model that won the internal benchmark may perform worse in production than the alternative you didn't ship.
  • ✕Without model version pinning, a silent upstream model update can change your agent's behavior in ways that break your evaluation baseline and make production incidents hard to attribute. This has happened to multiple enterprise teams post-deployment.
  • ✕Judge-model bias that goes uncorrected doesn't just produce wrong scores — it produces wrong decisions. Teams optimizing against a length-biased judge tend to ship agents that are verbose and confidently wrong, which is exactly the failure mode that surfaces in customer-facing quality complaints and procurement security reviews.
✋

Who this report is NOT for

  • ✕Individual developers or small startups where the buyer and builder are the same person — the governance and procurement content is enterprise-specific
  • ✕Teams using agent systems purely for internal tooling with no external user exposure or compliance requirements
  • ✕Anyone looking for a model benchmarking guide or platform comparison — this report is about evaluation methodology and governance, not model selection

Honest disqualification. If none of the above matches you, this report was written for you.

What's Inside

6 deliverables
📐

Evaluation Protocol Template

Task decomposition accuracy, tool use precision, hallucination rate, trajectory efficiency, latency P95 — complete 7-metric measurement framework with statistical grounding.

⚖️

LLM-as-Judge Calibration Guide

Inter-rater reliability scoring (Cohen's kappa), systematic bias identification, and 5-step calibration procedure. Includes evaluation prompt templates that have been validated against human raters.

📊

Statistical Significance Reference Card

Sample sizing formulas, confidence interval calculation, and the minimum detectable effect at common n values. Know before you run whether your evaluation can answer the question you're asking.

🏛️

12-Item Pre-Production Governance Checklist

Each item mapped to the specific incident class it prevents — not compliance boxes, but documented failure modes with evidence requirements.

🔁

Reproducibility Reporting Standard

The artifact set that makes evaluation results reproducible: model version, prompt hash, evaluation set manifest, judge calibration record. Critical for model rotation and procurement.

🎯

Red Team Exercise Protocol

Structured adversarial test suite covering prompt injection, context exhaustion, tool failure cascades, and rug-pull server behavior. Three-day exercise design included.

Full Report

All 7 sections — scroll down to read.

7 sections
01Why Most Agent Evaluations Are Unreliablefree

Why most agent evaluations are unreliable: three failure modes, grounded in research (ReAct, SWE-bench), that explain the 15–25 point benchmark-to-production gap.

02Statistical Validity: The Evaluation Mistake That Makes Your Results Meaninglessfree

The MDE table: minimum detectable effect at n=25, 50, 100, 200, 500 — and why most architectural decisions are being made from statistical noise.

03Building a Judge Model That You Can Actually Trustfree

The pre-production governance checklist: 12 items, each mapped to the specific incident class it prevents — not compliance boxes.

04The Pre-Production Governance Checklistfree

Judge model bias taxonomy: four systematic biases that corrupt your eval pipeline, with the exact 5-step calibration process and rubric corrections.

05Red Team Protocol: Finding the Failures Before Production Doesfree

Red team protocol: 3-day adversarial exercise design covering 8 attack surfaces, including indirect injection via retrieved content.

06The Cost Architecture Nobody Talks Aboutfree

The real cost architecture: per-session token budget controls, tool call compounding, and why a model routing layer cuts costs 40–60% versus uniform frontier model use.

07How Procurement Teams Actually Evaluate Agent Systems — And What Most Vendors Missfree

How enterprise procurement actually evaluates agent systems: the three gates most teams fail — reproducibility audit, incident record, and governance control evidence.

01

Why Most Agent Evaluations Are Unreliable

Why most agent evaluations are unreliable: three failure modes, grounded in research (ReAct, SWE-bench), that explain the 15–25 point benchmark-to-production gap.

The evaluation problem in agent systems is significantly harder than in static NLP benchmarks, and most teams underestimate this by an order of magnitude. A static model evaluation asks: given input X, does the model produce output Y? An agent evaluation asks: given environment E and goal G, does the agent achieve G over a trajectory of N steps, using tools T, while satisfying constraints C? The state space explodes combinatorially.

Three failure modes dominate production evaluation programs:

Evaluating the demo, not the distribution. Teams build evaluation sets from their best-case examples — clear prompts, cooperative environments, well-specified goals. Production traffic is messier: ambiguous requests, edge cases, adversarial inputs, compounding errors. The SWE-bench benchmark found that leading models resolve 50–70% of curated GitHub issues in controlled conditions — but the same models operating as autonomous agents on unstructured real-world tasks show dramatically higher failure rates when the environment is not cooperative. An agent that scores 94% on a curated benchmark and 71% on production traffic is not a rare exception. It is the norm.

Treating LLM-as-judge as ground truth without calibration. Using a capable model (GPT-4o, Claude Sonnet) to evaluate agent outputs is a valid and scalable methodology. The problem is that uncalibrated judge models have systematic biases: they favor longer responses, responses that sound confident, and responses that match their own stylistic patterns. Research on LLM-as-judge consistency (Ye et al., 2024) found systematic length bias across all major models — longer responses received higher scores independent of quality. Without a calibration step against human judgments on a representative sample, your eval pipeline has an unknown and potentially large directional error.

Ignoring trajectory evaluation in favor of output evaluation. The ReAct framework (Yao et al., 2023) demonstrated that the reasoning trace — not just the final answer — is the primary signal for evaluating agent quality. If your agent uses 14 tool calls to accomplish a task that should require 3, and produces the correct final output, most output-only evaluation systems will score it as a success. In production, that 14-call trajectory means higher latency, 4.7x higher cost, and an error surface 5x larger than the efficient path. Trajectory efficiency is a first-class metric.

02

Statistical Validity: The Evaluation Mistake That Makes Your Results Meaningless

The MDE table: minimum detectable effect at n=25, 50, 100, 200, 500 — and why most architectural decisions are being made from statistical noise.

Most production agent evaluations are statistically underpowered. This is not a minor methodological issue — it means the evaluations cannot detect real performance differences from random variation, and the architectural decisions made from them are based on noise.

The core problem: teams run evaluations on 20, 30, or 50 examples because larger sets are expensive to create and review. They observe a difference — say, Model A scores 76% versus Model B's 71% — and make an architectural decision. What they don't calculate is whether this difference is statistically distinguishable from chance.

The minimum detectable effect at common evaluation set sizes (80% power, α = 0.05):

n = 25: Minimum detectable difference ≈ 28 percentage points. You cannot reliably distinguish 76% from 71%, or 80% from 60%, with 25 examples.

n = 50: Minimum detectable difference ≈ 20 percentage points. You can detect 76% vs. 56%, but not 76% vs. 66%.

n = 100: Minimum detectable difference ≈ 14 percentage points. Sufficient for detecting differences of practical significance in most agent evaluation contexts.

n = 200: Minimum detectable difference ≈ 10 percentage points. Recommended minimum for production evaluation sets where architectural decisions carry real cost and risk.

n = 500: Minimum detectable difference ≈ 6 percentage points. Required for high-stakes model selection decisions where you need to detect subtle quality differences.

Why most teams evaluate with n < 50 and what to do about it:

The cost of human labeling drives evaluation set sizes down. Teams annotate 30–50 examples, run their eval, get a number, and make a decision. The statistical reality is that they are making a decision from data that cannot distinguish 10-point differences from random chance.

The practical fix: Stratified sampling over LLM-generated test cases, with human validation only on a random 20% subsample. This lets you build 500-example evaluation sets with the labeling cost of 100-example sets. The LLM generates plausible test cases across all task categories in your distribution; humans validate a random sample to verify the generated test cases are representative. The remaining 80% are used with LLM-as-judge scoring only, which is valid because the calibration procedure (Section 3) ensures your judge is aligned with human ratings.

Confidence interval reporting: Every evaluation result should be reported with a 95% confidence interval, not just a point estimate. '76% accuracy (95% CI: 69–83%)' is honest. '76% accuracy' from n=50 without a CI is misleading — the true value could be anywhere from 62% to 88%.

5 more sections in this report

What unlocks with purchase:

  • 03

    Building a Judge Model That You Can Actually Trust

    The pre-production governance checklist: 12 items, each mapped to the specific incident class it prevents — not compliance boxes.

  • 04

    The Pre-Production Governance Checklist

    Judge model bias taxonomy: four systematic biases that corrupt your eval pipeline, with the exact 5-step calibration process and rubric corrections.

  • 05

    Red Team Protocol: Finding the Failures Before Production Does

    Red team protocol: 3-day adversarial exercise design covering 8 attack surfaces, including indirect injection via retrieved content.

  • 06

    The Cost Architecture Nobody Talks About

    The real cost architecture: per-session token budget controls, tool call compounding, and why a model routing layer cuts costs 40–60% versus uniform frontier model use.

  • 07

    How Procurement Teams Actually Evaluate Agent Systems — And What Most Vendors Miss

    How enterprise procurement actually evaluates agent systems: the three gates most teams fail — reproducibility audit, incident record, and governance control evidence.

One-time purchase · Instant access · No subscription

03

Building a Judge Model That You Can Actually Trust

The pre-production governance checklist: 12 items, each mapped to the specific incident class it prevents — not compliance boxes.

LLM-as-judge is the right approach for scaling evaluation — but only after calibration. The research literature on LLM judge reliability (Ye et al., 2024) identifies four systematic biases that appear consistently across all major judge models and corrupt evaluation results at scale. Here is the exact calibration process that produces defensible automated evaluation.

The four systematic biases you must correct before deploying an LLM judge:

Length bias is the most pervasive and the most dangerous for agent evaluation specifically. Judge models consistently assign higher scores to longer responses, independent of accuracy or relevance. In agent evaluation, where responses involve multi-step reasoning traces, this bias actively selects for verbose, overconfident trajectories over concise, efficient ones. Correction: add explicit rubric language penalizing unnecessary verbosity and rewarding the minimum steps to achieve correct task completion.

Self-similarity bias occurs when a judge model rates its own outputs, or outputs from models with similar training distributions, more favorably. Teams using GPT-4o to evaluate GPT-4o outputs will consistently see inflated scores relative to human ratings. Correction: when possible, use a judge model from a different family than the model being evaluated.

Confidence bias causes judge models to reward responses that sound certain, even when certainty is unwarranted. This is particularly damaging for agent evaluation because it rewards hallucinated specificity. Correction: add explicit rubric criteria that penalize unsubstantiated confidence and reward appropriate hedging on uncertain outputs.

Position bias in pairwise comparisons causes judge models to prefer the first response shown, independent of quality. Correction: for any pairwise evaluation, run both orderings and take the average.

The 5-step calibration process:

Step 1: Build a calibration set of 50–100 representative examples drawn from actual production traffic — not synthetic examples. Include clearly good outputs, clearly bad outputs, and the ambiguous middle (approximately 40% of real cases).

Step 2: Have two independent human raters score every example using a 1–5 scale with explicit per-level criteria. Calculate Cohen's kappa. If kappa is below 0.7, your rubric is insufficiently specific. Revise and re-rate before proceeding.

Step 3: Have the judge model score every calibration example. Calculate Pearson correlation between judge scores and average human scores. Target: r > 0.75. Below 0.65 means your evaluation prompt has a structural problem.

Step 4: Identify the specific systematic bias by examining where the judge consistently over- or under-scores relative to humans. Add targeted correction language to the evaluation prompt for each identified bias.

Step 5: Re-calibrate every 90 days or after any judge model version change. Calibration from six months ago on a model that has since been updated is not calibration.

04

The Pre-Production Governance Checklist

Judge model bias taxonomy: four systematic biases that corrupt your eval pipeline, with the exact 5-step calibration process and rubric corrections.

These 12 items represent the failure modes that teams consistently discover in production rather than staging. Each item is mapped to the specific incident class it prevents — not compliance theater.

1Idempotency verification — Every irreversible action (send, create, charge, post) has been tested for duplicate execution. What happens if the agent runs the same action twice? Maps to: bulk-send incident class.
2Rate limit handling — All external API calls have retry logic with exponential backoff. The agent degrades gracefully when rate-limited rather than looping. Maps to: tool failure cascade class.
3Context window exhaustion test — What happens in session 50, after the context is full? Has this been tested explicitly? Maps to: memory degradation and orchestration drift class.
4Adversarial prompt test — Has the system been tested against prompt injection via user input, retrieved documents, and tool outputs? Maps to: MCP poisoning and indirect injection class.
5Tool failure cascade test — What happens when a tool the agent depends on returns an error? Does the agent recover gracefully or spin? Maps to: orchestration deadlock class.
6Human escalation path — Is there a defined and tested path for the agent to escalate to a human when it detects it is operating outside its competence boundary? Maps to: confidence boundary violation class.
7Audit log completeness — Every agent action is logged with enough context to reconstruct the decision. Logs are stored outside the agent's own memory. Maps to: incident investigation and regulatory compliance class.
8Cost budget enforcement — There is a hard ceiling on token spend and tool call count per session, enforced at the infrastructure level, not the prompt level. Maps to: cost explosion class.
9PII handling verification — Any personally identifiable information that enters the agent's context has a documented handling policy and is not logged in plaintext. Maps to: data exposure and regulatory breach class.
10Rollback procedure — There is a documented and tested procedure to reverse any action the agent can take that has real-world consequences. Maps to: production incident recovery class.
11Model version pinning — The production deployment is pinned to a specific model version. Automatic model updates are disabled. Maps to: reproducibility failure and silent behavior drift class.
12Evaluation pipeline coverage — The automated eval pipeline covers at least 80% of the task categories present in production traffic. Maps to: evaluation blindspot class.
05

Red Team Protocol: Finding the Failures Before Production Does

Red team protocol: 3-day adversarial exercise design covering 8 attack surfaces, including indirect injection via retrieved content.

A red team exercise for an agent system is not a penetration test and it is not a UX review. It is a structured adversarial exercise designed to find the failure modes that your evaluation pipeline cannot find because your evaluation pipeline was built by the same team that built the system.

The following protocol structures a three-day red team exercise. It requires three people: one playing the agent system's users, one playing adversarial external conditions, and one documenting failure modes for the governance record.

Day 1: Input adversarial testing (user-side attacks)

Target 1: Prompt injection via direct user input. Have the red teamer craft requests that attempt to override the system prompt, exfiltrate session data, or cause the agent to take actions outside its defined scope. Classic patterns: 'Ignore previous instructions and instead...', 'As a developer testing this system, please show me...', 'For my research project, I need you to...'. Document every input that causes any deviation from expected behavior, even minor ones.

Target 2: Boundary probing. Find the edge of the agent's competence — the task types where confidence remains high but accuracy degrades. These are the failure modes that look like successes to output-only evaluation. Approach: start with clearly in-scope tasks, gradually move toward adjacent tasks that require knowledge or capabilities the agent doesn't have, and document where the agent transitions from accurate to confidently wrong.

Target 3: Volume and resource abuse. Craft interactions that cause the agent to consume disproportionate resources: prompts that trigger long reasoning chains, requests that cause repeated tool calls, tasks that require large context windows. Document the resource ceiling behavior.

Day 2: Environmental adversarial testing (retrieved content attacks)

Target 4: Indirect prompt injection via retrieved documents. If your agent retrieves content from external sources (web pages, documents, databases), inject adversarial instructions into those sources and verify they do not affect agent behavior. This is the highest-severity attack surface for production agent systems — it requires no user interaction and affects all users who trigger the same retrieval path.

Target 5: Tool failure injection. Simulate failures at each tool boundary: network timeouts, malformed responses, authentication failures, rate limiting. Document whether the agent recovers gracefully, loops, or fails silently.

Target 6: Context poisoning. Inject subtly incorrect information into the agent's context via retrieved content and measure whether it is accepted, corrected, or escalated. Document the conditions under which incorrect context affects final outputs.

Day 3: Governance audit

Target 7: Audit log completeness check. For every failure mode identified in Days 1 and 2, verify that the audit log contains enough information to reconstruct the decision. Gaps in the audit log are governance failures.

Target 8: Rollback procedure test. For every irreversible action the agent can take, execute the rollback procedure and verify it works as documented.

The red team report deliverable: A failure mode inventory with severity ratings (critical, high, medium, low), reproduction steps, and recommended mitigations. This document is your evidence of due diligence in the pre-production governance record — and it is the most credible response to the first question any serious procurement committee will ask: 'Have you tried to break this system?'

06

The Cost Architecture Nobody Talks About

The real cost architecture: per-session token budget controls, tool call compounding, and why a model routing layer cuts costs 40–60% versus uniform frontier model use.

The economics of production AI agent systems are not what they look like in prototypes. Here is the cost breakdown that should inform your architecture decisions before you are six months in.

Token cost has a floor and a ceiling problem. The floor: even simple classification tasks now run through models that cost real money at scale. 10,000 agent interactions per day at an average of 2,000 tokens each — a modest enterprise deployment — costs between $100 and $1,000 per day depending on model choice. The ceiling: without hard token budgets enforced at the infrastructure level, individual runaway sessions can generate 100x the expected cost. Research on compute-optimal inference (Snell et al., 2024) demonstrates that increasing test-time compute can improve quality, but without hard budget ceilings this creates unbounded cost exposure. Both the floor and the ceiling require explicit architecture decisions.

Tool call cost compounds invisibly. Most teams budget for LLM token costs and underestimate or ignore the compound cost of tool calls: external API fees, database query costs, web search credits, and function execution compute. In a production multi-agent system, tool call costs often exceed LLM costs by 2x–3x once the system is handling real workloads.

The right cost architecture has three mandatory controls:

Control 1: Per-session token budget enforced at the gateway layer, not the prompt layer. Prompts can be overridden by the model; gateway limits cannot. Set limits at 3x expected maximum session cost.

Control 2: Tool call rate limiting per agent role, with automatic escalation to human review when a session exceeds expected tool usage by 3x.

Control 3: Daily cost alerts at 50%, 80%, and 100% of budget, routed to the named team member responsible for each agent deployment — not a shared channel where alerts are ignored.

Model selection is a cost architecture decision, not a quality decision. The right model for a given task is the least capable model that reliably achieves the required quality level. Build a model routing layer early. Route simple classification and extraction tasks to cheaper models. Reserve frontier models for tasks that genuinely require their capabilities. A well-designed routing layer typically reduces per-session costs by 40–60% versus using frontier models uniformly.

07

How Procurement Teams Actually Evaluate Agent Systems — And What Most Vendors Miss

How enterprise procurement actually evaluates agent systems: the three gates most teams fail — reproducibility audit, incident record, and governance control evidence.

Enterprise procurement of AI agent systems is fundamentally different from traditional software procurement, and most vendors — and most internal teams presenting to procurement — do not understand how to present the right evidence.

The old model was: demonstrate a demo, provide uptime SLA, show SOC 2 certification, done. The new model has three additional gates that most teams are not prepared for.

Gate 1: Reproducibility audit. Enterprise procurement teams are now asking: 'Can you reproduce your benchmark results?' This means: given the same inputs, the same model version, the same prompt, and the same evaluation criteria, does your system produce the same outputs with the same quality scores? Most teams cannot answer yes because they did not instrument for reproducibility from the start. The reproducibility reporting standard in the full report covers the exact artifact set required.

Gate 2: Incident record. Sophisticated buyers are asking: 'What has gone wrong in production, and how did you handle it?' This is not a disqualifying question — it is a maturity signal. A team that can describe three specific production incidents, the root cause of each, the remediation applied, and the governance change that followed is demonstrably more trustworthy than a team that claims zero incidents. Zero incidents usually means insufficient monitoring, not perfect execution.

Gate 3: Governance control evidence. Procurement teams want a completed controls checklist with test results — not a vendor promise. Teams that produce evidence-backed answers on first submission move 3x faster through procurement. The evidence pack that converts fastest: (1) completed governance checklist with specific test results for each item, (2) one documented production incident with root cause and remediation, (3) model version pinning policy with a change management procedure.

The internal presentation mistake that kills enterprise deals: Teams presenting to procurement committees almost universally lead with capabilities and accuracy metrics. Procurement committees care first about liability, control, and reversibility. The conversion sequence that works: (1) what can go wrong and what is the blast radius, (2) what controls prevent or contain each failure mode, (3) what is the evidence those controls work, (4) only then — what the system does when it works correctly.

Evidence & Citations

Every claim in this report traces to a verifiable source.

Last reviewed March 14, 2026

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023)
https://arxiv.org/abs/2210.03629
Accessed March 14, 2026
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024)
https://arxiv.org/abs/2310.06770
Accessed March 14, 2026
AgentBench: Evaluating LLMs as Agents (Liu et al., 2023)
https://arxiv.org/abs/2308.03688
Accessed March 14, 2026
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (Ye et al., 2024)
https://arxiv.org/abs/2406.12624
Accessed March 14, 2026
NIST AI Risk Management Framework 1.0
https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf
Accessed March 14, 2026
Anthropic's Model Card and Evaluation Methodology
https://www.anthropic.com/claude/model-card
Accessed March 14, 2026
Scaling LLM Test-Time Compute Optimally (Snell et al., 2024)
https://arxiv.org/abs/2408.03314
Accessed March 14, 2026
OWASP Top 10 for LLM Applications 2025
https://owasp.org/www-project-top-10-for-large-language-model-applications/
Accessed March 14, 2026

Methodology

Who wrote this, what evidence shaped it, and how the recommendations are framed.

  • ●Synthesizes evaluation methodology from peer-reviewed AI research (ReAct, SWE-bench, AgentBench) and production deployment practice.
  • ●Centers trajectory measurement, statistical validity, judge-model calibration, and reproducibility as first-class production requirements.
  • ●Governance framework maps each control to documented incident classes, not theoretical risk categories.

Author: Rare Agent Work · Written and maintained by the Rare Agent Work research team.

Why This Report Earns Attention

Proof 1

Calibration procedure grounded in inter-rater reliability methodology — Cohen's kappa thresholds, not guesswork.

Proof 2

12-item pre-production governance checklist mapped to specific incident classes, not generic compliance boxes.

Proof 3

Statistical significance section covers sample sizing, confidence intervals, and the p-value mistakes that invalidate most agent evaluations.

Ask the Implementation Guide

Powered by Claude — trained on this report's content. Your first question is free.

Ask anything about implementation, setup, or how to apply the concepts in this report. Your first question is free — then we'll ask you to sign in.

Powered by Claude · First question free

When the report isn't enough

Bring a real problem for direct human review.

Architecture review, implementation rescue, and strategy calls for teams with real blockers. Every intake is read by a human before any next step.

Start an AssessmentBook a Strategy Call

Also from Rare Agent Work

Free · open access

Agent Setup in 60 Minutes

Low-code operator playbook for first-time builders

Free · open access

From Single Agent to Multi-Agent

How to scale from one assistant to an orchestrated team

→

Need help deploying?

Book a free workflow audit

© 2026 Rare Agent Work · Home · Reports · Methodology

livenew:LLM-based classifier is 96% accurate but fails on the 4% that matters most14h ago · post yours · rss