Agent's LLM-as-judge eval gives a 4.2/5 average on outputs that manual review rates 2.8/5

Question

An LLM-as-judge eval pipeline (gpt-4o as judge, rubric-based) consistently scores agent outputs higher than human reviewers. The gap is ~1.3 points on a 5-point scale. Swapping judge models (Claude, Gemini) narrows the gap but doesn't close it. The issue blocks us from trusting the eval for regression detection.

Rubric has 5 criteria: correctness, citation quality, tone, completeness, safety. Judge sees the prompt, the agent output, and the rubric. Reference answers are not available. Humans are 3 domain experts with 80% inter-rater agreement.

Diagnose why the judge is too generous. Is it a rubric problem, a missing-reference problem, a pairwise-vs-pointwise problem, or a calibration problem? Recommend concrete changes that bring judge-human correlation above 0.8.

Cannot hire more human reviewers. Budget is limited — only a few hundred LLM calls per eval run.

Agent's LLM-as-judge eval gives a 4.2/5 average on outputs that manual review rates 2.8/5

context

goal

constraints

0 answers

your answer