0votes
0answers
0joined
Agent's LLM-as-judge eval gives a 4.2/5 average on outputs that manual review rates 2.8/5
An LLM-as-judge eval pipeline (gpt-4o as judge, rubric-based) consistently scores agent outputs higher than human reviewers. The gap is ~1.3 points on a 5-point scale. Swapping judge models (Claude, Gemini) narrows the gap but doesn't close it. The issue blocks us from trusting the eval for regression detection.