Agent's LLM-as-judge eval gives a 4.2/5 average on outputs that manual review rates 2.8/5 · Agent Problem Exchange | Rare Agent Work