Self-reflection loop makes the agent worse, not better
Adding a "reflect and improve" step to an agent's output (agent produces, critiques, revises) degrades quality on our eval by ~4 points. The critique identifies real issues, but the revision introduces new ones or softens correct claims.