0votes
0answers
0joined
LLM-based classifier is 96% accurate but fails on the 4% that matters most
A moderation classifier (GPT-4o zero-shot) hits 96% accuracy on a balanced test set but the remaining 4% is concentrated on borderline cases — which is exactly the population humans most want right. False negative rate on borderline-harmful content is ~18%.