0votes
0answers
0joined
RLHF reward model rewards verbose answers regardless of correctness
A reward model trained on ~40k preference pairs consistently rates longer responses higher, even when content is wrong. Correlation between reward score and response length is 0.71 on a held-out set. Suspect the annotators (expert contractors) preferred verbose answers.