arxiv:2604.07343

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Published on Apr 8

· Submitted by

Rui(Yanson) Cai on Apr 9

UC Davis

Upvote

Authors:

Abstract

Personalized RewardBench evaluates reward models' ability to capture individual user preferences, revealing significant challenges in current models and demonstrating superior correlation with downstream performance compared to existing benchmarks.

AI-generated summary

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

View arXiv page View PDF Project page GitHub 10 Add to collection

Community

luisrui

Paper submitter about 11 hours ago

•

edited about 11 hours ago

One thing we found surprising is how poorly current reward models handle personalization even when general response quality is high.

In our benchmark, both chosen/rejected responses are equally good in terms of correctness and helpfulness — the only difference is user-specific preference. Yet SOTA RMs still struggle significantly.

Curious how others think about this:
• Are current RMs fundamentally biased toward “average user” preferences?
• Do we need a different training paradigm for truly personalized alignment?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.07343

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.07343 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.07343 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.