Papers
arxiv:2604.11297

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Published on Apr 13
· Submitted by
Bo Wang
on Apr 14
Authors:
,
,
,
,
,
,
,
,

Abstract

MEDS is a memory-enhanced dynamic reward shaping framework that improves sampling diversity in reinforcement learning for large language models by identifying and penalizing recurrent error patterns through clustering of historical behavioral signals.

AI-generated summary

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.

Community

Paper submitter

We find that, in RL training, large language models often do not make mistakes at random. Instead, they tend to fall back into the same kinds of traps again and again. Even after long training, they can still return to familiar failure paths.

A key reason is that many existing reward models are effectively memoryless: they only judge whether the current answer is right or wrong, without considering whether the same mistake has already been repeated many times before. As a result, these recurring “old mistakes” often need to appear — and be penalized — over and over again before they are corrected, making models more likely to get stuck in fixed error patterns.

To address this, we propose MEDS💊 (Memory-Enhanced Dynamic Reward Shaping).

MEDS acts like a more attentive teacher with memory, specifically keeping an eye on errors that appear repeatedly. Concretely, it leverages the layer-wise logits naturally produced during the model’s forward pass to efficiently capture more stable reasoning trajectories behind each response. These serve as a kind of “reasoning fingerprint,” allowing MEDS to identify recurring error patterns and dynamically reshape rewards, encouraging the model to explore more genuinely effective reasoning paths.

Experiments show that MEDS delivers consistent gains across 5 datasets and 3 base models, while also improving sampling diversity.

Feel free to check out our paper and share your thoughts — we would love to discuss! And we would greatly appreciate a Hugging Face upvote and a GitHub star if you find our work interesting. Thank you so much!! ⭐️🙏

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.11297 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.11297 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.11297 in a Space README.md to link it from this page.

Collections including this paper 3