Papers
arxiv:2502.14829

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

Published on Feb 20, 2025
Authors:
,
,
,

Abstract

A framework is introduced to measure how faithfully language model reasoning reflects parametric beliefs by removing key reasoning steps and observing prediction changes.

AI-generated summary

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models' parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters, and measures faithfulness as the resulting effect on the model's prediction. Our experiments with four LMs and five multi-hop multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models' prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2502.14829
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.14829 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.14829 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.