Papers
arxiv:2603.02099

Recursive Think-Answer Process for LLMs and VLMs

Published on Mar 2
· Submitted by
Byung-Kwan Lee
on Mar 3
Authors:
,

Abstract

Recursive Think-Answer Process enables iterative reasoning cycles that improve accuracy and reduce self-reflective errors in language and vision-language models through confidence-based reinforcement learning.

AI-generated summary

Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.

Community

Paper author Paper submitter

🧠 Can models know when they are wrong—and try again?
Think–Answer models such as DeepSeek-R1 and OpenAI o1 sometimes produce self-reflective cues like “Oops” or “Let me reconsider,” suggesting internal uncertainty. However, even when this uncertainty is evident, the model does not actually revisit its reasoning—it still outputs a final random answer after a single reasoning pass.

💡 Core Idea - R-TAP (Recursive Think-Answer Process)
Instead of stopping after one Think–Answer pair, we enable models to:
1️⃣ Generate a Think–Answer
2️⃣ Estimate its own confidence via a dedicated Confidence Generator
3️⃣ Re-run reasoning if confidence is low
4️⃣ Stop early if confidence is sufficiently high

🎁 During training, we introduce two confidence-driven rewards:
1️⃣ Recursive Confidence Increase Reward
→ Encourages confidence to improve across iterations
2️⃣ Final Answer Confidence Reward
→ Encourages high-confidence termination

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.02099 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.02099 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.02099 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.