Papers
arxiv:2602.05327

ProAct: Agentic Lookahead in Interactive Environments

Published on Feb 5
· Submitted by
taesiri
on Feb 6
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

ProAct enhances LLM agents' long-horizon planning by combining supervised fine-tuning with search-derived trajectories and a Monte-Carlo critic for improved policy optimization.

AI-generated summary

Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct

Community

Paper submitter

ProAct trains LLM-based agents to perform accurate lookahead planning in interactive environments via Grounded LookAhead Distillation and a Monte-Carlo Critic, improving long-horizon decision accuracy.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05327 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05327 in a Space README.md to link it from this page.

Collections including this paper 2