A Rubric-Supervised Critic from Sparse Real-World Outcomes
Paper
• 2603.03800 • Published
Paper •
Github (definitions & prompts)
A 4B parameter critic model for evaluating AI agent trajectories, trained to predict behavioral rubrics and task success.
We serve this model using vLLM’s classification task:
vllm serve <MODEL_PATH> \
--host 0.0.0.0 \
--port 8000 \
--api-key <API_KEY> \
--served-model-name <MODEL_NAME> \
--task classify \
--max-model-len 262144 \
--dtype bfloat16 \
--trust-remote-code \
--enable-prefix-caching
We recommend using the OpenHands SDK for inference instead of calling the vLLM classification endpoint directly.
Follow the SDK guide: https://docs.openhands.dev/sdk/guides/critic
In particular, reuse the SDK client implementation here (it already handles formatting and API calls): https://github.com/OpenHands/software-agent-sdk/blob/main/openhands-sdk/openhands/sdk/critic/impl/api/critic.py
At a high level, you will:
@misc{wang2026rubricsupervisedcriticsparserealworld,
title={A Rubric-Supervised Critic from Sparse Real-World Outcomes},
author={Xingyao Wang and Valerie Chen and Heng Ji and Graham Neubig},
year={2026},
eprint={2603.03800},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.03800},
}