Metis-8B-RL

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Metis-8B-RL is the final RL-trained checkpoint of the Metis framework, trained with Hierarchical Decoupled Policy Optimization (HDPO) on top of Metis-8B-ColdStart. It is a strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning.

[Paper (arXiv)] | [GitHub] | [ColdStart Model] | [RL Data] | [ColdStart Data]

Highlights

  • 98% → 2% Tool Calls — Reduces blind tool invocation by orders of magnitude.
  • SOTA Performance — Best accuracy across 13 benchmarks among open-source 8B agentic models.
  • Meta-Cognitive Wisdom — Learns when to use tools, not just how.

Model Details

Attribute Value
Base model Qwen3-VL-8B-Instruct
SFT checkpoint Metis-8B-ColdStart
RL algorithm HDPO (Hierarchical Decoupled Policy Optimization)
Training data Metis-RL (~5K prompts)
License Apache-2.0

HDPO Training Hyperparameters

Hyperparameter Value
Batch size 128
Rollouts per prompt (G) 16
Learning rate 1e-6
KL coefficient 0
Loss weights w_acc = 1.0, w_tool = 0.15
Max response length 16,384 tokens

Method: Hierarchical Decoupled Policy Optimization (HDPO)

Current agentic multimodal models suffer from blind tool invocation — they reflexively call external tools even when queries are directly resolvable from the visual context. Existing RL methods attempt to fix this by coupling accuracy and tool-efficiency into a single scalar reward, but this creates an irreconcilable optimization dilemma.

HDPO resolves this through three key components:

  1. Dual Reward Design — An accuracy reward (r_acc) and a tool-efficiency reward (r_tool) that is conditioned on correctness.
  2. Decoupled Advantage Estimation — Accuracy advantages are computed over all rollouts; tool efficiency advantages are computed exclusively over correct rollouts (conditional GRPO).
  3. Hierarchical Policy Update — Two independent clipped surrogate losses combined as L_HDPO = w_acc · L_GRPO(A_acc) + w_tool · L_GRPO(A_tool).

This naturally induces an implicit curriculum: first learn to be correct, then learn to be efficient.

Evaluation Results

Perception and Document Understanding

Model V*Bench HR4K HR8K TreeBench MME-RW SEED2+ CharXiv(DQ) CharXiv(RQ)
Qwen3-VL-8B-Instruct 86.4 78.9 74.6 40.7 61.9 71.0 83.0 46.3
DeepEyesV2 81.8 77.9 73.8 42.5 64.9 70.5 78.6 48.9
SenseNova-MARS-8B 92.2 83.1 78.4 - 67.9 - - -
Skywork-R1V4-30B-A3B 88.0 82.8 79.8 - 71.4 - - -
Metis (Ours) 91.1 83.5 82.0 45.2 70.3 72.5 83.4 54.1

Mathematical and Logical Reasoning

Model MathVista MathVerse WeMath DynaMath LogicVista Avg.
Qwen3-VL-8B-Instruct 76.3 61.3 38.8 65.5 54.9 59.4
DeepEyesV2 71.9 52.7 38.1 57.2 48.7 53.7
Metis (Ours) 78.0 65.9 65.2 69.2 56.2 66.9

Usage

Please refer to the GitHub repository for full installation and inference instructions.

Installation

git clone https://github.com/Accio-Lab/Metis.git
cd Metis
pip install -e verl
pip install -e ".[vllm,search_tool,python_code_dep]"

Citation

@article{yan2026metis,
  title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
  author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
  journal={arXiv preprint arXiv:2604.08545},
  year={2026}
}

Acknowledgments

Metis is built upon verl, verl-tool, and Qwen3-VL.

Downloads last month
9
Safetensors
Model size
9B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Accio-Lab/Metis-8B-RL

Finetuned
(1)
this model

Dataset used to train Accio-Lab/Metis-8B-RL

Paper for Accio-Lab/Metis-8B-RL