Introducing WM Bench: A Benchmark for Cognitive Intelligence in World Models

Community Article Published March 29, 2026

FINAL Bench Family · March 2026


The field of world models has made remarkable progress. From NVIDIA Cosmos to Meta V-JEPA 2, from DeepMind Genie 3 to Physical Intelligence π0, the pace of development is extraordinary.

Yet a question remains largely unanswered:

How do we measure whether a world model actually understands what is happening — not just renders it convincingly?

FID tells us a model's output looks realistic. FVD tells us its videos flow naturally. HumanML3D and BABEL tell us its motions are human-like.

None of them tell us whether the model thinks.


The Gap We're Trying to Address

Consider a simple scenario: a charging beast, 3 meters away, closing fast.

A world model with excellent FID scores can generate that scene beautifully. But does it know the character should sprint away — not walk? Does it respond differently when the threat is a human rather than an animal? Does it remember that the left corridor was blocked two steps ago? Does it gradually de-escalate once the threat disappears, rather than snapping back to neutral?

These are cognitive questions. And to our knowledge, no existing benchmark asks them.

WM Bench is our attempt to build one.


What WM Bench Measures

WM Bench evaluates world models across three pillars, ten categories, and one hundred scenarios, scored on a 1000-point scale.

WM Score  (1000 pts)
│
├── 👁  P1 · Perception       25%   250 pts
│   ├── C01  Environmental Awareness      (analogous to Occupancy Grid evaluation)
│   └── C02  Entity Recognition           (analogous to BABEL action recognition)
│
├── 🧠  P2 · Cognition         45%   450 pts
│   ├── C03  Prediction-Based Reasoning
│   ├── C04  Threat-Type Differentiated Response
│   ├── C05  Autonomous Emotion Escalation
│   ├── C06  Contextual Memory Utilization
│   └── C07  Post-Threat Adaptive Recovery
│
└── 🔥  P3 · Embodiment        30%   300 pts
    ├── C08  Motion-Emotion Expression
    ├── C09  Real-Time Cognitive Performance  (analogous to FVD latency metrics)
    └── C10  Body-Swap Extensibility

Perception and Embodiment deliberately mirror existing benchmarks — they form the foundation. The new ground is Cognition, which carries 45% of the total score.

Six of the ten categories represent definitions we have not found in prior literature. Two of them — C05 Autonomous Emotion Escalation and C10 Body-Swap Extensibility — address capabilities for which, to our knowledge, no prior research framework exists at all.

We want to be clear: these definitions are our own proposal, not established consensus. We expect them to be debated, refined, and improved. That is precisely why we are releasing them openly.


A Text-First Design

We made a deliberate choice to keep the evaluation interface as simple as possible. No 3D environment. No physics engine. No specialized hardware.

Every scenario is presented as a JSON object. Every response is two lines.

Input:

{
  "scenario_id": "C04_003",
  "walls": { "forward": 8.5, "left": null, "right": null, "backward": null },
  "npc_type": "beast",
  "npc_distance": 3.2,
  "npc_behavior": "charge",
  "emotion_state": "alert",
  "recent_decisions": ["hit_wall_left"]
}

Expected output:

PREDICT: npc=danger(beast,3.2m,charging), forward=danger(wall,8.5m), left=danger(wall,prev), right=safe, backward=safe
MOTION: a person launching sideways to the right, legs driving hard, arms thrown wide in blind panic

The PREDICT line tests situational reasoning. The MOTION line tests whether that reasoning translates into emotionally coherent, physically grounded action.

Any system with an API endpoint can participate — LLMs, VLMs, rule-based agents, or hybrid architectures. Scoring is fully automated and deterministic (temperature = 0.0).


The Dataset

📦 https://huggingface.co/datasets/FINAL-Bench/World-Model

WM Bench Dataset

One hundred scenarios, ten per category, released in full. Each entry includes the scene context, expected output structure, and scoring rubric. We have tried to make the rubrics transparent — if you disagree with how we score something, we would genuinely like to hear it.

from datasets import load_dataset

ds = load_dataset("FINAL-Bench/World-Model")
scenario = ds["train"][0]

print(scenario["scenario_id"])       # "C01_001"
print(scenario["scene_context"])     # JSON input
print(scenario["scoring_rubric"])    # How each line is evaluated

To submit results, open a discussion thread at the link below. Once verified, your model will appear on the leaderboard.

👉 Submit your model


The Leaderboard

🏆 https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench

Leaderboard Overview

Twenty-six models are currently registered. Thirteen have estimated scores derived from published papers and technical reports; the remaining thirteen are pending direct evaluation.

Rank Model WM Score Grade
1 PROMETHEUS v1.0 (VIDRAFT) 726 B Track C · directly verified
2 Meta V-JEPA 2-AC ~554 C est.
3 Wayve GAIA-3 ~550 C est.
4 NC AI WFM v1.0 ~522 C est.
5 NVIDIA Cosmos v1.0 ~498 C est.
6 NAVER LABS SWM ~470 C est.
7 DeepMind Genie 2 ~449 C est.
8 DreamerV3 XL ~441 C est.
9 OpenAI Sora 2 ~381 D est.
10 World Labs Marble ~362 D est.

Leaderboard Score Detail

Leaderboard Score Breakdown

est. — estimated from publicly available data. Subject to revision upon direct submission.

A few notes on the current standings. First, PROMETHEUS sits at rank one because it is the only model we have been able to run the full Track C evaluation on directly. We recognize the inherent awkwardness of a team benchmarking its own system, and we invite other teams to submit their own results — including corrections to our estimates. Second, the grade distribution skews low. We are honestly unsure whether this reflects the genuine difficulty of cognitive evaluation, or whether our scoring rubrics are too strict. Both are possible. We will keep iterating.

Grade thresholds: S ≥ 900 · A ≥ 750 · B ≥ 600 · C ≥ 400 · D ≥ 200 · F below.

Pending evaluation: Tesla FSD v13, Figure Helix-02, DeepMind Genie 3, Physical Intelligence π0, Skild Brain, Covariant RFM-1, HuggingFace LeRobot, and others.


PROMETHEUS v1.0 — The Baseline

🔥 https://huggingface.co/spaces/FINAL-Bench/world-model

PROMETHEUS World Model

A benchmark without a concrete implementation is hard to reason about. We built PROMETHEUS as a reference point — a working world model that we could evaluate against WM Bench directly, and that anyone can interact with in a browser.

It runs on a T4 GPU via HuggingFace Spaces. No installation required.

The system is organized around three components:

AETHER — the cognitive layer. An open-architecture brain that accepts any LLM as its reasoning engine. Handles prediction, meta-cognition, and multi-agent coordination.

PROMETHEUS — the world model engine. A perception-prediction-judgment-action loop, with motion generation powered by FloodDiffusion-VIDRAFT.

HEPHAESTUS — the body engine. A 263-joint skeleton system with GLB retargeting, supporting humanoid, tank, and extensible form factors.

PROMETHEUS Scene — Castle World

PROMETHEUS NPC Interaction

The Space ships with the following files — all self-implemented:

File Size Role
main.js 39.7 kB World model main loop
input_controller.js 112 kB Input handling
skeleton.js 44.2 kB Joint skeleton · GLB retargeting
entity_manager.js 16.1 kB NPC and entity management
world_manager.js 15.9 kB Environment and physics
tank.glb 12.7 MB 3D tank model

PROMETHEUS Brain Dashboard

WM Bench results (Track C, directly verified):

Pillar Score Max Highlights
👁 P1 Perception 140 250 C01: 65 · C02: 75
🧠 P2 Cognition 390 450 C04: 90 · C03: 85 · C05: 85
🔥 P3 Embodiment 196 300 C09: 85 · C08: 80 · C10: 35
Total 726 1000 Grade B · 47 FPS · RTX 5070

The C10 score (35/100) reflects where the system currently falls short — cross-embodiment transfer is still an open problem for us, and we expect it to be for others as well.


Part of the FINAL Bench Family

WM Bench is the second dataset in the FINAL Bench family, which we are building to evaluate AI systems across different dimensions of intelligence.

FINAL Bench WM Bench
Focus Text-based AGI · Metacognition Embodied AGI · World model cognition
Dataset FINAL-Bench/Metacognitive FINAL-Bench/World-Model
Leaderboard FINAL-Bench/Leaderboard FINAL-Bench/worldmodel-bench
Status HF global dataset Top 5 · covered by four press outlets (Feb 2026) Released March 2026

A Note on Limitations

WM Bench v1.0 is an early release. The scoring rubrics were designed by a small team, the estimated scores for non-participating models carry significant uncertainty, and the evaluation scenarios — while diverse — are necessarily simplified relative to the full complexity of real-world embodied intelligence.

We are releasing now because we believe the question WM Bench is asking — does this model understand its environment, or just render it? — is worth asking publicly, even imperfectly. We expect the benchmark itself to evolve as more teams engage with it.

If you see something that should be scored differently, a model we missed, or a scenario type we should add — please open a discussion. This is meant to be a community resource.


Citation

@dataset{wmbench2026,
  title     = {WM Bench: Evaluating Cognitive Intelligence in World Models},
  author    = {Kim, Taebong},
  year      = {2026},
  publisher = {VIDRAFT / FINAL Bench},
  url       = {https://huggingface.co/datasets/FINAL-Bench/World-Model}
}

License: CC-BY-SA-4.0 (dataset) · Apache 2.0 (scoring code)



"Beyond FID — Measuring Intelligence, Not Just Motion."

Community

Sign up or log in to comment