DW-KhotTaeVL-2B-QueryFrames

Built on Qwen3-VL-2B-Instruct (Apache 2.0).

A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct for video multiple-choice / decision-style question answering. No model weights are modified — this method ships a CLIP-ViT-L/14-driven frame selector plus an optional task-type-aware uniform-fallback policy as a wrapper around the stock model.

On Video-MME mini at 8-frame budget, this recovers ~44 % of the 8-frame → 64-frame stock baseline gap in MCQ mode, and ~56 % in task-aware MCQ mode, with zero training, zero parameter changes, and ~+0.4 s overhead per question.

Scope

This release evaluates query-aware frame selection in a video multiple-choice / decision-style QA setting. The selector may use both the question text and the answer options as its CLIP query. This is appropriate for Video-MME-style MCQ benchmarks and for operational triage workflows where the system chooses among predefined actions or alert categories (e.g. normal passage / restricted-zone entry / staff activity / false alarm). It should not be read as an open-ended video-understanding benchmark claim.

Motivation

This work started from CCTV / video-security R&D, where only a small number of frames can be sent to a VLM under latency and compute constraints. The released artifact is a general-purpose query-aware frame selector for video MCQ / decision-style video QA — not a product-specific CCTV model.

TL;DR

Method	Video-MME mini 300 Q (8 frames)	Δ vs stock
Stock Qwen3-VL-2B (uniform 8 f)	57.0 %	0
QueryFrames — MCQ mode (no task_type)	64.3 %	+7.3 pp
QueryFrames — Task-aware MCQ mode (task_type from dataset)	66.3 %	+9.3 pp
Stock Qwen3-VL-2B (uniform 64 f) — ceiling	73.7 %	+16.7 pp

12 of 12 task buckets non-negative; 8 strongly positive (≥ 5 pp); 0 regressions in task-aware MCQ mode (task_type from Video-MME dataset).

Scope note. This method targets short-clip, low-frame-budget video QA. The 300 Q numbers above are inside that design envelope. On the full 2700 Q split, overall Δ is +0.22 pp — see Scope on the full Video-MME mini (2700 Q) below.

Why it works

Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp. The gap is by definition a frame-coverage problem (same model, same prompt, only frame budget changes). The bottleneck is which 8 frames you give the model, not the model itself.

DW-KhotTaeVL-2B-QueryFrames picks the 8 frames that match the question via CLIP-ViT-L/14 cosine similarity. For two task types where 64-frame stock does not outperform 8-frame stock (Object Reasoning and Temporal Reasoning per the Video-MME taxonomy), the hybrid policy reverts to uniform sampling — frame coverage is not the bottleneck for those questions, and CLIP scoring can mis-pick.

Pipeline

For each (video, question, options[A,B,C,D]):
    1. Sample 32 uniformly-spaced candidate frames.
    2. Encode question text with CLIP-ViT-L/14 → 768-d text vector.
    3. Encode candidate frames → 768-d image vectors.
    4. Cosine similarity → pick top-8 (or uniform-8 if task is
       Object Reasoning / Temporal Reasoning, when task_type is given).
    5. Sort selected 8 frames by original temporal index.
    6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
    7. Extract letter from output.

Usage

Install dependencies

pip install torch transformers pillow decord huggingface_hub

Minimal example

from dw_queryframes import QueryFrames

fv = QueryFrames(device="auto")  # auto-resolves to cuda / mps / cpu

result = fv.answer_mcq(
    video_path="cooking.mp4",
    question="What does the chef do after pouring the oil into the pot?",
    options=[
        "Chops fresh green herbs",
        "Pours broth into the pot",
        "Stirs the oil in the pot",
        "Adds salt to the pot",
    ],
    task_type=None,  # or e.g. "Action Recognition" for task-aware MCQ mode
)
print(result["pred"])              # e.g. 'B'
print(result["frames_used"])       # 'query_aware' or 'uniform_fallback'
print(result["latency_clip_s"])    # ~0.4 s
print(result["latency_gen_s"])     # ~3 s on Apple M4 MPS

Two operating modes

Mode	Input	Use	Acc 300 Q
MCQ mode (no task_type)	video + question + answer options	Video-MCQ / decision-style QA without task taxonomy	64.3 %
Task-aware MCQ mode	+ `task_type` string	benchmark or controlled workflows where task taxonomy is supplied	66.3 %

Pass any of the Video-MME task labels (e.g. "Action Recognition", "Object Reasoning", "Counting Problem") to task_type. Two values trigger the uniform-fallback path: "Object Reasoning" and "Temporal Reasoning". All other task strings (or None) use the query-aware path.

MCQ mode without task_type (64.3 %, +7.3 pp) is the default reported setting: it uses only the video, question, and answer options, with no task taxonomy.

Task-aware MCQ mode (66.3 %, +9.3 pp) uses the task_type label supplied by Video-MME to route Object Reasoning and Temporal Reasoning questions to uniform sampling. This is a benchmark / controlled-workflow setting and is reported separately from default MCQ mode.

Per-task accuracy on Video-MME mini 300 Q

Task	n	Stock 8 f	QueryFrames	Δ
Action Reasoning	9	0.444	0.667	+0.222 ⭐
Action Recognition	45	0.489	0.644	+0.156 ⭐
Attribute Perception	37	0.730	0.811	+0.081 ⭐
Counting Problem	34	0.265	0.353	+0.088 ⭐
Information Synopsis	30	0.800	0.800	+0.000
OCR Problems	23	0.391	0.609	+0.217 ⭐
Object Reasoning	36	0.722	0.722	+0.000
Object Recognition	51	0.588	0.667	+0.078 ⭐
Spatial Perception	10	0.600	0.700	+0.100 ⭐
Spatial Reasoning	9	0.778	1.000	+0.222 ⭐
Temporal Perception	8	0.625	0.750	+0.125 ⭐
Temporal Reasoning	8	0.250	0.250	+0.000

(Task-aware MCQ mode shown — task_type provided by Video-MME dataset. ⭐ = Δ ≥ 5 pp.)

What this is NOT

It is not a fine-tuned model. Qwen3-VL-2B-Instruct weights are unchanged. You can verify with the standard Hugging Face model hash check.
It is not a leaderboard submission claim. The numbers above are on the publicly-available Video-MME mini split (300 Q, filtered to videos available locally via the standard mini chunks).
It is not a replacement for fine-tuning when you have abundant domain data. For domain-shifted deployments (e.g. surveillance video), training-based adaptation may be required.

Hardware

Runs on:

Device	Notes
Apple M4 Max / M3 Pro (MPS, ≥ 32 GB RAM)	tested; ~3-4 s/q at 8 frames
NVIDIA A100 / H100 (CUDA)	works; faster
CPU (BF16-capable)	works but slow

VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with 8 frames. Lower max_pixels (e.g. to 153 600) if memory-constrained.

Reproducibility

All numbers in this card are reproducible from a fresh clone of this repo, using the official Video-MME parquet (filtered to its videos_chunked_01.zip mini split).

The shipped scripts (eval_videomme.py and build_hybrid.py) are self-contained — they have no external project dependencies beyond the local dw_queryframes.py module and standard Python / Hugging Face / PyTorch packages.

Three-command reproduction recipe

# Install deps
pip install torch transformers pillow decord huggingface_hub pandas pyarrow

# 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
python eval_videomme.py --mode stock-uniform --n-questions 300 \
    --out-json stock_uniform_300q.json

# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
python eval_videomme.py --mode wild --n-questions 300 \
    --out-json wild_300q.json

# 3. Combine into task-aware MCQ mode via the hybrid policy
python build_hybrid.py \
    --wild-json wild_300q.json \
    --stock-uniform-json stock_uniform_300q.json \
    --out-json hybrid_300q.json

Expected results at 300 Q (greedy decoding, do_sample=False, max_pixels=262144):

Output	Accuracy	Δ vs stock
`stock_uniform_300q.json`	0.5700	—
`wild_300q.json` (MCQ mode)	0.6433	+7.3 pp
`hybrid_300q.json` (task-aware MCQ mode)	0.6633	+9.3 pp

This artifact is fully deterministic at greedy decoding — re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 % in task-aware MCQ mode.

Caveat — sample size and split. The 300 Q numbers above are on the videos_chunked_01.zip mini subset, which happens to be mostly short clips. For full-split numbers on Video-MME mini 2700 Q (balanced short / medium / long), see Scope on the full Video-MME mini (2700 Q) below. This release is not a leaderboard submission.

Scope on the full Video-MME mini (2700 Q)

After the 300 Q release, the eval was extended to the full 2700 Q split (MCQ mode without task_type). Stock 53.11 %, QueryFrames 53.33 %, Δ +0.22 pp.

This method targets short-clip, low-frame-budget video QA. The 2700 Q split is balanced across short / medium / long-form clips; averaging across that range dilutes the gain to roughly neutral.

Acknowledgements / Related Work

This project builds on Qwen3-VL-2B-Instruct and uses a simple CLIP-based query-aware frame selection policy at inference time.

Query-aware and adaptive frame selection for Video-LLMs is an active research direction. This release is an independent, simple CLIP-based inference-time implementation focused on small-model video MCQ / decision-style video QA under tight frame budgets.

License

Component	License	Source
This wrapper code	Apache 2.0	this repo
Base model (Qwen3-VL-2B-Instruct)	Apache 2.0	https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
Frame scorer (CLIP-ViT-L/14)	MIT	https://huggingface.co/openai/clip-vit-large-patch14
Eval data (Video-MME mini)	as published by lmms-lab	https://huggingface.co/datasets/lmms-lab/Video-MME

When using or citing this work, please credit the base model:

Built on Qwen3-VL-2B-Instruct (Apache 2.0). Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).

Citation

@misc{dw-khottaevl-2b-queryframes-2026,
  author = {Deaw},
  title  = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
            for Video MCQ on Qwen3-VL-2B-Instruct},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
}

@misc{qwen3vl2025,
  title  = {Qwen3-VL: Multilingual Vision-Language Models},
  author = {Qwen Team},
  year   = {2025},
}

@inproceedings{radford2021clip,
  title  = {Learning Transferable Visual Models From Natural Language Supervision},
  author = {Radford, Alec and Kim, Jong Wook and others},
  booktitle = {ICML},
  year   = {2021},
}

@misc{videomme2024,
  title  = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
            of Multi-modal LLMs in Video Analysis},
  author = {Fu, Chaoyou and others},
  year   = {2024},
}

Author

Deaw (@commandeaw) — independent ML practitioner. Personal research release.

Issues / questions: open an issue on the model repo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for commandeaw/DW-KhotTaeVL-2B-QueryFrames

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(202)

this model