DW-KhotTaeVL-2B-QueryFrames
Built on Qwen3-VL-2B-Instruct (Apache 2.0).
A query-aware frame selection wrapper around stock Qwen3-VL-2B-Instruct for video multiple-choice / decision-style question answering. No model weights are modified β this method ships a CLIP-ViT-L/14-driven frame selector plus an optional task-type-aware uniform-fallback policy as a wrapper around the stock model.
On Video-MME mini at 8-frame budget, this recovers ~44 % of the 8-frame β 64-frame stock baseline gap in MCQ mode, and ~56 % in task-aware MCQ mode, with zero training, zero parameter changes, and ~+0.4 s overhead per question.
Scope
This release evaluates query-aware frame selection in a video multiple-choice / decision-style QA setting. The selector may use both the question text and the answer options as its CLIP query. This is appropriate for Video-MME-style MCQ benchmarks and for operational triage workflows where the system chooses among predefined actions or alert categories (e.g. normal passage / restricted-zone entry / staff activity / false alarm). It should not be read as an open-ended video-understanding benchmark claim.
Motivation
This work started from CCTV / video-security R&D, where only a small number of frames can be sent to a VLM under latency and compute constraints. The released artifact is a general-purpose query-aware frame selector for video MCQ / decision-style video QA β not a product-specific CCTV model.
TL;DR
| Method | trainable params | Video-MME mini 300 Q (8 frames) | Ξ vs stock |
|---|---|---|---|
| Stock Qwen3-VL-2B (uniform 8 f) | 0 | 57.0 % | 0 |
| QueryFrames β MCQ mode (no task_type) | 0 | 64.3 % | +7.3 pp |
| QueryFrames β Task-aware MCQ mode (task_type from dataset) | 0 | 66.3 % | +9.3 pp |
| Stock Qwen3-VL-2B (uniform 64 f) β ceiling | 0 | 73.7 % | +16.7 pp |
12 of 12 task buckets non-negative; 8 strongly positive (β₯ 5 pp); 0 regressions in task-aware MCQ mode (task_type from Video-MME dataset).
Scope note. This method targets short-clip, low-frame-budget video QA. The 300 Q numbers above are inside that design envelope. On the full 2700 Q split, overall Ξ is +0.22 pp β see Scope on the full Video-MME mini (2700 Q) below.
Why it works
Stock Qwen3-VL-2B at 8 frames lags itself at 64 frames by ~17 pp. The gap is by definition a frame-coverage problem (same model, same prompt, only frame budget changes). The bottleneck is which 8 frames you give the model, not the model itself.
DW-KhotTaeVL-2B-QueryFrames picks the 8 frames that match the question via CLIP-ViT-L/14 cosine similarity. For two task types where 64-frame stock does not outperform 8-frame stock (Object Reasoning and Temporal Reasoning per the Video-MME taxonomy), the hybrid policy reverts to uniform sampling β frame coverage is not the bottleneck for those questions, and CLIP scoring can mis-pick.
Pipeline
For each (video, question, options[A,B,C,D]):
1. Sample 32 uniformly-spaced candidate frames.
2. Encode question text with CLIP-ViT-L/14 β 768-d text vector.
3. Encode candidate frames β 768-d image vectors.
4. Cosine similarity β pick top-8 (or uniform-8 if task is
Object Reasoning / Temporal Reasoning, when task_type is given).
5. Sort selected 8 frames by original temporal index.
6. Pass 8 frames + MCQ to stock Qwen3-VL-2B-Instruct.
7. Extract letter from output.
Usage
Install dependencies
pip install torch transformers pillow decord huggingface_hub
Minimal example
from dw_queryframes import QueryFrames
fv = QueryFrames(device="auto") # auto-resolves to cuda / mps / cpu
result = fv.answer_mcq(
video_path="cooking.mp4",
question="What does the chef do after pouring the oil into the pot?",
options=[
"Chops fresh green herbs",
"Pours broth into the pot",
"Stirs the oil in the pot",
"Adds salt to the pot",
],
task_type=None, # or e.g. "Action Recognition" for task-aware MCQ mode
)
print(result["pred"]) # e.g. 'B'
print(result["frames_used"]) # 'query_aware' or 'uniform_fallback'
print(result["latency_clip_s"]) # ~0.4 s
print(result["latency_gen_s"]) # ~3 s on Apple M4 MPS
Two operating modes
| Mode | Input | Use | Acc 300 Q |
|---|---|---|---|
| MCQ mode (no task_type) | video + question + answer options | Video-MCQ / decision-style QA without task taxonomy | 64.3 % |
| Task-aware MCQ mode | + task_type string |
benchmark or controlled workflows where task taxonomy is supplied | 66.3 % |
Pass any of the Video-MME task labels (e.g. "Action Recognition",
"Object Reasoning", "Counting Problem") to task_type. Two values
trigger the uniform-fallback path: "Object Reasoning" and
"Temporal Reasoning". All other task strings (or None) use the
query-aware path.
MCQ mode without task_type (64.3 %, +7.3 pp) is the default reported setting: it uses only the video, question, and answer options, with no task taxonomy.
Task-aware MCQ mode (66.3 %, +9.3 pp) uses the
task_typelabel supplied by Video-MME to route Object Reasoning and Temporal Reasoning questions to uniform sampling. This is a benchmark / controlled-workflow setting and is reported separately from default MCQ mode.
Per-task accuracy on Video-MME mini 300 Q
| Task | n | Stock 8 f | QueryFrames | Ξ |
|---|---|---|---|---|
| Action Reasoning | 9 | 0.444 | 0.667 | +0.222 β |
| Action Recognition | 45 | 0.489 | 0.644 | +0.156 β |
| Attribute Perception | 37 | 0.730 | 0.811 | +0.081 β |
| Counting Problem | 34 | 0.265 | 0.353 | +0.088 β |
| Information Synopsis | 30 | 0.800 | 0.800 | +0.000 |
| OCR Problems | 23 | 0.391 | 0.609 | +0.217 β |
| Object Reasoning | 36 | 0.722 | 0.722 | +0.000 |
| Object Recognition | 51 | 0.588 | 0.667 | +0.078 β |
| Spatial Perception | 10 | 0.600 | 0.700 | +0.100 β |
| Spatial Reasoning | 9 | 0.778 | 1.000 | +0.222 β |
| Temporal Perception | 8 | 0.625 | 0.750 | +0.125 β |
| Temporal Reasoning | 8 | 0.250 | 0.250 | +0.000 |
(Task-aware MCQ mode shown β task_type provided by Video-MME dataset. β = Ξ β₯ 5 pp.)
What this is NOT
- It is not a fine-tuned model. Qwen3-VL-2B-Instruct weights are unchanged. You can verify with the standard Hugging Face model hash check.
- It is not a leaderboard submission claim. The numbers above are on the publicly-available Video-MME mini split (300 Q, filtered to videos available locally via the standard mini chunks).
- It is not a replacement for fine-tuning when you have abundant domain data. For domain-shifted deployments (e.g. surveillance video), training-based adaptation may be required.
Hardware
Runs on:
| Device | Notes |
|---|---|
| Apple M4 Max / M3 Pro (MPS, β₯ 32 GB RAM) | tested; ~3-4 s/q at 8 frames |
| NVIDIA A100 / H100 (CUDA) | works; faster |
| CPU (BF16-capable) | works but slow |
VRAM / unified memory needed: ~6-8 GB at 262 144 max-pixels with
8 frames. Lower max_pixels (e.g. to 153 600) if memory-constrained.
Reproducibility
All numbers in this card are reproducible from a fresh clone of this
repo, using the official Video-MME parquet
(filtered to its videos_chunked_01.zip mini split).
The shipped scripts (eval_videomme.py and build_hybrid.py) are
self-contained β they have no external project dependencies beyond
the local dw_queryframes.py module and standard Python /
Hugging Face / PyTorch packages.
Three-command reproduction recipe
# Install deps
pip install torch transformers pillow decord huggingface_hub pandas pyarrow
# 1. Reproduce stock-uniform-8f baseline (writes stock_uniform_300q.json)
python eval_videomme.py --mode stock-uniform --n-questions 300 \
--out-json stock_uniform_300q.json
# 2. Reproduce QA-mode (no task_type) (writes wild_300q.json)
python eval_videomme.py --mode wild --n-questions 300 \
--out-json wild_300q.json
# 3. Combine into task-aware MCQ mode via the hybrid policy
python build_hybrid.py \
--wild-json wild_300q.json \
--stock-uniform-json stock_uniform_300q.json \
--out-json hybrid_300q.json
Expected results at 300 Q (greedy decoding, do_sample=False,
max_pixels=262144):
| Output | Accuracy | Ξ vs stock |
|---|---|---|
stock_uniform_300q.json |
0.5700 | β |
wild_300q.json (MCQ mode) |
0.6433 | +7.3 pp |
hybrid_300q.json (task-aware MCQ mode) |
0.6633 | +9.3 pp |
This artifact is fully deterministic at greedy decoding β re-running on the same 300 questions reproduces the same 199 / 300 = 66.3 % in task-aware MCQ mode.
Caveat β sample size and split. The 300 Q numbers above are on the
videos_chunked_01.zipmini subset, which happens to be mostly short clips. For full-split numbers on Video-MME mini 2700 Q (balanced short / medium / long), see Scope on the full Video-MME mini (2700 Q) below. This release is not a leaderboard submission.
Scope on the full Video-MME mini (2700 Q)
After the 300 Q release, the eval was extended to the full 2700 Q
split (MCQ mode without task_type). Stock 53.11 %, QueryFrames
53.33 %, Ξ +0.22 pp.
This method targets short-clip, low-frame-budget video QA. The 2700 Q split is balanced across short / medium / long-form clips; averaging across that range dilutes the gain to roughly neutral.
Acknowledgements / Related Work
This project builds on Qwen3-VL-2B-Instruct and uses a simple CLIP-based query-aware frame selection policy at inference time.
Query-aware and adaptive frame selection for Video-LLMs is an active research direction. This release is an independent, simple CLIP-based inference-time implementation focused on small-model video MCQ / decision-style video QA under tight frame budgets.
License
| Component | License | Source |
|---|---|---|
| This wrapper code | Apache 2.0 | this repo |
| Base model (Qwen3-VL-2B-Instruct) | Apache 2.0 | https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
| Frame scorer (CLIP-ViT-L/14) | MIT | https://huggingface.co/openai/clip-vit-large-patch14 |
| Eval data (Video-MME mini) | as published by lmms-lab | https://huggingface.co/datasets/lmms-lab/Video-MME |
When using or citing this work, please credit the base model:
Built on Qwen3-VL-2B-Instruct (Apache 2.0). Frame selector: CLIP-ViT-L/14 (Radford et al. 2021, OpenAI, MIT).
Citation
@misc{dw-khottaevl-2b-queryframes-2026,
author = {Deaw},
title = {DW-KhotTaeVL-2B-QueryFrames: Query-Aware Frame Selection
for Video MCQ on Qwen3-VL-2B-Instruct},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/commandeaw/DW-KhotTaeVL-2B-QueryFrames}
}
@misc{qwen3vl2025,
title = {Qwen3-VL: Multilingual Vision-Language Models},
author = {Qwen Team},
year = {2025},
}
@inproceedings{radford2021clip,
title = {Learning Transferable Visual Models From Natural Language Supervision},
author = {Radford, Alec and Kim, Jong Wook and others},
booktitle = {ICML},
year = {2021},
}
@misc{videomme2024,
title = {Video-MME: The First-Ever Comprehensive Evaluation Benchmark
of Multi-modal LLMs in Video Analysis},
author = {Fu, Chaoyou and others},
year = {2024},
}
Author
Deaw (@commandeaw) β independent ML practitioner. Personal research release.
Issues / questions: open an issue on the model repo.
Model tree for commandeaw/DW-KhotTaeVL-2B-QueryFrames
Base model
Qwen/Qwen3-VL-2B-Instruct