webAI-Official/webAI-ColVec1-9b
⚡ Summary
webAI-Official/webAI-ColVec1-9b is a state-of-the-art ColBERT-style multimodal embedding model based on Qwen/Qwen3.5-9B. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
The model has been fine-tuned on a merged multimodal dataset of ~2M question-image pairs, including DocVQA, PubTables-1M, TAT-QA, ViDoRe-ColPali-Training, VDR Multilingual, VisRAG-Ret-Train-In-domain-data, VisRAG-Ret-Train-Synthetic-data and proprietary domain-specific synthetic data
The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves competitive performance across ViDoRe V1 & V3 (English and multilingual).
🛠️ Model Specifications
| Feature | Detail |
|---|---|
| Architecture | Qwen3.5-9B Vision-Language Model (VLM) + 2560 dim Linear Projection Head |
| Methodology | ColBERT-style Late Interaction (MaxSim scoring) |
| Output | Multi-vector (Seq_Len × 2560), L2-normalized |
| Modalities | Text Queries, Images (Documents) |
| Training Strategy | LoRA adapters + Fully-trained projection layer |
| Precision | bfloat16 weights, FlashAttention 2 enabled |
Key Properties
Unified Encoder (Single-Tower): A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.
Projection Head: A single linear layer projects final hidden states → compact embedding space (hidden_size → 2560 dim). - No activation - Fully trained - Replaces LM head for retrieval
Multi-Vector Representation: Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.
📊 Evaluation Results
We report results on the ViDoRe benchmark suite. The tables below summarize the image-modality accuracy of webAI-ColVec1-9b on the ViDoRe V1 and V3 benchmarks, alongside other webAI ColVec1 models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.
ViDoRe V3 (NDCG@10)
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | Avg (Public) |
|---|---|---|---|---|---|---|---|---|---|
| webAI-ColVec1-9b | 0.8092 | 0.6976 | 0.6827 | 0.5372 | 0.7004 | 0.5718 | 0.6732 | 0.4838 | 0.6445 |
| nemotron-colembed-vl-8b-v2 | 0.7929 | 0.6982 | 0.6729 | 0.5154 | 0.6632 | 0.5603 | 0.6719 | 0.5084 | 0.6354 |
| webAI-ColVec1-4b | 0.7983 | 0.6869 | 0.6848 | 0.5111 | 0.6739 | 0.5573 | 0.6567 | 0.5014 | 0.6338 |
| tomoro-colqwen3-embed-8b | 0.7535 | 0.6841 | 0.6508 | 0.4910 | 0.6398 | 0.5441 | 0.6636 | 0.5013 | 0.6160 |
| colqwen3.5-4.5B-v3 | 0.7866 | 0.6804 | 0.6406 | 0.4856 | 0.6206 | 0.5520 | 0.6559 | 0.5034 | 0.6156 |
ViDoRe V1 (NDCG@5)
| Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| nemotron-colembed-vl-8b-v2 | 0.9310 | 0.6810 | 0.9460 | 0.9330 | 1.0000 | 0.9790 | 0.9890 | 0.9960 | 0.9770 | 0.8340 | 0.9270 |
| llama-nemotron-colembed-vl-3b-v2 | 0.9040 | 0.6720 | 0.9470 | 0.9200 | 1.0000 | 0.9800 | 0.9800 | 0.9890 | 0.9730 | 0.8100 | 0.9170 |
| nemotron-colembed-vl-4b-v2 | 0.9200 | 0.6740 | 0.9330 | 0.9230 | 0.9930 | 0.9620 | 0.9800 | 0.9850 | 0.9810 | 0.8120 | 0.9160 |
| colqwen3.5-4.5B-v3 | 0.9190 | 0.6660 | 0.9360 | 0.9020 | 1.0000 | 0.9710 | 0.9730 | 0.9890 | 0.9590 | 0.8400 | 0.9150 |
| webAI-ColVec1-9b | 0.9413 | 0.6882 | 0.9505 | 0.8758 | 0.9963 | 0.9739 | 0.9839 | 0.9926 | 0.9460 | 0.7956 | 0.9144 |
| Ops-Colqwen3-4B | 0.9180 | 0.6650 | 0.9400 | 0.9080 | 0.9960 | 0.9730 | 0.9800 | 0.9960 | 0.9360 | 0.8240 | 0.9140 |
| SauerkrautLM-ColQwen3-8b-v0.1 | 0.9380 | 0.6470 | 0.9450 | 0.9040 | 0.9860 | 0.9650 | 0.9680 | 0.9930 | 0.9220 | 0.8400 | 0.9110 |
| webAI-ColVec1-4b | 0.9258 | 0.6773 | 0.9412 | 0.8764 | 1.0000 | 0.9703 | 0.9721 | 1.0000 | 0.9414 | 0.7950 | 0.9100 |
💻 Usage
The processor exposes three primary methods for encoding inputs and computing retrieval scores.
process_images(images, max_length=None)
Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with **batch.
| Parameter | Type | Description |
|---|---|---|
images |
List[PIL.Image.Image] |
Document page images. Each image is automatically converted to RGB. |
max_length |
int |
None |
batch = processor.process_images(images=pil_images)
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
process_queries(texts, max_length=None)
Encodes a batch of text queries into model-ready tensors.
| Parameter | Type | Description |
|---|---|---|
texts |
List[str] |
Natural-language query strings. |
max_length |
int |
None |
batch = processor.process_queries(texts=["What is the revenue for Q3?"])
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
score_multi_vector(qs, ps, batch_size=128, device=None)
Computes ColBERT-style MaxSim late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.
| Parameter | Type | Description |
|---|---|---|
qs |
List[Tensor] or Tensor |
Query embeddings. Each tensor has shape (seq_len_q, embed_dim). |
ps |
List[Tensor] or Tensor |
Passage embeddings. Each tensor has shape (seq_len_p, embed_dim). |
batch_size |
int |
Number of queries processed per inner loop iteration (default: 128). |
device |
str |
torch.device |
Returns a torch.Tensor of shape (n_queries, n_passages) on CPU in float32. Higher scores indicate greater relevance.
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
# scores[i, j] is the relevance of document j to query i
best_doc_per_query = scores.argmax(dim=1)
Prerequisites
We strongly suggest flash-attn to be installed. If not, please change to attention_impl="sdpa"
Currently we only support torch==2.8.0, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that torch==2.8.0 supports Python Versions: >= 3.9 and <= 3.13.
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers pillow requests
pip install flash-attn --no-build-isolation
Inference Code
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO
# Configuration
MODEL_ID = "webAI-Official/webAI-ColVec1-9b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load Model & Processor
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="flash_attention_2",
trust_remote_code=True,
device_map=DEVICE,
).eval()
# Sample Data
queries = [
"Retrieve the city of Singapore",
"Retrieve the city of Beijing"
]
docs = [
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
]
def load_image(url: str) -> Image.Image:
# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
resp = requests.get(url, headers=headers, timeout=10)
if resp.status_code == 403:
continue
resp.raise_for_status()
try:
return Image.open(BytesIO(resp.content)).convert("RGB")
except UnidentifiedImageError as e:
raise RuntimeError(f"Failed to decode image from {url}") from e
raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
# Helper Functions
def encode_queries(texts, batch_size=8):
outputs = []
for start in range(0, len(texts), batch_size):
batch = processor.process_queries(texts=texts[start : start + batch_size])
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
embeddings = model(**batch)
vecs = embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
def encode_docs(urls, batch_size=4):
pil_images = [load_image(url) for url in urls]
outputs = []
for start in range(0, len(pil_images), batch_size):
batch_imgs = pil_images[start : start + batch_size]
features = processor.process_images(images=batch_imgs)
features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
with torch.inference_mode():
embeddings = model(**features)
vecs = embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)
# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)
⚖️ Strengths & Limitations
Strengths
- Performance: State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
- Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
- End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
- Multilingualism: Strong performance on non-English document inputs.
Limitations
- Storage Cost: Still larger than single‑vector baselines despite the smaller token dimension.
License & Data
📚 Citation
If you use this model, please cite:
@misc{webAI-ColVec1,
title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
author={webAI},
year={2026},
url={https://huggingface.co/webAI-Official/webAI-ColVec1-9b}
}
- Downloads last month
- 78