webAI-Official/webAI-ColVec1-4b

⚡ Summary

webAI-Official/webAI-ColVec1-4b is a state-of-the-art ColBERT-style multimodal embedding model based on Qwen/Qwen3.5-4B. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.

The model has been fine-tuned on a merged multimodal dataset of ~2M question-image pairs, including DocVQA, PubTables-1M, TAT-QA, ViDoRe-ColPali-Training, VDR Multilingual, VisRAG-Ret-Train-In-domain-data, VisRAG-Ret-Train-Synthetic-data and proprietary domain-specific synthetic data

The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves competitive performance across ViDoRe V1 & V3 (English and multilingual).

🛠️ Model Specifications

Feature	Detail
Architecture	Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head
Methodology	ColBERT-style Late Interaction (MaxSim scoring)
Output	Multi-vector (Seq_Len × 640), L2-normalized
Modalities	Text Queries, Images (Documents)
Training Strategy	LoRA adapters + Fully-trained projection layer
Precision	`bfloat16` weights, FlashAttention 2 enabled

Key Properties

Unified Encoder (Single-Tower): A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.
Projection Head: A single linear layer projects final hidden states → compact embedding space (hidden_size → 640 dim). - No activation - Fully trained - Replaces LM head for retrieval
Multi-Vector Representation: Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.

📊 Evaluation Results

We report results on the ViDoRe benchmark suite. The tables below summarize the image-modality accuracy of webAI-ColVec1-4b on the ViDoRe V1 and V3 benchmarks, alongside other webAI ColVec1 models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.

ViDoRe V3 (NDCG@10)

Model	CompSci	Energy	FinanceEn	FinanceFr	HR	Industrial	Pharma	Physics	Avg (Public)
webAI-ColVec1-9b	0.8092	0.6976	0.6827	0.5372	0.7004	0.5718	0.6732	0.4838	0.6445
nemotron-colembed-vl-8b-v2	0.7929	0.6982	0.6729	0.5154	0.6632	0.5603	0.6719	0.5084	0.6354
webAI-ColVec1-4b	0.7983	0.6869	0.6848	0.5111	0.6739	0.5573	0.6567	0.5014	0.6338
tomoro-colqwen3-embed-8b	0.7535	0.6841	0.6508	0.4910	0.6398	0.5441	0.6636	0.5013	0.6160
colqwen3.5-4.5B-v3	0.7866	0.6804	0.6406	0.4856	0.6206	0.5520	0.6559	0.5034	0.6156

ViDoRe V1 (NDCG@5)

Model	ArxivQA	DocVQA	InfoVQA	Shift	Syn-AI	Syn-Eng	Syn-Gov	Syn-Health	TabFQuAD	Tatdqa	Avg
nemotron-colembed-vl-8b-v2	0.9310	0.6810	0.9460	0.9330	1.0000	0.9790	0.9890	0.9960	0.9770	0.8340	0.9270
llama-nemotron-colembed-vl-3b-v2	0.9040	0.6720	0.9470	0.9200	1.0000	0.9800	0.9800	0.9890	0.9730	0.8100	0.9170
nemotron-colembed-vl-4b-v2	0.9200	0.6740	0.9330	0.9230	0.9930	0.9620	0.9800	0.9850	0.9810	0.8120	0.9160
colqwen3.5-4.5B-v3	0.9190	0.6660	0.9360	0.9020	1.0000	0.9710	0.9730	0.9890	0.9590	0.8400	0.9150
webAI-ColVec1-9b	0.9413	0.6882	0.9505	0.8758	0.9963	0.9739	0.9839	0.9926	0.9460	0.7956	0.9144
Ops-Colqwen3-4B	0.9180	0.6650	0.9400	0.9080	0.9960	0.9730	0.9800	0.9960	0.9360	0.8240	0.9140
SauerkrautLM-ColQwen3-8b-v0.1	0.9380	0.6470	0.9450	0.9040	0.9860	0.9650	0.9680	0.9930	0.9220	0.8400	0.9110
webAI-ColVec1-4b	0.9258	0.6773	0.9412	0.8764	1.0000	0.9703	0.9721	1.0000	0.9414	0.7950	0.9100

💻 Usage

The processor exposes three primary methods for encoding inputs and computing retrieval scores.

`process_images(images, max_length=None)`

Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with **batch.

Parameter	Type	Description
`images`	`List[PIL.Image.Image]`	Document page images. Each image is automatically converted to RGB.
`max_length`	`int`	`None`

batch = processor.process_images(images=pil_images)
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)

`process_queries(texts, max_length=None)`

Encodes a batch of text queries into model-ready tensors.

Parameter	Type	Description
`texts`	`List[str]`	Natural-language query strings.
`max_length`	`int`	`None`

batch = processor.process_queries(texts=["What is the revenue for Q3?"])
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)

`score_multi_vector(qs, ps, batch_size=128, device=None)`

Computes ColBERT-style MaxSim late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.

Parameter	Type	Description
`qs`	`List[Tensor]` or `Tensor`	Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`.
`ps`	`List[Tensor]` or `Tensor`	Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`.
`batch_size`	`int`	Number of queries processed per inner loop iteration (default: `128`).
`device`	`str`	`torch.device`

Returns a torch.Tensor of shape (n_queries, n_passages) on CPU in float32. Higher scores indicate greater relevance.

scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
# scores[i, j] is the relevance of document j to query i
best_doc_per_query = scores.argmax(dim=1)

Prerequisites

We strongly suggest flash-attn to be installed. If not, please change to attention_impl="sdpa"

Currently we only support torch==2.8.0, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that torch==2.8.0 supports Python Versions: >= 3.9 and <= 3.13.

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers pillow requests
pip install flash-attn --no-build-isolation

Inference Code

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO

# Configuration
MODEL_ID = "webAI-Official/webAI-ColVec1-4b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model & Processor
processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

# Sample Data
queries = [
    "Retrieve the city of Singapore",
    "Retrieve the city of Beijing"
]
docs = [
    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
]

def load_image(url: str) -> Image.Image:
    # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
    for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
        resp = requests.get(url, headers=headers, timeout=10)
        if resp.status_code == 403:
            continue
        resp.raise_for_status()
        try:
            return Image.open(BytesIO(resp.content)).convert("RGB")
        except UnidentifiedImageError as e:
            raise RuntimeError(f"Failed to decode image from {url}") from e
    raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")

# Helper Functions
def encode_queries(texts, batch_size=8):
    outputs = []
    for start in range(0, len(texts), batch_size):
        batch = processor.process_queries(texts=texts[start : start + batch_size])
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        with torch.inference_mode():
            embeddings = model(**batch)
            vecs = embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

def encode_docs(urls, batch_size=4):
    pil_images = [load_image(url) for url in urls]
    outputs = []
    for start in range(0, len(pil_images), batch_size):
        batch_imgs = pil_images[start : start + batch_size]
        features = processor.process_images(images=batch_imgs)
        features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
        with torch.inference_mode():
            embeddings = model(**features)
            vecs = embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)

# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)

⚖️ Strengths & Limitations

Strengths

Performance: State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
Multilingualism: Strong performance on non-English document inputs.

Limitations

Storage Cost: Still larger than single‑vector baselines despite the smaller token dimension.

License & Data

LICENSE

📚 Citation

If you use this model, please cite:

@misc{webAI-ColVec1,
  title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
  author={webAI},
  year={2026},
  url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
}