| --- |
| pipeline_tag: visual-document-retrieval |
| library_name: transformers |
| language: |
| - multilingual |
| license: other |
| license_name: webai-non-commercial-license-v1.0 |
| license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md |
| base_model: Qwen/Qwen3.5-4B |
| tags: |
| - text |
| - image |
| - video |
| - multimodal-embedding |
| - vidore |
| - colpali |
| - colqwen3_5 |
| - multilingual-embedding |
| --- |
| |
| # webAI-Official/webAI-ColVec1-4b |
|
|
| ## ⚡ Summary |
|
|
| **webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings. |
|
|
| The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA), [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m), [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) and proprietary domain-specific synthetic data |
|
|
| The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual). |
|
|
| ## 🛠️ Model Specifications |
|
|
|
|
| | Feature | Detail | |
| | --------------------- | ------------------------------------------------------------------------- | |
| | **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head | |
| | **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) | |
| | **Output** | Multi-vector (Seq_Len × *640*), L2-normalized | |
| | **Modalities** | Text Queries, Images (Documents) | |
| | **Training Strategy** | LoRA adapters + Fully-trained projection layer | |
| | **Precision** | `bfloat16` weights, FlashAttention 2 enabled | |
| |
| |
| --- |
| |
| ### Key Properties |
| |
| - **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders. |
| |
| - **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 640 dim*). |
| - No activation |
| - Fully trained |
| - Replaces LM head for retrieval |
| |
| - **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling. |
| |
| ## 📊 Evaluation Results |
| |
| We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank. |
| |
| ### ViDoRe V3 (NDCG@10) |
| |
| | Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg (Public)** | |
| | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | |
| | **[webAI-ColVec1-9b](https://huggingface.co/webAI-Official/webAI-ColVec1-9b)** | **0.8092** | 0.6976 | 0.6827 | **0.5372** | **0.7004** | **0.5718** | **0.6732** | 0.4838 | **0.6445** | |
| | [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.7929 | **0.6982** | 0.6729 | 0.5154 | 0.6632 | 0.5603 | 0.6719 | **0.5084** | 0.6354 | |
| | **[webAI-ColVec1-4b](https://huggingface.co/webAI-Official/webAI-ColVec1-4b)** | 0.7983 | 0.6869 | **0.6848** | 0.5111 | 0.6739 | 0.5573 | 0.6567 | 0.5014 | 0.6338 | |
| | [tomoro-colqwen3-embed-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) | 0.7535 | 0.6841 | 0.6508 | 0.4910 | 0.6398 | 0.5441 | 0.6636 | 0.5013 | 0.6160 | |
| | [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.7866 | 0.6804 | 0.6406 | 0.4856 | 0.6206 | 0.5520 | 0.6559 | 0.5034 | 0.6156 | |
| |
| |
| |
| ### ViDoRe V1 (NDCG@5) |
| |
| | Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | **Avg** | |
| | :------------------------------------------------------------------------------------------------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | |
| | [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.9310 | 0.6810 | 0.9460 | **0.9330** | **1.0000** | 0.9790 | **0.9890** | 0.9960 | 0.9770 | 0.8340 | **0.9270** | |
| | [llama-nemotron-colembed-vl-3b-v2](https://huggingface.co/nvidia/llama-nemotron-colembed-vl-3b-v2) | 0.9040 | 0.6720 | 0.9470 | 0.9200 | **1.0000** | **0.9800** | 0.9800 | 0.9890 | 0.9730 | 0.8100 | 0.9170 | |
| | [nemotron-colembed-vl-4b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2) | 0.9200 | 0.6740 | 0.9330 | 0.9230 | 0.9930 | 0.9620 | 0.9800 | 0.9850 | **0.9810** | 0.8120 | 0.9160 | |
| | [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.9190 | 0.6660 | 0.9360 | 0.9020 | **1.0000** | 0.9710 | 0.9730 | 0.9890 | 0.9590 | **0.8400** | 0.9150 | |
| | **[webAI-ColVec1-9b](TODO)** | **0.9413** | **0.6882** | **0.9505** | 0.8758 | 0.9963 | 0.9739 | 0.9839 | 0.9926 | 0.9460 | 0.7956 | 0.9144 | |
| | [Ops-Colqwen3-4B](https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B) | 0.9180 | 0.6650 | 0.9400 | 0.9080 | 0.9960 | 0.9730 | 0.9800 | 0.9960 | 0.9360 | 0.8240 | 0.9140 | |
| | **[SauerkrautLM-ColQwen3-8b-v0.1](https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1)** | 0.9380 | 0.6470 | 0.9450 | 0.9040 | 0.9860 | 0.9650 | 0.9680 | 0.9930 | 0.9220 | 0.8400 | 0.9110 | |
| | **[webAI-ColVec1-4b](TODO)** | 0.9258 | 0.6773 | 0.9412 | 0.8764 | **1.0000** | 0.9703 | 0.9721 | **1.0000** | 0.9414 | 0.7950 | 0.9100 | |
| |
| --- |
| |
| ## 💻 Usage |
| |
| The processor exposes three primary methods for encoding inputs and computing retrieval scores. |
| |
| #### `process_images(images, max_length=None)` |
| |
| Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`. |
| |
| | Parameter | Type | Description | |
| | ------------ | ----------------------- | ------------------------------------------------------------------- | |
| | `images` | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. | |
| | `max_length` | `int` | `None` | |
| |
| ```python |
| batch = processor.process_images(images=pil_images) |
| batch = {k: v.to(device) for k, v in batch.items()} |
| embeddings = model(**batch) # shape: (B, seq_len, embed_dim) |
| ``` |
| |
| --- |
| |
| #### `process_queries(texts, max_length=None)` |
| |
| Encodes a batch of text queries into model-ready tensors. |
| |
| | Parameter | Type | Description | |
| | ------------ | ----------- | ------------------------------- | |
| | `texts` | `List[str]` | Natural-language query strings. | |
| | `max_length` | `int` | `None` | |
| |
| ```python |
| batch = processor.process_queries(texts=["What is the revenue for Q3?"]) |
| batch = {k: v.to(device) for k, v in batch.items()} |
| embeddings = model(**batch) # shape: (B, seq_len, embed_dim) |
| ``` |
| |
| --- |
| |
| #### `score_multi_vector(qs, ps, batch_size=128, device=None)` |
| |
| Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair. |
|
|
|
|
| | Parameter | Type | Description | |
| | ------------ | -------------------------- | ---------------------------------------------------------------------- | |
| | `qs` | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. | |
| | `ps` | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. | |
| | `batch_size` | `int` | Number of queries processed per inner loop iteration (default: `128`). | |
| | `device` | `str` | `torch.device` | |
|
|
|
|
| Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance. |
|
|
| ```python |
| scores = processor.score_multi_vector(query_embeddings, doc_embeddings) |
| # scores[i, j] is the relevance of document j to query i |
| best_doc_per_query = scores.argmax(dim=1) |
| ``` |
|
|
| ### Prerequisites |
|
|
| We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"` |
|
|
| Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`. |
|
|
| ```bash |
| pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128 |
| pip install transformers pillow requests |
| pip install flash-attn --no-build-isolation |
| ``` |
|
|
| ### Inference Code |
|
|
| ```python |
| import torch |
| from transformers import AutoModel, AutoProcessor |
| from PIL import Image, UnidentifiedImageError |
| import requests |
| from io import BytesIO |
| |
| # Configuration |
| MODEL_ID = "webAI-Official/webAI-ColVec1-4b" |
| DTYPE = torch.bfloat16 |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| # Load Model & Processor |
| processor = AutoProcessor.from_pretrained( |
| MODEL_ID, |
| trust_remote_code=True, |
| ) |
| |
| model = AutoModel.from_pretrained( |
| MODEL_ID, |
| dtype=DTYPE, |
| attn_implementation="flash_attention_2", |
| trust_remote_code=True, |
| device_map=DEVICE, |
| ).eval() |
| |
| # Sample Data |
| queries = [ |
| "Retrieve the city of Singapore", |
| "Retrieve the city of Beijing" |
| ] |
| docs = [ |
| "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg", |
| "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG" |
| ] |
| |
| def load_image(url: str) -> Image.Image: |
| # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s. |
| for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}): |
| resp = requests.get(url, headers=headers, timeout=10) |
| if resp.status_code == 403: |
| continue |
| resp.raise_for_status() |
| try: |
| return Image.open(BytesIO(resp.content)).convert("RGB") |
| except UnidentifiedImageError as e: |
| raise RuntimeError(f"Failed to decode image from {url}") from e |
| raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.") |
| |
| # Helper Functions |
| def encode_queries(texts, batch_size=8): |
| outputs = [] |
| for start in range(0, len(texts), batch_size): |
| batch = processor.process_queries(texts=texts[start : start + batch_size]) |
| batch = {k: v.to(DEVICE) for k, v in batch.items()} |
| with torch.inference_mode(): |
| embeddings = model(**batch) |
| vecs = embeddings.to(torch.bfloat16).cpu() |
| outputs.extend(vecs) |
| return outputs |
| |
| def encode_docs(urls, batch_size=4): |
| pil_images = [load_image(url) for url in urls] |
| outputs = [] |
| for start in range(0, len(pil_images), batch_size): |
| batch_imgs = pil_images[start : start + batch_size] |
| features = processor.process_images(images=batch_imgs) |
| features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()} |
| with torch.inference_mode(): |
| embeddings = model(**features) |
| vecs = embeddings.to(torch.bfloat16).cpu() |
| outputs.extend(vecs) |
| return outputs |
| |
| # Execution |
| query_embeddings = encode_queries(queries) |
| doc_embeddings = encode_docs(docs) |
| |
| # MaxSim Scoring |
| scores = processor.score_multi_vector(query_embeddings, doc_embeddings) |
| print(scores) |
| ``` |
|
|
| --- |
|
|
| ## ⚖️ Strengths & Limitations |
|
|
| ### Strengths |
|
|
| - **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval. |
| - **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents. |
| - **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval. |
| - **Multilingualism:** Strong performance on non-English document inputs. |
|
|
| ### Limitations |
|
|
| - **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension. |
|
|
| ### License & Data |
|
|
| [LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md) |
|
|
| ## 📚 Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{webAI-ColVec1, |
| title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval}, |
| author={webAI}, |
| year={2026}, |
| url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b} |
| } |
| ``` |