---
pipeline_tag: visual-document-retrieval
library_name: transformers
language:
  - multilingual
license: other
license_name: webai-non-commercial-license-v1.0
license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md
base_model: Qwen/Qwen3.5-4B
tags:
  - text
  - image
  - video
  - multimodal-embedding
  - vidore
  - colpali
  - colqwen3_5
  - multilingual-embedding
---

# webAI-Official/webAI-ColVec1-4b

## ⚡ Summary

**webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.

The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA), [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m), [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) and proprietary domain-specific synthetic data

The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).

## 🛠️ Model Specifications


| Feature               | Detail                                                                    |
| --------------------- | ------------------------------------------------------------------------- |
| **Architecture**      | Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head |
| **Methodology**       | ColBERT-style Late Interaction (MaxSim scoring)                           |
| **Output**            | Multi-vector (Seq_Len × *640*), L2-normalized                             |
| **Modalities**        | Text Queries, Images (Documents)                                          |
| **Training Strategy** | LoRA adapters + Fully-trained projection layer                            |
| **Precision**         | `bfloat16` weights, FlashAttention 2 enabled                              |


---

### Key Properties

- **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.

- **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 640 dim*).
	  - No activation
	  - Fully trained
	  - Replaces LM head for retrieval

- **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.

## 📊 Evaluation Results

We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.

### ViDoRe V3 (NDCG@10)

| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg (Public)** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **[webAI-ColVec1-9b](https://huggingface.co/webAI-Official/webAI-ColVec1-9b)** | **0.8092** | 0.6976 | 0.6827 | **0.5372** | **0.7004** | **0.5718** | **0.6732** | 0.4838 | **0.6445** |
| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.7929 | **0.6982** | 0.6729 | 0.5154 | 0.6632 | 0.5603 | 0.6719 | **0.5084** | 0.6354 |
| **[webAI-ColVec1-4b](https://huggingface.co/webAI-Official/webAI-ColVec1-4b)** | 0.7983 | 0.6869 | **0.6848** | 0.5111 | 0.6739 | 0.5573 | 0.6567 | 0.5014 | 0.6338 |
| [tomoro-colqwen3-embed-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) | 0.7535 | 0.6841 | 0.6508 | 0.4910 | 0.6398 | 0.5441 | 0.6636 | 0.5013 | 0.6160 |
| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.7866 | 0.6804 | 0.6406 | 0.4856 | 0.6206 | 0.5520 | 0.6559 | 0.5034 | 0.6156 |


### ViDoRe V1 (NDCG@5)

| Model                                                                                              | ArxivQA    | DocVQA     | InfoVQA    | Shift      | Syn-AI     | Syn-Eng    | Syn-Gov    | Syn-Health | TabFQuAD   | Tatdqa     | **Avg**    |
| :------------------------------------------------------------------------------------------------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- |
| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2)         | 0.9310 | 0.6810 | 0.9460     | **0.9330** | **1.0000** | 0.9790     | **0.9890** | 0.9960 | 0.9770 | 0.8340     | **0.9270** |
| [llama-nemotron-colembed-vl-3b-v2](https://huggingface.co/nvidia/llama-nemotron-colembed-vl-3b-v2) | 0.9040     | 0.6720     | 0.9470 | 0.9200     | **1.0000** | **0.9800** | 0.9800     | 0.9890     | 0.9730     | 0.8100     | 0.9170     |
| [nemotron-colembed-vl-4b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2)             | 0.9200     | 0.6740     | 0.9330     | 0.9230     | 0.9930     | 0.9620     | 0.9800     | 0.9850     | **0.9810** | 0.8120     | 0.9160     |
| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3)                       | 0.9190     | 0.6660     | 0.9360     | 0.9020     | **1.0000** | 0.9710     | 0.9730     | 0.9890     | 0.9590     | **0.8400** | 0.9150     |
| **[webAI-ColVec1-9b](TODO)**         | **0.9413** | **0.6882** | **0.9505**     | 0.8758 | 0.9963 | 0.9739     | 0.9839 | 0.9926 | 0.9460 | 0.7956     | 0.9144 |
| [Ops-Colqwen3-4B](https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B)                            | 0.9180     | 0.6650     | 0.9400     | 0.9080     | 0.9960     | 0.9730     | 0.9800     | 0.9960 | 0.9360     | 0.8240     | 0.9140     |
| **[SauerkrautLM-ColQwen3-8b-v0.1](https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1)**         | 0.9380 | 0.6470 | 0.9450     | 0.9040 | 0.9860 | 0.9650     | 0.9680 | 0.9930 | 0.9220 | 0.8400     | 0.9110 |
| **[webAI-ColVec1-4b](TODO)**         | 0.9258 | 0.6773 | 0.9412     | 0.8764 | **1.0000** | 0.9703     | 0.9721 | **1.0000** | 0.9414 | 0.7950     | 0.9100 |

---

## 💻 Usage

The processor exposes three primary methods for encoding inputs and computing retrieval scores.

#### `process_images(images, max_length=None)`

Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`.

| Parameter    | Type                    | Description                                                         |
| ------------ | ----------------------- | ------------------------------------------------------------------- |
| `images`     | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. |
| `max_length` | `int`                   | `None`                                                              |

```python
batch = processor.process_images(images=pil_images)
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)
```

---

#### `process_queries(texts, max_length=None)`

Encodes a batch of text queries into model-ready tensors.

| Parameter    | Type        | Description                     |
| ------------ | ----------- | ------------------------------- |
| `texts`      | `List[str]` | Natural-language query strings. |
| `max_length` | `int`       | `None`                          |

```python
batch = processor.process_queries(texts=["What is the revenue for Q3?"])
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)
```

---

#### `score_multi_vector(qs, ps, batch_size=128, device=None)`

Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.


| Parameter    | Type                       | Description                                                            |
| ------------ | -------------------------- | ---------------------------------------------------------------------- |
| `qs`         | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`.      |
| `ps`         | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`.    |
| `batch_size` | `int`                      | Number of queries processed per inner loop iteration (default: `128`). |
| `device`     | `str`                      | `torch.device`                                                         |


Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance.

```python
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
# scores[i, j] is the relevance of document j to query i
best_doc_per_query = scores.argmax(dim=1)
```

### Prerequisites

We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`

Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`.

```bash
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers pillow requests
pip install flash-attn --no-build-isolation
```

### Inference Code

```python
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO

# Configuration
MODEL_ID = "webAI-Official/webAI-ColVec1-4b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model & Processor
processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

# Sample Data
queries = [
    "Retrieve the city of Singapore",
    "Retrieve the city of Beijing"
]
docs = [
    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
]

def load_image(url: str) -> Image.Image:
    # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
    for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
        resp = requests.get(url, headers=headers, timeout=10)
        if resp.status_code == 403:
            continue
        resp.raise_for_status()
        try:
            return Image.open(BytesIO(resp.content)).convert("RGB")
        except UnidentifiedImageError as e:
            raise RuntimeError(f"Failed to decode image from {url}") from e
    raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")

# Helper Functions
def encode_queries(texts, batch_size=8):
    outputs = []
    for start in range(0, len(texts), batch_size):
        batch = processor.process_queries(texts=texts[start : start + batch_size])
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        with torch.inference_mode():
            embeddings = model(**batch)
            vecs = embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

def encode_docs(urls, batch_size=4):
    pil_images = [load_image(url) for url in urls]
    outputs = []
    for start in range(0, len(pil_images), batch_size):
        batch_imgs = pil_images[start : start + batch_size]
        features = processor.process_images(images=batch_imgs)
        features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
        with torch.inference_mode():
            embeddings = model(**features)
            vecs = embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)

# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)
```

---

## ⚖️ Strengths & Limitations

### Strengths

- **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval. 
- **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents.
- **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval. 
- **Multilingualism:** Strong performance on non-English document inputs.

### Limitations

- **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension.

### License & Data

[LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)

## 📚 Citation

If you use this model, please cite:

```bibtex
@misc{webAI-ColVec1,
  title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
  author={webAI},
  year={2026},
  url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
}
```