--- pipeline_tag: visual-document-retrieval library_name: transformers language: - multilingual license: other license_name: webai-non-commercial-license-v1.0 license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md base_model: Qwen/Qwen3.5-4B tags: - text - image - video - multimodal-embedding - vidore - colpali - colqwen3_5 - multilingual-embedding --- # webAI-Official/webAI-ColVec1-4b ## ⚡ Summary **webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings. The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA), [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m), [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) and proprietary domain-specific synthetic data The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual). ## 🛠️ Model Specifications | Feature | Detail | | --------------------- | ------------------------------------------------------------------------- | | **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head | | **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) | | **Output** | Multi-vector (Seq_Len × *640*), L2-normalized | | **Modalities** | Text Queries, Images (Documents) | | **Training Strategy** | LoRA adapters + Fully-trained projection layer | | **Precision** | `bfloat16` weights, FlashAttention 2 enabled | --- ### Key Properties - **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders. - **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 640 dim*). - No activation - Fully trained - Replaces LM head for retrieval - **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling. ## 📊 Evaluation Results We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank. ### ViDoRe V3 (NDCG@10) | Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg (Public)** | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | **[webAI-ColVec1-9b](https://huggingface.co/webAI-Official/webAI-ColVec1-9b)** | **0.8092** | 0.6976 | 0.6827 | **0.5372** | **0.7004** | **0.5718** | **0.6732** | 0.4838 | **0.6445** | | [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.7929 | **0.6982** | 0.6729 | 0.5154 | 0.6632 | 0.5603 | 0.6719 | **0.5084** | 0.6354 | | **[webAI-ColVec1-4b](https://huggingface.co/webAI-Official/webAI-ColVec1-4b)** | 0.7983 | 0.6869 | **0.6848** | 0.5111 | 0.6739 | 0.5573 | 0.6567 | 0.5014 | 0.6338 | | [tomoro-colqwen3-embed-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) | 0.7535 | 0.6841 | 0.6508 | 0.4910 | 0.6398 | 0.5441 | 0.6636 | 0.5013 | 0.6160 | | [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.7866 | 0.6804 | 0.6406 | 0.4856 | 0.6206 | 0.5520 | 0.6559 | 0.5034 | 0.6156 | ### ViDoRe V1 (NDCG@5) | Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | **Avg** | | :------------------------------------------------------------------------------------------------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | | [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.9310 | 0.6810 | 0.9460 | **0.9330** | **1.0000** | 0.9790 | **0.9890** | 0.9960 | 0.9770 | 0.8340 | **0.9270** | | [llama-nemotron-colembed-vl-3b-v2](https://huggingface.co/nvidia/llama-nemotron-colembed-vl-3b-v2) | 0.9040 | 0.6720 | 0.9470 | 0.9200 | **1.0000** | **0.9800** | 0.9800 | 0.9890 | 0.9730 | 0.8100 | 0.9170 | | [nemotron-colembed-vl-4b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2) | 0.9200 | 0.6740 | 0.9330 | 0.9230 | 0.9930 | 0.9620 | 0.9800 | 0.9850 | **0.9810** | 0.8120 | 0.9160 | | [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.9190 | 0.6660 | 0.9360 | 0.9020 | **1.0000** | 0.9710 | 0.9730 | 0.9890 | 0.9590 | **0.8400** | 0.9150 | | **[webAI-ColVec1-9b](TODO)** | **0.9413** | **0.6882** | **0.9505** | 0.8758 | 0.9963 | 0.9739 | 0.9839 | 0.9926 | 0.9460 | 0.7956 | 0.9144 | | [Ops-Colqwen3-4B](https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B) | 0.9180 | 0.6650 | 0.9400 | 0.9080 | 0.9960 | 0.9730 | 0.9800 | 0.9960 | 0.9360 | 0.8240 | 0.9140 | | **[SauerkrautLM-ColQwen3-8b-v0.1](https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1)** | 0.9380 | 0.6470 | 0.9450 | 0.9040 | 0.9860 | 0.9650 | 0.9680 | 0.9930 | 0.9220 | 0.8400 | 0.9110 | | **[webAI-ColVec1-4b](TODO)** | 0.9258 | 0.6773 | 0.9412 | 0.8764 | **1.0000** | 0.9703 | 0.9721 | **1.0000** | 0.9414 | 0.7950 | 0.9100 | --- ## 💻 Usage The processor exposes three primary methods for encoding inputs and computing retrieval scores. #### `process_images(images, max_length=None)` Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`. | Parameter | Type | Description | | ------------ | ----------------------- | ------------------------------------------------------------------- | | `images` | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. | | `max_length` | `int` | `None` | ```python batch = processor.process_images(images=pil_images) batch = {k: v.to(device) for k, v in batch.items()} embeddings = model(**batch) # shape: (B, seq_len, embed_dim) ``` --- #### `process_queries(texts, max_length=None)` Encodes a batch of text queries into model-ready tensors. | Parameter | Type | Description | | ------------ | ----------- | ------------------------------- | | `texts` | `List[str]` | Natural-language query strings. | | `max_length` | `int` | `None` | ```python batch = processor.process_queries(texts=["What is the revenue for Q3?"]) batch = {k: v.to(device) for k, v in batch.items()} embeddings = model(**batch) # shape: (B, seq_len, embed_dim) ``` --- #### `score_multi_vector(qs, ps, batch_size=128, device=None)` Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair. | Parameter | Type | Description | | ------------ | -------------------------- | ---------------------------------------------------------------------- | | `qs` | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. | | `ps` | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. | | `batch_size` | `int` | Number of queries processed per inner loop iteration (default: `128`). | | `device` | `str` | `torch.device` | Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance. ```python scores = processor.score_multi_vector(query_embeddings, doc_embeddings) # scores[i, j] is the relevance of document j to query i best_doc_per_query = scores.argmax(dim=1) ``` ### Prerequisites We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"` Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`. ```bash pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128 pip install transformers pillow requests pip install flash-attn --no-build-isolation ``` ### Inference Code ```python import torch from transformers import AutoModel, AutoProcessor from PIL import Image, UnidentifiedImageError import requests from io import BytesIO # Configuration MODEL_ID = "webAI-Official/webAI-ColVec1-4b" DTYPE = torch.bfloat16 DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # Load Model & Processor processor = AutoProcessor.from_pretrained( MODEL_ID, trust_remote_code=True, ) model = AutoModel.from_pretrained( MODEL_ID, dtype=DTYPE, attn_implementation="flash_attention_2", trust_remote_code=True, device_map=DEVICE, ).eval() # Sample Data queries = [ "Retrieve the city of Singapore", "Retrieve the city of Beijing" ] docs = [ "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg", "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG" ] def load_image(url: str) -> Image.Image: # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s. for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}): resp = requests.get(url, headers=headers, timeout=10) if resp.status_code == 403: continue resp.raise_for_status() try: return Image.open(BytesIO(resp.content)).convert("RGB") except UnidentifiedImageError as e: raise RuntimeError(f"Failed to decode image from {url}") from e raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.") # Helper Functions def encode_queries(texts, batch_size=8): outputs = [] for start in range(0, len(texts), batch_size): batch = processor.process_queries(texts=texts[start : start + batch_size]) batch = {k: v.to(DEVICE) for k, v in batch.items()} with torch.inference_mode(): embeddings = model(**batch) vecs = embeddings.to(torch.bfloat16).cpu() outputs.extend(vecs) return outputs def encode_docs(urls, batch_size=4): pil_images = [load_image(url) for url in urls] outputs = [] for start in range(0, len(pil_images), batch_size): batch_imgs = pil_images[start : start + batch_size] features = processor.process_images(images=batch_imgs) features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()} with torch.inference_mode(): embeddings = model(**features) vecs = embeddings.to(torch.bfloat16).cpu() outputs.extend(vecs) return outputs # Execution query_embeddings = encode_queries(queries) doc_embeddings = encode_docs(docs) # MaxSim Scoring scores = processor.score_multi_vector(query_embeddings, doc_embeddings) print(scores) ``` --- ## ⚖️ Strengths & Limitations ### Strengths - **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval. - **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents. - **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval. - **Multilingualism:** Strong performance on non-English document inputs. ### Limitations - **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension. ### License & Data [LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md) ## 📚 Citation If you use this model, please cite: ```bibtex @misc{webAI-ColVec1, title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval}, author={webAI}, year={2026}, url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b} } ```