webAI-Official/webAI-ColVec1-4b

⚡ Summary

webAI-Official/webAI-ColVec1-4b is a state-of-the-art ColBERT-style multimodal embedding model based on Qwen/Qwen3.5-4B. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.

The model has been fine-tuned on a merged multimodal dataset of ~2M question-image pairs, including DocVQA, PubTables-1M, TAT-QA, ViDoRe-ColPali-Training, VDR Multilingual, VisRAG-Ret-Train-In-domain-data, VisRAG-Ret-Train-Synthetic-data and proprietary domain-specific synthetic data

The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves competitive performance across ViDoRe V1 & V3 (English and multilingual).

🛠️ Model Specifications

Feature Detail
Architecture Qwen3.5-4B Vision-Language Model (VLM) + 640 dim Linear Projection Head
Methodology ColBERT-style Late Interaction (MaxSim scoring)
Output Multi-vector (Seq_Len × 640), L2-normalized
Modalities Text Queries, Images (Documents)
Training Strategy LoRA adapters + Fully-trained projection layer
Precision bfloat16 weights, FlashAttention 2 enabled

Key Properties

  • Unified Encoder (Single-Tower): A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.

  • Projection Head: A single linear layer projects final hidden states → compact embedding space (hidden_size → 640 dim). - No activation - Fully trained - Replaces LM head for retrieval

  • Multi-Vector Representation: Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.

📊 Evaluation Results

We report results on the ViDoRe benchmark suite. The tables below summarize the image-modality accuracy of webAI-ColVec1-4b on the ViDoRe V1 and V3 benchmarks, alongside other webAI ColVec1 models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.

ViDoRe V3 (NDCG@10)

Model CompSci Energy FinanceEn FinanceFr HR Industrial Pharma Physics Avg (Public)
webAI-ColVec1-9b 0.8092 0.6976 0.6827 0.5372 0.7004 0.5718 0.6732 0.4838 0.6445
nemotron-colembed-vl-8b-v2 0.7929 0.6982 0.6729 0.5154 0.6632 0.5603 0.6719 0.5084 0.6354
webAI-ColVec1-4b 0.7983 0.6869 0.6848 0.5111 0.6739 0.5573 0.6567 0.5014 0.6338
tomoro-colqwen3-embed-8b 0.7535 0.6841 0.6508 0.4910 0.6398 0.5441 0.6636 0.5013 0.6160
colqwen3.5-4.5B-v3 0.7866 0.6804 0.6406 0.4856 0.6206 0.5520 0.6559 0.5034 0.6156

ViDoRe V1 (NDCG@5)

Model ArxivQA DocVQA InfoVQA Shift Syn-AI Syn-Eng Syn-Gov Syn-Health TabFQuAD Tatdqa Avg
nemotron-colembed-vl-8b-v2 0.9310 0.6810 0.9460 0.9330 1.0000 0.9790 0.9890 0.9960 0.9770 0.8340 0.9270
llama-nemotron-colembed-vl-3b-v2 0.9040 0.6720 0.9470 0.9200 1.0000 0.9800 0.9800 0.9890 0.9730 0.8100 0.9170
nemotron-colembed-vl-4b-v2 0.9200 0.6740 0.9330 0.9230 0.9930 0.9620 0.9800 0.9850 0.9810 0.8120 0.9160
colqwen3.5-4.5B-v3 0.9190 0.6660 0.9360 0.9020 1.0000 0.9710 0.9730 0.9890 0.9590 0.8400 0.9150
webAI-ColVec1-9b 0.9413 0.6882 0.9505 0.8758 0.9963 0.9739 0.9839 0.9926 0.9460 0.7956 0.9144
Ops-Colqwen3-4B 0.9180 0.6650 0.9400 0.9080 0.9960 0.9730 0.9800 0.9960 0.9360 0.8240 0.9140
SauerkrautLM-ColQwen3-8b-v0.1 0.9380 0.6470 0.9450 0.9040 0.9860 0.9650 0.9680 0.9930 0.9220 0.8400 0.9110
webAI-ColVec1-4b 0.9258 0.6773 0.9412 0.8764 1.0000 0.9703 0.9721 1.0000 0.9414 0.7950 0.9100

💻 Usage

The processor exposes three primary methods for encoding inputs and computing retrieval scores.

process_images(images, max_length=None)

Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with **batch.

Parameter Type Description
images List[PIL.Image.Image] Document page images. Each image is automatically converted to RGB.
max_length int None
batch = processor.process_images(images=pil_images)
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)

process_queries(texts, max_length=None)

Encodes a batch of text queries into model-ready tensors.

Parameter Type Description
texts List[str] Natural-language query strings.
max_length int None
batch = processor.process_queries(texts=["What is the revenue for Q3?"])
batch = {k: v.to(device) for k, v in batch.items()}
embeddings = model(**batch)  # shape: (B, seq_len, embed_dim)

score_multi_vector(qs, ps, batch_size=128, device=None)

Computes ColBERT-style MaxSim late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.

Parameter Type Description
qs List[Tensor] or Tensor Query embeddings. Each tensor has shape (seq_len_q, embed_dim).
ps List[Tensor] or Tensor Passage embeddings. Each tensor has shape (seq_len_p, embed_dim).
batch_size int Number of queries processed per inner loop iteration (default: 128).
device str torch.device

Returns a torch.Tensor of shape (n_queries, n_passages) on CPU in float32. Higher scores indicate greater relevance.

scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
# scores[i, j] is the relevance of document j to query i
best_doc_per_query = scores.argmax(dim=1)

Prerequisites

We strongly suggest flash-attn to be installed. If not, please change to attention_impl="sdpa"

Currently we only support torch==2.8.0, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that torch==2.8.0 supports Python Versions: >= 3.9 and <= 3.13.

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers pillow requests
pip install flash-attn --no-build-isolation

Inference Code

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO

# Configuration
MODEL_ID = "webAI-Official/webAI-ColVec1-4b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model & Processor
processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

# Sample Data
queries = [
    "Retrieve the city of Singapore",
    "Retrieve the city of Beijing"
]
docs = [
    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
]

def load_image(url: str) -> Image.Image:
    # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
    for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
        resp = requests.get(url, headers=headers, timeout=10)
        if resp.status_code == 403:
            continue
        resp.raise_for_status()
        try:
            return Image.open(BytesIO(resp.content)).convert("RGB")
        except UnidentifiedImageError as e:
            raise RuntimeError(f"Failed to decode image from {url}") from e
    raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")

# Helper Functions
def encode_queries(texts, batch_size=8):
    outputs = []
    for start in range(0, len(texts), batch_size):
        batch = processor.process_queries(texts=texts[start : start + batch_size])
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        with torch.inference_mode():
            embeddings = model(**batch)
            vecs = embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

def encode_docs(urls, batch_size=4):
    pil_images = [load_image(url) for url in urls]
    outputs = []
    for start in range(0, len(pil_images), batch_size):
        batch_imgs = pil_images[start : start + batch_size]
        features = processor.process_images(images=batch_imgs)
        features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
        with torch.inference_mode():
            embeddings = model(**features)
            vecs = embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)

# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)

⚖️ Strengths & Limitations

Strengths

  • Performance: State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
  • Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
  • End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
  • Multilingualism: Strong performance on non-English document inputs.

Limitations

  • Storage Cost: Still larger than single‑vector baselines despite the smaller token dimension.

License & Data

LICENSE

📚 Citation

If you use this model, please cite:

@misc{webAI-ColVec1,
  title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
  author={webAI},
  year={2026},
  url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
}
Downloads last month
84
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for webAI-Official/webAI-ColVec1-4b

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(109)
this model

Paper for webAI-Official/webAI-ColVec1-4b