README.md · webAI-Official/webAI-ColVec1-4b at main

webAI-ColVec1-4b / README.md

psamal

Update README.md

dc4f170 verified 2 days ago

preview code

raw

history blame contribute delete

14.3 kB

	---
	pipeline_tag: visual-document-retrieval
	library_name: transformers
	language:
	- multilingual
	license: other
	license_name: webai-non-commercial-license-v1.0
	license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md
	base_model: Qwen/Qwen3.5-4B
	tags:
	- text
	- image
	- video
	- multimodal-embedding
	- vidore
	- colpali
	- colqwen3_5
	- multilingual-embedding
	---

	# webAI-Official/webAI-ColVec1-4b

	## ⚡ Summary

	webAI-Official/webAI-ColVec1-4b is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B). It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.

	The model has been fine-tuned on a merged multimodal dataset of ~2M question-image pairs, including [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA), [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m), [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) and proprietary domain-specific synthetic data

	The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves competitive performance across ViDoRe V1 & V3 (English and multilingual).

	## 🛠️ Model Specifications


	\| Feature \| Detail \|
	\| --------------------- \| ------------------------------------------------------------------------- \|
	\| Architecture \| Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head \|
	\| Methodology \| ColBERT-style Late Interaction (MaxSim scoring) \|
	\| Output \| Multi-vector (Seq_Len × 640), L2-normalized \|
	\| Modalities \| Text Queries, Images (Documents) \|
	\| Training Strategy \| LoRA adapters + Fully-trained projection layer \|
	\| Precision \| `bfloat16` weights, FlashAttention 2 enabled \|


	---

	### Key Properties

	- Unified Encoder (Single-Tower): A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.

	- Projection Head: A single linear layer projects final hidden states → compact embedding space (hidden_size → 640 dim).
	- No activation
	- Fully trained
	- Replaces LM head for retrieval

	- Multi-Vector Representation: Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.

	## 📊 Evaluation Results

	We report results on the ViDoRe benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.

	### ViDoRe V3 (NDCG@10)

	\| Model \| CompSci \| Energy \| FinanceEn \| FinanceFr \| HR \| Industrial \| Pharma \| Physics \| Avg (Public) \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| [webAI-ColVec1-9b](https://huggingface.co/webAI-Official/webAI-ColVec1-9b) \| 0.8092 \| 0.6976 \| 0.6827 \| 0.5372 \| 0.7004 \| 0.5718 \| 0.6732 \| 0.4838 \| 0.6445 \|
	\| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) \| 0.7929 \| 0.6982 \| 0.6729 \| 0.5154 \| 0.6632 \| 0.5603 \| 0.6719 \| 0.5084 \| 0.6354 \|
	\| [webAI-ColVec1-4b](https://huggingface.co/webAI-Official/webAI-ColVec1-4b) \| 0.7983 \| 0.6869 \| 0.6848 \| 0.5111 \| 0.6739 \| 0.5573 \| 0.6567 \| 0.5014 \| 0.6338 \|
	\| [tomoro-colqwen3-embed-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 0.7535 \| 0.6841 \| 0.6508 \| 0.4910 \| 0.6398 \| 0.5441 \| 0.6636 \| 0.5013 \| 0.6160 \|
	\| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) \| 0.7866 \| 0.6804 \| 0.6406 \| 0.4856 \| 0.6206 \| 0.5520 \| 0.6559 \| 0.5034 \| 0.6156 \|



	### ViDoRe V1 (NDCG@5)

	\| Model \| ArxivQA \| DocVQA \| InfoVQA \| Shift \| Syn-AI \| Syn-Eng \| Syn-Gov \| Syn-Health \| TabFQuAD \| Tatdqa \| Avg \|
	\| :------------------------------------------------------------------------------------------------- \| :--------- \| :--------- \| :--------- \| :--------- \| :--------- \| :--------- \| :--------- \| :--------- \| :--------- \| :--------- \| :--------- \|
	\| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) \| 0.9310 \| 0.6810 \| 0.9460 \| 0.9330 \| 1.0000 \| 0.9790 \| 0.9890 \| 0.9960 \| 0.9770 \| 0.8340 \| 0.9270 \|
	\| [llama-nemotron-colembed-vl-3b-v2](https://huggingface.co/nvidia/llama-nemotron-colembed-vl-3b-v2) \| 0.9040 \| 0.6720 \| 0.9470 \| 0.9200 \| 1.0000 \| 0.9800 \| 0.9800 \| 0.9890 \| 0.9730 \| 0.8100 \| 0.9170 \|
	\| [nemotron-colembed-vl-4b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2) \| 0.9200 \| 0.6740 \| 0.9330 \| 0.9230 \| 0.9930 \| 0.9620 \| 0.9800 \| 0.9850 \| 0.9810 \| 0.8120 \| 0.9160 \|
	\| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) \| 0.9190 \| 0.6660 \| 0.9360 \| 0.9020 \| 1.0000 \| 0.9710 \| 0.9730 \| 0.9890 \| 0.9590 \| 0.8400 \| 0.9150 \|
	\| [webAI-ColVec1-9b](TODO) \| 0.9413 \| 0.6882 \| 0.9505 \| 0.8758 \| 0.9963 \| 0.9739 \| 0.9839 \| 0.9926 \| 0.9460 \| 0.7956 \| 0.9144 \|
	\| [Ops-Colqwen3-4B](https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B) \| 0.9180 \| 0.6650 \| 0.9400 \| 0.9080 \| 0.9960 \| 0.9730 \| 0.9800 \| 0.9960 \| 0.9360 \| 0.8240 \| 0.9140 \|
	\| [SauerkrautLM-ColQwen3-8b-v0.1](https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1) \| 0.9380 \| 0.6470 \| 0.9450 \| 0.9040 \| 0.9860 \| 0.9650 \| 0.9680 \| 0.9930 \| 0.9220 \| 0.8400 \| 0.9110 \|
	\| [webAI-ColVec1-4b](TODO) \| 0.9258 \| 0.6773 \| 0.9412 \| 0.8764 \| 1.0000 \| 0.9703 \| 0.9721 \| 1.0000 \| 0.9414 \| 0.7950 \| 0.9100 \|

	---

	## 💻 Usage

	The processor exposes three primary methods for encoding inputs and computing retrieval scores.

	#### `process_images(images, max_length=None)`

	Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`.

	\| Parameter \| Type \| Description \|
	\| ------------ \| ----------------------- \| ------------------------------------------------------------------- \|
	\| `images` \| `List[PIL.Image.Image]` \| Document page images. Each image is automatically converted to RGB. \|
	\| `max_length` \| `int` \| `None` \|

	```python
	batch = processor.process_images(images=pil_images)
	batch = {k: v.to(device) for k, v in batch.items()}
	embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
	```

	---

	#### `process_queries(texts, max_length=None)`

	Encodes a batch of text queries into model-ready tensors.

	\| Parameter \| Type \| Description \|
	\| ------------ \| ----------- \| ------------------------------- \|
	\| `texts` \| `List[str]` \| Natural-language query strings. \|
	\| `max_length` \| `int` \| `None` \|

	```python
	batch = processor.process_queries(texts=["What is the revenue for Q3?"])
	batch = {k: v.to(device) for k, v in batch.items()}
	embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
	```

	---

	#### `score_multi_vector(qs, ps, batch_size=128, device=None)`

	Computes ColBERT-style MaxSim late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.


	\| Parameter \| Type \| Description \|
	\| ------------ \| -------------------------- \| ---------------------------------------------------------------------- \|
	\| `qs` \| `List[Tensor]` or `Tensor` \| Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. \|
	\| `ps` \| `List[Tensor]` or `Tensor` \| Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. \|
	\| `batch_size` \| `int` \| Number of queries processed per inner loop iteration (default: `128`). \|
	\| `device` \| `str` \| `torch.device` \|


	Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance.

	```python
	scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
	# scores[i, j] is the relevance of document j to query i
	best_doc_per_query = scores.argmax(dim=1)
	```

	### Prerequisites

	We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`

	Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`.

	```bash
	pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
	pip install transformers pillow requests
	pip install flash-attn --no-build-isolation
	```

	### Inference Code

	```python
	import torch
	from transformers import AutoModel, AutoProcessor
	from PIL import Image, UnidentifiedImageError
	import requests
	from io import BytesIO

	# Configuration
	MODEL_ID = "webAI-Official/webAI-ColVec1-4b"
	DTYPE = torch.bfloat16
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	# Load Model & Processor
	processor = AutoProcessor.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	)

	model = AutoModel.from_pretrained(
	MODEL_ID,
	dtype=DTYPE,
	attn_implementation="flash_attention_2",
	trust_remote_code=True,
	device_map=DEVICE,
	).eval()

	# Sample Data
	queries = [
	"Retrieve the city of Singapore",
	"Retrieve the city of Beijing"
	]
	docs = [
	"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
	"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
	]

	def load_image(url: str) -> Image.Image:
	# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
	for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
	resp = requests.get(url, headers=headers, timeout=10)
	if resp.status_code == 403:
	continue
	resp.raise_for_status()
	try:
	return Image.open(BytesIO(resp.content)).convert("RGB")
	except UnidentifiedImageError as e:
	raise RuntimeError(f"Failed to decode image from {url}") from e
	raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")

	# Helper Functions
	def encode_queries(texts, batch_size=8):
	outputs = []
	for start in range(0, len(texts), batch_size):
	batch = processor.process_queries(texts=texts[start : start + batch_size])
	batch = {k: v.to(DEVICE) for k, v in batch.items()}
	with torch.inference_mode():
	embeddings = model(**batch)
	vecs = embeddings.to(torch.bfloat16).cpu()
	outputs.extend(vecs)
	return outputs

	def encode_docs(urls, batch_size=4):
	pil_images = [load_image(url) for url in urls]
	outputs = []
	for start in range(0, len(pil_images), batch_size):
	batch_imgs = pil_images[start : start + batch_size]
	features = processor.process_images(images=batch_imgs)
	features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
	with torch.inference_mode():
	embeddings = model(**features)
	vecs = embeddings.to(torch.bfloat16).cpu()
	outputs.extend(vecs)
	return outputs

	# Execution
	query_embeddings = encode_queries(queries)
	doc_embeddings = encode_docs(docs)

	# MaxSim Scoring
	scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
	print(scores)
	```

	---

	## ⚖️ Strengths & Limitations

	### Strengths

	- Performance: State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
	- Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
	- End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
	- Multilingualism: Strong performance on non-English document inputs.

	### Limitations

	- Storage Cost: Still larger than single‑vector baselines despite the smaller token dimension.

	### License & Data

	[LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)

	## 📚 Citation

	If you use this model, please cite:

	```bibtex
	@misc{webAI-ColVec1,
	title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
	author={webAI},
	year={2026},
	url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
	}
	```