Model Description

mmbert-cap is a compact multilingual transformer model for classifying text into Comparative Agendas Project (CAP) policy categories. It is designed to provide strong and consistent performance across multiple languages and document types while remaining computationally efficient. For further details, read the documentation.

Model type: Multilingual transformer (mmbert-small, ~110M parameters)
Language(s): Danish, Dutch, English, German, Norwegian, Spanish, Swedish
Finetuned from model: mmbert-small (Marone et al. 2025)

Uses

Direct Use

Classification of political texts into CAP policy categories
Applicable to news articles, press releases, and social media posts
Multilingual political text analysis

Downstream Use

Policy agenda research
Political communication analysis
Dataset labeling / annotation support

Out-of-Scope Use

Non-political text classification
Tasks outside CAP taxonomy
Decision-making without human validation

Bias, Risks, and Limitations

CAP annotation is inherently subjective; model reflects annotation biases
Lower performance for some categories (e.g., public lands, foreign trade)

Recommendations

Evaluate on a small in-domain dataset before deployment
Combine quantitative metrics with qualitative inspection

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Sami92/mmbert-cap-int8")
model = AutoModelForSequenceClassification.from_pretrained("Sami92/mmbert-cap-int8")

inputs = tokenizer("Your text here", return_tensors="pt", truncation=True)
outputs = model(**inputs)

Training Details

Training Data

~171k manually labeled documents (cleaned)
~442k additional documents with soft labels
Sources: news, press releases, social media
Languages: 7 European languages

Training Procedure

Data cleaning using confident learning (Cleanlab)
Ensemble teacher models (XLM-R, XL, mmbert-base)
Knowledge distillation to mmbert-small
Training on soft labels

Training Hyperparameters

Batch size: 64
Learning rate: 2e-5

Evaluation

Metrics

Macro F1 score
Accuracy
Accuracy@2 (multi-label setting)

Results

Macro F1: 0.80
Accuracy: 0.81
Category range (F1): 0.65–0.90
Languages (F1): 0.74–0.84
Document types (F1): 0.77–0.79

Summary

The model achieves competitive performance while remaining efficient and consistent across languages and document types.

Technical Specifications

Model Architecture and Objective

Transformer-based multilingual classifier
~110M parameters
Single-label CAP classification

Compute Infrastructure

Hardware

NVIDIA H100 GPU (evaluation)
AMD EPYC CPU (qint8 inference)

Software

PyTorch
Hugging Face Transformers

Acknowledgments

We thank the researchers who shared their datasets with us. Gunnar Thesen and Erik de Vries provided the MaML dataset of news articles, Rens Vliegenhart contributed a dataset of Dutch newspaper articles, and Cornelius Erfort shared a dataset of press releases. High-quality data are essential for reliable machine learning classifiers, and this model could not have been trained without their support. We also thank Thomas Haase for his work in annotating the social media test data.

Downloads last month: 122

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Sami92/mmbert-cap

Base model

jhu-clsp/mmBERT-small

Finetuned

(39)

this model