Model Description
mmbert-cap is a compact multilingual transformer model for classifying text into Comparative Agendas Project (CAP) policy categories. It is designed to provide strong and consistent performance across multiple languages and document types while remaining computationally efficient. For further details, read the documentation.
- Model type: Multilingual transformer (mmbert-small, ~110M parameters)
- Language(s): Danish, Dutch, English, German, Norwegian, Spanish, Swedish
- Finetuned from model: mmbert-small (Marone et al. 2025)
Uses
Direct Use
- Classification of political texts into CAP policy categories
- Applicable to news articles, press releases, and social media posts
- Multilingual political text analysis
Downstream Use
- Policy agenda research
- Political communication analysis
- Dataset labeling / annotation support
Out-of-Scope Use
- Non-political text classification
- Tasks outside CAP taxonomy
- Decision-making without human validation
Bias, Risks, and Limitations
- CAP annotation is inherently subjective; model reflects annotation biases
- Lower performance for some categories (e.g., public lands, foreign trade)
Recommendations
- Evaluate on a small in-domain dataset before deployment
- Combine quantitative metrics with qualitative inspection
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Sami92/mmbert-cap-int8")
model = AutoModelForSequenceClassification.from_pretrained("Sami92/mmbert-cap-int8")
inputs = tokenizer("Your text here", return_tensors="pt", truncation=True)
outputs = model(**inputs)
Training Details
Training Data
- ~171k manually labeled documents (cleaned)
- ~442k additional documents with soft labels
- Sources: news, press releases, social media
- Languages: 7 European languages
Training Procedure
- Data cleaning using confident learning (Cleanlab)
- Ensemble teacher models (XLM-R, XL, mmbert-base)
- Knowledge distillation to mmbert-small
- Training on soft labels
Training Hyperparameters
- Batch size: 64
- Learning rate: 2e-5
Evaluation
Metrics
- Macro F1 score
- Accuracy
- Accuracy@2 (multi-label setting)
Results
- Macro F1: 0.80
- Accuracy: 0.81
- Category range (F1): 0.65โ0.90
- Languages (F1): 0.74โ0.84
- Document types (F1): 0.77โ0.79
Summary
The model achieves competitive performance while remaining efficient and consistent across languages and document types.
Technical Specifications
Model Architecture and Objective
- Transformer-based multilingual classifier
- ~110M parameters
- Single-label CAP classification
Compute Infrastructure
Hardware
- NVIDIA H100 GPU (evaluation)
- AMD EPYC CPU (qint8 inference)
Software
- PyTorch
- Hugging Face Transformers
Acknowledgments
We thank the researchers who shared their datasets with us. Gunnar Thesen and Erik de Vries provided the MaML dataset of news articles, Rens Vliegenhart contributed a dataset of Dutch newspaper articles, and Cornelius Erfort shared a dataset of press releases. High-quality data are essential for reliable machine learning classifiers, and this model could not have been trained without their support. We also thank Thomas Haase for his work in annotating the social media test data.
- Downloads last month
- 122
Model tree for Sami92/mmbert-cap
Base model
jhu-clsp/mmBERT-small