Model Description

mmbert-cap is a compact multilingual transformer model for classifying text into Comparative Agendas Project (CAP) policy categories. It is designed to provide strong and consistent performance across multiple languages and document types while remaining computationally efficient. For further details, read the documentation.

  • Model type: Multilingual transformer (mmbert-small, ~110M parameters)
  • Language(s): Danish, Dutch, English, German, Norwegian, Spanish, Swedish
  • Finetuned from model: mmbert-small (Marone et al. 2025)

Uses

Direct Use

  • Classification of political texts into CAP policy categories
  • Applicable to news articles, press releases, and social media posts
  • Multilingual political text analysis

Downstream Use

  • Policy agenda research
  • Political communication analysis
  • Dataset labeling / annotation support

Out-of-Scope Use

  • Non-political text classification
  • Tasks outside CAP taxonomy
  • Decision-making without human validation

Bias, Risks, and Limitations

  • CAP annotation is inherently subjective; model reflects annotation biases
  • Lower performance for some categories (e.g., public lands, foreign trade)

Recommendations

  • Evaluate on a small in-domain dataset before deployment
  • Combine quantitative metrics with qualitative inspection

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Sami92/mmbert-cap-int8")
model = AutoModelForSequenceClassification.from_pretrained("Sami92/mmbert-cap-int8")

inputs = tokenizer("Your text here", return_tensors="pt", truncation=True)
outputs = model(**inputs)

Training Details

Training Data

  • ~171k manually labeled documents (cleaned)
  • ~442k additional documents with soft labels
  • Sources: news, press releases, social media
  • Languages: 7 European languages

Training Procedure

  • Data cleaning using confident learning (Cleanlab)
  • Ensemble teacher models (XLM-R, XL, mmbert-base)
  • Knowledge distillation to mmbert-small
  • Training on soft labels

Training Hyperparameters

  • Batch size: 64
  • Learning rate: 2e-5

Evaluation

Metrics

  • Macro F1 score
  • Accuracy
  • Accuracy@2 (multi-label setting)

Results

  • Macro F1: 0.80
  • Accuracy: 0.81
  • Category range (F1): 0.65โ€“0.90
  • Languages (F1): 0.74โ€“0.84
  • Document types (F1): 0.77โ€“0.79

Summary

The model achieves competitive performance while remaining efficient and consistent across languages and document types.

Technical Specifications

Model Architecture and Objective

  • Transformer-based multilingual classifier
  • ~110M parameters
  • Single-label CAP classification

Compute Infrastructure

Hardware

  • NVIDIA H100 GPU (evaluation)
  • AMD EPYC CPU (qint8 inference)

Software

  • PyTorch
  • Hugging Face Transformers

Acknowledgments

We thank the researchers who shared their datasets with us. Gunnar Thesen and Erik de Vries provided the MaML dataset of news articles, Rens Vliegenhart contributed a dataset of Dutch newspaper articles, and Cornelius Erfort shared a dataset of press releases. High-quality data are essential for reliable machine learning classifiers, and this model could not have been trained without their support. We also thank Thomas Haase for his work in annotating the social media test data.

Downloads last month
122
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Sami92/mmbert-cap

Finetuned
(39)
this model