SwissBERT for Token-Level Language Identification in Multilingual Child Speech

This repository contains a fine-tuned version of SwissBERT for word-level language identification in multilingual child–caregiver interactions.
The model predicts a language label for each word and supports downstream analyses such as:

  • Inter-sentential code-switching
  • Intra-sentential code-switching
  • Cross-speaker switching
  • Switch-point detection
  • Multilingual child speech profiling

The model was trained on manually annotated child speech transcripts containing Swiss German, English, French, Italian, and an “other” category.
Because Swiss German child speech data is limited, the model was partially trained on the SwissDial dataset to improve Swiss German coverage.


Model Description

  • Base model: ZurichNLP/swissbert (XLM-RoBERTa architecture)
  • Task: Token classification (word-level language ID)
  • Labels: Swiss German, English, French, Italian, Other
  • Tokenizer: SentencePiece (slow tokenizer), extended with:
    • <medium>
    • <year>
    • <month>

The model is designed for:

  • Child multilingualism research
  • Code-switching analysis
  • Annotation pipelines
  • Automatic language tagging in naturalistic child speech

Training Data

The training dataset is a tab-separated file with the following structure:

sentence_id token label
12 Das gsw
12 isch gsw
12 good eng
12 gäll gsw

Tokens are grouped by sentence_id to form sequences for token-level classification.


Training Pipeline

The model was trained using the Hugging Face Trainer API.

1. Load labeled data

  • Read TSV file with (sentence_id, token, label)
  • Remove empty tokens and labels
  • Normalize labels to lowercase

2. Group tokens into sentences

Tokens and labels are grouped by sentence_id to form input sequences.


3. Build label mappings

label2id = {
    "gsw": 0,
    "deu": 1,
    "eng": 2,
    "fra": 3,
    "ita": 4,
    "other": 5
}

id2label = {v: k for k, v in label2id.items()}

4. Tokenization and label alignment

Because SwissBERT uses SentencePiece, each token may split into multiple subword units.

Manual alignment was implemented:

  • First subword receives the label
  • Remaining subwords receive -100 (ignored in loss)
  • CLS and SEP tokens also receive -100
  • Sequences padded/truncated to MAX_LENGTH = 128

5. Training Configuration

  • Epochs: 5
  • Batch size: 8
  • Learning rate: 5e‑5
  • Weight decay: 0.01
  • Evaluation: every epoch
  • Metric: F1 (seqeval)
  • Best model selection: enabled
  • Tokenizer: slow SentencePiece tokenizer

6. Model Setup

AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id,
)

7. Metrics

Evaluation uses seqeval, reporting:

  • token‑level F1
  • per‑label precision and recall
  • full classification report printed during training

References

Agnese D'Angelo, Sina Ahmadi, Moritz M. Daum, and Stephanie Wermelinger. 2026. Code-Switching Detection in Multilingual Child Speech with SwissBERT. In Proceedings of the 11th Swiss Text Analytics Conference (SwissText 2026), Zurich, Switzerland.

Downloads last month
29
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZurichNLP/SwissBERT-CS

Finetuned
(2)
this model