SwissBERT for Token-Level Language Identification in Multilingual Child Speech

This repository contains a fine-tuned version of SwissBERT for word-level language identification in multilingual child–caregiver interactions.
The model predicts a language label for each word and supports downstream analyses such as:

Inter-sentential code-switching
Intra-sentential code-switching
Cross-speaker switching
Switch-point detection
Multilingual child speech profiling

The model was trained on manually annotated child speech transcripts containing Swiss German, English, French, Italian, and an “other” category.
Because Swiss German child speech data is limited, the model was partially trained on the SwissDial dataset to improve Swiss German coverage.

Model Description

Base model: ZurichNLP/swissbert (XLM-RoBERTa architecture)
Task: Token classification (word-level language ID)
Labels: Swiss German, English, French, Italian, Other
Tokenizer: SentencePiece (slow tokenizer), extended with:
- <medium>
- <year>
- <month>

The model is designed for:

Child multilingualism research
Code-switching analysis
Annotation pipelines
Automatic language tagging in naturalistic child speech

Training Data

The training dataset is a tab-separated file with the following structure:

sentence_id	token	label
12	Das	gsw
12	isch	gsw
12	good	eng
12	gäll	gsw

Tokens are grouped by sentence_id to form sequences for token-level classification.

Training Pipeline

The model was trained using the Hugging Face Trainer API.

1. Load labeled data

Read TSV file with (sentence_id, token, label)
Remove empty tokens and labels
Normalize labels to lowercase

2. Group tokens into sentences

Tokens and labels are grouped by sentence_id to form input sequences.

3. Build label mappings

label2id = {
    "gsw": 0,
    "deu": 1,
    "eng": 2,
    "fra": 3,
    "ita": 4,
    "other": 5
}

id2label = {v: k for k, v in label2id.items()}

4. Tokenization and label alignment

Because SwissBERT uses SentencePiece, each token may split into multiple subword units.

Manual alignment was implemented:

First subword receives the label
Remaining subwords receive -100 (ignored in loss)
CLS and SEP tokens also receive -100
Sequences padded/truncated to MAX_LENGTH = 128

5. Training Configuration

Epochs: 5
Batch size: 8
Learning rate: 5e‑5
Weight decay: 0.01
Evaluation: every epoch
Metric: F1 (seqeval)
Best model selection: enabled
Tokenizer: slow SentencePiece tokenizer

6. Model Setup

AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id,
)

7. Metrics

Evaluation uses seqeval, reporting:

token‑level F1
per‑label precision and recall
full classification report printed during training

References

Agnese D'Angelo, Sina Ahmadi, Moritz M. Daum, and Stephanie Wermelinger. 2026. Code-Switching Detection in Multilingual Child Speech with SwissBERT. In Proceedings of the 11th Swiss Text Analytics Conference (SwissText 2026), Zurich, Switzerland.

Downloads last month: 29

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for ZurichNLP/SwissBERT-CS

Base model

ZurichNLP/swissbert

Finetuned

(2)

this model