Truecaser Models
Truecasing models for restoring proper capitalization in lowercase ASR output. Used by CrispASR via --truecase-model.
Available Models
| File | Type | Language | Size | F1 | License | Flag |
|---|---|---|---|---|---|---|
truecaser-lstm-de.bin |
BiLSTM char-level | German | 3.2 MB | 97.9% | Apache-2.0 | lstm or lstm-de |
truecaser-lstm-en.bin |
BiLSTM char-level | English | 3.2 MB | 93.0% | Apache-2.0 | lstm-en |
truecaser-lstm-es.bin |
BiLSTM char-level | Spanish | 3.2 MB | — | Apache-2.0 | lstm-es |
truecaser-lstm-ru.bin |
BiLSTM char-level | Russian | 4.1 MB | — | Apache-2.0 | lstm-ru |
truecaser-crf-de.bin |
CRF + context | German | 8.5 MB | ~95% | MIT | crf |
truecaser-de.bin |
Statistical freq | German | 1.7 MB | ~93% | MIT | auto |
BiLSTM Truecaser (recommended)
Converted from mayhewsw/pytorch-truecaser (Apache-2.0).
- Architecture: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
- Labels: L (lowercase), U (uppercase) — per character
- Training: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI)
- Original paper: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
- Source: mayhewsw/pytorch-truecaser v1.0 —
wmt-truecaser-model-de.tar.gz
Example
Input: die schnelle braune katze springt über den faulen hund
Output: Die schnelle braune Katze springt über den faulen Hund
Correctly handles:
- Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
- Formal pronouns: "Ihnen" (capitalize)
- Compound words and proper nouns
CRF Truecaser
Trained on 245K sentences of WMT News Crawl German using python-crfsuite.
- Features: word identity, 3-char suffix, noun suffixes, previous/next word, article context
- Decode: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
- Training data: WMT News Crawl 2023 German (8.5 MB model, MIT license)
Statistical Truecaser
Simple word-frequency lookup trained on WMT News Crawl 2023 German.
- Entries: 71,142 unique words
- Size: 1.7 MB
- Approach: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
- Training data: WMT News Crawl 2023 German (278K sentences), MIT license
Usage with CrispASR
# BiLSTM (recommended)
crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav
# CRF
crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav
# Statistical
crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav
# Combined with punctuation restoration
crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav
Conversion
# BiLSTM: download from mayhewsw, convert to binary
wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
tar xzf wmt-truecaser-model-de.tar.gz
python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin
# CRF: train from Wikipedia
python models/train-truecaser-crf.py --output truecaser-crf-de.bin