Truecaser Models

Truecasing models for restoring proper capitalization in lowercase ASR output. Used by CrispASR via --truecase-model.

Available Models

File Type Language Size F1 License Flag
truecaser-lstm-de.bin BiLSTM char-level German 3.2 MB 97.9% Apache-2.0 lstm or lstm-de
truecaser-lstm-en.bin BiLSTM char-level English 3.2 MB 93.0% Apache-2.0 lstm-en
truecaser-lstm-es.bin BiLSTM char-level Spanish 3.2 MB Apache-2.0 lstm-es
truecaser-lstm-ru.bin BiLSTM char-level Russian 4.1 MB Apache-2.0 lstm-ru
truecaser-crf-de.bin CRF + context German 8.5 MB ~95% MIT crf
truecaser-de.bin Statistical freq German 1.7 MB ~93% MIT auto

BiLSTM Truecaser (recommended)

Converted from mayhewsw/pytorch-truecaser (Apache-2.0).

  • Architecture: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
  • Labels: L (lowercase), U (uppercase) — per character
  • Training: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI)
  • Original paper: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
  • Source: mayhewsw/pytorch-truecaser v1.0wmt-truecaser-model-de.tar.gz

Example

Input:  die schnelle braune katze springt über den faulen hund
Output: Die schnelle braune Katze springt über den faulen Hund

Correctly handles:

  • Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
  • Formal pronouns: "Ihnen" (capitalize)
  • Compound words and proper nouns

CRF Truecaser

Trained on 245K sentences of WMT News Crawl German using python-crfsuite.

  • Features: word identity, 3-char suffix, noun suffixes, previous/next word, article context
  • Decode: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
  • Training data: WMT News Crawl 2023 German (8.5 MB model, MIT license)

Statistical Truecaser

Simple word-frequency lookup trained on WMT News Crawl 2023 German.

  • Entries: 71,142 unique words
  • Size: 1.7 MB
  • Approach: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
  • Training data: WMT News Crawl 2023 German (278K sentences), MIT license

Usage with CrispASR

# BiLSTM (recommended)
crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav

# CRF
crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav

# Statistical
crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav

# Combined with punctuation restoration
crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav

Conversion

# BiLSTM: download from mayhewsw, convert to binary
wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
tar xzf wmt-truecaser-model-de.tar.gz
python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin

# CRF: train from Wikipedia
python models/train-truecaser-crf.py --output truecaser-crf-de.bin
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support