Tibetan Normalisation - S2S Model (Tokenised)

A character-level sequence-to-sequence (S2S) encoder-decoder transformer model for the normalisation of Old/Classical Tibetan, converting diplomatic (non-standard, abbreviated) Tibetan manuscript text into Standard Classical Tibetan. This is the tokenised variant of the model — input and output have been pre-segmented into tokens using a customised version of the Botok Tibetan tokeniser prior to training (see Data Preparation).

Important: Results from Meelen & Griffiths (2026) indicate that for most use cases, normalisation performs better when applied to non-tokenised text. Tokenisation is best deferred until after normalisation in the processing pipeline. For general use, the non-tokenised model pagantibet/normalisationS2S-nontokenised is therefore recommended. The tokenised model is provided for research purposes and for direct comparison of the two approaches.

This model is part of the PaganTibet project and accompanies the paper:

Meelen, M. & Griffiths, R.M. (2026) 'Historical Tibetan Normalisation: rule-based vs neural & n-gram LM methods for extremely low-resource languages' in Proceedings of the AI4CHIEF conference, Springer.

Please cite the paper and the code repository when using this model.

Model Overview

Old/Classical Tibetan manuscripts present major normalisation challenges: extensive abbreviations, non-standard orthography, scribal variation, and a near-complete absence of gold-standard parallel data. This model addresses these challenges using a hybrid approach combining a neural sequence-to-sequence transformer with optional rule-based pre-/post-processing and KenLM n-gram language model ranking (the latter applied at inference time; see the Inference scripts).

The model operates at the character level on tokenised input — that is, the source text has been segmented into Tibetan word tokens using a customised version of Botok before being passed to the model (see Data Preparation). Both source (diplomatic) and target (normalised) sequences in the training data were tokenised in this way. At inference time, input text must likewise be tokenised using the same tool before being fed to this model.

Architecture

Type: Character-level encoder-decoder transformer (Seq2Seq)
Layers: 4
Attention heads: 8
Optimiser: Adam (lr = 0.0005, β1 = 0.9, β2 = 0.997)
Label smoothing: 0.1
Framework: PyTorch
Training hardware: RTX ADA 6000 GPU (~5–6 hours training time)

Full hyperparameter settings are reported in the Appendix of Meelen & Griffiths (2026).

Training Data

The model was trained on a tokenised version of the dataset pagantibet/normalisation-S2S-training (~2 million rows), which combines:

Gold-standard data: 7,421 manually normalised parallel sentence pairs from the PaganTibet corpus, tokenised using a customised version of Botok (see botokenise_src-tgt.py in the Data Preparation scripts).
Augmented data: The gold data was substantially expanded using four data augmentation strategies, each designed to simulate the kinds of variation found in historical Tibetan manuscripts:
- Random noise injection: Probabilistic character substitutions, diacritic variations, and orthographic inconsistencies calibrated to realistic manuscript variation frequencies (following Huang et al. 2023).
- OCR-based noise simulation: OCR-realistic noise patterns generated using the nlpaug library.
- Rule-based diplomatic transformations: Stochastic application of character replacements reflecting common scribal conventions in historical Tibetan manuscripts.
- Dictionary-based augmentation: Insertion of entries from a custom Tibetan abbreviation dictionary (~10,000 abbreviation–expansion pairs) applied to tokenised text to help the model learn abbreviation resolution.

Additional training data was derived from the Standard Classical Tibetan ACTib corpus (>180 million words; Meelen & Roux 2020), processed into manuscript-length lines and tokenised accordingly.

Full details of the data preparation and augmentation pipeline are described in the GitHub repository.

Intended Use

This model is intended for:

Research comparing tokenised and non-tokenised approaches to Classical Tibetan normalisation, as described in Meelen & Griffiths (2026).
Normalisation of diplomatic Old/Classical Tibetan texts in workflows where tokenised input is already available or required.
Digital humanities work on historical Tibetan manuscripts, particularly when studying the interaction between tokenisation and normalisation.

Note on pipeline order: Results in Meelen & Griffiths (2026) show that tokenisation is best left until after normalisation in the processing pipeline. For most use cases, the non-tokenised model pagantibet/normalisationS2S-nontokenised is recommended. For particularly challenging diplomatic corpora, combining either model with the KenLM n-gram ranker and rule-based pre/post-processing (see Inference) yields the best results.

How to Use

Input text must first be tokenised using a customised version of the Botok Tibetan tokeniser:

python3 botokenise_src-tgt.py

See the Custom Boktok ReadMe for full tokenisation details.

The model can then be used with the inference scripts provided in the PaganTibet normalisation repository. Six inference modes are available, ranging from rule-based only to combined neural + n-gram + rule-based pipelines:

# Run on a GPU cluster via Slurm
sbatch tibetan-inference-flexible.sh

# Or run directly
python3 tibetan-inference-flexible.py

See the Inference ReadMe for full usage details and configuration options.

Evaluation

The training script includes a built-in beam search evaluation. Separate evaluation is available via the evaluation scripts, which reports:

CER (Character Error Rate)
Precision, Recall, F1
Correction Precision (CP) and Correction Recall (CR) (following Huang et al. 2023) for a more accurate picture of normalisation effectiveness
Bootstrapped Confidence Intervals (1,000 iterations) for small test sets (optional)

Two versions of the evaluation script are available:

evaluate_model.py — the standard script#
evaluate-model-withCIs.py — an extended version that additionally computes 95% bootstrap confidence intervals (CI) for all metrics

sbatch evaluate-model.sh
# or
python3 evaluate_model.py
# or
python3 evaluate-model-withCIs.py

Full evaluation results including confidence intervals and example predictions are available in the tokenised Evaluations directory of the repository.

Related Models and Resources

All models and datasets from the PaganTibet normalisation project are collected in the Normalisation collection on Hugging Face.

Resource	Link
Non-tokenised model (recommended)	`pagantibet/normalisationS2S-nontokenised`
Training dataset	`pagantibet/normalisation-S2S-training`
Abbreviation dictionary	`pagantibet/Tibetan-abbreviation-dictionary`
Training & inference code	github.com/pagantibet/normalisation
ACTib corpus	Zenodo (Meelen & Roux 2020)
PaganTibet project	pagantibet.com

License

This model is released under CC BY-NC-SA 4.0. It may be used freely for non-commercial research and educational purposes, with attribution and under the same licence terms.

Funding

This work was partially funded by the European Union (ERC, Pagan Tibet, grant no. 101097364). Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency.

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including pagantibet/normalisationS2S-tokenised

Normalisation

Collection

Models & datasets for Normalisation historical Tibetan. • 7 items • Updated 22 days ago