Southeast Asian IPA Transliteration Models
Collection
Khmer and Lao IPA tranliteration models. These are models trained using MekongPhon https://github.com/byu-matrix-lab/MekongPhon • 3 items • Updated
A sequence-to-sequence model that converts Khmer script to IPA (International Phonetic Alphabet) transcriptions. Trained from scratch using a BERT-based encoder-decoder architecture with character-level tokenizers for both input (Khmer) and output (IPA).
| Property | Value |
|---|---|
| Architecture | EncoderDecoderModel (BERT encoder + BERT decoder) |
| Hidden size | 512 |
| Layers (enc + dec) | 6 each |
| Attention heads | 8 |
| Feed-forward size | 1024 |
| Encoder vocab size | 1000 (Khmer characters) |
| Decoder vocab size | 1000 (IPA characters) |
| Max sequence length | 128 |
| Best eval loss | 0.1736 (checkpoint 26000, ~11 epochs) |
This model uses two separate tokenizers — one for Khmer input and one for IPA output — stored in subfolders.
from transformers import EncoderDecoderModel, AutoTokenizer
model = EncoderDecoderModel.from_pretrained("byumatrixlab/khmer-ipa")
encoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="encoder_tokenizer")
decoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="decoder_tokenizer")
def khmer_to_ipa(text, num_beams=4):
inputs = encoder_tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=128,
)
output_ids = model.generate(
**inputs,
max_length=128,
num_beams=num_beams,
early_stopping=True,
)
return decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(khmer_to_ipa("សួស្តី"))
# → suǝsdǝy
print(khmer_to_ipa("ខ្ញុំជាសិស្ស"))
# → kɲomciesəh
def khmer_to_ipa_batch(texts, num_beams=4):
inputs = encoder_tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=128,
)
output_ids = model.generate(
**inputs,
max_length=128,
num_beams=num_beams,
early_stopping=True,
)
return [decoder_tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids]
Developed by the BYU MATRIX Lab Training code and data processing scripts: MekongPhon