Khmer IPA

A sequence-to-sequence model that converts Khmer script to IPA (International Phonetic Alphabet) transcriptions. Trained from scratch using a BERT-based encoder-decoder architecture with character-level tokenizers for both input (Khmer) and output (IPA).

Model Details

Property Value
Architecture EncoderDecoderModel (BERT encoder + BERT decoder)
Hidden size 512
Layers (enc + dec) 6 each
Attention heads 8
Feed-forward size 1024
Encoder vocab size 1000 (Khmer characters)
Decoder vocab size 1000 (IPA characters)
Max sequence length 128
Best eval loss 0.1736 (checkpoint 26000, ~11 epochs)

Usage

This model uses two separate tokenizers — one for Khmer input and one for IPA output — stored in subfolders.

from transformers import EncoderDecoderModel, AutoTokenizer

model = EncoderDecoderModel.from_pretrained("byumatrixlab/khmer-ipa")
encoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="encoder_tokenizer")
decoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="decoder_tokenizer")

def khmer_to_ipa(text, num_beams=4):
    inputs = encoder_tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128,
    )
    output_ids = model.generate(
        **inputs,
        max_length=128,
        num_beams=num_beams,
        early_stopping=True,
    )
    return decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(khmer_to_ipa("សួស្តី"))
# → suǝsdǝy

print(khmer_to_ipa("ខ្ញុំជាសិស្ស"))
# → kɲomciesəh

Batched inference

def khmer_to_ipa_batch(texts, num_beams=4):
    inputs = encoder_tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128,
    )
    output_ids = model.generate(
        **inputs,
        max_length=128,
        num_beams=num_beams,
        early_stopping=True,
    )
    return [decoder_tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids]

Training Data and Repository

Developed by the BYU MATRIX Lab Training code and data processing scripts: MekongPhon

Downloads last month
91
Safetensors
Model size
33.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including byumatrixlab/khmer-ipa