MagpieTTS Multilingual 357M

Model architecture | Model size | Language

๐Ÿค— HuggingFace MagpieTTS Multilingual demo: magpie_tts_multilingual_demo

๐Ÿ’ป NeMo Framework: github.com/NVIDIA/NeMo

Description:

The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, John Van Stan. Each speakers can speak seven different languages (En, Es, De, Fr, Vi, It, Zh). The model predicts discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment. The generated codecs are then converted to speech waveform using NanoCodec.

This model is ready for commercial use.

Key Features of the model

  • Multilingual Support โ€” Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, and Mandarin
  • Expressive Voices โ€” Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
  • Text Normalization โ€” Built-in text normalization for handling numbers, abbreviations, and special characters for all languages except Vietnamese

Explore more from NVIDIA:

  • For the enterprise offering, see the MagpieTTS NIM which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
  • What is Nemotron?
  • NVIDIA Developer Nemotron
  • Build a voice-agent code repo with the model.

Deployment Geography:

Global

Use Case:

Wherever NVIDIAโ€™s text-to-speech (TTS) models are used, Multilingual MagpieTTS can generate multilingual speech for a given text.

Model Architecture:

Architecture Type: Transformer Encoder, Transformer Decoder, Local Transformer, and feedforward layers

MagpieTTS Model Architecture Figure 1: MagpieTTS Model Architecture

Network Architecture:

  1. Causal Transformer Encoder with 6 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.
  2. Causal Transformer Decoder with 12 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.

Number of model parameters 3.57*10^8

Input:

Input Type(s): Text
Input Format: String
Input Parameters: One-Dimensional (1D)

Output:

Output Type(s): Audio
Output Format: .wav file
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Audio output with dimensions (B x T), where B is batch size and T is time dimension.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIAโ€™s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

How to Use this Model

NeMo Installation

To train, fine-tune or perform TTS with this model, you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version and Python version โ‰ฅ 3.10.12.

pip install nemo_toolkit[tts]@main
pip install kaldialign

The model is available for use in the NeMo Framework, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Method 1: Single TTS Inference

In this method, the model can be used to infer on a single (text, language) pair. Text Normalization can also be applied if needed for En, Es, De, Fr, It, Zh languages. Load the Open Source MagpieTTS checkpoint from Huggingface and call the do_tts(transcript, language, apply_TN) method. This returns the generated audio and the length of the audio.

from nemo.collections.tts.models import MagpieTTSModel

speaker_map = {
    "John": 0,
    "Sofia": 1,
    "Aria": 2,
    "Jason": 3,
    "Leo": 4
}
transcript = "Hello world from NeMo Text to Speech."
language = "en"
speaker = "Sofia"
speaker_idx = speaker_map[speaker]

model = MagpieTTSModel.from_pretrained("nvidia/magpie_tts_multilingual_357m")
audio, audio_len = model.do_tts(transcript, language=language, apply_TN=False, speaker_index=speaker_idx)

Method 2: Batch Inference

This section explains how to run batch inference and evaluation on MagpieTTS models using the examples/tts/magpietts_inference.py script.

Key Points

The MagpieTTS inference script supports:

  • Batch inference from .nemo files or .ckpt checkpoints
  • Optional evaluation with metrics (CER, WER, Speaker Similarity, UTMOSv2)
  • Multiple datasets in a single run

Dataset Configuration (examples/tts/evalset_config.json)

The script requires a JSON configuration file that defines the metadata for the datasets to process.

Format

{
    "dataset_name_1": {
        "manifest_path": "/absolute/path/to/manifest.json",
        "audio_dir": "/",
        "feature_dir": null
    },
    "dataset_name_2": {
        "manifest_path": "/path/to/another_manifest.json",
        "audio_dir": "/base/audio/path",
        "feature_dir": "/path/to/features"
    }
}

Fields

Field Required Description
manifest_path Yes Absolute path to the NeMo manifest JSON file
audio_dir Yes Base directory for audio files. Use "/" if manifest contains absolute paths
feature_dir No Directory for pre-computed features (set to null if not used)
whisper_language No Language code for ASR evaluation (default: "en")

Example

{
    "libritts_test_clean": {
        "manifest_path": "/data/libritts/test_clean_manifest.json",
        "audio_dir": "/",
        "feature_dir": null,
        "whisper_language": "en"
    },
    "vctk": {
        "manifest_path": "/data/vctk/manifest.json",
        "audio_dir": "/data/vctk/wav48",
        "feature_dir": null
    }
}

Manifest Format

The manifest is a JSON-lines file where each line is a JSON object representing one utterance.

Minimum Required Fields

For models with fixed speaker context embeddings (no audio/text conditioning needed):

{"audio_filepath": "/path/to/audio.wav", "text": "The transcript text.", "duration": 3.5}
Field Type Description
audio_filepath string Path to the target audio file
text string Text transcript to synthesize
duration float Audio duration in seconds

Run Inference and Evaluation

# Basic inference (no evaluation)
python examples/tts/magpietts_inference.py \
    --nemo_files "nvidia/magpie_tts_multilingual_357m" \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
    --use_cfg \
    --cfg_scale 2.5

# Inference with evaluation
python examples/tts/magpietts_inference.py \
    --nemo_files "nvidia/magpie_tts_multilingual_357m" \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
    --run_evaluation \
    --use_cfg \
    --cfg_scale 2.5

Check Outputs

After running, you'll find:

  • Generated audio files in <out_dir>/<checkpoint_name>/
  • Evaluation metrics in metrics.json
  • Visualization plots (if evaluation enabled)

Evaluation Metrics

When --run_evaluation is enabled, the following metrics are computed:

Metric Description
CER Character Error Rate (lower is better)
WER Word Error Rate (lower is better)
SSIM (pred-gt) Speaker similarity between predicted and ground truth
SSIM (pred-context) Speaker similarity between predicted and context
UTMOSv2 Audio quality score (higher is better, requires utmosv2 package)
RTF Real-time factor (processing time / audio duration)

Software Integration:

Runtime Engine(s): NeMo Framework 25.11

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA A10 GPU
  • NVIDIA A30 GPU
  • NVIDIA A100 GPU
  • NVIDIA H100 GPU

Preferred/Supported Operating System(s):

  • Linux
  • Linux 4 Tegra

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

Multilingual MagpieTTS-357M

Training and Evaluation Datasets:

Training Dataset:

The following datasets were used to train the model, including additional datasets focused on speech and ASR.

Data Modality

  • Audio

Audio Training Data Size

  • 60,000 Hours

Data Collection Method by dataset

  • Publicly available dataset
  • Human

Labeling Method by dataset

  • Hybrid: Human, Synthetic - Human recorded data points were preprocessed algorithmically.

Properties: Number of data items in training set: 38k hours Modality: Audio (speech signal) Nature of the content: Audio books Language: Multilingual (En, Es, De, Fr, Vi, It, Zh) Sensor Type: Microphones

Evaluation Dataset:

Benchmark Score

Data Collection Method by dataset:

  • Publicly available dataset
  • Human

Labeling Method by dataset:

  • Human
  • Hybrid: Human, Synthetic - Human labeled data points are mixed and matched to create more variabilities.

Properties: Modality: Audio (speech signal) Nature of the content: Audio books and Newspaper passages Language: Multilingual (En, Es, De, Fr) Sensor Type: Microphones

CER (%) SV-SSIM
LibriTTS test-clean 0.38 0.823
Spanish CML 1.0 0.719
French CML 2.8 0.708
German CML 1.1 0.646

Inference:

Acceleration Engine: None
Test Hardware:

  • NVIDIA H100 GPU
  • NVIDIA A100 GPU
  • NVIDIA A6000 GPU
  • NVIDIA T4 GPU

Technical Limitations & Mitigation:

There are two modes of inference, namely, standard and long-form. In standard mode, this model can generate up to twenty (20) seconds of multilingual (En, Es, De, Fr, Vi, It, Zh) speech at a time. In long-form mode, the model performs optimally when the input text contains punctuation and capitalization. The model was trained on a mix of publicly available speech datasets and internally recorded datasets in seven languages. As a result, it is not suitable for speech generation in any language other than the seven languages mentioned. We have removed zero-shot capabilities of this model for this release. Text normalization is required.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/ .

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by NVIDIA Open Model License Agreement

References(s):

  1. Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
  2. Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Downloads last month
140
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using nvidia/magpie_tts_multilingual_357m 1