MagpieTTS Multilingual 357M

| |

🤗 HuggingFace MagpieTTS Multilingual demo: magpie_tts_multilingual_demo

💻 NeMo Framework: github.com/NVIDIA/NeMo

Description:

The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, John Van Stan. Each speakers can speak seven different languages (En, Es, De, Fr, Vi, It, Zh). The model predicts discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment. The generated codecs are then converted to speech waveform using NanoCodec.

This model is ready for commercial use.

Key Features of the model

Multilingual Support — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, and Mandarin
Expressive Voices — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
Text Normalization — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages except Vietnamese

Explore more from NVIDIA:

For the enterprise offering, see the MagpieTTS NIM which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
What is Nemotron?
NVIDIA Developer Nemotron
Build a voice-agent code repo with the model.

Deployment Geography:

Global

Use Case:

Wherever NVIDIA’s text-to-speech (TTS) models are used, Multilingual MagpieTTS can generate multilingual speech for a given text.

Model Architecture:

Architecture Type: Transformer Encoder, Transformer Decoder, Local Transformer, and feedforward layers

Figure 1: MagpieTTS Model Architecture

Network Architecture:

Causal Transformer Encoder with 6 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.
Causal Transformer Decoder with 12 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.

Number of model parameters 3.57*10^8

Input:

Input Type(s): Text
Input Format: String
Input Parameters: One-Dimensional (1D)

Output:

Output Type(s): Audio
Output Format: .wav file
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Audio output with dimensions (B x T), where B is batch size and T is time dimension.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

How to Use this Model

NeMo Installation

To train, fine-tune or perform TTS with this model, you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version and Python version ≥ 3.10.12.

pip install nemo_toolkit[tts]@main
pip install kaldialign

The model is available for use in the NeMo Framework, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Method 1: Single TTS Inference

In this method, the model can be used to infer on a single (text, language) pair. Text Normalization can also be applied if needed for En, Es, De, Fr, It, Zh languages. Load the Open Source MagpieTTS checkpoint from Huggingface and call the do_tts(transcript, language, apply_TN) method. This returns the generated audio and the length of the audio.

from nemo.collections.tts.models import MagpieTTSModel

speaker_map = {
    "John": 0,
    "Sofia": 1,
    "Aria": 2,
    "Jason": 3,
    "Leo": 4
}
transcript = "Hello world from NeMo Text to Speech."
language = "en"
speaker = "Sofia"
speaker_idx = speaker_map[speaker]

model = MagpieTTSModel.from_pretrained("nvidia/magpie_tts_multilingual_357m")
audio, audio_len = model.do_tts(transcript, language=language, apply_TN=False, speaker_index=speaker_idx)

Method 2: Batch Inference

This section explains how to run batch inference and evaluation on MagpieTTS models using the examples/tts/magpietts_inference.py script.

Key Points

The MagpieTTS inference script supports:

Batch inference from .nemo files or .ckpt checkpoints
Optional evaluation with metrics (CER, WER, Speaker Similarity, UTMOSv2)
Multiple datasets in a single run

Dataset Configuration (`examples/tts/evalset_config.json`)

The script requires a JSON configuration file that defines the metadata for the datasets to process.

Format

{
    "dataset_name_1": {
        "manifest_path": "/absolute/path/to/manifest.json",
        "audio_dir": "/",
        "feature_dir": null
    },
    "dataset_name_2": {
        "manifest_path": "/path/to/another_manifest.json",
        "audio_dir": "/base/audio/path",
        "feature_dir": "/path/to/features"
    }
}

Fields

Field	Required	Description
`manifest_path`	Yes	Absolute path to the NeMo manifest JSON file
`audio_dir`	Yes	Base directory for audio files. Use `"/"` if manifest contains absolute paths
`feature_dir`	No	Directory for pre-computed features (set to `null` if not used)
`whisper_language`	No	Language code for ASR evaluation (default: `"en"`)

Example

{
    "libritts_test_clean": {
        "manifest_path": "/data/libritts/test_clean_manifest.json",
        "audio_dir": "/",
        "feature_dir": null,
        "whisper_language": "en"
    },
    "vctk": {
        "manifest_path": "/data/vctk/manifest.json",
        "audio_dir": "/data/vctk/wav48",
        "feature_dir": null
    }
}

Manifest Format

The manifest is a JSON-lines file where each line is a JSON object representing one utterance.

Minimum Required Fields

For models with fixed speaker context embeddings (no audio/text conditioning needed):

{"audio_filepath": "/path/to/audio.wav", "text": "The transcript text.", "duration": 3.5}

Field	Type	Description
`audio_filepath`	string	Path to the target audio file
`text`	string	Text transcript to synthesize
`duration`	float	Audio duration in seconds

Run Inference and Evaluation

# Basic inference (no evaluation)
python examples/tts/magpietts_inference.py \
    --nemo_files "nvidia/magpie_tts_multilingual_357m" \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
    --use_cfg \
    --cfg_scale 2.5

# Inference with evaluation
python examples/tts/magpietts_inference.py \
    --nemo_files "nvidia/magpie_tts_multilingual_357m" \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
    --run_evaluation \
    --use_cfg \
    --cfg_scale 2.5

Check Outputs

After running, you'll find:

Generated audio files in <out_dir>/<checkpoint_name>/
Evaluation metrics in metrics.json
Visualization plots (if evaluation enabled)

Evaluation Metrics

When --run_evaluation is enabled, the following metrics are computed:

Metric	Description
CER	Character Error Rate (lower is better)
WER	Word Error Rate (lower is better)
SSIM (pred-gt)	Speaker similarity between predicted and ground truth
SSIM (pred-context)	Speaker similarity between predicted and context
UTMOSv2	Audio quality score (higher is better, requires `utmosv2` package)
RTF	Real-time factor (processing time / audio duration)

Software Integration:

Runtime Engine(s): NeMo Framework 25.11

Supported Hardware Microarchitecture Compatibility:

NVIDIA A10 GPU
NVIDIA A30 GPU
NVIDIA A100 GPU
NVIDIA H100 GPU

Preferred/Supported Operating System(s):

Linux
Linux 4 Tegra

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

Multilingual MagpieTTS-357M

Training and Evaluation Datasets:

Training Dataset:

The following datasets were used to train the model, including additional datasets focused on speech and ASR.

Hi-FiTTS En
HiFiTTS-2 A Large-Scale High Bandwidth Speech Dataset En
LibriTTS En
Internal English Dataset
CML-TTS Es
Internal Spanish Dataset
CML-TTS Fr
Internal French Dataset
CML-TTS It
CML-TTS De
Large-scale Vietnamese speech corpus (LSVSC) Vi
InfoRe-2 Vi
InfoRe-1 Vi
Internal Vietnamese Dataset
Internal Mandarin Dataset

Data Modality

Audio

Audio Training Data Size

60,000 Hours

Data Collection Method by dataset

Publicly available dataset
Human

Labeling Method by dataset

Hybrid: Human, Synthetic - Human recorded data points were preprocessed algorithmically.

Properties: Number of data items in training set: 38k hours Modality: Audio (speech signal) Nature of the content: Audio books Language: Multilingual (En, Es, De, Fr, Vi, It, Zh) Sensor Type: Microphones

Evaluation Dataset:

Benchmark Score

Data Collection Method by dataset:

Publicly available dataset
Human

Labeling Method by dataset:

Human
Hybrid: Human, Synthetic - Human labeled data points are mixed and matched to create more variabilities.

Properties: Modality: Audio (speech signal) Nature of the content: Audio books and Newspaper passages Language: Multilingual (En, Es, De, Fr) Sensor Type: Microphones

	CER (%)	SV-SSIM
LibriTTS test-clean	0.38	0.823
Spanish CML	1.0	0.719
French CML	2.8	0.708
German CML	1.1	0.646

This result is based on the MagpieTTS model (Huggingface Checkpoint)

Inference:

Acceleration Engine: None
Test Hardware:

NVIDIA H100 GPU
NVIDIA A100 GPU
NVIDIA A6000 GPU
NVIDIA T4 GPU

Technical Limitations & Mitigation:

There are two modes of inference, namely, standard and long-form. In standard mode, this model can generate up to twenty (20) seconds of multilingual (En, Es, De, Fr, Vi, It, Zh) speech at a time. In long-form mode, the model performs optimally when the input text contains punctuation and capitalization. The model was trained on a mix of publicly available speech datasets and internally recorded datasets in seven languages. As a result, it is not suitable for speech generation in any language other than the seven languages mentioned. We have removed zero-shot capabilities of this model for this release. Text normalization is required.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/ .

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by NVIDIA Open Model License Agreement

References(s):

Downloads last month: 140

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

nvidia
/

magpie_tts_multilingual_357m

MagpieTTS Multilingual 357M

Description:

Key Features of the model

Explore more from NVIDIA:

Deployment Geography:

Use Case:

Model Architecture:

Input:

Output:

How to Use this Model

NeMo Installation

Method 1: Single TTS Inference

Method 2: Batch Inference

Dataset Configuration (`examples/tts/evalset_config.json`)

Manifest Format

Run Inference and Evaluation

Software Integration:

Model Version(s):

Training and Evaluation Datasets:

Training Dataset:

Evaluation Dataset:

Inference:

Technical Limitations & Mitigation:

Ethical Considerations:

License/Terms of Use

References(s):

Space using nvidia/magpie_tts_multilingual_357m 1

MagpieTTS Multilingual 357M

Description:

Key Features of the model

Explore more from NVIDIA:

Deployment Geography:

Use Case:

Model Architecture:

Input:

Output:

How to Use this Model

NeMo Installation

Method 1: Single TTS Inference

Method 2: Batch Inference

Dataset Configuration (examples/tts/evalset_config.json)

Manifest Format

Run Inference and Evaluation

Software Integration:

Model Version(s):

Training and Evaluation Datasets:

Training Dataset:

Evaluation Dataset:

Inference:

Technical Limitations & Mitigation:

Ethical Considerations:

License/Terms of Use

References(s):

Space using nvidia/magpie_tts_multilingual_357m 1

Dataset Configuration (`examples/tts/evalset_config.json`)