MagpieTTS Multilingual 357M
๐ค HuggingFace MagpieTTS Multilingual demo: magpie_tts_multilingual_demo
๐ป NeMo Framework: github.com/NVIDIA/NeMo
Description:
The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, John Van Stan. Each speakers can speak seven different languages (En, Es, De, Fr, Vi, It, Zh). The model predicts discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment. The generated codecs are then converted to speech waveform using NanoCodec.
This model is ready for commercial use.
Key Features of the model
- Multilingual Support โ Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, and Mandarin
- Expressive Voices โ Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
- Text Normalization โ Built-in text normalization for handling numbers, abbreviations, and special characters for all languages except Vietnamese
Explore more from NVIDIA:
- For the enterprise offering, see the MagpieTTS NIM which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
- What is Nemotron?
- NVIDIA Developer Nemotron
- Build a voice-agent code repo with the model.
Deployment Geography:
Global
Use Case:
Wherever NVIDIAโs text-to-speech (TTS) models are used, Multilingual MagpieTTS can generate multilingual speech for a given text.
Model Architecture:
Architecture Type: Transformer Encoder, Transformer Decoder, Local Transformer, and feedforward layers
Figure 1: MagpieTTS Model Architecture
Network Architecture:
- Causal Transformer Encoder with 6 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.
- Causal Transformer Decoder with 12 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.
Number of model parameters 3.57*10^8
Input:
Input Type(s): Text
Input Format: String
Input Parameters: One-Dimensional (1D)
Output:
Output Type(s): Audio
Output Format: .wav file
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Audio output with dimensions (B x T), where B is batch size and T is time dimension.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIAโs hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
How to Use this Model
NeMo Installation
To train, fine-tune or perform TTS with this model, you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version and Python version โฅ 3.10.12.
pip install nemo_toolkit[tts]@main
pip install kaldialign
The model is available for use in the NeMo Framework, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Method 1: Single TTS Inference
In this method, the model can be used to infer on a single (text, language) pair. Text Normalization can also be applied if needed for En, Es, De, Fr, It, Zh languages.
Load the Open Source MagpieTTS checkpoint from Huggingface and call the do_tts(transcript, language, apply_TN) method. This returns the generated audio and the length of the audio.
from nemo.collections.tts.models import MagpieTTSModel
speaker_map = {
"John": 0,
"Sofia": 1,
"Aria": 2,
"Jason": 3,
"Leo": 4
}
transcript = "Hello world from NeMo Text to Speech."
language = "en"
speaker = "Sofia"
speaker_idx = speaker_map[speaker]
model = MagpieTTSModel.from_pretrained("nvidia/magpie_tts_multilingual_357m")
audio, audio_len = model.do_tts(transcript, language=language, apply_TN=False, speaker_index=speaker_idx)
Method 2: Batch Inference
This section explains how to run batch inference and evaluation on MagpieTTS models using the examples/tts/magpietts_inference.py script.
Key Points
The MagpieTTS inference script supports:
- Batch inference from
.nemofiles or.ckptcheckpoints - Optional evaluation with metrics (CER, WER, Speaker Similarity, UTMOSv2)
- Multiple datasets in a single run
Dataset Configuration (examples/tts/evalset_config.json)
The script requires a JSON configuration file that defines the metadata for the datasets to process.
Format
{
"dataset_name_1": {
"manifest_path": "/absolute/path/to/manifest.json",
"audio_dir": "/",
"feature_dir": null
},
"dataset_name_2": {
"manifest_path": "/path/to/another_manifest.json",
"audio_dir": "/base/audio/path",
"feature_dir": "/path/to/features"
}
}
Fields
| Field | Required | Description |
|---|---|---|
manifest_path |
Yes | Absolute path to the NeMo manifest JSON file |
audio_dir |
Yes | Base directory for audio files. Use "/" if manifest contains absolute paths |
feature_dir |
No | Directory for pre-computed features (set to null if not used) |
whisper_language |
No | Language code for ASR evaluation (default: "en") |
Example
{
"libritts_test_clean": {
"manifest_path": "/data/libritts/test_clean_manifest.json",
"audio_dir": "/",
"feature_dir": null,
"whisper_language": "en"
},
"vctk": {
"manifest_path": "/data/vctk/manifest.json",
"audio_dir": "/data/vctk/wav48",
"feature_dir": null
}
}
Manifest Format
The manifest is a JSON-lines file where each line is a JSON object representing one utterance.
Minimum Required Fields
For models with fixed speaker context embeddings (no audio/text conditioning needed):
{"audio_filepath": "/path/to/audio.wav", "text": "The transcript text.", "duration": 3.5}
| Field | Type | Description |
|---|---|---|
audio_filepath |
string | Path to the target audio file |
text |
string | Text transcript to synthesize |
duration |
float | Audio duration in seconds |
Run Inference and Evaluation
# Basic inference (no evaluation)
python examples/tts/magpietts_inference.py \
--nemo_files "nvidia/magpie_tts_multilingual_357m" \
--datasets_json_path /path/to/evalset_config.json \
--out_dir /path/to/output \
--codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
--use_cfg \
--cfg_scale 2.5
# Inference with evaluation
python examples/tts/magpietts_inference.py \
--nemo_files "nvidia/magpie_tts_multilingual_357m" \
--datasets_json_path /path/to/evalset_config.json \
--out_dir /path/to/output \
--codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
--run_evaluation \
--use_cfg \
--cfg_scale 2.5
Check Outputs
After running, you'll find:
- Generated audio files in
<out_dir>/<checkpoint_name>/ - Evaluation metrics in
metrics.json - Visualization plots (if evaluation enabled)
Evaluation Metrics
When --run_evaluation is enabled, the following metrics are computed:
| Metric | Description |
|---|---|
| CER | Character Error Rate (lower is better) |
| WER | Word Error Rate (lower is better) |
| SSIM (pred-gt) | Speaker similarity between predicted and ground truth |
| SSIM (pred-context) | Speaker similarity between predicted and context |
| UTMOSv2 | Audio quality score (higher is better, requires utmosv2 package) |
| RTF | Real-time factor (processing time / audio duration) |
Software Integration:
Runtime Engine(s): NeMo Framework 25.11
Supported Hardware Microarchitecture Compatibility:
- NVIDIA A10 GPU
- NVIDIA A30 GPU
- NVIDIA A100 GPU
- NVIDIA H100 GPU
Preferred/Supported Operating System(s):
- Linux
- Linux 4 Tegra
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
Multilingual MagpieTTS-357M
Training and Evaluation Datasets:
Training Dataset:
The following datasets were used to train the model, including additional datasets focused on speech and ASR.
- Hi-FiTTS En
- HiFiTTS-2 A Large-Scale High Bandwidth Speech Dataset En
- LibriTTS En
- Internal English Dataset
- CML-TTS Es
- Internal Spanish Dataset
- CML-TTS Fr
- Internal French Dataset
- CML-TTS It
- CML-TTS De
- Large-scale Vietnamese speech corpus (LSVSC) Vi
- InfoRe-2 Vi
- InfoRe-1 Vi
- Internal Vietnamese Dataset
- Internal Mandarin Dataset
Data Modality
- Audio
Audio Training Data Size
- 60,000 Hours
Data Collection Method by dataset
- Publicly available dataset
- Human
Labeling Method by dataset
- Hybrid: Human, Synthetic - Human recorded data points were preprocessed algorithmically.
Properties:
Number of data items in training set: 38k hours
Modality: Audio (speech signal)
Nature of the content: Audio books
Language: Multilingual (En, Es, De, Fr, Vi, It, Zh)
Sensor Type: Microphones
Evaluation Dataset:
Benchmark Score
Data Collection Method by dataset:
- Publicly available dataset
- Human
Labeling Method by dataset:
- Human
- Hybrid: Human, Synthetic - Human labeled data points are mixed and matched to create more variabilities.
Properties:
Modality: Audio (speech signal)
Nature of the content: Audio books and Newspaper passages
Language: Multilingual (En, Es, De, Fr)
Sensor Type: Microphones
| CER (%) | SV-SSIM | |
|---|---|---|
| LibriTTS test-clean | 0.38 | 0.823 |
| Spanish CML | 1.0 | 0.719 |
| French CML | 2.8 | 0.708 |
| German CML | 1.1 | 0.646 |
- This result is based on the MagpieTTS model (Huggingface Checkpoint)
Inference:
Acceleration Engine: None
Test Hardware:
- NVIDIA H100 GPU
- NVIDIA A100 GPU
- NVIDIA A6000 GPU
- NVIDIA T4 GPU
Technical Limitations & Mitigation:
There are two modes of inference, namely, standard and long-form. In standard mode, this model can generate up to twenty (20) seconds of multilingual (En, Es, De, Fr, Vi, It, Zh) speech at a time. In long-form mode, the model performs optimally when the input text contains punctuation and capitalization. The model was trained on a mix of publicly available speech datasets and internally recorded datasets in seven languages. As a result, it is not suitable for speech generation in any language other than the seven languages mentioned. We have removed zero-shot capabilities of this model for this release. Text normalization is required.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/ .
License/Terms of Use
GOVERNING TERMS: Use of this model is governed by NVIDIA Open Model License Agreement
References(s):
- Downloads last month
- 140