NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Paper
β’
2403.03100
β’
Published
β’
38
FAcodec trained on 50k hours speech data, with more timbre diversity and better at reconstructing speakers from podcasts, videos, games or animations.
This is a separate decoder designed and trained based on the pretrained encoder specifically for voice conversion task.
It is capable of zero-shot voice conversion, stream voice conversion and has outstanding timbre generalization ability.
See main repository for example usages.