wfloat-tts
wfloat-tts is a lightweight multi-speaker English VITS text-to-speech model with speaker, emotion, and intensity control.
This repo includes:
model.safetensors: inference weightsconfig.json: model config and token mappingsrc/wfloat_tts/: a small Python inference helper
The repo is set up for standalone inference from the released model files. You do not need the original training codebase to synthesize speech with it.
Sample Outputs
mad_scientist_woman surprise
- Audio: samples/08_mad_scientist_woman_surprise_080.wav
- Input text: "No, no, that's not possible. The formula should have crystallized, but it adapted instead. Do you realize what that means for the rest of my work?"
sid:7emotion:surpriseintensity:0.8
fun_hero_woman joy
- Audio: samples/04_fun_hero_woman_joy_070.wav
- Input text: "Come on, keep up! The crowd is cheering."
sid:3emotion:joyintensity:0.7
strong_hero_man anger
- Audio: samples/05_strong_hero_man_anger_080.wav
- Input text: "Enough. You had your warning, and you kept pushing innocent people around. Take one more step, and I end this."
sid:4emotion:angerintensity:0.8
Find more examples in the samples folder.
Inputs
The intended inference inputs are:
text: the utterance to synthesizesid: numeric speaker idemotion: emotion labelintensity: value from0.0to1.0
You do not need to pass raw control symbols. The Python helper converts emotion and intensity into the control tokens the model was trained on.
Install
pip install -e .
pip install "piper-phonemize==1.3.0" -f https://k2-fsa.github.io/icefall/piper_phonemize
Runtime dependencies:
torchnumpysafetensorspiper-phonemize
piper-phonemize is installed separately because the current recommended wheels are hosted here:
Python Example
from wfloat_tts import load_generator, write_wave
generator = load_generator(
checkpoint_path="model.safetensors",
config_path="config.json",
)
audio = generator.generate(
text="Hey there, how are you today?",
sid=11,
emotion="neutral",
intensity=0.5,
)
write_wave("out.wav", audio.samples, audio.sample_rate)
How It Is Conditioned
This model was trained to condition on:
- speaker id
- one emotion control token
- one intensity control token
The reference inference path processes a full utterance, appends one emotion token and one intensity token for the whole utterance, and runs synthesis over that full sequence.
Speaker IDs
Use numeric sid values:
| Speaker | SID |
|---|---|
skilled_hero_man |
0 |
skilled_hero_woman |
1 |
fun_hero_man |
2 |
fun_hero_woman |
3 |
strong_hero_man |
4 |
strong_hero_woman |
5 |
mad_scientist_man |
6 |
mad_scientist_woman |
7 |
clever_villain_man |
8 |
clever_villain_woman |
9 |
narrator_man |
10 |
narrator_woman |
11 |
wise_elder_man |
12 |
wise_elder_woman |
13 |
outgoing_anime_man |
14 |
outgoing_anime_woman |
15 |
scary_villain_man |
16 |
scary_villain_woman |
17 |
news_reporter_man |
18 |
news_reporter_woman |
19 |
Emotions
Supported emotion labels:
neutraljoysadnessangerfearsurprisedismissiveconfusion
intensity is clamped to the range [0.0, 1.0] and mapped to one of ten discrete intensity levels.
Notes
model.safetensorsis the main inference artifact in this repo.config.jsonincludes the token mapping needed by the processor.- The current release uses a multi-speaker model with 20 speakers.
- Downloads last month
- 64