--- license: mit language: - en pipeline_tag: text-to-speech --- # wfloat-tts `wfloat-tts` is a lightweight multi-speaker English VITS text-to-speech model with speaker, emotion, and intensity control. This repo includes: - `model.safetensors`: inference weights - `config.json`: model config and token mapping - `src/wfloat_tts/`: a small Python inference helper The repo is set up for standalone inference from the released model files. You do not need the original training codebase to synthesize speech with it. ## Sample Outputs ### `mad_scientist_woman` surprise - Audio: [samples/08_mad_scientist_woman_surprise_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/08_mad_scientist_woman_surprise_080.wav) - Input text: "No, no, that's not possible. The formula should have crystallized, but it adapted instead. Do you realize what that means for the rest of my work?" - `sid`: `7` - `emotion`: `surprise` - `intensity`: `0.8` ### `fun_hero_woman` joy - Audio: [samples/04_fun_hero_woman_joy_070.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/04_fun_hero_woman_joy_070.wav) - Input text: "Come on, keep up! The crowd is cheering." - `sid`: `3` - `emotion`: `joy` - `intensity`: `0.7` ### `strong_hero_man` anger - Audio: [samples/05_strong_hero_man_anger_080.wav](https://huggingface.co/Wfloat/wfloat-tts/resolve/main/samples/05_strong_hero_man_anger_080.wav) - Input text: "Enough. You had your warning, and you kept pushing innocent people around. Take one more step, and I end this." - `sid`: `4` - `emotion`: `anger` - `intensity`: `0.8` Find more examples in the [samples folder](https://huggingface.co/Wfloat/wfloat-tts/tree/main/samples). ## Inputs The intended inference inputs are: - `text`: the utterance to synthesize - `sid`: numeric speaker id - `emotion`: emotion label - `intensity`: value from `0.0` to `1.0` You do not need to pass raw control symbols. The Python helper converts `emotion` and `intensity` into the control tokens the model was trained on. ## Install ```bash pip install -e . pip install "piper-phonemize==1.3.0" -f https://k2-fsa.github.io/icefall/piper_phonemize ``` Runtime dependencies: - `torch` - `numpy` - `safetensors` - `piper-phonemize` `piper-phonemize` is installed separately because the current recommended wheels are hosted here: - https://k2-fsa.github.io/icefall/piper_phonemize ## Python Example ```python from wfloat_tts import load_generator, write_wave generator = load_generator( checkpoint_path="model.safetensors", config_path="config.json", ) audio = generator.generate( text="Hey there, how are you today?", sid=11, emotion="neutral", intensity=0.5, ) write_wave("out.wav", audio.samples, audio.sample_rate) ``` ## How It Is Conditioned This model was trained to condition on: - speaker id - one emotion control token - one intensity control token The reference inference path processes a full utterance, appends one emotion token and one intensity token for the whole utterance, and runs synthesis over that full sequence. ## Speaker IDs Use numeric `sid` values: | Speaker | SID | | --- | ---: | | `skilled_hero_man` | 0 | | `skilled_hero_woman` | 1 | | `fun_hero_man` | 2 | | `fun_hero_woman` | 3 | | `strong_hero_man` | 4 | | `strong_hero_woman` | 5 | | `mad_scientist_man` | 6 | | `mad_scientist_woman` | 7 | | `clever_villain_man` | 8 | | `clever_villain_woman` | 9 | | `narrator_man` | 10 | | `narrator_woman` | 11 | | `wise_elder_man` | 12 | | `wise_elder_woman` | 13 | | `outgoing_anime_man` | 14 | | `outgoing_anime_woman` | 15 | | `scary_villain_man` | 16 | | `scary_villain_woman` | 17 | | `news_reporter_man` | 18 | | `news_reporter_woman` | 19 | ## Emotions Supported emotion labels: - `neutral` - `joy` - `sadness` - `anger` - `fear` - `surprise` - `dismissive` - `confusion` `intensity` is clamped to the range `[0.0, 1.0]` and mapped to one of ten discrete intensity levels. ## Notes - `model.safetensors` is the main inference artifact in this repo. - `config.json` includes the token mapping needed by the processor. - The current release uses a multi-speaker model with 20 speakers.