wfloat-tts

wfloat-tts is a lightweight multi-speaker English VITS text-to-speech model with speaker, emotion, and intensity control.

This repo includes:

  • model.safetensors: inference weights
  • config.json: model config and token mapping
  • src/wfloat_tts/: a small Python inference helper

The repo is set up for standalone inference from the released model files. You do not need the original training codebase to synthesize speech with it.

Sample Outputs

mad_scientist_woman surprise

  • Audio: samples/08_mad_scientist_woman_surprise_080.wav
  • Input text: "No, no, that's not possible. The formula should have crystallized, but it adapted instead. Do you realize what that means for the rest of my work?"
  • sid: 7
  • emotion: surprise
  • intensity: 0.8

fun_hero_woman joy

strong_hero_man anger

  • Audio: samples/05_strong_hero_man_anger_080.wav
  • Input text: "Enough. You had your warning, and you kept pushing innocent people around. Take one more step, and I end this."
  • sid: 4
  • emotion: anger
  • intensity: 0.8

Find more examples in the samples folder.

Inputs

The intended inference inputs are:

  • text: the utterance to synthesize
  • sid: numeric speaker id
  • emotion: emotion label
  • intensity: value from 0.0 to 1.0

You do not need to pass raw control symbols. The Python helper converts emotion and intensity into the control tokens the model was trained on.

Install

pip install -e .
pip install "piper-phonemize==1.3.0" -f https://k2-fsa.github.io/icefall/piper_phonemize

Runtime dependencies:

  • torch
  • numpy
  • safetensors
  • piper-phonemize

piper-phonemize is installed separately because the current recommended wheels are hosted here:

Python Example

from wfloat_tts import load_generator, write_wave

generator = load_generator(
    checkpoint_path="model.safetensors",
    config_path="config.json",
)

audio = generator.generate(
    text="Hey there, how are you today?",
    sid=11,
    emotion="neutral",
    intensity=0.5,
)

write_wave("out.wav", audio.samples, audio.sample_rate)

How It Is Conditioned

This model was trained to condition on:

  • speaker id
  • one emotion control token
  • one intensity control token

The reference inference path processes a full utterance, appends one emotion token and one intensity token for the whole utterance, and runs synthesis over that full sequence.

Speaker IDs

Use numeric sid values:

Speaker SID
skilled_hero_man 0
skilled_hero_woman 1
fun_hero_man 2
fun_hero_woman 3
strong_hero_man 4
strong_hero_woman 5
mad_scientist_man 6
mad_scientist_woman 7
clever_villain_man 8
clever_villain_woman 9
narrator_man 10
narrator_woman 11
wise_elder_man 12
wise_elder_woman 13
outgoing_anime_man 14
outgoing_anime_woman 15
scary_villain_man 16
scary_villain_woman 17
news_reporter_man 18
news_reporter_woman 19

Emotions

Supported emotion labels:

  • neutral
  • joy
  • sadness
  • anger
  • fear
  • surprise
  • dismissive
  • confusion

intensity is clamped to the range [0.0, 1.0] and mapped to one of ten discrete intensity levels.

Notes

  • model.safetensors is the main inference artifact in this repo.
  • config.json includes the token mapping needed by the processor.
  • The current release uses a multi-speaker model with 20 speakers.
Downloads last month
64
Safetensors
Model size
30.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support