Two questions

by mrfakename - opened 13 days ago

13 days ago

Hi,
Thanks for releasing MiraTTS! I had two questions about the model: 1) what dataset was the model trained on, approximately how large is this dataset, and are there any plans to release it? 2) what is the license of this model?
Thanks!

YatharthS

Owner 13 days ago

Thanks for checking it out!

It was finetuned with a mix of open-source non synthetic datasets (roughly 200 hours) with high emotion but poor audio quality. The FlashSR model is used inference-time to enhance the audio quality considerably. I might release the dataset but it's not too high on the priority list.
Unfortunately, since it's a finetune of Spark-TTS, it also inherits the cc-by-nc-sa 4.0 license.

mrfakename

13 days ago

Oh nice! Mind if I ask where the dataset came from?

mrfakename

13 days ago

Also, if the audio quality was poor, wouldn't it make more sense to enhance the audio in the dataset itself, rather than at inference time?

YatharthS

Owner 13 days ago

Mix of several open-source ones, filtered for emotion using an arousal, dominance and valence model and transcribed again. There were a few flaws but still worked decently. I used yodas2, instructTTS, and commonvoice. Commonvoice was filtered a bit more and denoised as well.

MiraTTS predicts raw 16khz and then uses FlashSR to upsample to 48khz. Enhancing the training audio was "useless" since it would get resampled to 16khz anyway. Since this project is gaining popularity, I am training a native 48khz bicodec similar to LayaCodec and the 44.1khz XCodec2 model by modifying the decoder without training anything else.

mrfakename

13 days ago

Cool, thanks!

mrfakename changed discussion status to closed 13 days ago

mrfakename

4 days ago

By the way, I took a look at LayaCodec and was wondering what the difference between LayaCodec and FocalCodec was? They look pretty much identical :/

YatharthS

Owner 3 days ago

•

edited 3 days ago

@mrfakename I changed the decoder to use upsamplerblocks similar to hifigan with a few modifications. Because of this, it's far "easier" to generate 44.1khz audio while keeping vocos's speed.

I am working on a new 44.1khz 12.5hz model based on another audio tokenizer which I modified by adding a much higher quality upsampler block using modern techniques and far more training data. Should come sometime this week.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment