Two questions

#3
by mrfakename - opened

Hi,
Thanks for releasing MiraTTS! I had two questions about the model: 1) what dataset was the model trained on, approximately how large is this dataset, and are there any plans to release it? 2) what is the license of this model?
Thanks!

Thanks for checking it out!

  1. It was finetuned with a mix of open-source non synthetic datasets (roughly 200 hours) with high emotion but poor audio quality. The FlashSR model is used inference-time to enhance the audio quality considerably. I might release the dataset but it's not too high on the priority list.
  2. Unfortunately, since it's a finetune of Spark-TTS, it also inherits the cc-by-nc-sa 4.0 license.

Oh nice! Mind if I ask where the dataset came from?

Also, if the audio quality was poor, wouldn't it make more sense to enhance the audio in the dataset itself, rather than at inference time?

Mix of several open-source ones, filtered for emotion using an arousal, dominance and valence model and transcribed again. There were a few flaws but still worked decently. I used yodas2, instructTTS, and commonvoice. Commonvoice was filtered a bit more and denoised as well.

MiraTTS predicts raw 16khz and then uses FlashSR to upsample to 48khz. Enhancing the training audio was "useless" since it would get resampled to 16khz anyway. Since this project is gaining popularity, I am training a native 48khz bicodec similar to LayaCodec and the 44.1khz XCodec2 model by modifying the decoder without training anything else.

Cool, thanks!

mrfakename changed discussion status to closed

By the way, I took a look at LayaCodec and was wondering what the difference between LayaCodec and FocalCodec was? They look pretty much identical :/

@mrfakename I changed the decoder to use upsamplerblocks similar to hifigan with a few modifications. Because of this, it's far "easier" to generate 44.1khz audio while keeping vocos's speed.

I am working on a new 44.1khz 12.5hz model based on another audio tokenizer which I modified by adding a much higher quality upsampler block using modern techniques and far more training data. Should come sometime this week.

Sign up or log in to comment