arxiv:2605.03073

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Published on May 4

· Submitted by

Venkata Pushpak Teja Menta on May 6

Praxel

Upvote

Authors:

Abstract

A self-contained text-to-speech to speech-to-text flywheel approach significantly improves niche-domain Indic automatic speech recognition performance through synthetic data generation and low-resource fine-tuning techniques.

AI-generated summary

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.

View arXiv page View PDF GitHub 0 Add to collection

Community

praxelhq

Paper submitter about 21 hours ago

We benchmark open-source SOTA (vasista22/whisper-{te,ta,hi}-large-v2) and commercial Deepgram Nova-3 on a synthesised entity-dense Telugu test set — content that real Indian users actually speak: digit strings, currency amounts, addresses, brand names, English/Indic codemix. Open-source SOTA gets EHR 0.027. Commercial Deepgram gets 0.16. Both are an order of magnitude below their own read-prose performance.

We close the gap with a self-contained TTS↔STT flywheel: an open-source Indic TTS pipeline (Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia) synthesises ~22k entity-dense Indic-English utterances at <$50 marginal cost, and a LoRA on top of vasista22 reaches EHR 0.473 on Telugu (17× over open SOTA, 3× over commercial), 0.337 on Hindi, 0.543 on Tamil. Two of three languages beat commercial Deepgram. Native human-recorded sanity check confirms transfer: 0.516 EHR on real Telugu speech vs 0.473 on synth.

Honest reporting throughout: all three β models miss the pre-registered EHR target (0.75 Te, 0.65 Hi/Ta), Hindi underperforms commercial (Deepgram has invested there), and the secondary "fix Whisper-v3 Telugu Script Collapse via per-language LoRA" recipe is contraindicated on Hindi/Tamil where vanilla SFR ≥ 0.98. An EDSA-isolation ablation (LoRA on FLEURS-Te alone → EHR 0.020) attributes ~100% of the gain to the entity-dense corpus, not the LoRA process.

Code, holdouts, predictions, EDSA corpus, and entity dictionaries open-source. Six LoRA adapters released on HF (te/hi/ta × {rb on vasista22, r2 on Whisper-v3}). Companion to arXiv:2604.25441 (Praxy Voice TTS), arXiv:2604.25476 (PSP), arXiv:2605.00777 (LASE).