The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail
Abstract
A self-contained text-to-speech to speech-to-text flywheel approach significantly improves niche-domain Indic automatic speech recognition performance through synthetic data generation and low-resource fine-tuning techniques.
Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.
Community
We benchmark open-source SOTA (vasista22/whisper-{te,ta,hi}-large-v2) and commercial Deepgram Nova-3 on a synthesised entity-dense Telugu test set — content that real Indian users actually speak: digit strings, currency amounts, addresses, brand names, English/Indic codemix. Open-source SOTA gets EHR 0.027. Commercial Deepgram gets 0.16. Both are an order of magnitude below their own read-prose performance.
We close the gap with a self-contained TTS↔STT flywheel: an open-source Indic TTS pipeline (Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia) synthesises ~22k entity-dense Indic-English utterances at <$50 marginal cost, and a LoRA on top of vasista22 reaches EHR 0.473 on Telugu (17× over open SOTA, 3× over commercial), 0.337 on Hindi, 0.543 on Tamil. Two of three languages beat commercial Deepgram. Native human-recorded sanity check confirms transfer: 0.516 EHR on real Telugu speech vs 0.473 on synth.
Honest reporting throughout: all three β models miss the pre-registered EHR target (0.75 Te, 0.65 Hi/Ta), Hindi underperforms commercial (Deepgram has invested there), and the secondary "fix Whisper-v3 Telugu Script Collapse via per-language LoRA" recipe is contraindicated on Hindi/Tamil where vanilla SFR ≥ 0.98. An EDSA-isolation ablation (LoRA on FLEURS-Te alone → EHR 0.020) attributes ~100% of the gain to the entity-dense corpus, not the LoRA process.
Code, holdouts, predictions, EDSA corpus, and entity dictionaries open-source. Six LoRA adapters released on HF (te/hi/ta × {rb on vasista22, r2 on Whisper-v3}). Companion to arXiv:2604.25441 (Praxy Voice TTS), arXiv:2604.25476 (PSP), arXiv:2605.00777 (LASE).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost (2026)
- PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech (2026)
- Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation (2026)
- LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation (2026)
- Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark (2026)
- Fine-tuning Whisper for Pashto ASR: strategies and scale (2026)
- BlasBench: An Open Benchmark for Irish Speech Recognition (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.03073 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper