Try Orpheus TTS here
Transcribe uploaded audio to text with language detection
Generate speech from text with many voices