Papers
arxiv:2603.05413

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Published on Mar 17
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A tutorial presents a cascaded streaming pipeline approach for building self-hosted real-time voice agents using separate components for speech recognition, language modeling, and text-to-speech synthesis.

AI-generated summary

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While end-to-end speech-to-speech models may ultimately provide the best latency for voice agents, fully self-hosted end-to-end solutions are not yet available. We evaluate the closest candidate, Qwen3-Omni, across three configurations: its cloud-only DashScope Realtime API achieves sim702ms audio-to-audio latency with streaming, but is not self-hostable; its local vLLM deployment supports only the Thinker (text generation from audio, 516ms), not the Talker (audio synthesis); and its local Transformers deployment runs the full pipeline but at sim146s -- far too slow for realtime. The cascaded streaming pipeline (STT rightarrow LLM rightarrow TTS) therefore remains the practical architecture for self-hosted realtime voice agents, and the focus of this tutorial. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured time-to-first-audio of 755ms (best case 729ms) with full function calling support. We release the full codebase as a 9-chapter progressive tutorial with working, tested code for every component.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.05413
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.05413 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.05413 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.