Loading video player...
Most voice AI tutorials show you a REST call to a TTS API. They don't show you why that approach is 1300ms slow, what a streaming cascade actually looks like in code, or how to instrument latency so you know your p90 before shipping. In this episode I build a FastAPI server that implements the ElevenLabs bidirectional WebSocket API — streaming speech in, streaming audio out, with sub-200ms end-to-end latency. Every architectural decision is explained: chunk_length_schedule tuning, flush signals at sentence boundaries, async cascade coordination, and latency instrumentation. ───────────────────────────────────── 🔧 WHAT YOU'LL BUILD ───────────────────────────────────── ✅ FastAPI WebSocket server bridging browser audio → ElevenLabs → browser playback ✅ Streaming cascade: STT → LLM → TTS running concurrently, not sequentially ✅ chunk_length_schedule tuning: 3 configurations compared with real latency numbers ✅ Sentence boundary detection + flush: true for immediate audio at sentence ends ✅ Latency instrumentation: speech-to-first-token, first-token-to-audio, end-to-end p50/p90 ───────────────────────────────────── 📌 KEY CONCEPTS COVERED ───────────────────────────────────── • Sequential vs. cascade pipeline architecture — why sequential is broken for voice • ElevenLabs bidirectional WebSocket API — text_input, try_trigger_generation, flush • chunk_length_schedule — the single parameter that controls latency vs. prosody quality • Sentence boundary detection — three-line pattern that eliminates tail gaps • asyncio.gather for concurrent streaming — non-blocking coordination of three streams ───────────────────────────────────── ⏱️ TIMESTAMPS ───────────────────────────────────── 00:00 — The physics of voice latency 01:30 — Sequential pipeline problem: 1300ms before the user hears anything 03:30 — The streaming cascade architecture 06:00 — ElevenLabs bidirectional WebSocket API explained 09:00 — Building the streaming cascade (cascade.py) 12:00 — chunk_length_schedule: tuning latency vs. prosody 14:30 — flush: true at sentence boundaries 16:30 — Latency instrumentation (p50, p90, p99) 18:30 — The complete FastAPI server 20:30 — Key takeaways ───────────────────────────────────── 💻 CODE & RESOURCES ───────────────────────────────────── GitHub repo → https://github.com/ogu83/voice-agent-architect-lab Branch → ep1-realtime-tts ElevenLabs bidirectional WebSocket docs → https://elevenlabs.io/docs/api-reference/text-to-speech/streaming ElevenLabs Eleven Flash v2.5 → https://elevenlabs.io/docs/models FastAPI WebSocket docs → https://fastapi.tiangolo.com/advanced/websockets/ ───────────────────────────────────── 🤝 WORK WITH ME ───────────────────────────────────── I help engineering teams build production voice AI systems — real-time TTS, telephony integration, and agentic voice orchestration. Upwork profile → https://www.upwork.com/freelancers/oguzkoroglu ───────────────────────────────────── #VoiceAI #ElevenLabs #FastAPI #RealtimeTTS #WebSocket #AIEngineering #SpeechSynthesis #LowLatency #PythonAI #StreamingAI #TTS #ConversationalAI #AIAgent #Agentic #BackendDevelopment