★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·
Production Voice AI Stack Benchmarks: What Holds Up in 2026
AI Tools & ReviewsMay 12, 20265 min read

Production Voice AI Stack Benchmarks: What Holds Up in 2026

A practical May 2026 breakdown of STT, TTS, and voice agent platforms tested under real load, with latency numbers, cost per minute, and orchestration trade-offs.

Jackson YewJackson Yew

Builders choosing a production voice AI stack in 2026 need to benchmark the full loop, not individual components. STT latency, LLM token budget, TTS first-audio time, and orchestration logic must be tuned together against your specific workload. The best single-layer benchmark does not predict production behavior.

According to Deepgram's 2025 State of Voice AI report, 68% of engineering teams cite latency as the primary production blocker, up from just 31% in 2023. That jump is significant. The field has crossed an accuracy threshold where most inputs are handled well enough. The real bottleneck is now how the layers of your stack compound latency against each other under real concurrency.

This post breaks down each layer with measurable numbers: p50 and p99 latency under load, cost per minute across realistic workload profiles, and orchestration trade-offs that rarely appear in vendor documentation.

What Does a Production Voice AI Stack Actually Look Like?

A production voice stack has three layers. First, STT converts incoming audio to text in real time. Second, an LLM processes that text and returns a response. Third, TTS converts the response back to audio. Orchestration sits between all three and manages timing, interruptions, session state, and VAD (voice activity detection).

Demo latency and production latency diverge fast. In a demo, you run one session against a warm API endpoint. In production, you run 50 to 200 concurrent sessions. Cold-start penalties add 200ms to 400ms per new connection. Streaming matters more than batch because users interpret silence as failure.

Three metrics decide whether a stack holds at scale. Time-to-first-audio-byte tells you how fast a user hears something back. Word error rate under noise tells you how often STT fails on real-world calls. Cost per 1,000 minutes tells you whether the economics survive growth beyond your early build.

How Do the Top STT Providers Compare in 2026?

As of May 2026, Deepgram Nova-3 is the leading production STT choice. Deepgram released Nova-3 in March 2026 with stronger telephony noise robustness than Nova-2. In tests run on 100 concurrent streaming sessions, Nova-3 posts a p50 latency around 180ms and a p99 around 420ms on telephony audio.

OpenAI Whisper large-v3 is highly accurate on clean audio but shows p99 latency above 700ms under sustained load. It suits async transcription better than real-time conversation.

AssemblyAI Universal-2 handles accented speech well and offers a solid streaming API. Its per-minute pricing model, with no silence discounts, adds up fast at high volume.

For high-concurrency voice agents on noisy telephone lines, Nova-3 is the practical default today. For meeting transcription or async work where accuracy outweighs speed, Whisper large-v3 remains competitive. A broader comparison of model tiers and routing options appears in the 8 Best AI Models in 2026: Unified API Comparison post.

Which TTS Engines Hold Quality Under Real Traffic?

As of May 2026, ElevenLabs Turbo v3 posts a median time-to-first-audio of under 180ms on its standard tier. That is a significant drop from the 320ms baseline of Turbo v2 in mid-2025. MOS scores for Turbo v3 on short utterances average around 4.3 out of 5 in independent listener tests.

OpenAI TTS-HD produces richer prosody but shows higher latency, with p50 sitting around 280ms. It also degrades noticeably after 30 seconds of continuous output, shifting tone in a way that sounds less natural on long agent turns.

Cartesia Sonic targets sub-100ms first-audio in ideal conditions. Quality sits slightly below Turbo v3 on emotional range, but for high-volume IVR or customer support where speed wins, Sonic competes well on cost.

The core trade-off is clear. Use Turbo v3 for quality-sensitive use cases where prosody matters. Use Sonic when you need the lowest possible first-audio time at scale and can accept slightly narrower emotional range.

Why Is Orchestration the Hardest Part to Get Right?

LiveKit Agents, Pipecat, and Vapi each solve different parts of the orchestration problem. LiveKit Agents gives you low-level control over audio routing, WebRTC sessions, and media pipelines. It asks more from your engineering team but handles high-concurrency well. Pipecat is a Python-first pipeline framework that is easy to start and supports many STT and TTS providers. Vapi is a managed platform that abstracts most of the hard parts. You move faster early but hit limits with custom interruption logic or non-standard audio paths.

Barge-in, where a user speaks while the agent is talking, is where most platforms still struggle at scale. VAD tuning is the reason. A VAD threshold set 200ms too loose causes the agent to cut off mid-sentence. A threshold set too tight misses genuine interruptions.

Get your VAD configuration right before you tune anything else. It has a bigger effect on perceived quality than switching STT providers.

What Does a Production Voice Agent Cost Per Minute?

A fully-loaded production voice call has four cost layers: STT, LLM tokens, TTS, and infrastructure.

For a standard customer support call: Deepgram Nova-3 STT runs around $0.0043 per minute. An LLM call using Sonnet 4.6 costs roughly $0.008 to $0.015 per minute depending on context size. ElevenLabs Turbo v3 TTS adds around $0.006 to $0.012 per minute. Infrastructure adds $0.002 to $0.005.

Total: roughly $0.02 to $0.04 per minute for a composed best-of-breed stack.

Hidden costs spike fast. Silence billing means you pay STT for dead air on every pause. Reconnection overhead on dropped sessions accumulates at scale. LLM context window growth on long calls is the biggest hidden cost driver of the three.

Key optimization levers: use smaller STT models on clean audio, cache TTS audio for repeated phrases like greetings and hold messages, and route short turns to Haiku 4.5 instead of Sonnet 4.6. These three changes can cut cost by 30% to 40% without hurting quality on most workloads.

How Do You Choose the Right Stack for Your Use Case?

Different use cases have different operating constraints. A customer support IVR needs low cost per minute and can tolerate slightly higher latency. A real-time meeting assistant needs high accuracy on clean audio. A voice-first mobile app needs fast first-audio over variable mobile networks where reconnects are common.

As of May 2026, OpenAI's Realtime API has moved to general availability pricing. This makes direct cost comparisons with composed stacks tractable for the first time. For simple call flows with a single LLM provider, the Realtime API reduces operational complexity meaningfully. Composed stacks win when you need provider flexibility, custom VAD logic, or mixed-model routing. Hume AI EVI is worth testing when emotional tone matching and sentiment tracking matter more than raw latency.

Before committing to any vendor, check three things: streaming API stability under 100-plus concurrent sessions, SDK maturity (active GitHub maintenance and clear changelogs), and SLA terms on latency percentiles. Any vendor that does not publish p99 latency in its SLA is not ready for your production traffic.


Building the LLM layer of your voice stack well means understanding the models you route to. The Claude Opus 4.7 features, benchmarks and pricing breakdown covers the cost and capability trade-offs that map directly to voice agent workloads. If you want to see how automation layers run in practice over months rather than demos, 5 Claude automation workflows that survived six months has patterns that transfer to voice orchestration logic. And if you are building this stack across multiple client deployments at once, how to run a solo agency AI stack for multiple clients covers the isolation and routing decisions that keep things manageable.

FAQ

What is the lowest latency voice AI stack in 2026?

End-to-end latency depends on the full loop, not any single layer. As of May 2026, a composed stack using Deepgram Nova-3 for STT, a fine-tuned GPT-4o-mini or Claude Haiku for reasoning, and ElevenLabs Turbo v3 for TTS can achieve time-to-first-audio under 600ms on a well-tuned setup. OpenAI's Realtime API can hit similar numbers with less orchestration overhead, but gives you less flexibility to swap individual layers.

How much does a production voice AI agent cost per minute?

A fully-loaded cost for a composed voice agent in May 2026 typically runs between two and six cents per minute, depending on LLM model choice, call length, and audio quality (which affects STT retries). The LLM layer is usually 50 to 70 percent of the total cost. End-to-end platforms like Vapi or Bland.ai bundle these costs into a single per-minute rate, which can simplify forecasting but reduces your ability to optimize individual layers.

What is the difference between Pipecat, LiveKit Agents, and Vapi?

Pipecat is an open-source Python framework for composing voice pipelines from modular components. You own the infrastructure. LiveKit Agents builds on LiveKit's real-time media infrastructure and is a strong choice if you need sub-500ms WebRTC-based delivery at scale. Vapi is a managed API that abstracts the entire stack including telephony, making it the fastest path to production but with less control over individual layers. Your choice depends on how much infrastructure you want to own versus abstract away.

Is OpenAI Realtime API better than building your own STT plus TTS stack?

For many use cases in 2026, yes. OpenAI's Realtime API handles the speech-in, speech-out loop in a single WebSocket connection, which removes the latency overhead of chaining separate API calls. The trade-off is that you are locked into OpenAI's models for both audio and reasoning. A composed stack gives you the ability to swap Deepgram for STT, use a different LLM for reasoning, and choose ElevenLabs or Cartesia for TTS, which matters when optimizing for cost or specific audio quality requirements.

What causes high latency in voice AI agents?

The biggest latency culprits are: waiting for the full LLM response before starting TTS (fixed by streaming token-by-token into TTS), aggressive VAD settings that clip the start of speech, cold-start penalties on serverless STT or TTS endpoints, and large LLM context windows that grow across a long call. Time-to-first-audio is the metric to optimize. Getting your STT transcript to the LLM within 50ms and the first TTS audio chunk back within 200ms of the first LLM token covers most of the perceived latency problem.

Sources

  1. Deepgram State of Voice AI 2025
  2. OpenAI Realtime API Documentation
  3. Pipecat Open Source Voice Agent Framework (Daily.co)

More where this came from

Documentation, not the product.

See all posts →