Text-to-Speech Latency Benchmark

Sub-200 ms text-to-speech,
built for live voice agents

Our streaming TTS delivers first audio in ~197 ms (median) for both Hindi and English — among the fastest available, with the tightest latency consistency, and native Hindi-English code-switching out of the box.

Benchmark June 2026 · Lower latency is better

~197 ms

Median first audio

~300 ms

p95 first audio

0.93

Intelligibility (STOI)

Sub-200 ms

Hindi + English

Time to first audio

The metric that matters most for a live voice agent: how fast the first audio is heard after the text is ready. Lower is better.

Provider / model	First-audio (median)	Source
Cartesia Sonic-3	188 ms	Coval independent benchmark, May 2026
60dB (streaming)Us	~197 ms	Measured (this benchmark)
Sarvam Bulbul v2	~250 ms (vendor claim)	Sarvam documentation
ElevenLabs Turbo v2.5	264 ms	Coval independent benchmark, May 2026
ElevenLabs Flash v2.5	288 ms	Coval independent benchmark, May 2026
Deepgram Aura-2	313 ms	Coval independent benchmark, May 2026

Faster than ElevenLabs and Deepgram, on par with the fastest. Cartesia Sonic-3 (188 ms) edges our median by a few milliseconds — we report that transparently. Where 60dB stands out is consistency: a tight p95 across both languages, which is what keeps a conversation feeling natural turn after turn.

Consistent across languages

Median is only half the story — a stable, low p95 is what makes every turn feel fast. All values are first-audio latency in milliseconds.

Language	Median	p90	p95	Best case
Hindi	196 ms	246 ms	288 ms	148 ms
English	198 ms	278 ms	323 ms	160 ms

Quality

0.93

Intelligibility (STOI)

Clear, easily-understood speech (1.0 = perfect)

16 kHz

Native sample rate

Clean, full-bandwidth audio; telephony-ready

Bidirectional WebSocket

Streaming

Audio begins streaming on the first generated chunk

Why this matters for voice agents

Sub-200 ms first audio keeps replies feeling instant and human.
A tight p95 means every turn is fast — not just the average one.
Native Hindi-English code-switching handles how people actually speak.

Methodology

Corpus: 330 Hindi and English prompts from the public sarvamai/tts-general-benchmark dataset — real voice-agent use cases (assistants, support, sales, announcements, conversational bots), including code-switched Hindi-English and 8 kHz telephony-condition prompts.
Latency: measured as Time to First Audio over the streaming API — elapsed time from submitting the final text to receiving the first audio bytes, on a warm connection in steady state.
Quality: a reference-free objective measure — STOI (intelligibility) via TorchAudio SQUIM.
Competitor latency figures are from the independent Coval TTS Latency Benchmark (May 2026) and published vendor specifications, as cited.

Notes & disclosures

1Competitor numbers are sourced (Coval independent benchmark, May 2026, and vendor documentation), not re-measured by us. They are end-to-end figures provided for context.
2Cartesia Sonic-3 (188 ms) edges our median first-audio time; we report this transparently. 60dB is on par with the fastest and leads on latency consistency (a tight p95 across languages).
3Vendor “model-only” latency claims (e.g. 75–90 ms) exclude network and are not directly comparable to delivered first-audio time.
4The Sarvam Bulbul v2 figure is a vendor claim from documentation, not an independent measurement.

Sub-200 ms text-to-speech, built for live voice agents