60db Logo
Text-to-Speech Latency Benchmark

Sub-200 ms text-to-speech, built for live voice agents

Our streaming TTS delivers first audio in ~197 ms (median) for both Hindi and English — among the fastest available, with the tightest latency consistency, and native Hindi-English code-switching out of the box.

Benchmark June 2026 · Lower latency is better

~197 ms
Median first audio
~300 ms
p95 first audio
0.93
Intelligibility (STOI)
Sub-200 ms
Hindi + English

Time to first audio

The metric that matters most for a live voice agent: how fast the first audio is heard after the text is ready. Lower is better.

Provider / modelFirst-audio (median)Source
Cartesia Sonic-3188 msCoval independent benchmark, May 2026
60dB (streaming)Us~197 msMeasured (this benchmark)
Sarvam Bulbul v2~250 ms (vendor claim)Sarvam documentation
ElevenLabs Turbo v2.5264 msCoval independent benchmark, May 2026
ElevenLabs Flash v2.5288 msCoval independent benchmark, May 2026
Deepgram Aura-2313 msCoval independent benchmark, May 2026

Faster than ElevenLabs and Deepgram, on par with the fastest. Cartesia Sonic-3 (188 ms) edges our median by a few milliseconds — we report that transparently. Where 60dB stands out is consistency: a tight p95 across both languages, which is what keeps a conversation feeling natural turn after turn.

Consistent across languages

Median is only half the story — a stable, low p95 is what makes every turn feel fast. All values are first-audio latency in milliseconds.

LanguageMedianp90p95Best case
Hindi196 ms246 ms288 ms148 ms
English198 ms278 ms323 ms160 ms

Quality

0.93
Intelligibility (STOI)

Clear, easily-understood speech (1.0 = perfect)

16 kHz
Native sample rate

Clean, full-bandwidth audio; telephony-ready

Bidirectional WebSocket
Streaming

Audio begins streaming on the first generated chunk

Why this matters for voice agents

  • Sub-200 ms first audio keeps replies feeling instant and human.
  • A tight p95 means every turn is fast — not just the average one.
  • Native Hindi-English code-switching handles how people actually speak.

Methodology

  • Corpus: 330 Hindi and English prompts from the public sarvamai/tts-general-benchmark dataset — real voice-agent use cases (assistants, support, sales, announcements, conversational bots), including code-switched Hindi-English and 8 kHz telephony-condition prompts.
  • Latency: measured as Time to First Audio over the streaming API — elapsed time from submitting the final text to receiving the first audio bytes, on a warm connection in steady state.
  • Quality: a reference-free objective measure — STOI (intelligibility) via TorchAudio SQUIM.
  • Competitor latency figures are from the independent Coval TTS Latency Benchmark (May 2026) and published vendor specifications, as cited.

Notes & disclosures

  1. 1Competitor numbers are sourced (Coval independent benchmark, May 2026, and vendor documentation), not re-measured by us. They are end-to-end figures provided for context.
  2. 2Cartesia Sonic-3 (188 ms) edges our median first-audio time; we report this transparently. 60dB is on par with the fastest and leads on latency consistency (a tight p95 across languages).
  3. 3Vendor “model-only” latency claims (e.g. 75–90 ms) exclude network and are not directly comparable to delivered first-audio time.
  4. 4The Sarvam Bulbul v2 figure is a vendor claim from documentation, not an independent measurement.

Build voice agents with sub-200 ms speech