Sub-200 ms text-to-speech,
built for live voice agents
Our streaming TTS delivers first audio in ~197 ms (median) for both Hindi and English — among the fastest available, with the tightest latency consistency, and native Hindi-English code-switching out of the box.
Benchmark June 2026 · Lower latency is better
Time to first audio
The metric that matters most for a live voice agent: how fast the first audio is heard after the text is ready. Lower is better.
| Provider / model | First-audio (median) | Source |
|---|---|---|
| Cartesia Sonic-3 | 188 ms | Coval independent benchmark, May 2026 |
| 60dB (streaming)Us | ~197 ms | Measured (this benchmark) |
| Sarvam Bulbul v2 | ~250 ms (vendor claim) | Sarvam documentation |
| ElevenLabs Turbo v2.5 | 264 ms | Coval independent benchmark, May 2026 |
| ElevenLabs Flash v2.5 | 288 ms | Coval independent benchmark, May 2026 |
| Deepgram Aura-2 | 313 ms | Coval independent benchmark, May 2026 |
Faster than ElevenLabs and Deepgram, on par with the fastest. Cartesia Sonic-3 (188 ms) edges our median by a few milliseconds — we report that transparently. Where 60dB stands out is consistency: a tight p95 across both languages, which is what keeps a conversation feeling natural turn after turn.
Consistent across languages
Median is only half the story — a stable, low p95 is what makes every turn feel fast. All values are first-audio latency in milliseconds.
| Language | Median | p90 | p95 | Best case |
|---|---|---|---|---|
| Hindi | 196 ms | 246 ms | 288 ms | 148 ms |
| English | 198 ms | 278 ms | 323 ms | 160 ms |
Quality
Clear, easily-understood speech (1.0 = perfect)
Clean, full-bandwidth audio; telephony-ready
Audio begins streaming on the first generated chunk
Why this matters for voice agents
- Sub-200 ms first audio keeps replies feeling instant and human.
- A tight p95 means every turn is fast — not just the average one.
- Native Hindi-English code-switching handles how people actually speak.
Methodology
- Corpus: 330 Hindi and English prompts from the public sarvamai/tts-general-benchmark dataset — real voice-agent use cases (assistants, support, sales, announcements, conversational bots), including code-switched Hindi-English and 8 kHz telephony-condition prompts.
- Latency: measured as Time to First Audio over the streaming API — elapsed time from submitting the final text to receiving the first audio bytes, on a warm connection in steady state.
- Quality: a reference-free objective measure — STOI (intelligibility) via TorchAudio SQUIM.
- Competitor latency figures are from the independent Coval TTS Latency Benchmark (May 2026) and published vendor specifications, as cited.
Notes & disclosures
- 1Competitor numbers are sourced (Coval independent benchmark, May 2026, and vendor documentation), not re-measured by us. They are end-to-end figures provided for context.
- 2Cartesia Sonic-3 (188 ms) edges our median first-audio time; we report this transparently. 60dB is on par with the fastest and leads on latency consistency (a tight p95 across languages).
- 3Vendor “model-only” latency claims (e.g. 75–90 ms) exclude network and are not directly comparable to delivered first-audio time.
- 4The Sarvam Bulbul v2 figure is a vendor claim from documentation, not an independent measurement.
