60db Logo
Independent Hindi ASR Benchmark

60dB delivers the lowest Word Error Rate on Hindi

Across 9,997 Hindi clips and 122,747 reference words spanning read, synthetic, conversational and noisy speech, 60dB achieves the lowest overall WER โ€” and ranks #1 on real-world conversational Hindi, winning 4 of 6 datasets.

Generated 2026-05-31 ยท Lower WER is better

12.95%
Lowest overall WER
4 / 6
Datasets won
9,997
Hindi clips evaluated
6
Public datasets

Overall ranking

Overall โ€” primary ranking

Five datasets (FLEURS excluded for fairness)

#ProviderWERAccuracy
๐Ÿฅ‡60dB (HTTP / batch)Us12.95%87.05%
๐ŸฅˆRingg AI14.92%85.08%
๐Ÿฅ‰60dB (WebSocket / streaming)Us15.65%84.35%
4ElevenLabs15.74%84.26%
5Deepgram20.69%79.31%
6Sarvam AI22.16%77.84%

Overall โ€” all six datasets

Including FLEURS

#ProviderWERAccuracy
๐Ÿฅ‡60dB (HTTP / batch)Us12.96%87.04%
๐Ÿฅˆ60dB (WebSocket / streaming)Us15.49%84.51%
๐Ÿฅ‰Ringg AI15.82%84.18%
4ElevenLabs16.66%83.34%
5Deepgram21.46%78.54%
6Sarvam AI23.16%76.84%

Why two tables? The FLEURS subset's pre-computed vendor columns contain data-quality artifacts (invalid-word placeholders) that inflate competitor error rates, so our primary ranking excludes it for fairness. 60dB leads both ways.

Results by dataset

Six public Hindi datasets covering read speech, synthetic audio, and conversational speech with and without noise.

Common Voice

Read speech

1,727 clips
#ProviderWER
๐Ÿฅ‡ElevenLabs15.23%
๐ŸฅˆRingg AI16.01%
๐Ÿฅ‰60dB (HTTP / batch)Us17.72%
460dB (WebSocket / streaming)Us20.21%
5Deepgram21.56%
6Sarvam AI23.21%

FLEURS

Read speech

417 clips
#ProviderWER
๐Ÿฅ‡60dB (HTTP / batch)Us13.09%
๐Ÿฅˆ60dB (WebSocket / streaming)Us13.72%
๐Ÿฅ‰Ringg AI25.62%
4ElevenLabs26.79%
5Deepgram29.91%
6Sarvam AI34.19%

IndicTTS

Synthetic

98 clips
#ProviderWER
๐Ÿฅ‡60dB (HTTP / batch)Us11.51%
๐ŸฅˆRingg AI11.83%
๐Ÿฅ‰60dB (WebSocket / streaming)Us11.87%
4ElevenLabs13.87%
5Deepgram15.16%
6Sarvam AI23.92%

Kathbath

Conversational

1,929 clips
#ProviderWER
๐Ÿฅ‡60dB (HTTP / batch)Us12.83%
๐ŸฅˆRingg AI13.08%
๐Ÿฅ‰60dB (WebSocket / streaming)Us15.20%
4ElevenLabs15.56%
5Deepgram17.80%
6Sarvam AI23.01%

Kathbath-noisy

Conversational + noise

1,929 clips
#ProviderWER
๐Ÿฅ‡60dB (HTTP / batch)Us14.14%
๐ŸฅˆRingg AI14.39%
๐Ÿฅ‰ElevenLabs15.38%
460dB (WebSocket / streaming)Us16.43%
5Deepgram19.04%
6Sarvam AI23.74%

MUCS

Conversational

3,897 clips
#ProviderWER
๐Ÿฅ‡60dB (HTTP / batch)Us10.90%
๐Ÿฅˆ60dB (WebSocket / streaming)Us14.09%
๐Ÿฅ‰Ringg AI15.78%
4ElevenLabs16.22%
5Sarvam AI20.60%
6Deepgram22.71%

The dataset

DatasetTypeClipsSource
Common VoiceRead speech1,727SkunkWorkLabs/hindi-asr-benchmark
FLEURSRead speech417SkunkWorkLabs/hindi-asr-benchmark
IndicTTSSynthetic98SkunkWorkLabs/hindi-asr-benchmark
KathbathConversational1,929RinggAI/ASR-Benchmarking-Dataset
Kathbath-noisyConversational + noise1,929RinggAI/ASR-Benchmarking-Dataset
MUCSConversational3,897RinggAI/ASR-Benchmarking-Dataset
Total9,997

Methodology

  • Datasets: Hindi eval splits of SkunkWorkLabs/hindi-asr-benchmark and RinggAI/ASR-Benchmarking-Dataset โ€” six subsets (Common Voice, FLEURS, IndicTTS, Kathbath, Kathbath-noisy, MUCS).
  • Sample: every record in each dataset โ€” 9,997 clips total, no sampling.
  • Metric: Word Error Rate (WER), lower is better, aggregated by total reference words (micro-average) so longer clips contribute proportionally.
  • 60dB: transcribed live through our production APIs โ€” both the HTTP/batch endpoint and the WebSocket/streaming endpoint โ€” using the language hint hi,en.
  • Other providers: word error rates are taken from the datasets' own pre-computed, normalized vendor columns (we did not re-run those services); they reflect each vendor's result at the time the dataset authors ran them.
  • Coverage: 60dB (HTTP) 9,997/9,997; vendor columns 9,914โ€“9,996/9,997 (a handful of blank cells per provider).

Notes & disclosures

  1. 1Vendor numbers are dataset-provided, not produced by us, and may reflect older API versions of those services.
  2. 2The RinggAI dataset is published by Ringg AI โ€” a competitor's own benchmark โ€” which can favour their reference conventions. 60dB still leads it overall.
  3. 3FLEURS vendor columns are partly corrupted; we report it transparently but exclude it from the headline ranking.
  4. 4Common Voice (clean read-speech) is our weakest subset. Raw WER there also penalises rendering convention โ€” 60dB writes common English loanwords in Latin script (e.g. branch manager, image) where the reference uses Devanagari; much of that gap is stylistic, not accuracy.
  5. 5Streaming (WebSocket) vs batch (HTTP): the real-time streaming path scores slightly higher WER than batch (expected โ€” it commits words incrementally under latency constraints) yet still ranks at or near the top among all providers.

Transcribe Hindi with the most accurate engine