Click to start transcribing
Experience real-time transcription with industry-leading accuracy
Transcribe with
zero latency
Scribe v2 Realtime is built for conversational AI. With latency as low as 150ms, it enables fluid, natural voice interactions for any application.
Ultra-low Latency
Built for the speed of conversation, faster than human reaction time.
State-of-the-Art Accuracy
Industry-leading WER (Word Error Rate) for real-time transcription.
"The speed of 60db is a game changer for our AI agents."
Transcribe, tag,
and caption
Perfect for long-form content. Scribe v2 provides the highest standard of accuracy for audio files, complete with speaker diarization and automated captioning.
Detect and label multiple speakers automatically with high precision.
Generate SRT and VTT files for video content in seconds.
Provide rare words or technical terms to guide the transcription model.
Support for 22 languages with localized accents and context.
"Our goal at 60db is to make audio content accessible globally."
Built for ultimate creativity
Highly accurate, performant and secure Speech to Text models designed to power the next generation of audio apps.
Keyterm Prompting
Guide the model with rare words, acronyms, or technical jargon to ensure perfect transcription.
Dynamic Audio Tagging
Automatically detect and tag type of audio—whether it's speech, background music, or noise.
Speaker Detection
Accurately separate and label different speakers in an audio file, and detect entity types.
Enterprise Grade
Secure, SOC 2 and ISO 27001 compliant infrastructure built for critical business workflows.
Timestamp Accuracy
Get word-level timestamps that are perfectly synchronized with your audio input.
Multilingual Support
One model for the whole world. Scribe v2 supports 22 languages and 70+ accents.
Frequently Asked Questions
Everything you need to know about Scribe v2 and transcription.
Scribe v2 is our most accurate Speech to Text model yet, with industry-leading word error rates (WER). It's trained on over 1 million hours of diverse audio content to handle various accents, background noise, and overlapping speech.
