Speech-to-Text (STT)¶

Convert audio to text using multiple providers with support for batch and streaming transcription.

Quick Start¶

provider, _ := omnivoice.GetSTTProvider("deepgram",
    omnivoice.WithAPIKey(apiKey))

result, _ := provider.TranscribeFile(ctx, "audio.mp3", omnivoice.TranscriptionConfig{
    Language: "en",
})

fmt.Println(result.Text)

Available Providers¶

Provider	Registry Name	Accuracy	Latency	Best For
Deepgram	`"deepgram"`	Excellent	Very Low	Real-time, high volume
OpenAI	`"openai"`	Excellent	Medium	General purpose, multilingual
ElevenLabs	`"elevenlabs"`	Very Good	Medium	Integration with TTS
Twilio	`"twilio"`	Good	Low	Phone calls

Transcription Methods¶

TranscribeFile¶

Transcribe a local audio file:

result, err := provider.TranscribeFile(ctx, "recording.mp3", omnivoice.TranscriptionConfig{
    Language:             "en",
    EnableWordTimestamps: true,
})

TranscribeURL¶

Transcribe audio from a URL:

result, err := provider.TranscribeURL(ctx, "https://example.com/audio.mp3", omnivoice.TranscriptionConfig{
    Language: "en",
})

Transcribe¶

Transcribe from an io.Reader:

file, _ := os.Open("audio.mp3")
defer file.Close()

result, err := provider.Transcribe(ctx, file, omnivoice.TranscriptionConfig{
    Language: "en",
})

TranscribeStream¶

Real-time streaming transcription:

stream, err := provider.TranscribeStream(ctx, omnivoice.TranscriptionConfig{
    Language: "en",
})
if err != nil {
    panic(err)
}

// Send audio chunks
go func() {
    for chunk := range audioSource {
        stream.Write(chunk)
    }
    stream.Close()
}()

// Receive transcription results
for result := range stream.Results() {
    if result.IsFinal {
        fmt.Printf("Final: %s\n", result.Text)
    } else {
        fmt.Printf("Interim: %s\n", result.Text)
    }
}

Configuration Options¶

config := omnivoice.TranscriptionConfig{
    // Language
    Language: "en-US",  // BCP-47 language code

    // Features
    EnableWordTimestamps:  true,  // Word-level timing
    EnableSpeakerDiarization: true, // Speaker identification

    // Model selection (provider-specific)
    Model: "nova-2",

    // Provider-specific extensions
    Extensions: map[string]any{
        "smart_format": true,
        "punctuate":    true,
    },
}

Provider-Specific Examples¶

Deepgram¶

provider, _ := omnivoice.GetSTTProvider("deepgram",
    omnivoice.WithAPIKey(apiKey))

result, _ := provider.TranscribeFile(ctx, "audio.mp3", omnivoice.TranscriptionConfig{
    Language: "en",
    Model:    "nova-2",
    EnableWordTimestamps: true,
    Extensions: map[string]any{
        "smart_format": true,
        "punctuate":    true,
        "diarize":      true,
        "utterances":   true,
    },
})

OpenAI Whisper¶

provider, _ := omnivoice.GetSTTProvider("openai",
    omnivoice.WithAPIKey(apiKey))

result, _ := provider.TranscribeFile(ctx, "audio.mp3", omnivoice.TranscriptionConfig{
    Language: "en",
    Extensions: map[string]any{
        "model":            "whisper-1",
        "response_format":  "verbose_json",
        "temperature":      0,
    },
})

ElevenLabs Scribe¶

provider, _ := omnivoice.GetSTTProvider("elevenlabs",
    omnivoice.WithAPIKey(apiKey))

result, _ := provider.TranscribeFile(ctx, "audio.mp3", omnivoice.TranscriptionConfig{
    Language: "en",
    EnableWordTimestamps: true,
})

Working with Results¶

Basic Text¶

fmt.Println(result.Text)

Word Timestamps¶

for _, word := range result.Words {
    fmt.Printf("[%.2fs - %.2fs] %s\n",
        word.Start.Seconds(),
        word.End.Seconds(),
        word.Text)
}

Speaker Diarization¶

for _, segment := range result.Segments {
    fmt.Printf("Speaker %d: %s\n", segment.Speaker, segment.Text)
}

Confidence Scores¶

fmt.Printf("Confidence: %.2f%%\n", result.Confidence * 100)

for _, word := range result.Words {
    if word.Confidence < 0.8 {
        fmt.Printf("Low confidence word: %s (%.2f)\n", word.Text, word.Confidence)
    }
}

Language Codes¶

OmniVoice accepts BCP-47 language codes:

Code	Language
`en`	English
`en-US`	English (US)
`en-GB`	English (UK)
`es`	Spanish
`fr`	French
`de`	German
`ja`	Japanese
`zh`	Chinese

Most providers support automatic language detection when no code is specified.

Error Handling¶

result, err := provider.TranscribeFile(ctx, path, config)
if err != nil {
    switch {
    case errors.Is(err, context.DeadlineExceeded):
        log.Println("Request timed out")
    case os.IsNotExist(err):
        log.Println("Audio file not found")
    case strings.Contains(err.Error(), "unsupported_format"):
        log.Println("Audio format not supported")
    default:
        log.Printf("STT error: %v", err)
    }
    return
}

Audio Format Support¶

Format	Extension	Providers
MP3	`.mp3`	All
WAV	`.wav`	All
FLAC	`.flac`	Deepgram, OpenAI
OGG	`.ogg`	Deepgram, OpenAI
WebM	`.webm`	Deepgram, OpenAI
M4A	`.m4a`	Deepgram, OpenAI

Best Practices¶

Use appropriate models - Nova-2 for accuracy, Base for speed
Enable word timestamps - Essential for subtitles and alignment
Handle streaming errors - Reconnect on connection drops
Choose the right provider - Deepgram for real-time, OpenAI for accuracy
Preprocess audio - Normalize volume, remove silence

Next Steps¶

Streaming Guide - Real-time transcription
Subtitles Guide - Generate SRT/VTT captions
Voice Agents - Build conversational agents