Skip to content

OpenAI

OpenAI provides TTS via their text-to-speech API and STT via Whisper.

Features

  • TTS: High-quality voices with natural prosody
  • STT: Whisper model for accurate transcription
  • Languages: 50+ languages supported
  • Streaming: TTS streaming supported

Configuration

import (
    "github.com/plexusone/omnivoice"
    _ "github.com/plexusone/omnivoice/providers/openai"
)

// TTS Provider
tts, err := omnivoice.GetTTSProvider("openai",
    omnivoice.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
)

// STT Provider
stt, err := omnivoice.GetSTTProvider("openai",
    omnivoice.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
)

Text-to-Speech

Available Voices

Voice Description
alloy Neutral, balanced
echo Warm, conversational
fable British, storytelling
onyx Deep, authoritative
nova Friendly, upbeat
shimmer Soft, gentle

Basic Usage

result, err := tts.Synthesize(ctx, "Hello, world!", omnivoice.SynthesisConfig{
    VoiceID: "alloy",
})
if err != nil {
    log.Fatal(err)
}

os.WriteFile("output.mp3", result.Audio, 0600)

Models

Model Quality Speed Use Case
tts-1 Good Fast Real-time applications
tts-1-hd Better Slower Pre-generated content
result, err := tts.Synthesize(ctx, text, omnivoice.SynthesisConfig{
    VoiceID: "alloy",
    Extensions: map[string]any{
        "model": "tts-1-hd",  // Higher quality
    },
})

Output Formats

config := omnivoice.SynthesisConfig{
    VoiceID:      "alloy",
    OutputFormat: "mp3",  // mp3, opus, aac, flac
}

Streaming

stream, err := tts.SynthesizeStream(ctx, text, omnivoice.SynthesisConfig{
    VoiceID: "alloy",
})
if err != nil {
    log.Fatal(err)
}

for chunk := range stream {
    if chunk.Error != nil {
        log.Printf("Error: %v", chunk.Error)
        break
    }
    playAudio(chunk.Audio)
}

Speech-to-Text

Basic Transcription

result, err := stt.TranscribeFile(ctx, "audio.mp3", omnivoice.TranscriptionConfig{
    Language: "en",
})
if err != nil {
    log.Fatal(err)
}

fmt.Println(result.Text)

With Word Timestamps

result, err := stt.TranscribeFile(ctx, "audio.mp3", omnivoice.TranscriptionConfig{
    Language:             "en",
    EnableWordTimestamps: true,
})

for _, word := range result.Words {
    fmt.Printf("[%.2f-%.2f] %s\n", word.Start, word.End, word.Word)
}

Whisper Models

Model Size Speed Accuracy
whisper-1 Large Medium High
result, err := stt.TranscribeFile(ctx, "audio.mp3", omnivoice.TranscriptionConfig{
    Model:    "whisper-1",
    Language: "en",
})

Translation

Translate audio to English:

result, err := stt.TranscribeFile(ctx, "french_audio.mp3", omnivoice.TranscriptionConfig{
    Extensions: map[string]any{
        "task": "translate",  // Translate to English
    },
})

Limitations

  • STT Streaming: Not supported (file-based only)
  • File Size: Max 25MB per request
  • Audio Length: Max ~2 hours

Error Handling

result, err := tts.Synthesize(ctx, text, config)
if err != nil {
    switch {
    case strings.Contains(err.Error(), "rate_limit"):
        log.Println("Rate limited, retry with backoff")
    case strings.Contains(err.Error(), "invalid_api_key"):
        log.Println("Check OPENAI_API_KEY")
    default:
        log.Printf("Error: %v", err)
    }
}

Best Practices

  1. Use tts-1 for real-time - Lower latency than tts-1-hd
  2. Batch transcriptions - OpenAI STT is file-based, not streaming
  3. Set language explicitly - Improves accuracy for STT
  4. Use opus for streaming - Smaller chunks, lower latency

Pricing

Service Model Price
TTS tts-1 $15/1M chars
TTS tts-1-hd $30/1M chars
STT whisper-1 $0.006/min

Prices as of 2024. Check OpenAI Pricing for current rates.

Next Steps