Voice Processing¶
OmniChat supports voice message handling with transcription and synthesis.
Overview¶
Voice processing enables:
- Transcription - Convert incoming voice messages to text
- Synthesis - Convert text responses to voice notes
- Smart routing - Respond with voice to voice input
VoiceProcessor Interface¶
type VoiceProcessor interface {
// TranscribeAudio converts audio to text
TranscribeAudio(ctx context.Context, audio []byte, mimeType string) (string, error)
// SynthesizeSpeech converts text to audio
SynthesizeSpeech(ctx context.Context, text string) ([]byte, string, error)
// ResponseMode controls when to use voice responses
// "auto" - voice reply to voice input
// "always" - always respond with voice
// "never" - never respond with voice
ResponseMode() string
}
Basic Usage¶
// Implement VoiceProcessor
type MyVoiceProcessor struct {
// STT/TTS clients
}
func (p *MyVoiceProcessor) TranscribeAudio(ctx context.Context, audio []byte, mimeType string) (string, error) {
// Use your STT service
return transcribe(audio, mimeType)
}
func (p *MyVoiceProcessor) SynthesizeSpeech(ctx context.Context, text string) ([]byte, string, error) {
audio, err := synthesize(text)
return audio, "audio/ogg; codecs=opus", err
}
func (p *MyVoiceProcessor) ResponseMode() string {
return "auto"
}
// Use with router
router.SetAgent(agent)
router.OnMessage(provider.All(), router.ProcessWithVoice(voiceProcessor))
ProcessWithVoice Flow¶
Incoming Message
│
▼
┌──────────────┐
│ Has Voice? │──No──▶ Process as text
└──────┬───────┘
│ Yes
▼
┌──────────────┐
│ Transcribe │
└──────┬───────┘
│
▼
┌──────────────┐
│ Process text │
│ (via Agent) │
└──────┬───────┘
│
▼
┌──────────────┐
│ ResponseMode │
└──────┬───────┘
│
┌────┴────┐
│ │
auto always/never
│ │
▼ ▼
Voice Based on mode
reply
Response Modes¶
Auto Mode¶
Respond with voice only when the input was voice:
- Voice input → Voice response
- Text input → Text response
Always Mode¶
Always respond with voice:
Never Mode¶
Never respond with voice (transcription only):
Audio Formats¶
Incoming Audio¶
Common formats from platforms:
| Platform | Format | MIME Type |
|---|---|---|
| Opus in Ogg | audio/ogg; codecs=opus |
|
| Telegram | Ogg Opus | audio/ogg |
| Discord | Opus | audio/opus |
Outgoing Audio¶
Recommended format for voice notes:
func (p *MyVoiceProcessor) SynthesizeSpeech(ctx context.Context, text string) ([]byte, string, error) {
audio, err := synthesize(text, "ogg_opus")
return audio, "audio/ogg; codecs=opus", err
}
URL Handling¶
When responses contain URLs, both voice and text are sent:
// Response: "Check out https://example.com"
// Results in:
// 1. Voice note with "Check out example.com"
// 2. Text message with clickable URL
This ensures URLs remain clickable while providing voice response.
Integration with STT/TTS Services¶
Google Cloud Speech¶
import (
speech "cloud.google.com/go/speech/apiv1"
texttospeech "cloud.google.com/go/texttospeech/apiv1"
)
type GoogleVoiceProcessor struct {
sttClient *speech.Client
ttsClient *texttospeech.Client
}
func (p *GoogleVoiceProcessor) TranscribeAudio(ctx context.Context, audio []byte, mimeType string) (string, error) {
req := &speechpb.RecognizeRequest{
Config: &speechpb.RecognitionConfig{
Encoding: speechpb.RecognitionConfig_OGG_OPUS,
LanguageCode: "en-US",
},
Audio: &speechpb.RecognitionAudio{
AudioSource: &speechpb.RecognitionAudio_Content{Content: audio},
},
}
resp, err := p.sttClient.Recognize(ctx, req)
// Extract transcript...
}
OpenAI Whisper¶
type OpenAIVoiceProcessor struct {
client *openai.Client
}
func (p *OpenAIVoiceProcessor) TranscribeAudio(ctx context.Context, audio []byte, mimeType string) (string, error) {
resp, err := p.client.CreateTranscription(ctx, openai.AudioRequest{
Model: openai.Whisper1,
Reader: bytes.NewReader(audio),
FilePath: "audio.ogg",
})
return resp.Text, err
}
ElevenLabs TTS¶
type ElevenLabsVoiceProcessor struct {
apiKey string
voiceID string
}
func (p *ElevenLabsVoiceProcessor) SynthesizeSpeech(ctx context.Context, text string) ([]byte, string, error) {
// Call ElevenLabs API
audio, err := elevenlabs.TextToSpeech(p.apiKey, p.voiceID, text)
return audio, "audio/mpeg", err
}
Receiving Voice Messages¶
Voice messages arrive in the Media field:
router.OnMessage(provider.All(), func(ctx context.Context, msg provider.IncomingMessage) error {
for _, media := range msg.Media {
if media.Type == provider.MediaTypeVoice {
// media.Data - audio bytes
// media.MimeType - format info
transcript, err := voiceProcessor.TranscribeAudio(ctx, media.Data, media.MimeType)
if err != nil {
return err
}
log.Printf("Voice message said: %s", transcript)
}
}
return nil
})
Sending Voice Messages¶
audio, mimeType, err := voiceProcessor.SynthesizeSpeech(ctx, "Hello!")
if err != nil {
return err
}
router.Send(ctx, "whatsapp", chatID, provider.OutgoingMessage{
Media: []provider.Media{{
Type: provider.MediaTypeVoice,
Data: audio,
MimeType: mimeType,
}},
})
Platform Support¶
| Platform | Receive Voice | Send Voice |
|---|---|---|
| Yes | Yes (PTT) | |
| Telegram | Yes | Yes |
| Discord | Limited | Limited |
| Slack | No | No |
| Gmail | No | No |
Best Practices¶
- Handle transcription errors gracefully
transcript, err := processor.TranscribeAudio(ctx, audio, mimeType)
if err != nil {
// Fall back to asking user to type
return router.Send(ctx, provider, chatID, provider.OutgoingMessage{
Content: "Sorry, I couldn't understand the audio. Please type your message.",
})
}
- Limit audio length
const maxAudioBytes = 10 * 1024 * 1024 // 10MB
if len(audio) > maxAudioBytes {
return errors.New("audio too long")
}
- Cache synthesized audio
var audioCache = make(map[string][]byte)
func (p *CachedProcessor) SynthesizeSpeech(ctx context.Context, text string) ([]byte, string, error) {
if cached, ok := audioCache[text]; ok {
return cached, "audio/ogg", nil
}
audio, mimeType, err := p.underlying.SynthesizeSpeech(ctx, text)
if err == nil {
audioCache[text] = audio
}
return audio, mimeType, err
}