How to Configure OpenClaw Voice Mode with ElevenLabs
Voice mode transforms OpenClaw into a conversational AI assistant you can speak to naturally. This guide covers the complete voice pipeline: speech-to-text with Whisper, text-to-speech with ElevenLabs, wake word detection, audio configuration, and latency optimization.
Why This Is Hard to Do Yourself
These are the common pitfalls that trip people up.
Audio pipeline complexity
Microphone input, speech-to-text, LLM processing, text-to-speech, speaker output โ each step can fail independently
Voice latency
Round-trip from speech to AI response to voice output must be under 2 seconds to feel natural
Wake word reliability
False positives (triggers on random words) and false negatives (doesn't trigger on the wake word) both frustrate users
ElevenLabs costs
High-quality voice synthesis is expensive. A chatty voice setup can cost $50-100+/month in ElevenLabs API fees alone.
Step-by-Step Guide
Some links on this page are affiliate links. We may earn a commission at no extra cost to you.
Create an ElevenLabs account and API key
# 1. Sign up at elevenlabs.io
# 2. Go to Profile โ API Keys
# 3. Generate a new API key
# 4. Choose or clone a voice (note the Voice ID)Configure speech-to-text (STT)
# In config/voice/stt.yaml:
stt:
provider: whisper # or "deepgram", "google"
model: whisper-large-v3
language: en
# For local Whisper:
whisper:
model_path: ~/.openclaw/models/whisper-large-v3
device: auto # "cpu", "cuda", or "mps" for Apple SiliconConfigure text-to-speech (TTS) with ElevenLabs
# In config/voice/tts.yaml:
tts:
provider: elevenlabs
elevenlabs:
api_key: "YOUR_ELEVENLABS_API_KEY"
voice_id: "YOUR_VOICE_ID"
model: eleven_turbo_v2_5 # Fastest model
stability: 0.5
similarity_boost: 0.75
style: 0.0
use_streaming: true # Stream audio for lower latencyWarning: ElevenLabs charges per character. The `eleven_turbo_v2_5` model is cheaper and faster than `eleven_monolingual_v1` but slightly lower quality. Start with turbo for most use cases.
Set up wake word detection
# In config/voice/wake.yaml:
wake_word:
enabled: true
engine: porcupine # or "snowboy", "custom"
keyword: "hey claw" # Custom wake word
sensitivity: 0.5 # 0.0 (strict) to 1.0 (lenient)
# For Porcupine:
porcupine:
access_key: "YOUR_PICOVOICE_KEY"
keyword_path: ~/.openclaw/models/hey-claw.ppnConfigure the audio pipeline
# In config/voice/pipeline.yaml:
pipeline:
input_device: default # Or specify device name
output_device: default
sample_rate: 16000
channels: 1
vad: # Voice Activity Detection
enabled: true
threshold: 0.5
min_speech_ms: 250
max_silence_ms: 1000
latency:
target_ms: 1500
stt_timeout_ms: 5000
tts_buffer_ms: 200Test voice mode
# Start OpenClaw with voice mode:
npm start -- --voice
# Or enable in config:
# In config/openclaw.yaml:
# voice:
# enabled: true
# Test: Say "hey claw" followed by a question
# Check logs:
tail -f ~/.openclaw/logs/voice.logVoice Mode Has Many Moving Parts
STT, TTS, wake word, audio pipeline, latency tuning โ voice mode requires careful configuration of 5+ systems working together. Our experts get it running smoothly so you can just start talking.
Get matched with a specialist who can help.
Sign Up for Expert Help โ