๐ŸขEnterprise & Advanced

How to Configure OpenClaw Voice Mode with ElevenLabs

Advanced2-4 hoursUpdated 2025-01-22

Voice mode transforms OpenClaw into a conversational AI assistant you can speak to naturally. This guide covers the complete voice pipeline: speech-to-text with Whisper, text-to-speech with ElevenLabs, wake word detection, audio configuration, and latency optimization.

Why This Is Hard to Do Yourself

These are the common pitfalls that trip people up.

๐ŸŽค

Audio pipeline complexity

Microphone input, speech-to-text, LLM processing, text-to-speech, speaker output โ€” each step can fail independently

๐Ÿ—ฃ๏ธ

Voice latency

Round-trip from speech to AI response to voice output must be under 2 seconds to feel natural

๐Ÿ”Š

Wake word reliability

False positives (triggers on random words) and false negatives (doesn't trigger on the wake word) both frustrate users

๐Ÿ’ฐ

ElevenLabs costs

High-quality voice synthesis is expensive. A chatty voice setup can cost $50-100+/month in ElevenLabs API fees alone.

Step-by-Step Guide

Some links on this page are affiliate links. We may earn a commission at no extra cost to you.

Step 1

Create an ElevenLabs account and API key

# 1. Sign up at elevenlabs.io
# 2. Go to Profile โ†’ API Keys
# 3. Generate a new API key
# 4. Choose or clone a voice (note the Voice ID)
Create your ElevenLabs account
Step 2

Configure speech-to-text (STT)

# In config/voice/stt.yaml:
stt:
  provider: whisper  # or "deepgram", "google"
  model: whisper-large-v3
  language: en

  # For local Whisper:
  whisper:
    model_path: ~/.openclaw/models/whisper-large-v3
    device: auto  # "cpu", "cuda", or "mps" for Apple Silicon
Step 3

Configure text-to-speech (TTS) with ElevenLabs

# In config/voice/tts.yaml:
tts:
  provider: elevenlabs
  elevenlabs:
    api_key: "YOUR_ELEVENLABS_API_KEY"
    voice_id: "YOUR_VOICE_ID"
    model: eleven_turbo_v2_5  # Fastest model
    stability: 0.5
    similarity_boost: 0.75
    style: 0.0
    use_streaming: true  # Stream audio for lower latency

Warning: ElevenLabs charges per character. The `eleven_turbo_v2_5` model is cheaper and faster than `eleven_monolingual_v1` but slightly lower quality. Start with turbo for most use cases.

Step 4

Set up wake word detection

# In config/voice/wake.yaml:
wake_word:
  enabled: true
  engine: porcupine  # or "snowboy", "custom"
  keyword: "hey claw"  # Custom wake word
  sensitivity: 0.5     # 0.0 (strict) to 1.0 (lenient)

  # For Porcupine:
  porcupine:
    access_key: "YOUR_PICOVOICE_KEY"
    keyword_path: ~/.openclaw/models/hey-claw.ppn
Step 5

Configure the audio pipeline

# In config/voice/pipeline.yaml:
pipeline:
  input_device: default  # Or specify device name
  output_device: default
  sample_rate: 16000
  channels: 1

  vad:  # Voice Activity Detection
    enabled: true
    threshold: 0.5
    min_speech_ms: 250
    max_silence_ms: 1000

  latency:
    target_ms: 1500
    stt_timeout_ms: 5000
    tts_buffer_ms: 200
Step 6

Test voice mode

# Start OpenClaw with voice mode:
npm start -- --voice

# Or enable in config:
# In config/openclaw.yaml:
# voice:
#   enabled: true

# Test: Say "hey claw" followed by a question
# Check logs:
tail -f ~/.openclaw/logs/voice.log

Voice Mode Has Many Moving Parts

STT, TTS, wake word, audio pipeline, latency tuning โ€” voice mode requires careful configuration of 5+ systems working together. Our experts get it running smoothly so you can just start talking.

Get matched with a specialist who can help.

Sign Up for Expert Help โ†’

Frequently Asked Questions