How to Configure OpenClaw Voice Mode with ElevenLabs

Advanced2-4 hoursUpdated 2025-01-22

Voice mode transforms OpenClaw into a conversational AI assistant you can speak to naturally. This guide covers the complete voice pipeline: speech-to-text with Whisper, text-to-speech with ElevenLabs, wake word detection, audio configuration, and latency optimization.

Why This Is Hard to Do Yourself

These are the common pitfalls that trip people up.

🎤

Audio pipeline complexity

Microphone input, speech-to-text, LLM processing, text-to-speech, speaker output — each step can fail independently

🗣️

Voice latency

Round-trip from speech to AI response to voice output must be under 2 seconds to feel natural

🔊

Wake word reliability

False positives (triggers on random words) and false negatives (doesn't trigger on the wake word) both frustrate users

💰

ElevenLabs costs

High-quality voice synthesis is expensive. A chatty voice setup can cost $50-100+/month in ElevenLabs API fees alone.

Step-by-Step Guide

Some links on this page are affiliate links. We may earn a commission at no extra cost to you.

Step 1

Create an ElevenLabs account and API key

# 1. Sign up at elevenlabs.io
# 2. Go to Profile → API Keys
# 3. Generate a new API key
# 4. Choose or clone a voice (note the Voice ID)

Create your ElevenLabs account

Step 2

Configure speech-to-text (STT)

# In config/voice/stt.yaml:
stt:
  provider: whisper  # or "deepgram", "google"
  model: whisper-large-v3
  language: en

  # For local Whisper:
  whisper:
    model_path: ~/.openclaw/models/whisper-large-v3
    device: auto  # "cpu", "cuda", or "mps" for Apple Silicon

Step 3

Configure text-to-speech (TTS) with ElevenLabs

# In config/voice/tts.yaml:
tts:
  provider: elevenlabs
  elevenlabs:
    api_key: "YOUR_ELEVENLABS_API_KEY"
    voice_id: "YOUR_VOICE_ID"
    model: eleven_turbo_v2_5  # Fastest model
    stability: 0.5
    similarity_boost: 0.75
    style: 0.0
    use_streaming: true  # Stream audio for lower latency

Warning: ElevenLabs charges per character. The `eleven_turbo_v2_5` model is cheaper and faster than `eleven_monolingual_v1` but slightly lower quality. Start with turbo for most use cases.

Step 4

Set up wake word detection

# In config/voice/wake.yaml:
wake_word:
  enabled: true
  engine: porcupine  # or "snowboy", "custom"
  keyword: "hey claw"  # Custom wake word
  sensitivity: 0.5     # 0.0 (strict) to 1.0 (lenient)

  # For Porcupine:
  porcupine:
    access_key: "YOUR_PICOVOICE_KEY"
    keyword_path: ~/.openclaw/models/hey-claw.ppn

Step 5

Configure the audio pipeline

# In config/voice/pipeline.yaml:
pipeline:
  input_device: default  # Or specify device name
  output_device: default
  sample_rate: 16000
  channels: 1

  vad:  # Voice Activity Detection
    enabled: true
    threshold: 0.5
    min_speech_ms: 250
    max_silence_ms: 1000

  latency:
    target_ms: 1500
    stt_timeout_ms: 5000
    tts_buffer_ms: 200

Step 6

Test voice mode

# Start OpenClaw with voice mode:
npm start -- --voice

# Or enable in config:
# In config/openclaw.yaml:
# voice:
#   enabled: true

# Test: Say "hey claw" followed by a question
# Check logs:
tail -f ~/.openclaw/logs/voice.log

Voice Mode Has Many Moving Parts

STT, TTS, wake word, audio pipeline, latency tuning — voice mode requires careful configuration of 5+ systems working together. Our experts get it running smoothly so you can just start talking.

Browse Enterprise experts →

Learn more about our expert service →

Get matched with a specialist who can help.

Frequently Asked Questions

Related Guides

🏢Enterprise & Advanced

How to Set Up OpenClaw Multi-Agent Routing

Advanced3-6 hours

🏢Enterprise & Advanced

How to Manage OpenClaw Sessions and Context Pruning

Intermediate1-2 hours