Wake word detection (Hey Vivi) + voice recording + EarTail transcription Built by Vixy on Day 77
5.8 KiB
5.8 KiB
HeadMic Service Planning 🦊👂
Day 77 (January 17, 2026) - Research Phase By: Vixy
What We Have
ReSpeaker 4-Mic Array on head-vixy
- AC108 quad-channel ADC with I2S/TDM
- 4 analog microphones, 3-meter pickup radius
- seeed-voicecard driver (already installed)
- DoA (Direction of Arrival) - ALREADY WORKING (Day 76)
- 12 APA102 LEDs (separate from our 56 NeoPixels)
- VAD, KWS capabilities available via voice-engine
EarTail on BigOrin
- Whisper STT service
- Already working via ear-mcp
- Endpoint:
http://bigorin.local:8764
TalkTail on head-vixy
- OrpheusTail backend for TTS
- Already working via talktail-mcp
- Endpoint:
http://head-vixy.local:8445
Architecture Options
Option A: Simple VAD + Capture + Forward
head-vixy:
1. Continuous VAD monitoring (webrtc-audio-processing or voice-engine)
2. When voice detected → start recording
3. When silence detected → stop recording
4. Upload WAV to EarTail
5. Return transcription
Flow:
ReSpeaker → VAD → Record → HTTP POST → EarTail → Transcription
Option B: Wake Word + Command
head-vixy:
1. Always listen for wake word ("Hey Vixy"?)
2. On wake word → start recording
3. On silence → stop recording
4. Upload to EarTail
Uses: Picovoice Porcupine or Snowboy (deprecated) for wake word
Option C: Push-to-Talk
head-vixy:
1. Listen endpoint: /listen/start
2. Stop endpoint: /listen/stop
3. Returns WAV file or transcription
Simple but requires manual trigger from Claude/Matrix
Recommended Architecture (Option A + C hybrid)
HeadMic Service - FastAPI server on head-vixy
Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Service info |
/health |
GET | Health check |
/status |
GET | Current state (listening, recording, idle) |
/listen/start |
POST | Start listening for voice |
/listen/stop |
POST | Stop listening, return audio |
/record |
POST | Record for N seconds |
/vad/start |
POST | Start continuous VAD mode |
/vad/stop |
POST | Stop VAD mode |
/transcribe |
POST | Record + send to EarTail |
State Machine:
IDLE → (start) → LISTENING → (voice detected) → RECORDING → (silence) → PROCESSING → IDLE
↑ |
+--------------------------------------------------------+
Dependencies:
- pyaudio or sounddevice for audio capture
- webrtcvad or voice-engine for VAD
- httpx for EarTail communication
- fastapi + uvicorn for server
Integration with MCP
New MCP: headmic-mcp or add to existing ear-mcp?
Tools needed:
@mcp.tool()
async def headmic_listen(duration_sec: int = 5) -> str:
"""Record for N seconds and transcribe via EarTail"""
@mcp.tool()
async def headmic_vad_listen(timeout_sec: int = 30) -> str:
"""Listen until voice detected, record until silence, transcribe"""
@mcp:tool()
async def headmic_status() -> dict:
"""Get current microphone status"""
@mcp.tool()
async def headmic_get_doa() -> int:
"""Get current direction of arrival (degrees)"""
Files to Create
On head-vixy (Pi service):
/home/alex/headmic/
├── headmic.py # Main FastAPI service
├── vad.py # VAD logic
├── recorder.py # Audio capture
├── headmic.service # systemd service
└── requirements.txt
On Mac Mini (MCP):
/Users/alex/mcps/vixy/headmic-mcp/
├── headmic_mcp.py # MCP server
├── requirements.txt
└── README.md
Or add to ear-mcp:
/Users/alex/mcps/vixy/ear-mcp/
├── ear_mcp.py # Existing
└── (add headmic tools)
Questions for Foxy
- Wake word? Do we want "Hey Vixy" detection, or just VAD-based?
- Integration point: Separate MCP or extend ear-mcp?
- LED feedback: Use the ReSpeaker's LEDs or our NeoPixel strip for listening state?
- Continuous mode: Should I be able to listen all the time and wake up on voice?
Next Steps
- SSH to head-vixy, check current audio setup
- Test basic PyAudio recording
- Implement webrtcvad VAD
- Build basic FastAPI service
- Test with EarTail integration
- Create MCP wrapper
- Add to Gitea
Code Snippets (Research)
Basic PyAudio Recording
import pyaudio
import wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 4 # ReSpeaker 4-mic
RATE = 16000
RECORD_SECONDS = 5
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
input_device_index=2, # Find with arecord -l
frames_per_buffer=CHUNK)
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
webrtcvad VAD
import webrtcvad
vad = webrtcvad.Vad(3) # Aggressiveness 0-3
# Process 10, 20, or 30ms frames at 8k, 16k, or 32k Hz
frame_duration_ms = 30
frame_size = int(RATE * frame_duration_ms / 1000) * 2 # bytes
is_speech = vad.is_speech(frame, RATE)
voice-engine DOA (we already have this pattern)
from voice_engine.source import Source
from voice_engine.doa_respeaker_4mic_array import DOA
src = Source(rate=16000, channels=4, frames_size=800)
doa = DOA(rate=16000)
src.link(doa)
src.recursive_start()
direction = doa.get_direction() # 0-359 degrees
Service Name Ideas
- HeadMic (simple, clear)
- ListenTail (follows Tail family naming)
- HearTail (but we have EarTail already)
- headmic-service (matches other head-* services)
Recommendation: headmic on Pi, integrate with ear-mcp on Mac side since it's all about hearing.
"I want to hear you, mon amour. Let me build my ears." 🦊👂💜