Added speaker identification section explaining the three-tier system (enrolled/anonymous/unidentified), the promote workflow, and enrollment options. Updated speakers API table with /speakers/promote endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HeadMic - Vixy's Ears 🦊👂
Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial tracking, sound classification, speaker identification, and wake word detection.
Hardware: 2× ReSpeaker XVF3800 4-Mic Array (USB, left/right ear) Wake word: "Hey Vivi" (Picovoice Porcupine) Runs on: Raspberry Pi 5 (head-vixy.local)
Architecture
[Left XVF3800]──┐ [Right XVF3800]──┐
4 mics, DoA │ 4 mics, DoA │
WS2812 LEDs │ WS2812 LEDs │
▼ ▼
arecord (16kHz mono) arecord (16kHz mono)
│ │
└────────────┬───────────────────────────────┘
▼
DualAudioStream (audio_stream.py)
best-beam selection (energy-based, 10% hysteresis)
│
┌──────────────────┼──────────────────────┐
▼ ▼ ▼
Porcupine YAMNet Binaural
wake word (Edge TPU) Recorder
"Hey Vivi" 521 classes stereo WAV
▼ ▼
Record + Speaker ID
Transcribe (Resemblyzer)
via EarTail │
▼
Spatial Tracker (spatial.py)
DoA → triangulation → ILD distance
→ smooth gaze → proximity zones
│
┌────────────┼────────────────┐
▼ ▼ ▼
Eye Service Spatial Scene USB Control
POST /gaze (spatial_scene) (xvf3800.py)
eyes follow what+where map LEDs + DoA
the speaker anomaly detect per-array
Features
| Feature | Module | Hardware | Status |
|---|---|---|---|
| Wake word detection | Porcupine | CPU | Needs Picovoice key |
| Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms |
| Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrolled + anonymous tracking |
| Spatial tracking | spatial.py | USB control | 3-signal fusion: DoA + ILD + ITD |
| Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) |
| ITD processing | spatial.py | audio cross-correlation | Sub-ms delay → bearing angle |
| Multi-speaker tracking | multi_speaker.py | XVF3800 beam steering | 2 simultaneous speakers, auto beam lock |
| Cocktail party filtering | multi_speaker.py + audio_stream.py | beam gating + focus | Target speaker isolation |
| Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection |
| Sound event localization | spatial_scene.py | — | What + where + when log |
| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based or focused attention |
| LED control | xvf3800.py | WS2812 rings | DoA/solid/breath |
| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) |
Installation
Prerequisites
# On head-vixy (Raspberry Pi 5, Debian Trixie)
sudo apt install python3-dev portaudio19-dev alsa-utils
# USB permissions for XVF3800
sudo tee /etc/udev/rules.d/99-respeaker.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="2886", ATTR{idProduct}=="001a", MODE="0666"
EOF
# USB permissions for Coral Edge TPU
sudo tee /etc/udev/rules.d/99-coral.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="1a6e", ATTR{idProduct}=="089a", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="18d1", ATTR{idProduct}=="9302", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger
XVF3800 Firmware
Both arrays must be flashed with the 2-channel USB firmware (not 6-channel — the 6ch firmware breaks LED/DoA control commands):
git clone https://github.com/respeaker/reSpeaker_XVF3800_USB_4MIC_ARRAY.git /tmp/xvf3800
# Unplug one array, flash the other:
sudo dfu-util -R -e -a 1 -D /tmp/xvf3800/xmos_firmwares/usb/respeaker_xvf3800_usb_dfu_firmware_v2.0.7.bin
# Swap and repeat
Verify: arecord -l should show two capture devices.
Edge TPU Runtime
The packaged libedgetpu from Google's apt repo is ABI-incompatible with ai-edge-litert on Debian Trixie / Python 3.13. A custom build is required:
# Install build deps
sudo apt install libabsl-dev libflatbuffers-dev libusb-1.0-0-dev binutils-gold cmake
# Clone sources
cd /tmp
git clone --depth 1 https://github.com/google-coral/libedgetpu.git
git clone --depth 1 --branch v2.16.1 https://github.com/tensorflow/tensorflow.git
git clone --depth 1 --branch v23.5.26 https://github.com/google/flatbuffers.git flatbuffers-23
# Build flatc v23
cd /tmp/flatbuffers-23 && cmake -B build -DFLATBUFFERS_BUILD_TESTS=OFF && cmake --build build -j4 -- flatc
# Patch libedgetpu Makefile (see below), then:
cd /tmp/libedgetpu
TFROOT=/tmp/tensorflow make -f makefile_build/Makefile -j4 libedgetpu
# Install
sudo cp out/direct/k8/libedgetpu.so.1.0 /usr/lib/aarch64-linux-gnu/libedgetpu.so.1.0
sudo ldconfig
Makefile patches required (TF 2.16 moved files):
- Replace
FLATC=flatcwithFLATC=/tmp/flatbuffers-23/build/flatc - Add
/tmp/flatbuffers-23/includetoLIBEDGETPU_INCLUDES - Add
-Wno-return-typetoLIBEDGETPU_CXXFLAGS - Remove
$(TFROOT)/tensorflow/lite/c/common.cfromLIBEDGETPU_CSRCS - Add
$(TFROOT)/tensorflow/lite/core/c/common.ccand$(TFROOT)/tensorflow/lite/array.cctoLIBEDGETPU_CCSRCS - Add
-labsl_bad_optional_accesstoLIBEDGETPU_LDFLAGS
A backup of the working binary is saved at ~/headmic/libedgetpu.so.1.0.custom.
Python Setup
cd /home/alex/headmic
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/pip install setuptools # Python 3.13 compatibility
.venv/bin/pip install resemblyzer # Speaker ID (pulls PyTorch)
Learn Mic Array Positions
Both arrays must be plugged in. This lights up one array at a time and asks you to confirm left/right:
sudo .venv/bin/python headmic.py --learn
Config saved to ~/.vixy/headmic.json with USB serial numbers for stable identification.
Install Service
sudo cp headmic.service /etc/systemd/system/
# Edit to add your PORCUPINE_ACCESS_KEY:
sudo nano /etc/systemd/system/headmic.service
sudo systemctl daemon-reload
sudo systemctl enable headmic
sudo systemctl start headmic
API Endpoints
Core
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Service info |
/health |
GET | Health check (listening, recording, features enabled) |
/status |
GET | Current state (transcription, scene, speaker, active side) |
/last |
GET | Last transcription + timestamp |
Spatial
| Endpoint | Method | Description |
|---|---|---|
/doa |
GET | DoA + triangulated position + ILD + ITD + gaze + distance + proximity |
/devices |
GET | XVF3800 connection status, serials, ALSA devices |
/speakers/tracked |
GET | Multi-speaker positions, beam mode, lock state, target |
/speakers/focus |
POST | Switch cocktail party attention (query: speaker=0|1) |
/scene |
GET | Learned spatial scene (usual direction per category) + last anomaly |
/scene/events |
GET | Recent sound events with what + where + when (query: seconds, category) |
/scene/heatmap |
GET | Per-category angular distribution for visualization |
Sound
| Endpoint | Method | Description |
|---|---|---|
/sounds |
GET | Current audio scene (category, top 5 classes, speaker) |
/sounds/history |
GET | Classification history (last N seconds) |
Speakers
| Endpoint | Method | Description |
|---|---|---|
/speakers |
GET | List all speakers (enrolled + anonymous) |
/speakers/enroll |
POST | Enroll from uploaded audio (multipart: name + WAV) |
/speakers/enroll-from-mic |
POST | Record 5s from mic + enroll (query: name) |
/speakers/promote |
POST | Promote anonymous → enrolled (query: anon_id, name) |
/speakers/{name} |
DELETE | Remove a speaker |
Recording
| Endpoint | Method | Description |
|---|---|---|
/recording |
GET | Binaural recording stats |
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
PORCUPINE_ACCESS_KEY |
(none) | Picovoice API key for wake word |
WAKE_WORD_PATH |
~/headmic/Hey-Vivi_*.ppn |
Wake word model path |
EARTAIL_URL |
http://bigorin.local:8764 |
Transcription service |
EYE_SERVICE_URL |
http://localhost:8780 |
Eye service for gaze push |
BINAURAL_RECORD |
0 |
Set to 1 to enable stereo recording |
BINAURAL_DIR |
~/headmic/recordings |
Output directory for WAV segments |
Config File (~/.vixy/headmic.json)
{
"ears": {
"left": {"usb_serial": "101991441254500541", "alsa_card": "Array"},
"right": {"usb_serial": "101991441254500556", "alsa_card": "Array_1"}
},
"array_separation_mm": 175.0
}
Speaker Identification
Three-tier recognition using Resemblyzer 256-dim GE2E embeddings:
| Tier | Name format | How it works |
|---|---|---|
| Enrolled | "Alex" |
Matched against stored embeddings (cosine ≥ 0.75) |
| Anonymous | "unknown_bfa1" |
Clustered online from unrecognized voices (cosine ≥ 0.70) |
| Unidentified | null |
Audio too short or no speech detected |
Anonymous speakers get a stable 4-character hex ID derived from their voice embedding. The same person consistently gets the same ID across observations. IDs expire after 1 hour of silence, max 10 tracked simultaneously.
Workflow:
Unknown person speaks → "unknown_bfa1" (auto-created)
↓
You ask "who's that?" → check /speakers
↓
curl -X POST "http://head:8446/speakers/promote?anon_id=unknown_bfa1&name=Bob"
↓
Now recognized as "Bob" going forward (embedding saved to voices.db)
Alternatively, enroll directly from mic:
curl -X POST "http://head:8446/speakers/enroll-from-mic?name=Alex"
# Speak for 5 seconds
LED States
| State | Effect | Color |
|---|---|---|
| Idle | Off | — |
| Wake word detected | Solid | White (flash) |
| Listening/Recording | DoA indicator | Cyan |
| Processing | Breath | Purple |
| Enrolling speaker | Solid | Orange |
File Structure
headmic/
├── headmic.py # Main FastAPI service
├── audio_stream.py # Dual arecord streams + best-beam selection
├── spatial.py # 3-signal fusion (DoA + ILD + ITD) + gaze + proximity
├── spatial_scene.py # Spatial audio scene map + anomaly detection
├── multi_speaker.py # Multi-speaker tracking + beam steering + cocktail party
├── xvf3800.py # USB vendor control (DoA + LEDs + beam steering)
├── sound_id.py # YAMNet sound classification (CPU/Edge TPU)
├── speaker_id.py # Resemblyzer speaker identification
├── binaural_recorder.py # Stereo WAV recording from both ears
├── headmic.service # systemd service file
├── requirements.txt # Python dependencies
├── BINAURAL_ROADMAP.md # Roadmap for binaural features
├── models/
│ ├── yamnet.tflite # YAMNet CPU model
│ ├── yamnet_edgetpu.tflite # YAMNet Edge TPU model
│ └── yamnet_class_map.csv # 521 class names
└── voices.db # Speaker embeddings (SQLite, runtime)
XVF3800 USB Control Protocol
Commands use USB vendor control transfers: wValue = cmdid, wIndex = resid.
Key findings during development:
- Payload format: single bytes for effects (
bytes([3])), not packed uint32 - Color format:
[R, G, B, 0](4 bytes) - Read responses have a 1-byte status header before data
- Read wLength must be
count * type_size + 1(exact, not rounded up) DOA_VALUE(resid=20, cmdid=18) is sluggish/cached — useAUDIO_MGR_SELECTED_AZIMUTHS(resid=35, cmdid=11) for real-time trackingAUDIO_MGR_SELECTED_AZIMUTHSreturns 2 floats (radians): index 0 = processed DoA (NaN = no speech = VAD indicator), index 1 = auto-select beam (always tracks strongest source)AEC_SPENERGY_VALUES(resid=33, cmdid=80) is always zero on 2-channel firmware — don't rely on it- 2-channel firmware only — 6-channel firmware silently ignores LED/control commands
Built by Vixy on Day 77 (January 17, 2026) Upgraded to dual XVF3800 binaural hearing on Day 161 (April 2026) Full binaural suite (10/12 features) built Day 162 "Hey Vivi" — the words that summon me 💜