Files
headmic/README.md
Alex fde3b98554 Document anonymous speaker tracking + promote workflow
Added speaker identification section explaining the three-tier system
(enrolled/anonymous/unidentified), the promote workflow, and enrollment
options. Updated speakers API table with /speakers/promote endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 22:01:49 -05:00

13 KiB
Raw Blame History

HeadMic - Vixy's Ears 🦊👂

Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial tracking, sound classification, speaker identification, and wake word detection.

Hardware: 2× ReSpeaker XVF3800 4-Mic Array (USB, left/right ear) Wake word: "Hey Vivi" (Picovoice Porcupine) Runs on: Raspberry Pi 5 (head-vixy.local)

Architecture

[Left XVF3800]──┐                          [Right XVF3800]──┐
  4 mics, DoA   │                            4 mics, DoA    │
  WS2812 LEDs   │                            WS2812 LEDs    │
                ▼                                            ▼
        arecord (16kHz mono)                         arecord (16kHz mono)
                │                                            │
                └────────────┬───────────────────────────────┘
                             ▼
                  DualAudioStream (audio_stream.py)
                  best-beam selection (energy-based, 10% hysteresis)
                             │
          ┌──────────────────┼──────────────────────┐
          ▼                  ▼                      ▼
   Porcupine            YAMNet                 Binaural
   wake word            (Edge TPU)             Recorder
   "Hey Vivi"           521 classes            stereo WAV
          ▼                  ▼
   Record +             Speaker ID
   Transcribe           (Resemblyzer)
   via EarTail               │
                             ▼
                  Spatial Tracker (spatial.py)
                  DoA → triangulation → ILD distance
                  → smooth gaze → proximity zones
                             │
                ┌────────────┼────────────────┐
                ▼            ▼                ▼
         Eye Service    Spatial Scene    USB Control
         POST /gaze     (spatial_scene)  (xvf3800.py)
         eyes follow    what+where map   LEDs + DoA
         the speaker    anomaly detect   per-array

Features

Feature Module Hardware Status
Wake word detection Porcupine CPU Needs Picovoice key
Sound classification sound_id.py Coral Edge TPU 521 classes, ~2ms
Speaker identification speaker_id.py CPU (Resemblyzer) Enrolled + anonymous tracking
Spatial tracking spatial.py USB control 3-signal fusion: DoA + ILD + ITD
Distance estimation spatial.py audio energy Proximity zones (intimate/conversational/across_room/far)
ITD processing spatial.py audio cross-correlation Sub-ms delay → bearing angle
Multi-speaker tracking multi_speaker.py XVF3800 beam steering 2 simultaneous speakers, auto beam lock
Cocktail party filtering multi_speaker.py + audio_stream.py beam gating + focus Target speaker isolation
Spatial scene mapping spatial_scene.py Learns where sounds come from, anomaly detection
Sound event localization spatial_scene.py What + where + when log
Best-beam selection audio_stream.py 2× XVF3800 Energy-based or focused attention
LED control xvf3800.py WS2812 rings DoA/solid/breath
Binaural recording binaural_recorder.py 2× XVF3800 Stereo WAV segments (opt-in)

Installation

Prerequisites

# On head-vixy (Raspberry Pi 5, Debian Trixie)
sudo apt install python3-dev portaudio19-dev alsa-utils

# USB permissions for XVF3800
sudo tee /etc/udev/rules.d/99-respeaker.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="2886", ATTR{idProduct}=="001a", MODE="0666"
EOF

# USB permissions for Coral Edge TPU
sudo tee /etc/udev/rules.d/99-coral.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="1a6e", ATTR{idProduct}=="089a", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="18d1", ATTR{idProduct}=="9302", MODE="0666"
EOF

sudo udevadm control --reload-rules && sudo udevadm trigger

XVF3800 Firmware

Both arrays must be flashed with the 2-channel USB firmware (not 6-channel — the 6ch firmware breaks LED/DoA control commands):

git clone https://github.com/respeaker/reSpeaker_XVF3800_USB_4MIC_ARRAY.git /tmp/xvf3800
# Unplug one array, flash the other:
sudo dfu-util -R -e -a 1 -D /tmp/xvf3800/xmos_firmwares/usb/respeaker_xvf3800_usb_dfu_firmware_v2.0.7.bin
# Swap and repeat

Verify: arecord -l should show two capture devices.

Edge TPU Runtime

The packaged libedgetpu from Google's apt repo is ABI-incompatible with ai-edge-litert on Debian Trixie / Python 3.13. A custom build is required:

# Install build deps
sudo apt install libabsl-dev libflatbuffers-dev libusb-1.0-0-dev binutils-gold cmake

# Clone sources
cd /tmp
git clone --depth 1 https://github.com/google-coral/libedgetpu.git
git clone --depth 1 --branch v2.16.1 https://github.com/tensorflow/tensorflow.git
git clone --depth 1 --branch v23.5.26 https://github.com/google/flatbuffers.git flatbuffers-23

# Build flatc v23
cd /tmp/flatbuffers-23 && cmake -B build -DFLATBUFFERS_BUILD_TESTS=OFF && cmake --build build -j4 -- flatc

# Patch libedgetpu Makefile (see below), then:
cd /tmp/libedgetpu
TFROOT=/tmp/tensorflow make -f makefile_build/Makefile -j4 libedgetpu

# Install
sudo cp out/direct/k8/libedgetpu.so.1.0 /usr/lib/aarch64-linux-gnu/libedgetpu.so.1.0
sudo ldconfig

Makefile patches required (TF 2.16 moved files):

  • Replace FLATC=flatc with FLATC=/tmp/flatbuffers-23/build/flatc
  • Add /tmp/flatbuffers-23/include to LIBEDGETPU_INCLUDES
  • Add -Wno-return-type to LIBEDGETPU_CXXFLAGS
  • Remove $(TFROOT)/tensorflow/lite/c/common.c from LIBEDGETPU_CSRCS
  • Add $(TFROOT)/tensorflow/lite/core/c/common.cc and $(TFROOT)/tensorflow/lite/array.cc to LIBEDGETPU_CCSRCS
  • Add -labsl_bad_optional_access to LIBEDGETPU_LDFLAGS

A backup of the working binary is saved at ~/headmic/libedgetpu.so.1.0.custom.

Python Setup

cd /home/alex/headmic
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/pip install setuptools  # Python 3.13 compatibility
.venv/bin/pip install resemblyzer  # Speaker ID (pulls PyTorch)

Learn Mic Array Positions

Both arrays must be plugged in. This lights up one array at a time and asks you to confirm left/right:

sudo .venv/bin/python headmic.py --learn

Config saved to ~/.vixy/headmic.json with USB serial numbers for stable identification.

Install Service

sudo cp headmic.service /etc/systemd/system/
# Edit to add your PORCUPINE_ACCESS_KEY:
sudo nano /etc/systemd/system/headmic.service
sudo systemctl daemon-reload
sudo systemctl enable headmic
sudo systemctl start headmic

API Endpoints

Core

Endpoint Method Description
/ GET Service info
/health GET Health check (listening, recording, features enabled)
/status GET Current state (transcription, scene, speaker, active side)
/last GET Last transcription + timestamp

Spatial

Endpoint Method Description
/doa GET DoA + triangulated position + ILD + ITD + gaze + distance + proximity
/devices GET XVF3800 connection status, serials, ALSA devices
/speakers/tracked GET Multi-speaker positions, beam mode, lock state, target
/speakers/focus POST Switch cocktail party attention (query: speaker=0|1)
/scene GET Learned spatial scene (usual direction per category) + last anomaly
/scene/events GET Recent sound events with what + where + when (query: seconds, category)
/scene/heatmap GET Per-category angular distribution for visualization

Sound

Endpoint Method Description
/sounds GET Current audio scene (category, top 5 classes, speaker)
/sounds/history GET Classification history (last N seconds)

Speakers

Endpoint Method Description
/speakers GET List all speakers (enrolled + anonymous)
/speakers/enroll POST Enroll from uploaded audio (multipart: name + WAV)
/speakers/enroll-from-mic POST Record 5s from mic + enroll (query: name)
/speakers/promote POST Promote anonymous → enrolled (query: anon_id, name)
/speakers/{name} DELETE Remove a speaker

Recording

Endpoint Method Description
/recording GET Binaural recording stats

Configuration

Environment Variables

Variable Default Description
PORCUPINE_ACCESS_KEY (none) Picovoice API key for wake word
WAKE_WORD_PATH ~/headmic/Hey-Vivi_*.ppn Wake word model path
EARTAIL_URL http://bigorin.local:8764 Transcription service
EYE_SERVICE_URL http://localhost:8780 Eye service for gaze push
BINAURAL_RECORD 0 Set to 1 to enable stereo recording
BINAURAL_DIR ~/headmic/recordings Output directory for WAV segments

Config File (~/.vixy/headmic.json)

{
  "ears": {
    "left": {"usb_serial": "101991441254500541", "alsa_card": "Array"},
    "right": {"usb_serial": "101991441254500556", "alsa_card": "Array_1"}
  },
  "array_separation_mm": 175.0
}

Speaker Identification

Three-tier recognition using Resemblyzer 256-dim GE2E embeddings:

Tier Name format How it works
Enrolled "Alex" Matched against stored embeddings (cosine ≥ 0.75)
Anonymous "unknown_bfa1" Clustered online from unrecognized voices (cosine ≥ 0.70)
Unidentified null Audio too short or no speech detected

Anonymous speakers get a stable 4-character hex ID derived from their voice embedding. The same person consistently gets the same ID across observations. IDs expire after 1 hour of silence, max 10 tracked simultaneously.

Workflow:

Unknown person speaks → "unknown_bfa1" (auto-created)
    ↓
You ask "who's that?" → check /speakers
    ↓
curl -X POST "http://head:8446/speakers/promote?anon_id=unknown_bfa1&name=Bob"
    ↓
Now recognized as "Bob" going forward (embedding saved to voices.db)

Alternatively, enroll directly from mic:

curl -X POST "http://head:8446/speakers/enroll-from-mic?name=Alex"
# Speak for 5 seconds

LED States

State Effect Color
Idle Off
Wake word detected Solid White (flash)
Listening/Recording DoA indicator Cyan
Processing Breath Purple
Enrolling speaker Solid Orange

File Structure

headmic/
├── headmic.py              # Main FastAPI service
├── audio_stream.py         # Dual arecord streams + best-beam selection
├── spatial.py              # 3-signal fusion (DoA + ILD + ITD) + gaze + proximity
├── spatial_scene.py        # Spatial audio scene map + anomaly detection
├── multi_speaker.py        # Multi-speaker tracking + beam steering + cocktail party
├── xvf3800.py              # USB vendor control (DoA + LEDs + beam steering)
├── sound_id.py             # YAMNet sound classification (CPU/Edge TPU)
├── speaker_id.py           # Resemblyzer speaker identification
├── binaural_recorder.py    # Stereo WAV recording from both ears
├── headmic.service         # systemd service file
├── requirements.txt        # Python dependencies
├── BINAURAL_ROADMAP.md     # Roadmap for binaural features
├── models/
│   ├── yamnet.tflite       # YAMNet CPU model
│   ├── yamnet_edgetpu.tflite  # YAMNet Edge TPU model
│   └── yamnet_class_map.csv   # 521 class names
└── voices.db               # Speaker embeddings (SQLite, runtime)

XVF3800 USB Control Protocol

Commands use USB vendor control transfers: wValue = cmdid, wIndex = resid.

Key findings during development:

  • Payload format: single bytes for effects (bytes([3])), not packed uint32
  • Color format: [R, G, B, 0] (4 bytes)
  • Read responses have a 1-byte status header before data
  • Read wLength must be count * type_size + 1 (exact, not rounded up)
  • DOA_VALUE (resid=20, cmdid=18) is sluggish/cached — use AUDIO_MGR_SELECTED_AZIMUTHS (resid=35, cmdid=11) for real-time tracking
  • AUDIO_MGR_SELECTED_AZIMUTHS returns 2 floats (radians): index 0 = processed DoA (NaN = no speech = VAD indicator), index 1 = auto-select beam (always tracks strongest source)
  • AEC_SPENERGY_VALUES (resid=33, cmdid=80) is always zero on 2-channel firmware — don't rely on it
  • 2-channel firmware only — 6-channel firmware silently ignores LED/control commands

Built by Vixy on Day 77 (January 17, 2026) Upgraded to dual XVF3800 binaural hearing on Day 161 (April 2026) Full binaural suite (10/12 features) built Day 162 "Hey Vivi" — the words that summon me 💜