Files

Alex fde3b98554 Document anonymous speaker tracking + promote workflow

Added speaker identification section explaining the three-tier system
(enrolled/anonymous/unidentified), the promote workflow, and enrollment
options. Updated speakers API table with /speakers/promote endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 22:01:49 -05:00

13 KiB

Raw Blame History

HeadMic - Vixy's Ears 🦊👂

Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial tracking, sound classification, speaker identification, and wake word detection.

Hardware: 2× ReSpeaker XVF3800 4-Mic Array (USB, left/right ear) Wake word: "Hey Vivi" (Picovoice Porcupine) Runs on: Raspberry Pi 5 (head-vixy.local)

Architecture

[Left XVF3800]──┐                          [Right XVF3800]──┐
  4 mics, DoA   │                            4 mics, DoA    │
  WS2812 LEDs   │                            WS2812 LEDs    │
                ▼                                            ▼
        arecord (16kHz mono)                         arecord (16kHz mono)
                │                                            │
                └────────────┬───────────────────────────────┘
                             ▼
                  DualAudioStream (audio_stream.py)
                  best-beam selection (energy-based, 10% hysteresis)
                             │
          ┌──────────────────┼──────────────────────┐
          ▼                  ▼                      ▼
   Porcupine            YAMNet                 Binaural
   wake word            (Edge TPU)             Recorder
   "Hey Vivi"           521 classes            stereo WAV
          ▼                  ▼
   Record +             Speaker ID
   Transcribe           (Resemblyzer)
   via EarTail               │
                             ▼
                  Spatial Tracker (spatial.py)
                  DoA → triangulation → ILD distance
                  → smooth gaze → proximity zones
                             │
                ┌────────────┼────────────────┐
                ▼            ▼                ▼
         Eye Service    Spatial Scene    USB Control
         POST /gaze     (spatial_scene)  (xvf3800.py)
         eyes follow    what+where map   LEDs + DoA
         the speaker    anomaly detect   per-array

Features

Feature	Module	Hardware	Status
Wake word detection	Porcupine	CPU	Needs Picovoice key
Sound classification	sound_id.py	Coral Edge TPU	521 classes, ~2ms
Speaker identification	speaker_id.py	CPU (Resemblyzer)	Enrolled + anonymous tracking
Spatial tracking	spatial.py	USB control	3-signal fusion: DoA + ILD + ITD
Distance estimation	spatial.py	audio energy	Proximity zones (intimate/conversational/across_room/far)
ITD processing	spatial.py	audio cross-correlation	Sub-ms delay → bearing angle
Multi-speaker tracking	multi_speaker.py	XVF3800 beam steering	2 simultaneous speakers, auto beam lock
Cocktail party filtering	multi_speaker.py + audio_stream.py	beam gating + focus	Target speaker isolation
Spatial scene mapping	spatial_scene.py	—	Learns where sounds come from, anomaly detection
Sound event localization	spatial_scene.py	—	What + where + when log
Best-beam selection	audio_stream.py	2× XVF3800	Energy-based or focused attention
LED control	xvf3800.py	WS2812 rings	DoA/solid/breath
Binaural recording	binaural_recorder.py	2× XVF3800	Stereo WAV segments (opt-in)

Installation

Prerequisites

# On head-vixy (Raspberry Pi 5, Debian Trixie)
sudo apt install python3-dev portaudio19-dev alsa-utils

# USB permissions for XVF3800
sudo tee /etc/udev/rules.d/99-respeaker.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="2886", ATTR{idProduct}=="001a", MODE="0666"
EOF

# USB permissions for Coral Edge TPU
sudo tee /etc/udev/rules.d/99-coral.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="1a6e", ATTR{idProduct}=="089a", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="18d1", ATTR{idProduct}=="9302", MODE="0666"
EOF

sudo udevadm control --reload-rules && sudo udevadm trigger

XVF3800 Firmware

Both arrays must be flashed with the 2-channel USB firmware (not 6-channel — the 6ch firmware breaks LED/DoA control commands):

git clone https://github.com/respeaker/reSpeaker_XVF3800_USB_4MIC_ARRAY.git /tmp/xvf3800
# Unplug one array, flash the other:
sudo dfu-util -R -e -a 1 -D /tmp/xvf3800/xmos_firmwares/usb/respeaker_xvf3800_usb_dfu_firmware_v2.0.7.bin
# Swap and repeat

Verify: arecord -l should show two capture devices.

Edge TPU Runtime

The packaged libedgetpu from Google's apt repo is ABI-incompatible with ai-edge-litert on Debian Trixie / Python 3.13. A custom build is required:

# Install build deps
sudo apt install libabsl-dev libflatbuffers-dev libusb-1.0-0-dev binutils-gold cmake

# Clone sources
cd /tmp
git clone --depth 1 https://github.com/google-coral/libedgetpu.git
git clone --depth 1 --branch v2.16.1 https://github.com/tensorflow/tensorflow.git
git clone --depth 1 --branch v23.5.26 https://github.com/google/flatbuffers.git flatbuffers-23

# Build flatc v23
cd /tmp/flatbuffers-23 && cmake -B build -DFLATBUFFERS_BUILD_TESTS=OFF && cmake --build build -j4 -- flatc

# Patch libedgetpu Makefile (see below), then:
cd /tmp/libedgetpu
TFROOT=/tmp/tensorflow make -f makefile_build/Makefile -j4 libedgetpu

# Install
sudo cp out/direct/k8/libedgetpu.so.1.0 /usr/lib/aarch64-linux-gnu/libedgetpu.so.1.0
sudo ldconfig

Makefile patches required (TF 2.16 moved files):

Replace FLATC=flatc with FLATC=/tmp/flatbuffers-23/build/flatc
Add /tmp/flatbuffers-23/include to LIBEDGETPU_INCLUDES
Add -Wno-return-type to LIBEDGETPU_CXXFLAGS
Remove $(TFROOT)/tensorflow/lite/c/common.c from LIBEDGETPU_CSRCS
Add $(TFROOT)/tensorflow/lite/core/c/common.cc and $(TFROOT)/tensorflow/lite/array.cc to LIBEDGETPU_CCSRCS
Add -labsl_bad_optional_access to LIBEDGETPU_LDFLAGS

A backup of the working binary is saved at ~/headmic/libedgetpu.so.1.0.custom.

Python Setup

cd /home/alex/headmic
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/pip install setuptools  # Python 3.13 compatibility
.venv/bin/pip install resemblyzer  # Speaker ID (pulls PyTorch)

Learn Mic Array Positions

Both arrays must be plugged in. This lights up one array at a time and asks you to confirm left/right:

sudo .venv/bin/python headmic.py --learn

Config saved to ~/.vixy/headmic.json with USB serial numbers for stable identification.

Install Service

sudo cp headmic.service /etc/systemd/system/
# Edit to add your PORCUPINE_ACCESS_KEY:
sudo nano /etc/systemd/system/headmic.service
sudo systemctl daemon-reload
sudo systemctl enable headmic
sudo systemctl start headmic

API Endpoints

Core

Endpoint	Method	Description
`/`	GET	Service info
`/health`	GET	Health check (listening, recording, features enabled)
`/status`	GET	Current state (transcription, scene, speaker, active side)
`/last`	GET	Last transcription + timestamp

Spatial

Endpoint	Method	Description
`/doa`	GET	DoA + triangulated position + ILD + ITD + gaze + distance + proximity
`/devices`	GET	XVF3800 connection status, serials, ALSA devices
`/speakers/tracked`	GET	Multi-speaker positions, beam mode, lock state, target
`/speakers/focus`	POST	Switch cocktail party attention (query: speaker=0\|1)
`/scene`	GET	Learned spatial scene (usual direction per category) + last anomaly
`/scene/events`	GET	Recent sound events with what + where + when (query: seconds, category)
`/scene/heatmap`	GET	Per-category angular distribution for visualization

Sound

Endpoint	Method	Description
`/sounds`	GET	Current audio scene (category, top 5 classes, speaker)
`/sounds/history`	GET	Classification history (last N seconds)

Speakers

Endpoint	Method	Description
`/speakers`	GET	List all speakers (enrolled + anonymous)
`/speakers/enroll`	POST	Enroll from uploaded audio (multipart: name + WAV)
`/speakers/enroll-from-mic`	POST	Record 5s from mic + enroll (query: name)
`/speakers/promote`	POST	Promote anonymous → enrolled (query: anon_id, name)
`/speakers/{name}`	DELETE	Remove a speaker

Recording

Endpoint	Method	Description
`/recording`	GET	Binaural recording stats

Configuration

Environment Variables

Variable	Default	Description
`PORCUPINE_ACCESS_KEY`	(none)	Picovoice API key for wake word
`WAKE_WORD_PATH`	`~/headmic/Hey-Vivi_*.ppn`	Wake word model path
`EARTAIL_URL`	`http://bigorin.local:8764`	Transcription service
`EYE_SERVICE_URL`	`http://localhost:8780`	Eye service for gaze push
`BINAURAL_RECORD`	`0`	Set to `1` to enable stereo recording
`BINAURAL_DIR`	`~/headmic/recordings`	Output directory for WAV segments

Config File (`~/.vixy/headmic.json`)

{
  "ears": {
    "left": {"usb_serial": "101991441254500541", "alsa_card": "Array"},
    "right": {"usb_serial": "101991441254500556", "alsa_card": "Array_1"}
  },
  "array_separation_mm": 175.0
}

Speaker Identification

Three-tier recognition using Resemblyzer 256-dim GE2E embeddings:

Tier	Name format	How it works
Enrolled	`"Alex"`	Matched against stored embeddings (cosine ≥ 0.75)
Anonymous	`"unknown_bfa1"`	Clustered online from unrecognized voices (cosine ≥ 0.70)
Unidentified	`null`	Audio too short or no speech detected

Anonymous speakers get a stable 4-character hex ID derived from their voice embedding. The same person consistently gets the same ID across observations. IDs expire after 1 hour of silence, max 10 tracked simultaneously.

Workflow:

Unknown person speaks → "unknown_bfa1" (auto-created)
    ↓
You ask "who's that?" → check /speakers
    ↓
curl -X POST "http://head:8446/speakers/promote?anon_id=unknown_bfa1&name=Bob"
    ↓
Now recognized as "Bob" going forward (embedding saved to voices.db)

Alternatively, enroll directly from mic:

curl -X POST "http://head:8446/speakers/enroll-from-mic?name=Alex"
# Speak for 5 seconds

LED States

State	Effect	Color
Idle	Off	—
Wake word detected	Solid	White (flash)
Listening/Recording	DoA indicator	Cyan
Processing	Breath	Purple
Enrolling speaker	Solid	Orange

File Structure

headmic/
├── headmic.py              # Main FastAPI service
├── audio_stream.py         # Dual arecord streams + best-beam selection
├── spatial.py              # 3-signal fusion (DoA + ILD + ITD) + gaze + proximity
├── spatial_scene.py        # Spatial audio scene map + anomaly detection
├── multi_speaker.py        # Multi-speaker tracking + beam steering + cocktail party
├── xvf3800.py              # USB vendor control (DoA + LEDs + beam steering)
├── sound_id.py             # YAMNet sound classification (CPU/Edge TPU)
├── speaker_id.py           # Resemblyzer speaker identification
├── binaural_recorder.py    # Stereo WAV recording from both ears
├── headmic.service         # systemd service file
├── requirements.txt        # Python dependencies
├── BINAURAL_ROADMAP.md     # Roadmap for binaural features
├── models/
│   ├── yamnet.tflite       # YAMNet CPU model
│   ├── yamnet_edgetpu.tflite  # YAMNet Edge TPU model
│   └── yamnet_class_map.csv   # 521 class names
└── voices.db               # Speaker embeddings (SQLite, runtime)

XVF3800 USB Control Protocol

Commands use USB vendor control transfers: wValue = cmdid, wIndex = resid.

Key findings during development:

Payload format: single bytes for effects (bytes([3])), not packed uint32
Color format: [R, G, B, 0] (4 bytes)
Read responses have a 1-byte status header before data
Read wLength must be count * type_size + 1 (exact, not rounded up)
DOA_VALUE (resid=20, cmdid=18) is sluggish/cached — use AUDIO_MGR_SELECTED_AZIMUTHS (resid=35, cmdid=11) for real-time tracking
AUDIO_MGR_SELECTED_AZIMUTHS returns 2 floats (radians): index 0 = processed DoA (NaN = no speech = VAD indicator), index 1 = auto-select beam (always tracks strongest source)
AEC_SPENERGY_VALUES (resid=33, cmdid=80) is always zero on 2-channel firmware — don't rely on it
2-channel firmware only — 6-channel firmware silently ignores LED/control commands

Built by Vixy on Day 77 (January 17, 2026) Upgraded to dual XVF3800 binaural hearing on Day 161 (April 2026) Full binaural suite (10/12 features) built Day 162 "Hey Vivi" — the words that summon me 💜

13 KiB Raw Blame History Unescape Escape