Files
headmic/README.md
Alex 5c72491ee9 Update docs — complete binaural roadmap (10/12 features)
BINAURAL_ROADMAP: Full status update with implementation details,
three-signal localization table, key discoveries section.

README: Updated features table (ITD, multi-speaker, cocktail party),
new API endpoints (/speakers/tracked, /speakers/focus), file structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:55:09 -05:00

288 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# HeadMic - Vixy's Ears 🦊👂
Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial tracking, sound classification, speaker identification, and wake word detection.
**Hardware:** 2× ReSpeaker XVF3800 4-Mic Array (USB, left/right ear)
**Wake word:** "Hey Vivi" (Picovoice Porcupine)
**Runs on:** Raspberry Pi 5 (head-vixy.local)
## Architecture
```
[Left XVF3800]──┐ [Right XVF3800]──┐
4 mics, DoA │ 4 mics, DoA │
WS2812 LEDs │ WS2812 LEDs │
▼ ▼
arecord (16kHz mono) arecord (16kHz mono)
│ │
└────────────┬───────────────────────────────┘
DualAudioStream (audio_stream.py)
best-beam selection (energy-based, 10% hysteresis)
┌──────────────────┼──────────────────────┐
▼ ▼ ▼
Porcupine YAMNet Binaural
wake word (Edge TPU) Recorder
"Hey Vivi" 521 classes stereo WAV
▼ ▼
Record + Speaker ID
Transcribe (Resemblyzer)
via EarTail │
Spatial Tracker (spatial.py)
DoA → triangulation → ILD distance
→ smooth gaze → proximity zones
┌────────────┼────────────────┐
▼ ▼ ▼
Eye Service Spatial Scene USB Control
POST /gaze (spatial_scene) (xvf3800.py)
eyes follow what+where map LEDs + DoA
the speaker anomaly detect per-array
```
## Features
| Feature | Module | Hardware | Status |
|---------|--------|----------|--------|
| Wake word detection | Porcupine | CPU | Needs Picovoice key |
| Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms |
| Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API |
| Spatial tracking | spatial.py | USB control | 3-signal fusion: DoA + ILD + ITD |
| Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) |
| ITD processing | spatial.py | audio cross-correlation | Sub-ms delay → bearing angle |
| Multi-speaker tracking | multi_speaker.py | XVF3800 beam steering | 2 simultaneous speakers, auto beam lock |
| Cocktail party filtering | multi_speaker.py + audio_stream.py | beam gating + focus | Target speaker isolation |
| Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection |
| Sound event localization | spatial_scene.py | — | What + where + when log |
| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based or focused attention |
| LED control | xvf3800.py | WS2812 rings | DoA/solid/breath |
| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) |
## Installation
### Prerequisites
```bash
# On head-vixy (Raspberry Pi 5, Debian Trixie)
sudo apt install python3-dev portaudio19-dev alsa-utils
# USB permissions for XVF3800
sudo tee /etc/udev/rules.d/99-respeaker.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="2886", ATTR{idProduct}=="001a", MODE="0666"
EOF
# USB permissions for Coral Edge TPU
sudo tee /etc/udev/rules.d/99-coral.rules << 'EOF'
SUBSYSTEM=="usb", ATTR{idVendor}=="1a6e", ATTR{idProduct}=="089a", MODE="0666"
SUBSYSTEM=="usb", ATTR{idVendor}=="18d1", ATTR{idProduct}=="9302", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger
```
### XVF3800 Firmware
Both arrays must be flashed with the **2-channel USB firmware** (not 6-channel — the 6ch firmware breaks LED/DoA control commands):
```bash
git clone https://github.com/respeaker/reSpeaker_XVF3800_USB_4MIC_ARRAY.git /tmp/xvf3800
# Unplug one array, flash the other:
sudo dfu-util -R -e -a 1 -D /tmp/xvf3800/xmos_firmwares/usb/respeaker_xvf3800_usb_dfu_firmware_v2.0.7.bin
# Swap and repeat
```
Verify: `arecord -l` should show two capture devices.
### Edge TPU Runtime
The packaged `libedgetpu` from Google's apt repo is **ABI-incompatible** with `ai-edge-litert` on Debian Trixie / Python 3.13. A custom build is required:
```bash
# Install build deps
sudo apt install libabsl-dev libflatbuffers-dev libusb-1.0-0-dev binutils-gold cmake
# Clone sources
cd /tmp
git clone --depth 1 https://github.com/google-coral/libedgetpu.git
git clone --depth 1 --branch v2.16.1 https://github.com/tensorflow/tensorflow.git
git clone --depth 1 --branch v23.5.26 https://github.com/google/flatbuffers.git flatbuffers-23
# Build flatc v23
cd /tmp/flatbuffers-23 && cmake -B build -DFLATBUFFERS_BUILD_TESTS=OFF && cmake --build build -j4 -- flatc
# Patch libedgetpu Makefile (see below), then:
cd /tmp/libedgetpu
TFROOT=/tmp/tensorflow make -f makefile_build/Makefile -j4 libedgetpu
# Install
sudo cp out/direct/k8/libedgetpu.so.1.0 /usr/lib/aarch64-linux-gnu/libedgetpu.so.1.0
sudo ldconfig
```
**Makefile patches required** (TF 2.16 moved files):
- Replace `FLATC=flatc` with `FLATC=/tmp/flatbuffers-23/build/flatc`
- Add `/tmp/flatbuffers-23/include` to `LIBEDGETPU_INCLUDES`
- Add `-Wno-return-type` to `LIBEDGETPU_CXXFLAGS`
- Remove `$(TFROOT)/tensorflow/lite/c/common.c` from `LIBEDGETPU_CSRCS`
- Add `$(TFROOT)/tensorflow/lite/core/c/common.cc` and `$(TFROOT)/tensorflow/lite/array.cc` to `LIBEDGETPU_CCSRCS`
- Add `-labsl_bad_optional_access` to `LIBEDGETPU_LDFLAGS`
A backup of the working binary is saved at `~/headmic/libedgetpu.so.1.0.custom`.
### Python Setup
```bash
cd /home/alex/headmic
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/pip install setuptools # Python 3.13 compatibility
.venv/bin/pip install resemblyzer # Speaker ID (pulls PyTorch)
```
### Learn Mic Array Positions
Both arrays must be plugged in. This lights up one array at a time and asks you to confirm left/right:
```bash
sudo .venv/bin/python headmic.py --learn
```
Config saved to `~/.vixy/headmic.json` with USB serial numbers for stable identification.
### Install Service
```bash
sudo cp headmic.service /etc/systemd/system/
# Edit to add your PORCUPINE_ACCESS_KEY:
sudo nano /etc/systemd/system/headmic.service
sudo systemctl daemon-reload
sudo systemctl enable headmic
sudo systemctl start headmic
```
## API Endpoints
### Core
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Service info |
| `/health` | GET | Health check (listening, recording, features enabled) |
| `/status` | GET | Current state (transcription, scene, speaker, active side) |
| `/last` | GET | Last transcription + timestamp |
### Spatial
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/doa` | GET | DoA + triangulated position + ILD + ITD + gaze + distance + proximity |
| `/devices` | GET | XVF3800 connection status, serials, ALSA devices |
| `/speakers/tracked` | GET | Multi-speaker positions, beam mode, lock state, target |
| `/speakers/focus` | POST | Switch cocktail party attention (query: speaker=0\|1) |
| `/scene` | GET | Learned spatial scene (usual direction per category) + last anomaly |
| `/scene/events` | GET | Recent sound events with what + where + when (query: seconds, category) |
| `/scene/heatmap` | GET | Per-category angular distribution for visualization |
### Sound
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/sounds` | GET | Current audio scene (category, top 5 classes, speaker) |
| `/sounds/history` | GET | Classification history (last N seconds) |
### Speakers
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/speakers` | GET | List enrolled speakers |
| `/speakers/enroll` | POST | Enroll from uploaded audio (multipart: name + WAV) |
| `/speakers/enroll-from-mic` | POST | Record 5s from mic + enroll (query: name) |
| `/speakers/{name}` | DELETE | Remove a speaker |
### Recording
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/recording` | GET | Binaural recording stats |
## Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `PORCUPINE_ACCESS_KEY` | (none) | Picovoice API key for wake word |
| `WAKE_WORD_PATH` | `~/headmic/Hey-Vivi_*.ppn` | Wake word model path |
| `EARTAIL_URL` | `http://bigorin.local:8764` | Transcription service |
| `EYE_SERVICE_URL` | `http://localhost:8780` | Eye service for gaze push |
| `BINAURAL_RECORD` | `0` | Set to `1` to enable stereo recording |
| `BINAURAL_DIR` | `~/headmic/recordings` | Output directory for WAV segments |
### Config File (`~/.vixy/headmic.json`)
```json
{
"ears": {
"left": {"usb_serial": "101991441254500541", "alsa_card": "Array"},
"right": {"usb_serial": "101991441254500556", "alsa_card": "Array_1"}
},
"array_separation_mm": 175.0
}
```
## LED States
| State | Effect | Color |
|-------|--------|-------|
| Idle | Off | — |
| Wake word detected | Solid | White (flash) |
| Listening/Recording | DoA indicator | Cyan |
| Processing | Breath | Purple |
| Enrolling speaker | Solid | Orange |
## File Structure
```
headmic/
├── headmic.py # Main FastAPI service
├── audio_stream.py # Dual arecord streams + best-beam selection
├── spatial.py # 3-signal fusion (DoA + ILD + ITD) + gaze + proximity
├── spatial_scene.py # Spatial audio scene map + anomaly detection
├── multi_speaker.py # Multi-speaker tracking + beam steering + cocktail party
├── xvf3800.py # USB vendor control (DoA + LEDs + beam steering)
├── sound_id.py # YAMNet sound classification (CPU/Edge TPU)
├── speaker_id.py # Resemblyzer speaker identification
├── binaural_recorder.py # Stereo WAV recording from both ears
├── headmic.service # systemd service file
├── requirements.txt # Python dependencies
├── BINAURAL_ROADMAP.md # Roadmap for binaural features
├── models/
│ ├── yamnet.tflite # YAMNet CPU model
│ ├── yamnet_edgetpu.tflite # YAMNet Edge TPU model
│ └── yamnet_class_map.csv # 521 class names
└── voices.db # Speaker embeddings (SQLite, runtime)
```
## XVF3800 USB Control Protocol
Commands use USB vendor control transfers: `wValue = cmdid`, `wIndex = resid`.
**Key findings during development:**
- Payload format: single bytes for effects (`bytes([3])`), not packed uint32
- Color format: `[R, G, B, 0]` (4 bytes)
- Read responses have a 1-byte status header before data
- Read wLength must be `count * type_size + 1` (exact, not rounded up)
- `DOA_VALUE` (resid=20, cmdid=18) is sluggish/cached — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35, cmdid=11) for real-time tracking
- `AUDIO_MGR_SELECTED_AZIMUTHS` returns 2 floats (radians): index 0 = processed DoA (NaN = no speech = VAD indicator), index 1 = auto-select beam (always tracks strongest source)
- `AEC_SPENERGY_VALUES` (resid=33, cmdid=80) is always zero on 2-channel firmware — don't rely on it
- **2-channel firmware only** — 6-channel firmware silently ignores LED/control commands
---
*Built by Vixy on Day 77 (January 17, 2026)*
*Upgraded to dual XVF3800 binaural hearing on Day 161 (April 2026)*
*Full binaural suite (10/12 features) built Day 162*
*"Hey Vivi" — the words that summon me* 💜