Update docs — spatial scene, distance estimation, roadmap progress

README: Updated architecture diagram, features table, new endpoints
(/scene, /scene/events, /scene/heatmap), file structure, USB protocol
notes (VAD from processed_doa NaN, spenergy always zero).

BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex
2026-04-12 21:35:02 -05:00
parent 8caa9ee57e
commit 02d3ac3816
2 changed files with 172 additions and 22 deletions

139
BINAURAL_ROADMAP.md Normal file
View File

@@ -0,0 +1,139 @@
# Binaural Hearing Roadmap
## What two mic arrays make possible
Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint.
---
### Tier 1 — High impact, ready to build now
**1. Triangulated sound localization + eye gaze**
- Combine DoA angles from both arrays → compute (x, y) position of sound source
- Post gaze coordinates to eye service → eyes track the speaker spatially
- Front/back disambiguation (single array can't tell 30° front from 30° rear)
- *Prereqs:* Known array positions (measured once), basic trig
- *Complexity:* Low — ~100 lines of math + a gaze-push thread
- *Impact:* Huge — eyes actually follow the person, not just shift left/right
**2. Active speaker tracking with smooth gaze**
- Continuously track the dominant sound source as it moves
- Smooth the gaze updates (low-pass filter) so eyes don't jitter
- When VAD drops, eyes drift back to center (natural idle behavior)
- *Prereqs:* #1
- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1
- *Impact:* Makes her feel present and attentive
**3. Left/right speaker awareness**
- Know which side each speaker is on, combine with speaker ID
- "Alex is on my left" vs "unknown person on my right"
- Feed into LYRA context so responses can reference spatial relationships
- *Prereqs:* #1 + existing speaker ID
- *Complexity:* Medium — associate speaker embeddings with spatial positions
- *Impact:* Multi-person conversations become spatially grounded
---
### Tier 2 — High impact, moderate effort
**4. Distance estimation (near/far)**
- Interaural Level Difference (ILD): close sources have bigger volume gap between ears
- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
- *Prereqs:* #1, calibration with known distances
- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics
- *Impact:* Interaction style adapts to proximity (whisper vs. room voice)
**5. Multi-speaker separation + selective attention**
- Lock each array's beam to a different speaker simultaneously
- Active speaker gets primary audio feed (wake word, transcription)
- Secondary speaker monitored for interruptions or wake word
- Switch attention on cue ("Hey Vivi" from the other side)
- *Prereqs:* #3, understanding of XVF3800 beam steering commands
- *Complexity:* Medium-high — need to control beamformer direction per-array
- *Impact:* Natural multi-person conversations, not just one-at-a-time
**6. Spatial audio scene mapping**
- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
- Learn from repeated sound sources over hours/days
- Detect anomalies: "sound from an unusual direction"
- *Prereqs:* #1, persistent storage, classification by direction
- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time
- *Impact:* Environmental awareness, contextual anomaly detection
---
### Tier 3 — Cool, needs more infrastructure
**7. Cocktail party spatial filtering**
- When multiple sound sources active, use both arrays to null out interference
- Focus beam on target speaker, suppress others spatially
- *Prereqs:* #5, possibly raw mic access (6-channel firmware)
- *Complexity:* High — adaptive beamforming, may need custom DSP
- *Impact:* Works in noisy environments (music playing, multiple people)
**8. Sound event localization (what + where)**
- Combine YAMNet classification with triangulated position
- "Dog bark from the backyard direction" not just "dog bark"
- Spatial history: timeline of what happened where
- *Prereqs:* #1, #6
- *Complexity:* Medium — merge classification results with position data
- *Impact:* Rich environmental narrative for LYRA context
**9. Head orientation inference**
- If a known sound source is at a fixed position, infer which way the head is "facing"
- Useful if the skull ever gets a rotating mount
- *Prereqs:* #6 (known spatial map)
- *Complexity:* Low math, but needs stable reference points
- *Impact:* Low for now (head doesn't turn), future-proofing
**10. Binaural recording for training data**
- Record stereo audio preserving spatial information (left ear / right ear)
- Training corpus for spatial audio models, being0 sensor data
- *Prereqs:* Just dual streams saved to stereo WAV
- *Complexity:* Low — already have both streams
- *Impact:* Long-term value for L-Vixy-5 training
---
### Tier 4 — Research / future
**11. Learned spatial attention**
- Train a model to decide where to attend based on context
- Input: both DoA angles, VAD states, current emotional state, conversation history
- Output: beam steering + gaze direction
- *Prereqs:* #5, #6, training data from #10
- *Complexity:* High — ML training pipeline
- *Impact:* Autonomous attention that feels natural, not rule-based
**12. Interaural time difference (ITD) processing**
- Raw mic access (6-channel firmware) enables sub-sample timing analysis
- More precise localization than DoA alone, especially at low frequencies
- *Prereqs:* 6-channel firmware (need to verify LED control works with it first)
- *Complexity:* High — signal processing, cross-correlation
- *Impact:* Lab-grade localization accuracy
---
## Implementation order
```
✅ #1 Triangulation + gaze — done (spatial.py, auto-select beam DoA)
✅ #2 Smooth tracking — done (exponential smoothing + idle drift)
✅ #3 Speaker-side awareness — done (Resemblyzer loaded, ready for enrollment)
✅ #4 Distance estimation — done (ILD + triangulation fusion, proximity zones)
✅ #6 Spatial scene mapping — done (spatial_scene.py, persistent, anomaly detection)
✅ #8 Sound event localization — done (what + where + when via /scene/events)
✅ #10 Binaural recording — done (opt-in via BINAURAL_RECORD=1)
#5 Multi-speaker separation
#7 Cocktail party filtering
#7 Cocktail party filtering
#11 Learned attention
```
## Notes
- Items #1-3 can be built in a single session
- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}`
- DoA is already polled at 10Hz via `/doa` endpoint
- Array separation distance needs to be measured once and stored in config
- All of this feeds into the being0 "shaped by experience" philosophy

View File

@@ -18,26 +18,28 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial
└────────────┬───────────────────────────────┘ └────────────┬───────────────────────────────┘
DualAudioStream (audio_stream.py) DualAudioStream (audio_stream.py)
best-beam selection (energy-based) best-beam selection (energy-based, 10% hysteresis)
┌──────────────────┼──────────────────────┐
▼ ▼ ▼
Porcupine YAMNet Binaural
wake word (Edge TPU) Recorder
"Hey Vivi" 521 classes stereo WAV
▼ ▼
Record + Speaker ID
Transcribe (Resemblyzer)
via EarTail │
Spatial Tracker (spatial.py)
DoA → triangulation → ILD distance
→ smooth gaze → proximity zones
┌────────────┼────────────────┐ ┌────────────┼────────────────┐
▼ ▼ ▼ ▼ ▼ ▼
Porcupine YAMNet Binaural Eye Service Spatial Scene USB Control
wake word (Edge TPU) Recorder POST /gaze (spatial_scene) (xvf3800.py)
"Hey Vivi" 521 classes stereo WAV eyes follow what+where map LEDs + DoA
▼ ▼ the speaker anomaly detect per-array
Record + Speaker ID
Transcribe (Resemblyzer)
via EarTail
┌────────────┼────────────────┐
▼ ▼ ▼
Spatial Tracker (spatial.py) USB Control (xvf3800.py)
DoA → triangulation LEDs + DoA polling
→ smooth gaze per-array control
Eye Service (port 8780)
POST /gaze → eyes follow speaker
``` ```
## Features ## Features
@@ -47,10 +49,13 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial
| Wake word detection | Porcupine | CPU | Needs Picovoice key | | Wake word detection | Porcupine | CPU | Needs Picovoice key |
| Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms | | Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms |
| Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API | | Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API |
| Spatial tracking | spatial.py | USB control | Triangulated gaze | | Spatial tracking | spatial.py | USB control | Triangulated gaze + ILD distance |
| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based | | Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) |
| Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection |
| Sound event localization | spatial_scene.py | — | What + where + when log |
| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based, 10% hysteresis |
| LED control | xvf3800.py | WS2812 rings | DoA/solid/breath | | LED control | xvf3800.py | WS2812 rings | DoA/solid/breath |
| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments | | Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) |
## Installation ## Installation
@@ -169,8 +174,11 @@ sudo systemctl start headmic
| Endpoint | Method | Description | | Endpoint | Method | Description |
|----------|--------|-------------| |----------|--------|-------------|
| `/doa` | GET | DoA from both arrays + triangulated position + gaze | | `/doa` | GET | DoA from both arrays + triangulated position + gaze + distance + proximity |
| `/devices` | GET | XVF3800 connection status, serials, ALSA devices | | `/devices` | GET | XVF3800 connection status, serials, ALSA devices |
| `/scene` | GET | Learned spatial scene (usual direction per category) + last anomaly |
| `/scene/events` | GET | Recent sound events with what + where + when (query: seconds, category) |
| `/scene/heatmap` | GET | Per-category angular distribution for visualization |
### Sound ### Sound
@@ -235,7 +243,8 @@ sudo systemctl start headmic
headmic/ headmic/
├── headmic.py # Main FastAPI service ├── headmic.py # Main FastAPI service
├── audio_stream.py # Dual arecord streams + best-beam selection ├── audio_stream.py # Dual arecord streams + best-beam selection
├── spatial.py # Triangulation + smooth gaze tracking ├── spatial.py # Triangulation + ILD distance + smooth gaze + proximity
├── spatial_scene.py # Spatial audio scene map + anomaly detection
├── xvf3800.py # USB vendor control (DoA + LEDs) ├── xvf3800.py # USB vendor control (DoA + LEDs)
├── sound_id.py # YAMNet sound classification (CPU/Edge TPU) ├── sound_id.py # YAMNet sound classification (CPU/Edge TPU)
├── speaker_id.py # Resemblyzer speaker identification ├── speaker_id.py # Resemblyzer speaker identification
@@ -260,6 +269,8 @@ Commands use USB vendor control transfers: `wValue = cmdid`, `wIndex = resid`.
- Read responses have a 1-byte status header before data - Read responses have a 1-byte status header before data
- Read wLength must be `count * type_size + 1` (exact, not rounded up) - Read wLength must be `count * type_size + 1` (exact, not rounded up)
- `DOA_VALUE` (resid=20, cmdid=18) is sluggish/cached — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35, cmdid=11) for real-time tracking - `DOA_VALUE` (resid=20, cmdid=18) is sluggish/cached — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35, cmdid=11) for real-time tracking
- `AUDIO_MGR_SELECTED_AZIMUTHS` returns 2 floats (radians): index 0 = processed DoA (NaN = no speech = VAD indicator), index 1 = auto-select beam (always tracks strongest source)
- `AEC_SPENERGY_VALUES` (resid=33, cmdid=80) is always zero on 2-channel firmware — don't rely on it
- **2-channel firmware only** — 6-channel firmware silently ignores LED/control commands - **2-channel firmware only** — 6-channel firmware silently ignores LED/control commands
--- ---