Update docs — spatial scene, distance estimation, roadmap progress

README: Updated architecture diagram, features table, new endpoints (/scene, /scene/events, /scene/heatmap), file structure, USB protocol notes (VAD from processed_doa NaN, spenergy always zero). BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:35:02 -05:00
parent 8caa9ee57e
commit 02d3ac3816
2 changed files with 172 additions and 22 deletions
--- a/BINAURAL_ROADMAP.md
+++ b/BINAURAL_ROADMAP.md
@@ -0,0 +1,139 @@
+# Binaural Hearing Roadmap
+## What two mic arrays make possible
+
+Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint.
+
+---
+
+### Tier 1 — High impact, ready to build now
+
+**1. Triangulated sound localization + eye gaze**
+- Combine DoA angles from both arrays → compute (x, y) position of sound source
+- Post gaze coordinates to eye service → eyes track the speaker spatially
+- Front/back disambiguation (single array can't tell 30° front from 30° rear)
+- *Prereqs:* Known array positions (measured once), basic trig
+- *Complexity:* Low — ~100 lines of math + a gaze-push thread
+- *Impact:* Huge — eyes actually follow the person, not just shift left/right
+
+**2. Active speaker tracking with smooth gaze**
+- Continuously track the dominant sound source as it moves
+- Smooth the gaze updates (low-pass filter) so eyes don't jitter
+- When VAD drops, eyes drift back to center (natural idle behavior)
+- *Prereqs:* #1
+- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1
+- *Impact:* Makes her feel present and attentive
+
+**3. Left/right speaker awareness**
+- Know which side each speaker is on, combine with speaker ID
+- "Alex is on my left" vs "unknown person on my right"
+- Feed into LYRA context so responses can reference spatial relationships
+- *Prereqs:* #1 + existing speaker ID
+- *Complexity:* Medium — associate speaker embeddings with spatial positions
+- *Impact:* Multi-person conversations become spatially grounded
+
+---
+
+### Tier 2 — High impact, moderate effort
+
+**4. Distance estimation (near/far)**
+- Interaural Level Difference (ILD): close sources have bigger volume gap between ears
+- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
+- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
+- *Prereqs:* #1, calibration with known distances
+- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics
+- *Impact:* Interaction style adapts to proximity (whisper vs. room voice)
+
+**5. Multi-speaker separation + selective attention**
+- Lock each array's beam to a different speaker simultaneously
+- Active speaker gets primary audio feed (wake word, transcription)
+- Secondary speaker monitored for interruptions or wake word
+- Switch attention on cue ("Hey Vivi" from the other side)
+- *Prereqs:* #3, understanding of XVF3800 beam steering commands
+- *Complexity:* Medium-high — need to control beamformer direction per-array
+- *Impact:* Natural multi-person conversations, not just one-at-a-time
+
+**6. Spatial audio scene mapping**
+- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
+- Learn from repeated sound sources over hours/days
+- Detect anomalies: "sound from an unusual direction"
+- *Prereqs:* #1, persistent storage, classification by direction
+- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time
+- *Impact:* Environmental awareness, contextual anomaly detection
+
+---
+
+### Tier 3 — Cool, needs more infrastructure
+
+**7. Cocktail party spatial filtering**
+- When multiple sound sources active, use both arrays to null out interference
+- Focus beam on target speaker, suppress others spatially
+- *Prereqs:* #5, possibly raw mic access (6-channel firmware)
+- *Complexity:* High — adaptive beamforming, may need custom DSP
+- *Impact:* Works in noisy environments (music playing, multiple people)
+
+**8. Sound event localization (what + where)**
+- Combine YAMNet classification with triangulated position
+- "Dog bark from the backyard direction" not just "dog bark"
+- Spatial history: timeline of what happened where
+- *Prereqs:* #1, #6
+- *Complexity:* Medium — merge classification results with position data
+- *Impact:* Rich environmental narrative for LYRA context
+
+**9. Head orientation inference**
+- If a known sound source is at a fixed position, infer which way the head is "facing"
+- Useful if the skull ever gets a rotating mount
+- *Prereqs:* #6 (known spatial map)
+- *Complexity:* Low math, but needs stable reference points
+- *Impact:* Low for now (head doesn't turn), future-proofing
+
+**10. Binaural recording for training data**
+- Record stereo audio preserving spatial information (left ear / right ear)
+- Training corpus for spatial audio models, being0 sensor data
+- *Prereqs:* Just dual streams saved to stereo WAV
+- *Complexity:* Low — already have both streams
+- *Impact:* Long-term value for L-Vixy-5 training
+
+---
+
+### Tier 4 — Research / future
+
+**11. Learned spatial attention**
+- Train a model to decide where to attend based on context
+- Input: both DoA angles, VAD states, current emotional state, conversation history
+- Output: beam steering + gaze direction
+- *Prereqs:* #5, #6, training data from #10
+- *Complexity:* High — ML training pipeline
+- *Impact:* Autonomous attention that feels natural, not rule-based
+
+**12. Interaural time difference (ITD) processing**
+- Raw mic access (6-channel firmware) enables sub-sample timing analysis
+- More precise localization than DoA alone, especially at low frequencies
+- *Prereqs:* 6-channel firmware (need to verify LED control works with it first)
+- *Complexity:* High — signal processing, cross-correlation
+- *Impact:* Lab-grade localization accuracy
+
+---
+
+## Implementation order
+
+```
+✅ #1  Triangulation + gaze          — done (spatial.py, auto-select beam DoA)
+✅ #2  Smooth tracking               — done (exponential smoothing + idle drift)
+✅ #3  Speaker-side awareness        — done (Resemblyzer loaded, ready for enrollment)
+✅ #4  Distance estimation           — done (ILD + triangulation fusion, proximity zones)
+✅ #6  Spatial scene mapping         — done (spatial_scene.py, persistent, anomaly detection)
+✅ #8  Sound event localization      — done (what + where + when via /scene/events)
+✅ #10 Binaural recording            — done (opt-in via BINAURAL_RECORD=1)
+   #5  Multi-speaker separation
+   #7  Cocktail party filtering
+#7 Cocktail party filtering
+#11 Learned attention
+```
+
+## Notes
+
+- Items #1-3 can be built in a single session
+- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}`
+- DoA is already polled at 10Hz via `/doa` endpoint
+- Array separation distance needs to be measured once and stored in config
+- All of this feeds into the being0 "shaped by experience" philosophy