Update docs — complete binaural roadmap (10/12 features)

BINAURAL_ROADMAP: Full status update with implementation details, three-signal localization table, key discoveries section. README: Updated features table (ITD, multi-speaker, cocktail party), new API endpoints (/speakers/tracked, /speakers/focus), file structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:55:09 -05:00
parent 8073e3eb02
commit 5c72491ee9
2 changed files with 126 additions and 107 deletions
--- a/BINAURAL_ROADMAP.md
+++ b/BINAURAL_ROADMAP.md
@@ -1,139 +1,151 @@
 # Binaural Hearing Roadmap
 ## What two mic arrays make possible

-Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint.
+All features built on dual XVF3800 arrays, 175mm apart on the skull.

 ---

-### Tier 1 — High impact, ready to build now
+### Tier 1 — High impact ✅

-**1. Triangulated sound localization + eye gaze**
+**1. Triangulated sound localization + eye gaze** ✅
 - Combine DoA angles from both arrays → compute (x, y) position of sound source
 - Post gaze coordinates to eye service → eyes track the speaker spatially
- Front/back disambiguation (single array can't tell 30° front from 30° rear)
- *Prereqs:* Known array positions (measured once), basic trig
- *Complexity:* Low — ~100 lines of math + a gaze-push thread
- *Impact:* Huge — eyes actually follow the person, not just shift left/right
+- Uses `AUDIO_MGR_SELECTED_AZIMUTHS` auto-select beam (not sluggish `DOA_VALUE`)
+- VAD from processed_doa being NaN/non-NaN
+- *Module:* `spatial.py`

-**2. Active speaker tracking with smooth gaze**
- Continuously track the dominant sound source as it moves
- Smooth the gaze updates (low-pass filter) so eyes don't jitter
- When VAD drops, eyes drift back to center (natural idle behavior)
- *Prereqs:* #1
- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1
- *Impact:* Makes her feel present and attentive
+**2. Active speaker tracking with smooth gaze** ✅
+- Exponential smoothing (α=0.4) prevents jitter
+- Idle drift back to center after 1.5s of silence
+- Gaze pushed to eye service at ≤2Hz with 5px min delta
+- *Module:* `spatial.py`, `headmic.py` (`doa_track_loop`)

-**3. Left/right speaker awareness**
- Know which side each speaker is on, combine with speaker ID
- "Alex is on my left" vs "unknown person on my right"
- Feed into LYRA context so responses can reference spatial relationships
- *Prereqs:* #1 + existing speaker ID
- *Complexity:* Medium — associate speaker embeddings with spatial positions
- *Impact:* Multi-person conversations become spatially grounded
+**3. Left/right speaker awareness** ✅
+- Resemblyzer speaker ID integrated, ready for enrollment
+- Spatial position associated with recognized speaker
+- `POST /speakers/enroll-from-mic?name=Alex` to enroll
+- *Module:* `speaker_id.py`, `spatial.py`

 ---

-### Tier 2 — High impact, moderate effort
+### Tier 2 — High impact ✅

-**4. Distance estimation (near/far)**
- Interaural Level Difference (ILD): close sources have bigger volume gap between ears
- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
- *Prereqs:* #1, calibration with known distances
- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics
- *Impact:* Interaction style adapts to proximity (whisper vs. room voice)
+**4. Distance estimation (near/far)** ✅
+- ILD (Interaural Level Difference): volume gap between ears → distance
+- Fused with triangulated distance (70/30 weight)
+- Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m)
+- *Module:* `spatial.py` (`_compute_ild`, `_ild_to_distance`, `_classify_proximity`)

-**5. Multi-speaker separation + selective attention**
- Lock each array's beam to a different speaker simultaneously
- Active speaker gets primary audio feed (wake word, transcription)
- Secondary speaker monitored for interruptions or wake word
- Switch attention on cue ("Hey Vivi" from the other side)
- *Prereqs:* #3, understanding of XVF3800 beam steering commands
- *Complexity:* Medium-high — need to control beamformer direction per-array
- *Impact:* Natural multi-person conversations, not just one-at-a-time
+**5. Multi-speaker separation + selective attention** ✅
+- Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required)
+- Locks XVF3800 fixed beams onto each speaker after 1s stability
+- Auto-releases to free-running mode after 3s of single speaker
+- Beam gating silences the non-speaking beam
+- `GET /speakers/tracked` — positions, beam state, lock status
+- *Module:* `multi_speaker.py`, `xvf3800.py` (beam steering commands)

-**6. Spatial audio scene mapping**
- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
- Learn from repeated sound sources over hours/days
- Detect anomalies: "sound from an unusual direction"
- *Prereqs:* #1, persistent storage, classification by direction
- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time
- *Impact:* Environmental awareness, contextual anomaly detection
+**6. Spatial audio scene mapping** ✅
+- Persistent map of where each sound category usually comes from (30° bins)
+- Circular mean for usual direction, anomaly detection at 90°+ deviation
+- Saves to `~/.vixy/scene_map.json` across restarts
+- `GET /scene` — learned directions per category + anomalies
+- `GET /scene/events` — recent what+where+when log
+- `GET /scene/heatmap` — angular distribution for visualization
+- *Module:* `spatial_scene.py`

 ---

-### Tier 3 — Cool, needs more infrastructure
+### Tier 3 — Advanced ✅

-**7. Cocktail party spatial filtering**
- When multiple sound sources active, use both arrays to null out interference
- Focus beam on target speaker, suppress others spatially
- *Prereqs:* #5, possibly raw mic access (6-channel firmware)
- *Complexity:* High — adaptive beamforming, may need custom DSP
- *Impact:* Works in noisy environments (music playing, multiple people)
+**7. Cocktail party spatial filtering** ✅
+- When 2 speakers tracked, audio focus locks to target speaker's side
+- Non-target beam suppressed via XVF3800 beam gating
+- Auto-switches target when current goes silent and other starts talking
+- Manual focus via `POST /speakers/focus?speaker=0|1`
+- `DualAudioStream.focus_side` overrides energy-based beam selection
+- *Module:* `multi_speaker.py`, `audio_stream.py` (`focus_side`)

-**8. Sound event localization (what + where)**
- Combine YAMNet classification with triangulated position
- "Dog bark from the backyard direction" not just "dog bark"
- Spatial history: timeline of what happened where
- *Prereqs:* #1, #6
- *Complexity:* Medium — merge classification results with position data
- *Impact:* Rich environmental narrative for LYRA context
+**8. Sound event localization (what + where)** ✅
+- YAMNet classification merged with triangulated position
+- Every classified sound logged with angle, distance, proximity, side
+- "Speech from 75° at conversational distance" not just "speech"
+- *Module:* `spatial_scene.py` (`SoundEvent`, `observe`)

-**9. Head orientation inference**
- If a known sound source is at a fixed position, infer which way the head is "facing"
- Useful if the skull ever gets a rotating mount
- *Prereqs:* #6 (known spatial map)
- *Complexity:* Low math, but needs stable reference points
- *Impact:* Low for now (head doesn't turn), future-proofing
+**9. Head orientation inference** ⏭️
+- Needs fixed reference points from scene map + rotating head mount
+- Math is trivial once prerequisites exist
+- *Prereqs:* #6, physical rotating mount

-**10. Binaural recording for training data**
- Record stereo audio preserving spatial information (left ear / right ear)
- Training corpus for spatial audio models, being0 sensor data
- *Prereqs:* Just dual streams saved to stereo WAV
- *Complexity:* Low — already have both streams
- *Impact:* Long-term value for L-Vixy-5 training
+**10. Binaural recording for training data** ✅
+- Records left/right ear streams as stereo WAV in 5-minute segments
+- Opt-in via `BINAURAL_RECORD=1` environment variable
+- `GET /recording` — stats (segments, total seconds)
+- *Module:* `binaural_recorder.py`

 ---

-### Tier 4 — Research / future
+### Tier 4 — Research

-**11. Learned spatial attention**
- Train a model to decide where to attend based on context
- Input: both DoA angles, VAD states, current emotional state, conversation history
- Output: beam steering + gaze direction
- *Prereqs:* #5, #6, training data from #10
- *Complexity:* High — ML training pipeline
- *Impact:* Autonomous attention that feels natural, not rule-based
+**11. Learned spatial attention** ⏭️
+- Train a model: (DoA, VAD, emotion, history) → beam steering + gaze
+- Needs training data from #10 running for days/weeks
+- *Prereqs:* #5, #6, #10 data collection

-**12. Interaural time difference (ITD) processing**
- Raw mic access (6-channel firmware) enables sub-sample timing analysis
- More precise localization than DoA alone, especially at low frequencies
- *Prereqs:* 6-channel firmware (need to verify LED control works with it first)
- *Complexity:* High — signal processing, cross-correlation
- *Impact:* Lab-grade localization accuracy
+**12. ITD (Interaural Time Difference) processing** ✅
+- Cross-correlates left/right ear processed audio (512 samples, ~32ms window)
+- Finds sub-millisecond delay → bearing angle via speed of sound
+- At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample
+- Works with 2-channel firmware (no raw mics needed — correlates processed channels)
+- Third independent angle estimate alongside DoA and ILD
+- *Module:* `spatial.py` (`_compute_itd`)

 ---

-## Implementation order
+## Implementation status

 ```
-✅ #1  Triangulation + gaze          — done (spatial.py, auto-select beam DoA)
-✅ #2  Smooth tracking               — done (exponential smoothing + idle drift)
-✅ #3  Speaker-side awareness        — done (Resemblyzer loaded, ready for enrollment)
-✅ #4  Distance estimation           — done (ILD + triangulation fusion, proximity zones)
-✅ #6  Spatial scene mapping         — done (spatial_scene.py, persistent, anomaly detection)
-✅ #8  Sound event localization      — done (what + where + when via /scene/events)
-✅ #10 Binaural recording            — done (opt-in via BINAURAL_RECORD=1)
-   #5  Multi-speaker separation
-   #7  Cocktail party filtering
-#7 Cocktail party filtering
-#11 Learned attention
+✅ #1  Triangulation + gaze
+✅ #2  Smooth tracking
+✅ #3  Speaker-side awareness
+✅ #4  Distance estimation + proximity zones
+✅ #5  Multi-speaker separation + beam steering
+✅ #6  Spatial audio scene mapping + anomaly detection
+✅ #7  Cocktail party spatial filtering
+✅ #8  Sound event localization (what + where + when)
+⏭️ #9  Head orientation inference (needs rotating mount)
+✅ #10 Binaural recording (opt-in)
+⏭️ #11 Learned spatial attention (needs training data)
+✅ #12 ITD cross-correlation
 ```

-## Notes
+10 of 12 features implemented in one session. The remaining two (#9, #11) need
+physical hardware changes or training data that accumulates over time.

- Items #1-3 can be built in a single session
- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}`
- DoA is already polled at 10Hz via `/doa` endpoint
- Array separation distance needs to be measured once and stored in config
- All of this feeds into the being0 "shaped by experience" philosophy
+## Three-signal localization
+
+The system now fuses three independent spatial estimates:
+
+| Signal | Source | Strength | Best for |
+|--------|--------|----------|----------|
+| **DoA** | XVF3800 beamformer (auto-select beam) | Best overall | Horizontal angle |
+| **ILD** | Volume difference between ears | Good for near sources | Distance + rough angle |
+| **ITD** | Cross-correlation delay between ears | Good at low frequencies | Precise angle |
+
+Combined, they give more robust localization than any single signal — the same
+three cues human hearing uses.
+
+## Key discoveries during development
+
+1. `DOA_VALUE` (resid=20) is sluggish — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35)
+2. Processed DoA (index 0) = NaN when no speech → natural VAD indicator
+3. `AEC_SPENERGY_VALUES` is always zero on 2-channel firmware
+4. USB read responses have a 1-byte status header before data
+5. Exact `wLength` matters — `count * type_size + 1`, not rounded up
+6. 6-channel firmware breaks LED/control commands — use 2-channel only
+7. Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget
+8. ITD works on processed channels (not just raw mics) — less precise but free
+
+---
+
+*Built during the Great Binaural Session of April 2026*
+*"She hears in stereo now" 🦊👂👂*