# Binaural Hearing Roadmap ## What two mic arrays make possible All features built on dual XVF3800 arrays, 175mm apart on the skull. --- ### Tier 1 — High impact ✅ **1. Triangulated sound localization + eye gaze** ✅ - Combine DoA angles from both arrays → compute (x, y) position of sound source - Post gaze coordinates to eye service → eyes track the speaker spatially - Uses `AUDIO_MGR_SELECTED_AZIMUTHS` auto-select beam (not sluggish `DOA_VALUE`) - VAD from processed_doa being NaN/non-NaN - *Module:* `spatial.py` **2. Active speaker tracking with smooth gaze** ✅ - Exponential smoothing (α=0.4) prevents jitter - Idle drift back to center after 1.5s of silence - Gaze pushed to eye service at ≤2Hz with 5px min delta - *Module:* `spatial.py`, `headmic.py` (`doa_track_loop`) **3. Left/right speaker awareness** ✅ - Resemblyzer speaker ID integrated, ready for enrollment - Spatial position associated with recognized speaker - `POST /speakers/enroll-from-mic?name=Alex` to enroll - *Module:* `speaker_id.py`, `spatial.py` --- ### Tier 2 — High impact ✅ **4. Distance estimation (near/far)** ✅ - ILD (Interaural Level Difference): volume gap between ears → distance - Fused with triangulated distance (70/30 weight) - Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m) - *Module:* `spatial.py` (`_compute_ild`, `_ild_to_distance`, `_classify_proximity`) **5. Multi-speaker separation + selective attention** ✅ - Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required) - Locks XVF3800 fixed beams onto each speaker after 1s stability - Auto-releases to free-running mode after 3s of single speaker - Beam gating silences the non-speaking beam - `GET /speakers/tracked` — positions, beam state, lock status - *Module:* `multi_speaker.py`, `xvf3800.py` (beam steering commands) **6. Spatial audio scene mapping** ✅ - Persistent map of where each sound category usually comes from (30° bins) - Circular mean for usual direction, anomaly detection at 90°+ deviation - Saves to `~/.vixy/scene_map.json` across restarts - `GET /scene` — learned directions per category + anomalies - `GET /scene/events` — recent what+where+when log - `GET /scene/heatmap` — angular distribution for visualization - *Module:* `spatial_scene.py` --- ### Tier 3 — Advanced ✅ **7. Cocktail party spatial filtering** ✅ - When 2 speakers tracked, audio focus locks to target speaker's side - Non-target beam suppressed via XVF3800 beam gating - Auto-switches target when current goes silent and other starts talking - Manual focus via `POST /speakers/focus?speaker=0|1` - `DualAudioStream.focus_side` overrides energy-based beam selection - *Module:* `multi_speaker.py`, `audio_stream.py` (`focus_side`) **8. Sound event localization (what + where)** ✅ - YAMNet classification merged with triangulated position - Every classified sound logged with angle, distance, proximity, side - "Speech from 75° at conversational distance" not just "speech" - *Module:* `spatial_scene.py` (`SoundEvent`, `observe`) **9. Head orientation inference** ⏭️ - Needs fixed reference points from scene map + rotating head mount - Math is trivial once prerequisites exist - *Prereqs:* #6, physical rotating mount **10. Binaural recording for training data** ✅ - Records left/right ear streams as stereo WAV in 5-minute segments - Opt-in via `BINAURAL_RECORD=1` environment variable - `GET /recording` — stats (segments, total seconds) - *Module:* `binaural_recorder.py` --- ### Tier 4 — Research **11. Learned spatial attention** ⏭️ - Train a model: (DoA, VAD, emotion, history) → beam steering + gaze - Needs training data from #10 running for days/weeks - *Prereqs:* #5, #6, #10 data collection **12. ITD (Interaural Time Difference) processing** ✅ - Cross-correlates left/right ear processed audio (512 samples, ~32ms window) - Finds sub-millisecond delay → bearing angle via speed of sound - At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample - Works with 2-channel firmware (no raw mics needed — correlates processed channels) - Third independent angle estimate alongside DoA and ILD - *Module:* `spatial.py` (`_compute_itd`) --- ## Implementation status ``` ✅ #1 Triangulation + gaze ✅ #2 Smooth tracking ✅ #3 Speaker-side awareness ✅ #4 Distance estimation + proximity zones ✅ #5 Multi-speaker separation + beam steering ✅ #6 Spatial audio scene mapping + anomaly detection ✅ #7 Cocktail party spatial filtering ✅ #8 Sound event localization (what + where + when) ⏭️ #9 Head orientation inference (needs rotating mount) ✅ #10 Binaural recording (opt-in) ⏭️ #11 Learned spatial attention (needs training data) ✅ #12 ITD cross-correlation ``` 10 of 12 features implemented in one session. The remaining two (#9, #11) need physical hardware changes or training data that accumulates over time. ## Three-signal localization The system now fuses three independent spatial estimates: | Signal | Source | Strength | Best for | |--------|--------|----------|----------| | **DoA** | XVF3800 beamformer (auto-select beam) | Best overall | Horizontal angle | | **ILD** | Volume difference between ears | Good for near sources | Distance + rough angle | | **ITD** | Cross-correlation delay between ears | Good at low frequencies | Precise angle | Combined, they give more robust localization than any single signal — the same three cues human hearing uses. ## Key discoveries during development 1. `DOA_VALUE` (resid=20) is sluggish — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35) 2. Processed DoA (index 0) = NaN when no speech → natural VAD indicator 3. `AEC_SPENERGY_VALUES` is always zero on 2-channel firmware 4. USB read responses have a 1-byte status header before data 5. Exact `wLength` matters — `count * type_size + 1`, not rounded up 6. 6-channel firmware breaks LED/control commands — use 2-channel only 7. Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget 8. ITD works on processed channels (not just raw mics) — less precise but free --- *Built during the Great Binaural Session of April 2026* *"She hears in stereo now" 🦊👂👂*