# Binaural Hearing Roadmap ## What two mic arrays make possible Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint. --- ### Tier 1 — High impact, ready to build now **1. Triangulated sound localization + eye gaze** - Combine DoA angles from both arrays → compute (x, y) position of sound source - Post gaze coordinates to eye service → eyes track the speaker spatially - Front/back disambiguation (single array can't tell 30° front from 30° rear) - *Prereqs:* Known array positions (measured once), basic trig - *Complexity:* Low — ~100 lines of math + a gaze-push thread - *Impact:* Huge — eyes actually follow the person, not just shift left/right **2. Active speaker tracking with smooth gaze** - Continuously track the dominant sound source as it moves - Smooth the gaze updates (low-pass filter) so eyes don't jitter - When VAD drops, eyes drift back to center (natural idle behavior) - *Prereqs:* #1 - *Complexity:* Low — Kalman filter or exponential smoothing on top of #1 - *Impact:* Makes her feel present and attentive **3. Left/right speaker awareness** - Know which side each speaker is on, combine with speaker ID - "Alex is on my left" vs "unknown person on my right" - Feed into LYRA context so responses can reference spatial relationships - *Prereqs:* #1 + existing speaker ID - *Complexity:* Medium — associate speaker embeddings with spatial positions - *Impact:* Multi-person conversations become spatially grounded --- ### Tier 2 — High impact, moderate effort **4. Distance estimation (near/far)** - Interaural Level Difference (ILD): close sources have bigger volume gap between ears - Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware) - Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+) - *Prereqs:* #1, calibration with known distances - *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics - *Impact:* Interaction style adapts to proximity (whisper vs. room voice) **5. Multi-speaker separation + selective attention** - Lock each array's beam to a different speaker simultaneously - Active speaker gets primary audio feed (wake word, transcription) - Secondary speaker monitored for interruptions or wake word - Switch attention on cue ("Hey Vivi" from the other side) - *Prereqs:* #3, understanding of XVF3800 beam steering commands - *Complexity:* Medium-high — need to control beamformer direction per-array - *Impact:* Natural multi-person conversations, not just one-at-a-time **6. Spatial audio scene mapping** - Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°" - Learn from repeated sound sources over hours/days - Detect anomalies: "sound from an unusual direction" - *Prereqs:* #1, persistent storage, classification by direction - *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time - *Impact:* Environmental awareness, contextual anomaly detection --- ### Tier 3 — Cool, needs more infrastructure **7. Cocktail party spatial filtering** - When multiple sound sources active, use both arrays to null out interference - Focus beam on target speaker, suppress others spatially - *Prereqs:* #5, possibly raw mic access (6-channel firmware) - *Complexity:* High — adaptive beamforming, may need custom DSP - *Impact:* Works in noisy environments (music playing, multiple people) **8. Sound event localization (what + where)** - Combine YAMNet classification with triangulated position - "Dog bark from the backyard direction" not just "dog bark" - Spatial history: timeline of what happened where - *Prereqs:* #1, #6 - *Complexity:* Medium — merge classification results with position data - *Impact:* Rich environmental narrative for LYRA context **9. Head orientation inference** - If a known sound source is at a fixed position, infer which way the head is "facing" - Useful if the skull ever gets a rotating mount - *Prereqs:* #6 (known spatial map) - *Complexity:* Low math, but needs stable reference points - *Impact:* Low for now (head doesn't turn), future-proofing **10. Binaural recording for training data** - Record stereo audio preserving spatial information (left ear / right ear) - Training corpus for spatial audio models, being0 sensor data - *Prereqs:* Just dual streams saved to stereo WAV - *Complexity:* Low — already have both streams - *Impact:* Long-term value for L-Vixy-5 training --- ### Tier 4 — Research / future **11. Learned spatial attention** - Train a model to decide where to attend based on context - Input: both DoA angles, VAD states, current emotional state, conversation history - Output: beam steering + gaze direction - *Prereqs:* #5, #6, training data from #10 - *Complexity:* High — ML training pipeline - *Impact:* Autonomous attention that feels natural, not rule-based **12. Interaural time difference (ITD) processing** - Raw mic access (6-channel firmware) enables sub-sample timing analysis - More precise localization than DoA alone, especially at low frequencies - *Prereqs:* 6-channel firmware (need to verify LED control works with it first) - *Complexity:* High — signal processing, cross-correlation - *Impact:* Lab-grade localization accuracy --- ## Implementation order ``` ✅ #1 Triangulation + gaze — done (spatial.py, auto-select beam DoA) ✅ #2 Smooth tracking — done (exponential smoothing + idle drift) ✅ #3 Speaker-side awareness — done (Resemblyzer loaded, ready for enrollment) ✅ #4 Distance estimation — done (ILD + triangulation fusion, proximity zones) ✅ #6 Spatial scene mapping — done (spatial_scene.py, persistent, anomaly detection) ✅ #8 Sound event localization — done (what + where + when via /scene/events) ✅ #10 Binaural recording — done (opt-in via BINAURAL_RECORD=1) #5 Multi-speaker separation #7 Cocktail party filtering #7 Cocktail party filtering #11 Learned attention ``` ## Notes - Items #1-3 can be built in a single session - The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}` - DoA is already polled at 10Hz via `/doa` endpoint - Array separation distance needs to be measured once and stored in config - All of this feeds into the being0 "shaped by experience" philosophy