BINAURAL_ROADMAP: Full status update with implementation details, three-signal localization table, key discoveries section. README: Updated features table (ITD, multi-speaker, cocktail party), new API endpoints (/speakers/tracked, /speakers/focus), file structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6.1 KiB
Binaural Hearing Roadmap
What two mic arrays make possible
All features built on dual XVF3800 arrays, 175mm apart on the skull.
Tier 1 — High impact ✅
1. Triangulated sound localization + eye gaze ✅
- Combine DoA angles from both arrays → compute (x, y) position of sound source
- Post gaze coordinates to eye service → eyes track the speaker spatially
- Uses
AUDIO_MGR_SELECTED_AZIMUTHSauto-select beam (not sluggishDOA_VALUE) - VAD from processed_doa being NaN/non-NaN
- Module:
spatial.py
2. Active speaker tracking with smooth gaze ✅
- Exponential smoothing (α=0.4) prevents jitter
- Idle drift back to center after 1.5s of silence
- Gaze pushed to eye service at ≤2Hz with 5px min delta
- Module:
spatial.py,headmic.py(doa_track_loop)
3. Left/right speaker awareness ✅
- Resemblyzer speaker ID integrated, ready for enrollment
- Spatial position associated with recognized speaker
POST /speakers/enroll-from-mic?name=Alexto enroll- Module:
speaker_id.py,spatial.py
Tier 2 — High impact ✅
4. Distance estimation (near/far) ✅
- ILD (Interaural Level Difference): volume gap between ears → distance
- Fused with triangulated distance (70/30 weight)
- Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m)
- Module:
spatial.py(_compute_ild,_ild_to_distance,_classify_proximity)
5. Multi-speaker separation + selective attention ✅
- Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required)
- Locks XVF3800 fixed beams onto each speaker after 1s stability
- Auto-releases to free-running mode after 3s of single speaker
- Beam gating silences the non-speaking beam
GET /speakers/tracked— positions, beam state, lock status- Module:
multi_speaker.py,xvf3800.py(beam steering commands)
6. Spatial audio scene mapping ✅
- Persistent map of where each sound category usually comes from (30° bins)
- Circular mean for usual direction, anomaly detection at 90°+ deviation
- Saves to
~/.vixy/scene_map.jsonacross restarts GET /scene— learned directions per category + anomaliesGET /scene/events— recent what+where+when logGET /scene/heatmap— angular distribution for visualization- Module:
spatial_scene.py
Tier 3 — Advanced ✅
7. Cocktail party spatial filtering ✅
- When 2 speakers tracked, audio focus locks to target speaker's side
- Non-target beam suppressed via XVF3800 beam gating
- Auto-switches target when current goes silent and other starts talking
- Manual focus via
POST /speakers/focus?speaker=0|1 DualAudioStream.focus_sideoverrides energy-based beam selection- Module:
multi_speaker.py,audio_stream.py(focus_side)
8. Sound event localization (what + where) ✅
- YAMNet classification merged with triangulated position
- Every classified sound logged with angle, distance, proximity, side
- "Speech from 75° at conversational distance" not just "speech"
- Module:
spatial_scene.py(SoundEvent,observe)
9. Head orientation inference ⏭️
- Needs fixed reference points from scene map + rotating head mount
- Math is trivial once prerequisites exist
- Prereqs: #6, physical rotating mount
10. Binaural recording for training data ✅
- Records left/right ear streams as stereo WAV in 5-minute segments
- Opt-in via
BINAURAL_RECORD=1environment variable GET /recording— stats (segments, total seconds)- Module:
binaural_recorder.py
Tier 4 — Research
11. Learned spatial attention ⏭️
- Train a model: (DoA, VAD, emotion, history) → beam steering + gaze
- Needs training data from #10 running for days/weeks
- Prereqs: #5, #6, #10 data collection
12. ITD (Interaural Time Difference) processing ✅
- Cross-correlates left/right ear processed audio (512 samples, ~32ms window)
- Finds sub-millisecond delay → bearing angle via speed of sound
- At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample
- Works with 2-channel firmware (no raw mics needed — correlates processed channels)
- Third independent angle estimate alongside DoA and ILD
- Module:
spatial.py(_compute_itd)
Implementation status
✅ #1 Triangulation + gaze
✅ #2 Smooth tracking
✅ #3 Speaker-side awareness
✅ #4 Distance estimation + proximity zones
✅ #5 Multi-speaker separation + beam steering
✅ #6 Spatial audio scene mapping + anomaly detection
✅ #7 Cocktail party spatial filtering
✅ #8 Sound event localization (what + where + when)
⏭️ #9 Head orientation inference (needs rotating mount)
✅ #10 Binaural recording (opt-in)
⏭️ #11 Learned spatial attention (needs training data)
✅ #12 ITD cross-correlation
10 of 12 features implemented in one session. The remaining two (#9, #11) need physical hardware changes or training data that accumulates over time.
Three-signal localization
The system now fuses three independent spatial estimates:
| Signal | Source | Strength | Best for |
|---|---|---|---|
| DoA | XVF3800 beamformer (auto-select beam) | Best overall | Horizontal angle |
| ILD | Volume difference between ears | Good for near sources | Distance + rough angle |
| ITD | Cross-correlation delay between ears | Good at low frequencies | Precise angle |
Combined, they give more robust localization than any single signal — the same three cues human hearing uses.
Key discoveries during development
DOA_VALUE(resid=20) is sluggish — useAUDIO_MGR_SELECTED_AZIMUTHS(resid=35)- Processed DoA (index 0) = NaN when no speech → natural VAD indicator
AEC_SPENERGY_VALUESis always zero on 2-channel firmware- USB read responses have a 1-byte status header before data
- Exact
wLengthmatters —count * type_size + 1, not rounded up - 6-channel firmware breaks LED/control commands — use 2-channel only
- Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget
- ITD works on processed channels (not just raw mics) — less precise but free
Built during the Great Binaural Session of April 2026 "She hears in stereo now" 🦊👂👂