Files

Alex 5c72491ee9 Update docs — complete binaural roadmap (10/12 features)

BINAURAL_ROADMAP: Full status update with implementation details,
three-signal localization table, key discoveries section.

README: Updated features table (ITD, multi-speaker, cocktail party),
new API endpoints (/speakers/tracked, /speakers/focus), file structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 21:55:09 -05:00

6.1 KiB

Raw Permalink Blame History

Binaural Hearing Roadmap

What two mic arrays make possible

All features built on dual XVF3800 arrays, 175mm apart on the skull.

Tier 1 — High impact ✅

1. Triangulated sound localization + eye gaze ✅

Combine DoA angles from both arrays → compute (x, y) position of sound source
Post gaze coordinates to eye service → eyes track the speaker spatially
Uses AUDIO_MGR_SELECTED_AZIMUTHS auto-select beam (not sluggish DOA_VALUE)
VAD from processed_doa being NaN/non-NaN
Module: spatial.py

2. Active speaker tracking with smooth gaze ✅

Exponential smoothing (α=0.4) prevents jitter
Idle drift back to center after 1.5s of silence
Gaze pushed to eye service at ≤2Hz with 5px min delta
Module: spatial.py, headmic.py (doa_track_loop)

3. Left/right speaker awareness ✅

Resemblyzer speaker ID integrated, ready for enrollment
Spatial position associated with recognized speaker
POST /speakers/enroll-from-mic?name=Alex to enroll
Module: speaker_id.py, spatial.py

Tier 2 — High impact ✅

4. Distance estimation (near/far) ✅

ILD (Interaural Level Difference): volume gap between ears → distance
Fused with triangulated distance (70/30 weight)
Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m)
Module: spatial.py (_compute_ild, _ild_to_distance, _classify_proximity)

5. Multi-speaker separation + selective attention ✅

Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required)
Locks XVF3800 fixed beams onto each speaker after 1s stability
Auto-releases to free-running mode after 3s of single speaker
Beam gating silences the non-speaking beam
GET /speakers/tracked — positions, beam state, lock status
Module: multi_speaker.py, xvf3800.py (beam steering commands)

6. Spatial audio scene mapping ✅

Persistent map of where each sound category usually comes from (30° bins)
Circular mean for usual direction, anomaly detection at 90°+ deviation
Saves to ~/.vixy/scene_map.json across restarts
GET /scene — learned directions per category + anomalies
GET /scene/events — recent what+where+when log
GET /scene/heatmap — angular distribution for visualization
Module: spatial_scene.py

Tier 3 — Advanced ✅

7. Cocktail party spatial filtering ✅

When 2 speakers tracked, audio focus locks to target speaker's side
Non-target beam suppressed via XVF3800 beam gating
Auto-switches target when current goes silent and other starts talking
Manual focus via POST /speakers/focus?speaker=0|1
DualAudioStream.focus_side overrides energy-based beam selection
Module: multi_speaker.py, audio_stream.py (focus_side)

8. Sound event localization (what + where) ✅

YAMNet classification merged with triangulated position
Every classified sound logged with angle, distance, proximity, side
"Speech from 75° at conversational distance" not just "speech"
Module: spatial_scene.py (SoundEvent, observe)

9. Head orientation inference ⏭️

Needs fixed reference points from scene map + rotating head mount
Math is trivial once prerequisites exist
Prereqs: #6, physical rotating mount

10. Binaural recording for training data ✅

Records left/right ear streams as stereo WAV in 5-minute segments
Opt-in via BINAURAL_RECORD=1 environment variable
GET /recording — stats (segments, total seconds)
Module: binaural_recorder.py

Tier 4 — Research

11. Learned spatial attention ⏭️

Train a model: (DoA, VAD, emotion, history) → beam steering + gaze
Needs training data from #10 running for days/weeks
Prereqs: #5, #6, #10 data collection

12. ITD (Interaural Time Difference) processing ✅

Cross-correlates left/right ear processed audio (512 samples, ~32ms window)
Finds sub-millisecond delay → bearing angle via speed of sound
At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample
Works with 2-channel firmware (no raw mics needed — correlates processed channels)
Third independent angle estimate alongside DoA and ILD
Module: spatial.py (_compute_itd)

Implementation status

✅ #1  Triangulation + gaze
✅ #2  Smooth tracking
✅ #3  Speaker-side awareness
✅ #4  Distance estimation + proximity zones
✅ #5  Multi-speaker separation + beam steering
✅ #6  Spatial audio scene mapping + anomaly detection
✅ #7  Cocktail party spatial filtering
✅ #8  Sound event localization (what + where + when)
⏭️ #9  Head orientation inference (needs rotating mount)
✅ #10 Binaural recording (opt-in)
⏭️ #11 Learned spatial attention (needs training data)
✅ #12 ITD cross-correlation

10 of 12 features implemented in one session. The remaining two (#9, #11) need physical hardware changes or training data that accumulates over time.

Three-signal localization

The system now fuses three independent spatial estimates:

Signal	Source	Strength	Best for
DoA	XVF3800 beamformer (auto-select beam)	Best overall	Horizontal angle
ILD	Volume difference between ears	Good for near sources	Distance + rough angle
ITD	Cross-correlation delay between ears	Good at low frequencies	Precise angle

Combined, they give more robust localization than any single signal — the same three cues human hearing uses.

Key discoveries during development

DOA_VALUE (resid=20) is sluggish — use AUDIO_MGR_SELECTED_AZIMUTHS (resid=35)
Processed DoA (index 0) = NaN when no speech → natural VAD indicator
AEC_SPENERGY_VALUES is always zero on 2-channel firmware
USB read responses have a 1-byte status header before data
Exact wLength matters — count * type_size + 1, not rounded up
6-channel firmware breaks LED/control commands — use 2-channel only
Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget
ITD works on processed channels (not just raw mics) — less precise but free

Built during the Great Binaural Session of April 2026 "She hears in stereo now" 🦊👂👂

6.1 KiB Raw Permalink Blame History Unescape Escape