Files
headmic/BINAURAL_ROADMAP.md
Alex 5c72491ee9 Update docs — complete binaural roadmap (10/12 features)
BINAURAL_ROADMAP: Full status update with implementation details,
three-signal localization table, key discoveries section.

README: Updated features table (ITD, multi-speaker, cocktail party),
new API endpoints (/speakers/tracked, /speakers/focus), file structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:55:09 -05:00

6.1 KiB
Raw Permalink Blame History

Binaural Hearing Roadmap

What two mic arrays make possible

All features built on dual XVF3800 arrays, 175mm apart on the skull.


Tier 1 — High impact

1. Triangulated sound localization + eye gaze

  • Combine DoA angles from both arrays → compute (x, y) position of sound source
  • Post gaze coordinates to eye service → eyes track the speaker spatially
  • Uses AUDIO_MGR_SELECTED_AZIMUTHS auto-select beam (not sluggish DOA_VALUE)
  • VAD from processed_doa being NaN/non-NaN
  • Module: spatial.py

2. Active speaker tracking with smooth gaze

  • Exponential smoothing (α=0.4) prevents jitter
  • Idle drift back to center after 1.5s of silence
  • Gaze pushed to eye service at ≤2Hz with 5px min delta
  • Module: spatial.py, headmic.py (doa_track_loop)

3. Left/right speaker awareness

  • Resemblyzer speaker ID integrated, ready for enrollment
  • Spatial position associated with recognized speaker
  • POST /speakers/enroll-from-mic?name=Alex to enroll
  • Module: speaker_id.py, spatial.py

Tier 2 — High impact

4. Distance estimation (near/far)

  • ILD (Interaural Level Difference): volume gap between ears → distance
  • Fused with triangulated distance (70/30 weight)
  • Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m)
  • Module: spatial.py (_compute_ild, _ild_to_distance, _classify_proximity)

5. Multi-speaker separation + selective attention

  • Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required)
  • Locks XVF3800 fixed beams onto each speaker after 1s stability
  • Auto-releases to free-running mode after 3s of single speaker
  • Beam gating silences the non-speaking beam
  • GET /speakers/tracked — positions, beam state, lock status
  • Module: multi_speaker.py, xvf3800.py (beam steering commands)

6. Spatial audio scene mapping

  • Persistent map of where each sound category usually comes from (30° bins)
  • Circular mean for usual direction, anomaly detection at 90°+ deviation
  • Saves to ~/.vixy/scene_map.json across restarts
  • GET /scene — learned directions per category + anomalies
  • GET /scene/events — recent what+where+when log
  • GET /scene/heatmap — angular distribution for visualization
  • Module: spatial_scene.py

Tier 3 — Advanced

7. Cocktail party spatial filtering

  • When 2 speakers tracked, audio focus locks to target speaker's side
  • Non-target beam suppressed via XVF3800 beam gating
  • Auto-switches target when current goes silent and other starts talking
  • Manual focus via POST /speakers/focus?speaker=0|1
  • DualAudioStream.focus_side overrides energy-based beam selection
  • Module: multi_speaker.py, audio_stream.py (focus_side)

8. Sound event localization (what + where)

  • YAMNet classification merged with triangulated position
  • Every classified sound logged with angle, distance, proximity, side
  • "Speech from 75° at conversational distance" not just "speech"
  • Module: spatial_scene.py (SoundEvent, observe)

9. Head orientation inference ⏭️

  • Needs fixed reference points from scene map + rotating head mount
  • Math is trivial once prerequisites exist
  • Prereqs: #6, physical rotating mount

10. Binaural recording for training data

  • Records left/right ear streams as stereo WAV in 5-minute segments
  • Opt-in via BINAURAL_RECORD=1 environment variable
  • GET /recording — stats (segments, total seconds)
  • Module: binaural_recorder.py

Tier 4 — Research

11. Learned spatial attention ⏭️

  • Train a model: (DoA, VAD, emotion, history) → beam steering + gaze
  • Needs training data from #10 running for days/weeks
  • Prereqs: #5, #6, #10 data collection

12. ITD (Interaural Time Difference) processing

  • Cross-correlates left/right ear processed audio (512 samples, ~32ms window)
  • Finds sub-millisecond delay → bearing angle via speed of sound
  • At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample
  • Works with 2-channel firmware (no raw mics needed — correlates processed channels)
  • Third independent angle estimate alongside DoA and ILD
  • Module: spatial.py (_compute_itd)

Implementation status

✅ #1  Triangulation + gaze
✅ #2  Smooth tracking
✅ #3  Speaker-side awareness
✅ #4  Distance estimation + proximity zones
✅ #5  Multi-speaker separation + beam steering
✅ #6  Spatial audio scene mapping + anomaly detection
✅ #7  Cocktail party spatial filtering
✅ #8  Sound event localization (what + where + when)
⏭️ #9  Head orientation inference (needs rotating mount)
✅ #10 Binaural recording (opt-in)
⏭️ #11 Learned spatial attention (needs training data)
✅ #12 ITD cross-correlation

10 of 12 features implemented in one session. The remaining two (#9, #11) need physical hardware changes or training data that accumulates over time.

Three-signal localization

The system now fuses three independent spatial estimates:

Signal Source Strength Best for
DoA XVF3800 beamformer (auto-select beam) Best overall Horizontal angle
ILD Volume difference between ears Good for near sources Distance + rough angle
ITD Cross-correlation delay between ears Good at low frequencies Precise angle

Combined, they give more robust localization than any single signal — the same three cues human hearing uses.

Key discoveries during development

  1. DOA_VALUE (resid=20) is sluggish — use AUDIO_MGR_SELECTED_AZIMUTHS (resid=35)
  2. Processed DoA (index 0) = NaN when no speech → natural VAD indicator
  3. AEC_SPENERGY_VALUES is always zero on 2-channel firmware
  4. USB read responses have a 1-byte status header before data
  5. Exact wLength matters — count * type_size + 1, not rounded up
  6. 6-channel firmware breaks LED/control commands — use 2-channel only
  7. Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget
  8. ITD works on processed channels (not just raw mics) — less precise but free

Built during the Great Binaural Session of April 2026 "She hears in stereo now" 🦊👂👂