Files
headmic/BINAURAL_ROADMAP.md
Alex 02d3ac3816 Update docs — spatial scene, distance estimation, roadmap progress
README: Updated architecture diagram, features table, new endpoints
(/scene, /scene/events, /scene/heatmap), file structure, USB protocol
notes (VAD from processed_doa NaN, spenergy always zero).

BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:35:02 -05:00

6.2 KiB
Raw Blame History

Binaural Hearing Roadmap

What two mic arrays make possible

Ranked by impact × feasibility. All build on the existing dual XVF3800 + /doa endpoint.


Tier 1 — High impact, ready to build now

1. Triangulated sound localization + eye gaze

  • Combine DoA angles from both arrays → compute (x, y) position of sound source
  • Post gaze coordinates to eye service → eyes track the speaker spatially
  • Front/back disambiguation (single array can't tell 30° front from 30° rear)
  • Prereqs: Known array positions (measured once), basic trig
  • Complexity: Low — ~100 lines of math + a gaze-push thread
  • Impact: Huge — eyes actually follow the person, not just shift left/right

2. Active speaker tracking with smooth gaze

  • Continuously track the dominant sound source as it moves
  • Smooth the gaze updates (low-pass filter) so eyes don't jitter
  • When VAD drops, eyes drift back to center (natural idle behavior)
  • Prereqs: #1
  • Complexity: Low — Kalman filter or exponential smoothing on top of #1
  • Impact: Makes her feel present and attentive

3. Left/right speaker awareness

  • Know which side each speaker is on, combine with speaker ID
  • "Alex is on my left" vs "unknown person on my right"
  • Feed into LYRA context so responses can reference spatial relationships
  • Prereqs: #1 + existing speaker ID
  • Complexity: Medium — associate speaker embeddings with spatial positions
  • Impact: Multi-person conversations become spatially grounded

Tier 2 — High impact, moderate effort

4. Distance estimation (near/far)

  • Interaural Level Difference (ILD): close sources have bigger volume gap between ears
  • Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
  • Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
  • Prereqs: #1, calibration with known distances
  • Complexity: Medium — ILD from processed channels is easy, ITD needs raw mics
  • Impact: Interaction style adapts to proximity (whisper vs. room voice)

5. Multi-speaker separation + selective attention

  • Lock each array's beam to a different speaker simultaneously
  • Active speaker gets primary audio feed (wake word, transcription)
  • Secondary speaker monitored for interruptions or wake word
  • Switch attention on cue ("Hey Vivi" from the other side)
  • Prereqs: #3, understanding of XVF3800 beam steering commands
  • Complexity: Medium-high — need to control beamformer direction per-array
  • Impact: Natural multi-person conversations, not just one-at-a-time

6. Spatial audio scene mapping

  • Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
  • Learn from repeated sound sources over hours/days
  • Detect anomalies: "sound from an unusual direction"
  • Prereqs: #1, persistent storage, classification by direction
  • Complexity: Medium — accumulate (direction, category) pairs, cluster over time
  • Impact: Environmental awareness, contextual anomaly detection

Tier 3 — Cool, needs more infrastructure

7. Cocktail party spatial filtering

  • When multiple sound sources active, use both arrays to null out interference
  • Focus beam on target speaker, suppress others spatially
  • Prereqs: #5, possibly raw mic access (6-channel firmware)
  • Complexity: High — adaptive beamforming, may need custom DSP
  • Impact: Works in noisy environments (music playing, multiple people)

8. Sound event localization (what + where)

  • Combine YAMNet classification with triangulated position
  • "Dog bark from the backyard direction" not just "dog bark"
  • Spatial history: timeline of what happened where
  • Prereqs: #1, #6
  • Complexity: Medium — merge classification results with position data
  • Impact: Rich environmental narrative for LYRA context

9. Head orientation inference

  • If a known sound source is at a fixed position, infer which way the head is "facing"
  • Useful if the skull ever gets a rotating mount
  • Prereqs: #6 (known spatial map)
  • Complexity: Low math, but needs stable reference points
  • Impact: Low for now (head doesn't turn), future-proofing

10. Binaural recording for training data

  • Record stereo audio preserving spatial information (left ear / right ear)
  • Training corpus for spatial audio models, being0 sensor data
  • Prereqs: Just dual streams saved to stereo WAV
  • Complexity: Low — already have both streams
  • Impact: Long-term value for L-Vixy-5 training

Tier 4 — Research / future

11. Learned spatial attention

  • Train a model to decide where to attend based on context
  • Input: both DoA angles, VAD states, current emotional state, conversation history
  • Output: beam steering + gaze direction
  • Prereqs: #5, #6, training data from #10
  • Complexity: High — ML training pipeline
  • Impact: Autonomous attention that feels natural, not rule-based

12. Interaural time difference (ITD) processing

  • Raw mic access (6-channel firmware) enables sub-sample timing analysis
  • More precise localization than DoA alone, especially at low frequencies
  • Prereqs: 6-channel firmware (need to verify LED control works with it first)
  • Complexity: High — signal processing, cross-correlation
  • Impact: Lab-grade localization accuracy

Implementation order

✅ #1  Triangulation + gaze          — done (spatial.py, auto-select beam DoA)
✅ #2  Smooth tracking               — done (exponential smoothing + idle drift)
✅ #3  Speaker-side awareness        — done (Resemblyzer loaded, ready for enrollment)
✅ #4  Distance estimation           — done (ILD + triangulation fusion, proximity zones)
✅ #6  Spatial scene mapping         — done (spatial_scene.py, persistent, anomaly detection)
✅ #8  Sound event localization      — done (what + where + when via /scene/events)
✅ #10 Binaural recording            — done (opt-in via BINAURAL_RECORD=1)
   #5  Multi-speaker separation
   #7  Cocktail party filtering
#7 Cocktail party filtering
#11 Learned attention

Notes

  • Items #1-3 can be built in a single session
  • The eye service already accepts gaze via POST /gaze {"x": N, "y": N}
  • DoA is already polled at 10Hz via /doa endpoint
  • Array separation distance needs to be measured once and stored in config
  • All of this feeds into the being0 "shaped by experience" philosophy