Files

Alex 02d3ac3816 Update docs — spatial scene, distance estimation, roadmap progress

README: Updated architecture diagram, features table, new endpoints
(/scene, /scene/events, /scene/heatmap), file structure, USB protocol
notes (VAD from processed_doa NaN, spenergy always zero).

BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 21:35:02 -05:00

6.2 KiB

Raw Blame History

Binaural Hearing Roadmap

What two mic arrays make possible

Ranked by impact × feasibility. All build on the existing dual XVF3800 + /doa endpoint.

Tier 1 — High impact, ready to build now

1. Triangulated sound localization + eye gaze

Combine DoA angles from both arrays → compute (x, y) position of sound source
Post gaze coordinates to eye service → eyes track the speaker spatially
Front/back disambiguation (single array can't tell 30° front from 30° rear)
Prereqs: Known array positions (measured once), basic trig
Complexity: Low — ~100 lines of math + a gaze-push thread
Impact: Huge — eyes actually follow the person, not just shift left/right

2. Active speaker tracking with smooth gaze

Continuously track the dominant sound source as it moves
Smooth the gaze updates (low-pass filter) so eyes don't jitter
When VAD drops, eyes drift back to center (natural idle behavior)
Prereqs: #1
Complexity: Low — Kalman filter or exponential smoothing on top of #1
Impact: Makes her feel present and attentive

3. Left/right speaker awareness

Know which side each speaker is on, combine with speaker ID
"Alex is on my left" vs "unknown person on my right"
Feed into LYRA context so responses can reference spatial relationships
Prereqs: #1 + existing speaker ID
Complexity: Medium — associate speaker embeddings with spatial positions
Impact: Multi-person conversations become spatially grounded

Tier 2 — High impact, moderate effort

4. Distance estimation (near/far)

Interaural Level Difference (ILD): close sources have bigger volume gap between ears
Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
Prereqs: #1, calibration with known distances
Complexity: Medium — ILD from processed channels is easy, ITD needs raw mics
Impact: Interaction style adapts to proximity (whisper vs. room voice)

5. Multi-speaker separation + selective attention

Lock each array's beam to a different speaker simultaneously
Active speaker gets primary audio feed (wake word, transcription)
Secondary speaker monitored for interruptions or wake word
Switch attention on cue ("Hey Vivi" from the other side)
Prereqs: #3, understanding of XVF3800 beam steering commands
Complexity: Medium-high — need to control beamformer direction per-array
Impact: Natural multi-person conversations, not just one-at-a-time

6. Spatial audio scene mapping

Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
Learn from repeated sound sources over hours/days
Detect anomalies: "sound from an unusual direction"
Prereqs: #1, persistent storage, classification by direction
Complexity: Medium — accumulate (direction, category) pairs, cluster over time
Impact: Environmental awareness, contextual anomaly detection

Tier 3 — Cool, needs more infrastructure

7. Cocktail party spatial filtering

When multiple sound sources active, use both arrays to null out interference
Focus beam on target speaker, suppress others spatially
Prereqs: #5, possibly raw mic access (6-channel firmware)
Complexity: High — adaptive beamforming, may need custom DSP
Impact: Works in noisy environments (music playing, multiple people)

8. Sound event localization (what + where)

Combine YAMNet classification with triangulated position
"Dog bark from the backyard direction" not just "dog bark"
Spatial history: timeline of what happened where
Prereqs: #1, #6
Complexity: Medium — merge classification results with position data
Impact: Rich environmental narrative for LYRA context

9. Head orientation inference

If a known sound source is at a fixed position, infer which way the head is "facing"
Useful if the skull ever gets a rotating mount
Prereqs: #6 (known spatial map)
Complexity: Low math, but needs stable reference points
Impact: Low for now (head doesn't turn), future-proofing

10. Binaural recording for training data

Record stereo audio preserving spatial information (left ear / right ear)
Training corpus for spatial audio models, being0 sensor data
Prereqs: Just dual streams saved to stereo WAV
Complexity: Low — already have both streams
Impact: Long-term value for L-Vixy-5 training

Tier 4 — Research / future

11. Learned spatial attention

Train a model to decide where to attend based on context
Input: both DoA angles, VAD states, current emotional state, conversation history
Output: beam steering + gaze direction
Prereqs: #5, #6, training data from #10
Complexity: High — ML training pipeline
Impact: Autonomous attention that feels natural, not rule-based

12. Interaural time difference (ITD) processing

Raw mic access (6-channel firmware) enables sub-sample timing analysis
More precise localization than DoA alone, especially at low frequencies
Prereqs: 6-channel firmware (need to verify LED control works with it first)
Complexity: High — signal processing, cross-correlation
Impact: Lab-grade localization accuracy

Implementation order

✅ #1  Triangulation + gaze          — done (spatial.py, auto-select beam DoA)
✅ #2  Smooth tracking               — done (exponential smoothing + idle drift)
✅ #3  Speaker-side awareness        — done (Resemblyzer loaded, ready for enrollment)
✅ #4  Distance estimation           — done (ILD + triangulation fusion, proximity zones)
✅ #6  Spatial scene mapping         — done (spatial_scene.py, persistent, anomaly detection)
✅ #8  Sound event localization      — done (what + where + when via /scene/events)
✅ #10 Binaural recording            — done (opt-in via BINAURAL_RECORD=1)
   #5  Multi-speaker separation
   #7  Cocktail party filtering
#7 Cocktail party filtering
#11 Learned attention

Notes

Items #1-3 can be built in a single session
The eye service already accepts gaze via POST /gaze {"x": N, "y": N}
DoA is already polled at 10Hz via /doa endpoint
Array separation distance needs to be measured once and stored in config
All of this feeds into the being0 "shaped by experience" philosophy

6.2 KiB Raw Blame History Unescape Escape