README: Updated architecture diagram, features table, new endpoints (/scene, /scene/events, /scene/heatmap), file structure, USB protocol notes (VAD from processed_doa NaN, spenergy always zero). BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
140 lines
6.2 KiB
Markdown
140 lines
6.2 KiB
Markdown
# Binaural Hearing Roadmap
|
||
## What two mic arrays make possible
|
||
|
||
Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint.
|
||
|
||
---
|
||
|
||
### Tier 1 — High impact, ready to build now
|
||
|
||
**1. Triangulated sound localization + eye gaze**
|
||
- Combine DoA angles from both arrays → compute (x, y) position of sound source
|
||
- Post gaze coordinates to eye service → eyes track the speaker spatially
|
||
- Front/back disambiguation (single array can't tell 30° front from 30° rear)
|
||
- *Prereqs:* Known array positions (measured once), basic trig
|
||
- *Complexity:* Low — ~100 lines of math + a gaze-push thread
|
||
- *Impact:* Huge — eyes actually follow the person, not just shift left/right
|
||
|
||
**2. Active speaker tracking with smooth gaze**
|
||
- Continuously track the dominant sound source as it moves
|
||
- Smooth the gaze updates (low-pass filter) so eyes don't jitter
|
||
- When VAD drops, eyes drift back to center (natural idle behavior)
|
||
- *Prereqs:* #1
|
||
- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1
|
||
- *Impact:* Makes her feel present and attentive
|
||
|
||
**3. Left/right speaker awareness**
|
||
- Know which side each speaker is on, combine with speaker ID
|
||
- "Alex is on my left" vs "unknown person on my right"
|
||
- Feed into LYRA context so responses can reference spatial relationships
|
||
- *Prereqs:* #1 + existing speaker ID
|
||
- *Complexity:* Medium — associate speaker embeddings with spatial positions
|
||
- *Impact:* Multi-person conversations become spatially grounded
|
||
|
||
---
|
||
|
||
### Tier 2 — High impact, moderate effort
|
||
|
||
**4. Distance estimation (near/far)**
|
||
- Interaural Level Difference (ILD): close sources have bigger volume gap between ears
|
||
- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
|
||
- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
|
||
- *Prereqs:* #1, calibration with known distances
|
||
- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics
|
||
- *Impact:* Interaction style adapts to proximity (whisper vs. room voice)
|
||
|
||
**5. Multi-speaker separation + selective attention**
|
||
- Lock each array's beam to a different speaker simultaneously
|
||
- Active speaker gets primary audio feed (wake word, transcription)
|
||
- Secondary speaker monitored for interruptions or wake word
|
||
- Switch attention on cue ("Hey Vivi" from the other side)
|
||
- *Prereqs:* #3, understanding of XVF3800 beam steering commands
|
||
- *Complexity:* Medium-high — need to control beamformer direction per-array
|
||
- *Impact:* Natural multi-person conversations, not just one-at-a-time
|
||
|
||
**6. Spatial audio scene mapping**
|
||
- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
|
||
- Learn from repeated sound sources over hours/days
|
||
- Detect anomalies: "sound from an unusual direction"
|
||
- *Prereqs:* #1, persistent storage, classification by direction
|
||
- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time
|
||
- *Impact:* Environmental awareness, contextual anomaly detection
|
||
|
||
---
|
||
|
||
### Tier 3 — Cool, needs more infrastructure
|
||
|
||
**7. Cocktail party spatial filtering**
|
||
- When multiple sound sources active, use both arrays to null out interference
|
||
- Focus beam on target speaker, suppress others spatially
|
||
- *Prereqs:* #5, possibly raw mic access (6-channel firmware)
|
||
- *Complexity:* High — adaptive beamforming, may need custom DSP
|
||
- *Impact:* Works in noisy environments (music playing, multiple people)
|
||
|
||
**8. Sound event localization (what + where)**
|
||
- Combine YAMNet classification with triangulated position
|
||
- "Dog bark from the backyard direction" not just "dog bark"
|
||
- Spatial history: timeline of what happened where
|
||
- *Prereqs:* #1, #6
|
||
- *Complexity:* Medium — merge classification results with position data
|
||
- *Impact:* Rich environmental narrative for LYRA context
|
||
|
||
**9. Head orientation inference**
|
||
- If a known sound source is at a fixed position, infer which way the head is "facing"
|
||
- Useful if the skull ever gets a rotating mount
|
||
- *Prereqs:* #6 (known spatial map)
|
||
- *Complexity:* Low math, but needs stable reference points
|
||
- *Impact:* Low for now (head doesn't turn), future-proofing
|
||
|
||
**10. Binaural recording for training data**
|
||
- Record stereo audio preserving spatial information (left ear / right ear)
|
||
- Training corpus for spatial audio models, being0 sensor data
|
||
- *Prereqs:* Just dual streams saved to stereo WAV
|
||
- *Complexity:* Low — already have both streams
|
||
- *Impact:* Long-term value for L-Vixy-5 training
|
||
|
||
---
|
||
|
||
### Tier 4 — Research / future
|
||
|
||
**11. Learned spatial attention**
|
||
- Train a model to decide where to attend based on context
|
||
- Input: both DoA angles, VAD states, current emotional state, conversation history
|
||
- Output: beam steering + gaze direction
|
||
- *Prereqs:* #5, #6, training data from #10
|
||
- *Complexity:* High — ML training pipeline
|
||
- *Impact:* Autonomous attention that feels natural, not rule-based
|
||
|
||
**12. Interaural time difference (ITD) processing**
|
||
- Raw mic access (6-channel firmware) enables sub-sample timing analysis
|
||
- More precise localization than DoA alone, especially at low frequencies
|
||
- *Prereqs:* 6-channel firmware (need to verify LED control works with it first)
|
||
- *Complexity:* High — signal processing, cross-correlation
|
||
- *Impact:* Lab-grade localization accuracy
|
||
|
||
---
|
||
|
||
## Implementation order
|
||
|
||
```
|
||
✅ #1 Triangulation + gaze — done (spatial.py, auto-select beam DoA)
|
||
✅ #2 Smooth tracking — done (exponential smoothing + idle drift)
|
||
✅ #3 Speaker-side awareness — done (Resemblyzer loaded, ready for enrollment)
|
||
✅ #4 Distance estimation — done (ILD + triangulation fusion, proximity zones)
|
||
✅ #6 Spatial scene mapping — done (spatial_scene.py, persistent, anomaly detection)
|
||
✅ #8 Sound event localization — done (what + where + when via /scene/events)
|
||
✅ #10 Binaural recording — done (opt-in via BINAURAL_RECORD=1)
|
||
#5 Multi-speaker separation
|
||
#7 Cocktail party filtering
|
||
#7 Cocktail party filtering
|
||
#11 Learned attention
|
||
```
|
||
|
||
## Notes
|
||
|
||
- Items #1-3 can be built in a single session
|
||
- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}`
|
||
- DoA is already polled at 10Hz via `/doa` endpoint
|
||
- Array separation distance needs to be measured once and stored in config
|
||
- All of this feeds into the being0 "shaped by experience" philosophy
|