Files
headmic/BINAURAL_ROADMAP.md
Alex 02d3ac3816 Update docs — spatial scene, distance estimation, roadmap progress
README: Updated architecture diagram, features table, new endpoints
(/scene, /scene/events, /scene/heatmap), file structure, USB protocol
notes (VAD from processed_doa NaN, spenergy always zero).

BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:35:02 -05:00

140 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Binaural Hearing Roadmap
## What two mic arrays make possible
Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint.
---
### Tier 1 — High impact, ready to build now
**1. Triangulated sound localization + eye gaze**
- Combine DoA angles from both arrays → compute (x, y) position of sound source
- Post gaze coordinates to eye service → eyes track the speaker spatially
- Front/back disambiguation (single array can't tell 30° front from 30° rear)
- *Prereqs:* Known array positions (measured once), basic trig
- *Complexity:* Low — ~100 lines of math + a gaze-push thread
- *Impact:* Huge — eyes actually follow the person, not just shift left/right
**2. Active speaker tracking with smooth gaze**
- Continuously track the dominant sound source as it moves
- Smooth the gaze updates (low-pass filter) so eyes don't jitter
- When VAD drops, eyes drift back to center (natural idle behavior)
- *Prereqs:* #1
- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1
- *Impact:* Makes her feel present and attentive
**3. Left/right speaker awareness**
- Know which side each speaker is on, combine with speaker ID
- "Alex is on my left" vs "unknown person on my right"
- Feed into LYRA context so responses can reference spatial relationships
- *Prereqs:* #1 + existing speaker ID
- *Complexity:* Medium — associate speaker embeddings with spatial positions
- *Impact:* Multi-person conversations become spatially grounded
---
### Tier 2 — High impact, moderate effort
**4. Distance estimation (near/far)**
- Interaural Level Difference (ILD): close sources have bigger volume gap between ears
- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
- *Prereqs:* #1, calibration with known distances
- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics
- *Impact:* Interaction style adapts to proximity (whisper vs. room voice)
**5. Multi-speaker separation + selective attention**
- Lock each array's beam to a different speaker simultaneously
- Active speaker gets primary audio feed (wake word, transcription)
- Secondary speaker monitored for interruptions or wake word
- Switch attention on cue ("Hey Vivi" from the other side)
- *Prereqs:* #3, understanding of XVF3800 beam steering commands
- *Complexity:* Medium-high — need to control beamformer direction per-array
- *Impact:* Natural multi-person conversations, not just one-at-a-time
**6. Spatial audio scene mapping**
- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
- Learn from repeated sound sources over hours/days
- Detect anomalies: "sound from an unusual direction"
- *Prereqs:* #1, persistent storage, classification by direction
- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time
- *Impact:* Environmental awareness, contextual anomaly detection
---
### Tier 3 — Cool, needs more infrastructure
**7. Cocktail party spatial filtering**
- When multiple sound sources active, use both arrays to null out interference
- Focus beam on target speaker, suppress others spatially
- *Prereqs:* #5, possibly raw mic access (6-channel firmware)
- *Complexity:* High — adaptive beamforming, may need custom DSP
- *Impact:* Works in noisy environments (music playing, multiple people)
**8. Sound event localization (what + where)**
- Combine YAMNet classification with triangulated position
- "Dog bark from the backyard direction" not just "dog bark"
- Spatial history: timeline of what happened where
- *Prereqs:* #1, #6
- *Complexity:* Medium — merge classification results with position data
- *Impact:* Rich environmental narrative for LYRA context
**9. Head orientation inference**
- If a known sound source is at a fixed position, infer which way the head is "facing"
- Useful if the skull ever gets a rotating mount
- *Prereqs:* #6 (known spatial map)
- *Complexity:* Low math, but needs stable reference points
- *Impact:* Low for now (head doesn't turn), future-proofing
**10. Binaural recording for training data**
- Record stereo audio preserving spatial information (left ear / right ear)
- Training corpus for spatial audio models, being0 sensor data
- *Prereqs:* Just dual streams saved to stereo WAV
- *Complexity:* Low — already have both streams
- *Impact:* Long-term value for L-Vixy-5 training
---
### Tier 4 — Research / future
**11. Learned spatial attention**
- Train a model to decide where to attend based on context
- Input: both DoA angles, VAD states, current emotional state, conversation history
- Output: beam steering + gaze direction
- *Prereqs:* #5, #6, training data from #10
- *Complexity:* High — ML training pipeline
- *Impact:* Autonomous attention that feels natural, not rule-based
**12. Interaural time difference (ITD) processing**
- Raw mic access (6-channel firmware) enables sub-sample timing analysis
- More precise localization than DoA alone, especially at low frequencies
- *Prereqs:* 6-channel firmware (need to verify LED control works with it first)
- *Complexity:* High — signal processing, cross-correlation
- *Impact:* Lab-grade localization accuracy
---
## Implementation order
```
✅ #1 Triangulation + gaze — done (spatial.py, auto-select beam DoA)
✅ #2 Smooth tracking — done (exponential smoothing + idle drift)
✅ #3 Speaker-side awareness — done (Resemblyzer loaded, ready for enrollment)
✅ #4 Distance estimation — done (ILD + triangulation fusion, proximity zones)
✅ #6 Spatial scene mapping — done (spatial_scene.py, persistent, anomaly detection)
✅ #8 Sound event localization — done (what + where + when via /scene/events)
✅ #10 Binaural recording — done (opt-in via BINAURAL_RECORD=1)
#5 Multi-speaker separation
#7 Cocktail party filtering
#7 Cocktail party filtering
#11 Learned attention
```
## Notes
- Items #1-3 can be built in a single session
- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}`
- DoA is already polled at 10Hz via `/doa` endpoint
- Array separation distance needs to be measured once and stored in config
- All of this feeds into the being0 "shaped by experience" philosophy