BINAURAL_ROADMAP: Full status update with implementation details, three-signal localization table, key discoveries section. README: Updated features table (ITD, multi-speaker, cocktail party), new API endpoints (/speakers/tracked, /speakers/focus), file structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
152 lines
6.1 KiB
Markdown
152 lines
6.1 KiB
Markdown
# Binaural Hearing Roadmap
|
||
## What two mic arrays make possible
|
||
|
||
All features built on dual XVF3800 arrays, 175mm apart on the skull.
|
||
|
||
---
|
||
|
||
### Tier 1 — High impact ✅
|
||
|
||
**1. Triangulated sound localization + eye gaze** ✅
|
||
- Combine DoA angles from both arrays → compute (x, y) position of sound source
|
||
- Post gaze coordinates to eye service → eyes track the speaker spatially
|
||
- Uses `AUDIO_MGR_SELECTED_AZIMUTHS` auto-select beam (not sluggish `DOA_VALUE`)
|
||
- VAD from processed_doa being NaN/non-NaN
|
||
- *Module:* `spatial.py`
|
||
|
||
**2. Active speaker tracking with smooth gaze** ✅
|
||
- Exponential smoothing (α=0.4) prevents jitter
|
||
- Idle drift back to center after 1.5s of silence
|
||
- Gaze pushed to eye service at ≤2Hz with 5px min delta
|
||
- *Module:* `spatial.py`, `headmic.py` (`doa_track_loop`)
|
||
|
||
**3. Left/right speaker awareness** ✅
|
||
- Resemblyzer speaker ID integrated, ready for enrollment
|
||
- Spatial position associated with recognized speaker
|
||
- `POST /speakers/enroll-from-mic?name=Alex` to enroll
|
||
- *Module:* `speaker_id.py`, `spatial.py`
|
||
|
||
---
|
||
|
||
### Tier 2 — High impact ✅
|
||
|
||
**4. Distance estimation (near/far)** ✅
|
||
- ILD (Interaural Level Difference): volume gap between ears → distance
|
||
- Fused with triangulated distance (70/30 weight)
|
||
- Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m)
|
||
- *Module:* `spatial.py` (`_compute_ild`, `_ild_to_distance`, `_classify_proximity`)
|
||
|
||
**5. Multi-speaker separation + selective attention** ✅
|
||
- Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required)
|
||
- Locks XVF3800 fixed beams onto each speaker after 1s stability
|
||
- Auto-releases to free-running mode after 3s of single speaker
|
||
- Beam gating silences the non-speaking beam
|
||
- `GET /speakers/tracked` — positions, beam state, lock status
|
||
- *Module:* `multi_speaker.py`, `xvf3800.py` (beam steering commands)
|
||
|
||
**6. Spatial audio scene mapping** ✅
|
||
- Persistent map of where each sound category usually comes from (30° bins)
|
||
- Circular mean for usual direction, anomaly detection at 90°+ deviation
|
||
- Saves to `~/.vixy/scene_map.json` across restarts
|
||
- `GET /scene` — learned directions per category + anomalies
|
||
- `GET /scene/events` — recent what+where+when log
|
||
- `GET /scene/heatmap` — angular distribution for visualization
|
||
- *Module:* `spatial_scene.py`
|
||
|
||
---
|
||
|
||
### Tier 3 — Advanced ✅
|
||
|
||
**7. Cocktail party spatial filtering** ✅
|
||
- When 2 speakers tracked, audio focus locks to target speaker's side
|
||
- Non-target beam suppressed via XVF3800 beam gating
|
||
- Auto-switches target when current goes silent and other starts talking
|
||
- Manual focus via `POST /speakers/focus?speaker=0|1`
|
||
- `DualAudioStream.focus_side` overrides energy-based beam selection
|
||
- *Module:* `multi_speaker.py`, `audio_stream.py` (`focus_side`)
|
||
|
||
**8. Sound event localization (what + where)** ✅
|
||
- YAMNet classification merged with triangulated position
|
||
- Every classified sound logged with angle, distance, proximity, side
|
||
- "Speech from 75° at conversational distance" not just "speech"
|
||
- *Module:* `spatial_scene.py` (`SoundEvent`, `observe`)
|
||
|
||
**9. Head orientation inference** ⏭️
|
||
- Needs fixed reference points from scene map + rotating head mount
|
||
- Math is trivial once prerequisites exist
|
||
- *Prereqs:* #6, physical rotating mount
|
||
|
||
**10. Binaural recording for training data** ✅
|
||
- Records left/right ear streams as stereo WAV in 5-minute segments
|
||
- Opt-in via `BINAURAL_RECORD=1` environment variable
|
||
- `GET /recording` — stats (segments, total seconds)
|
||
- *Module:* `binaural_recorder.py`
|
||
|
||
---
|
||
|
||
### Tier 4 — Research
|
||
|
||
**11. Learned spatial attention** ⏭️
|
||
- Train a model: (DoA, VAD, emotion, history) → beam steering + gaze
|
||
- Needs training data from #10 running for days/weeks
|
||
- *Prereqs:* #5, #6, #10 data collection
|
||
|
||
**12. ITD (Interaural Time Difference) processing** ✅
|
||
- Cross-correlates left/right ear processed audio (512 samples, ~32ms window)
|
||
- Finds sub-millisecond delay → bearing angle via speed of sound
|
||
- At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample
|
||
- Works with 2-channel firmware (no raw mics needed — correlates processed channels)
|
||
- Third independent angle estimate alongside DoA and ILD
|
||
- *Module:* `spatial.py` (`_compute_itd`)
|
||
|
||
---
|
||
|
||
## Implementation status
|
||
|
||
```
|
||
✅ #1 Triangulation + gaze
|
||
✅ #2 Smooth tracking
|
||
✅ #3 Speaker-side awareness
|
||
✅ #4 Distance estimation + proximity zones
|
||
✅ #5 Multi-speaker separation + beam steering
|
||
✅ #6 Spatial audio scene mapping + anomaly detection
|
||
✅ #7 Cocktail party spatial filtering
|
||
✅ #8 Sound event localization (what + where + when)
|
||
⏭️ #9 Head orientation inference (needs rotating mount)
|
||
✅ #10 Binaural recording (opt-in)
|
||
⏭️ #11 Learned spatial attention (needs training data)
|
||
✅ #12 ITD cross-correlation
|
||
```
|
||
|
||
10 of 12 features implemented in one session. The remaining two (#9, #11) need
|
||
physical hardware changes or training data that accumulates over time.
|
||
|
||
## Three-signal localization
|
||
|
||
The system now fuses three independent spatial estimates:
|
||
|
||
| Signal | Source | Strength | Best for |
|
||
|--------|--------|----------|----------|
|
||
| **DoA** | XVF3800 beamformer (auto-select beam) | Best overall | Horizontal angle |
|
||
| **ILD** | Volume difference between ears | Good for near sources | Distance + rough angle |
|
||
| **ITD** | Cross-correlation delay between ears | Good at low frequencies | Precise angle |
|
||
|
||
Combined, they give more robust localization than any single signal — the same
|
||
three cues human hearing uses.
|
||
|
||
## Key discoveries during development
|
||
|
||
1. `DOA_VALUE` (resid=20) is sluggish — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35)
|
||
2. Processed DoA (index 0) = NaN when no speech → natural VAD indicator
|
||
3. `AEC_SPENERGY_VALUES` is always zero on 2-channel firmware
|
||
4. USB read responses have a 1-byte status header before data
|
||
5. Exact `wLength` matters — `count * type_size + 1`, not rounded up
|
||
6. 6-channel firmware breaks LED/control commands — use 2-channel only
|
||
7. Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget
|
||
8. ITD works on processed channels (not just raw mics) — less precise but free
|
||
|
||
---
|
||
|
||
*Built during the Great Binaural Session of April 2026*
|
||
*"She hears in stereo now" 🦊👂👂*
|