Files
headmic/BINAURAL_ROADMAP.md
Alex 5c72491ee9 Update docs — complete binaural roadmap (10/12 features)
BINAURAL_ROADMAP: Full status update with implementation details,
three-signal localization table, key discoveries section.

README: Updated features table (ITD, multi-speaker, cocktail party),
new API endpoints (/speakers/tracked, /speakers/focus), file structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:55:09 -05:00

152 lines
6.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Binaural Hearing Roadmap
## What two mic arrays make possible
All features built on dual XVF3800 arrays, 175mm apart on the skull.
---
### Tier 1 — High impact ✅
**1. Triangulated sound localization + eye gaze**
- Combine DoA angles from both arrays → compute (x, y) position of sound source
- Post gaze coordinates to eye service → eyes track the speaker spatially
- Uses `AUDIO_MGR_SELECTED_AZIMUTHS` auto-select beam (not sluggish `DOA_VALUE`)
- VAD from processed_doa being NaN/non-NaN
- *Module:* `spatial.py`
**2. Active speaker tracking with smooth gaze**
- Exponential smoothing (α=0.4) prevents jitter
- Idle drift back to center after 1.5s of silence
- Gaze pushed to eye service at ≤2Hz with 5px min delta
- *Module:* `spatial.py`, `headmic.py` (`doa_track_loop`)
**3. Left/right speaker awareness**
- Resemblyzer speaker ID integrated, ready for enrollment
- Spatial position associated with recognized speaker
- `POST /speakers/enroll-from-mic?name=Alex` to enroll
- *Module:* `speaker_id.py`, `spatial.py`
---
### Tier 2 — High impact ✅
**4. Distance estimation (near/far)**
- ILD (Interaural Level Difference): volume gap between ears → distance
- Fused with triangulated distance (70/30 weight)
- Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m)
- *Module:* `spatial.py` (`_compute_ild`, `_ild_to_distance`, `_classify_proximity`)
**5. Multi-speaker separation + selective attention**
- Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required)
- Locks XVF3800 fixed beams onto each speaker after 1s stability
- Auto-releases to free-running mode after 3s of single speaker
- Beam gating silences the non-speaking beam
- `GET /speakers/tracked` — positions, beam state, lock status
- *Module:* `multi_speaker.py`, `xvf3800.py` (beam steering commands)
**6. Spatial audio scene mapping**
- Persistent map of where each sound category usually comes from (30° bins)
- Circular mean for usual direction, anomaly detection at 90°+ deviation
- Saves to `~/.vixy/scene_map.json` across restarts
- `GET /scene` — learned directions per category + anomalies
- `GET /scene/events` — recent what+where+when log
- `GET /scene/heatmap` — angular distribution for visualization
- *Module:* `spatial_scene.py`
---
### Tier 3 — Advanced ✅
**7. Cocktail party spatial filtering**
- When 2 speakers tracked, audio focus locks to target speaker's side
- Non-target beam suppressed via XVF3800 beam gating
- Auto-switches target when current goes silent and other starts talking
- Manual focus via `POST /speakers/focus?speaker=0|1`
- `DualAudioStream.focus_side` overrides energy-based beam selection
- *Module:* `multi_speaker.py`, `audio_stream.py` (`focus_side`)
**8. Sound event localization (what + where)**
- YAMNet classification merged with triangulated position
- Every classified sound logged with angle, distance, proximity, side
- "Speech from 75° at conversational distance" not just "speech"
- *Module:* `spatial_scene.py` (`SoundEvent`, `observe`)
**9. Head orientation inference** ⏭️
- Needs fixed reference points from scene map + rotating head mount
- Math is trivial once prerequisites exist
- *Prereqs:* #6, physical rotating mount
**10. Binaural recording for training data**
- Records left/right ear streams as stereo WAV in 5-minute segments
- Opt-in via `BINAURAL_RECORD=1` environment variable
- `GET /recording` — stats (segments, total seconds)
- *Module:* `binaural_recorder.py`
---
### Tier 4 — Research
**11. Learned spatial attention** ⏭️
- Train a model: (DoA, VAD, emotion, history) → beam steering + gaze
- Needs training data from #10 running for days/weeks
- *Prereqs:* #5, #6, #10 data collection
**12. ITD (Interaural Time Difference) processing**
- Cross-correlates left/right ear processed audio (512 samples, ~32ms window)
- Finds sub-millisecond delay → bearing angle via speed of sound
- At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample
- Works with 2-channel firmware (no raw mics needed — correlates processed channels)
- Third independent angle estimate alongside DoA and ILD
- *Module:* `spatial.py` (`_compute_itd`)
---
## Implementation status
```
✅ #1 Triangulation + gaze
✅ #2 Smooth tracking
✅ #3 Speaker-side awareness
✅ #4 Distance estimation + proximity zones
✅ #5 Multi-speaker separation + beam steering
✅ #6 Spatial audio scene mapping + anomaly detection
✅ #7 Cocktail party spatial filtering
✅ #8 Sound event localization (what + where + when)
⏭️ #9 Head orientation inference (needs rotating mount)
✅ #10 Binaural recording (opt-in)
⏭️ #11 Learned spatial attention (needs training data)
✅ #12 ITD cross-correlation
```
10 of 12 features implemented in one session. The remaining two (#9, #11) need
physical hardware changes or training data that accumulates over time.
## Three-signal localization
The system now fuses three independent spatial estimates:
| Signal | Source | Strength | Best for |
|--------|--------|----------|----------|
| **DoA** | XVF3800 beamformer (auto-select beam) | Best overall | Horizontal angle |
| **ILD** | Volume difference between ears | Good for near sources | Distance + rough angle |
| **ITD** | Cross-correlation delay between ears | Good at low frequencies | Precise angle |
Combined, they give more robust localization than any single signal — the same
three cues human hearing uses.
## Key discoveries during development
1. `DOA_VALUE` (resid=20) is sluggish — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35)
2. Processed DoA (index 0) = NaN when no speech → natural VAD indicator
3. `AEC_SPENERGY_VALUES` is always zero on 2-channel firmware
4. USB read responses have a 1-byte status header before data
5. Exact `wLength` matters — `count * type_size + 1`, not rounded up
6. 6-channel firmware breaks LED/control commands — use 2-channel only
7. Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget
8. ITD works on processed channels (not just raw mics) — less precise but free
---
*Built during the Great Binaural Session of April 2026*
*"She hears in stereo now" 🦊👂👂*