diff --git a/BINAURAL_ROADMAP.md b/BINAURAL_ROADMAP.md index ebc485f..c7288fa 100644 --- a/BINAURAL_ROADMAP.md +++ b/BINAURAL_ROADMAP.md @@ -1,139 +1,151 @@ # Binaural Hearing Roadmap ## What two mic arrays make possible -Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint. +All features built on dual XVF3800 arrays, 175mm apart on the skull. --- -### Tier 1 — High impact, ready to build now +### Tier 1 — High impact ✅ -**1. Triangulated sound localization + eye gaze** +**1. Triangulated sound localization + eye gaze** ✅ - Combine DoA angles from both arrays → compute (x, y) position of sound source - Post gaze coordinates to eye service → eyes track the speaker spatially -- Front/back disambiguation (single array can't tell 30° front from 30° rear) -- *Prereqs:* Known array positions (measured once), basic trig -- *Complexity:* Low — ~100 lines of math + a gaze-push thread -- *Impact:* Huge — eyes actually follow the person, not just shift left/right +- Uses `AUDIO_MGR_SELECTED_AZIMUTHS` auto-select beam (not sluggish `DOA_VALUE`) +- VAD from processed_doa being NaN/non-NaN +- *Module:* `spatial.py` -**2. Active speaker tracking with smooth gaze** -- Continuously track the dominant sound source as it moves -- Smooth the gaze updates (low-pass filter) so eyes don't jitter -- When VAD drops, eyes drift back to center (natural idle behavior) -- *Prereqs:* #1 -- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1 -- *Impact:* Makes her feel present and attentive +**2. Active speaker tracking with smooth gaze** ✅ +- Exponential smoothing (α=0.4) prevents jitter +- Idle drift back to center after 1.5s of silence +- Gaze pushed to eye service at ≤2Hz with 5px min delta +- *Module:* `spatial.py`, `headmic.py` (`doa_track_loop`) -**3. Left/right speaker awareness** -- Know which side each speaker is on, combine with speaker ID -- "Alex is on my left" vs "unknown person on my right" -- Feed into LYRA context so responses can reference spatial relationships -- *Prereqs:* #1 + existing speaker ID -- *Complexity:* Medium — associate speaker embeddings with spatial positions -- *Impact:* Multi-person conversations become spatially grounded +**3. Left/right speaker awareness** ✅ +- Resemblyzer speaker ID integrated, ready for enrollment +- Spatial position associated with recognized speaker +- `POST /speakers/enroll-from-mic?name=Alex` to enroll +- *Module:* `speaker_id.py`, `spatial.py` --- -### Tier 2 — High impact, moderate effort +### Tier 2 — High impact ✅ -**4. Distance estimation (near/far)** -- Interaural Level Difference (ILD): close sources have bigger volume gap between ears -- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware) -- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+) -- *Prereqs:* #1, calibration with known distances -- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics -- *Impact:* Interaction style adapts to proximity (whisper vs. room voice) +**4. Distance estimation (near/far)** ✅ +- ILD (Interaural Level Difference): volume gap between ears → distance +- Fused with triangulated distance (70/30 weight) +- Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m) +- *Module:* `spatial.py` (`_compute_ild`, `_ild_to_distance`, `_classify_proximity`) -**5. Multi-speaker separation + selective attention** -- Lock each array's beam to a different speaker simultaneously -- Active speaker gets primary audio feed (wake word, transcription) -- Secondary speaker monitored for interruptions or wake word -- Switch attention on cue ("Hey Vivi" from the other side) -- *Prereqs:* #3, understanding of XVF3800 beam steering commands -- *Complexity:* Medium-high — need to control beamformer direction per-array -- *Impact:* Natural multi-person conversations, not just one-at-a-time +**5. Multi-speaker separation + selective attention** ✅ +- Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required) +- Locks XVF3800 fixed beams onto each speaker after 1s stability +- Auto-releases to free-running mode after 3s of single speaker +- Beam gating silences the non-speaking beam +- `GET /speakers/tracked` — positions, beam state, lock status +- *Module:* `multi_speaker.py`, `xvf3800.py` (beam steering commands) -**6. Spatial audio scene mapping** -- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°" -- Learn from repeated sound sources over hours/days -- Detect anomalies: "sound from an unusual direction" -- *Prereqs:* #1, persistent storage, classification by direction -- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time -- *Impact:* Environmental awareness, contextual anomaly detection +**6. Spatial audio scene mapping** ✅ +- Persistent map of where each sound category usually comes from (30° bins) +- Circular mean for usual direction, anomaly detection at 90°+ deviation +- Saves to `~/.vixy/scene_map.json` across restarts +- `GET /scene` — learned directions per category + anomalies +- `GET /scene/events` — recent what+where+when log +- `GET /scene/heatmap` — angular distribution for visualization +- *Module:* `spatial_scene.py` --- -### Tier 3 — Cool, needs more infrastructure +### Tier 3 — Advanced ✅ -**7. Cocktail party spatial filtering** -- When multiple sound sources active, use both arrays to null out interference -- Focus beam on target speaker, suppress others spatially -- *Prereqs:* #5, possibly raw mic access (6-channel firmware) -- *Complexity:* High — adaptive beamforming, may need custom DSP -- *Impact:* Works in noisy environments (music playing, multiple people) +**7. Cocktail party spatial filtering** ✅ +- When 2 speakers tracked, audio focus locks to target speaker's side +- Non-target beam suppressed via XVF3800 beam gating +- Auto-switches target when current goes silent and other starts talking +- Manual focus via `POST /speakers/focus?speaker=0|1` +- `DualAudioStream.focus_side` overrides energy-based beam selection +- *Module:* `multi_speaker.py`, `audio_stream.py` (`focus_side`) -**8. Sound event localization (what + where)** -- Combine YAMNet classification with triangulated position -- "Dog bark from the backyard direction" not just "dog bark" -- Spatial history: timeline of what happened where -- *Prereqs:* #1, #6 -- *Complexity:* Medium — merge classification results with position data -- *Impact:* Rich environmental narrative for LYRA context +**8. Sound event localization (what + where)** ✅ +- YAMNet classification merged with triangulated position +- Every classified sound logged with angle, distance, proximity, side +- "Speech from 75° at conversational distance" not just "speech" +- *Module:* `spatial_scene.py` (`SoundEvent`, `observe`) -**9. Head orientation inference** -- If a known sound source is at a fixed position, infer which way the head is "facing" -- Useful if the skull ever gets a rotating mount -- *Prereqs:* #6 (known spatial map) -- *Complexity:* Low math, but needs stable reference points -- *Impact:* Low for now (head doesn't turn), future-proofing +**9. Head orientation inference** ⏭️ +- Needs fixed reference points from scene map + rotating head mount +- Math is trivial once prerequisites exist +- *Prereqs:* #6, physical rotating mount -**10. Binaural recording for training data** -- Record stereo audio preserving spatial information (left ear / right ear) -- Training corpus for spatial audio models, being0 sensor data -- *Prereqs:* Just dual streams saved to stereo WAV -- *Complexity:* Low — already have both streams -- *Impact:* Long-term value for L-Vixy-5 training +**10. Binaural recording for training data** ✅ +- Records left/right ear streams as stereo WAV in 5-minute segments +- Opt-in via `BINAURAL_RECORD=1` environment variable +- `GET /recording` — stats (segments, total seconds) +- *Module:* `binaural_recorder.py` --- -### Tier 4 — Research / future +### Tier 4 — Research -**11. Learned spatial attention** -- Train a model to decide where to attend based on context -- Input: both DoA angles, VAD states, current emotional state, conversation history -- Output: beam steering + gaze direction -- *Prereqs:* #5, #6, training data from #10 -- *Complexity:* High — ML training pipeline -- *Impact:* Autonomous attention that feels natural, not rule-based +**11. Learned spatial attention** ⏭️ +- Train a model: (DoA, VAD, emotion, history) → beam steering + gaze +- Needs training data from #10 running for days/weeks +- *Prereqs:* #5, #6, #10 data collection -**12. Interaural time difference (ITD) processing** -- Raw mic access (6-channel firmware) enables sub-sample timing analysis -- More precise localization than DoA alone, especially at low frequencies -- *Prereqs:* 6-channel firmware (need to verify LED control works with it first) -- *Complexity:* High — signal processing, cross-correlation -- *Impact:* Lab-grade localization accuracy +**12. ITD (Interaural Time Difference) processing** ✅ +- Cross-correlates left/right ear processed audio (512 samples, ~32ms window) +- Finds sub-millisecond delay → bearing angle via speed of sound +- At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample +- Works with 2-channel firmware (no raw mics needed — correlates processed channels) +- Third independent angle estimate alongside DoA and ILD +- *Module:* `spatial.py` (`_compute_itd`) --- -## Implementation order +## Implementation status ``` -✅ #1 Triangulation + gaze — done (spatial.py, auto-select beam DoA) -✅ #2 Smooth tracking — done (exponential smoothing + idle drift) -✅ #3 Speaker-side awareness — done (Resemblyzer loaded, ready for enrollment) -✅ #4 Distance estimation — done (ILD + triangulation fusion, proximity zones) -✅ #6 Spatial scene mapping — done (spatial_scene.py, persistent, anomaly detection) -✅ #8 Sound event localization — done (what + where + when via /scene/events) -✅ #10 Binaural recording — done (opt-in via BINAURAL_RECORD=1) - #5 Multi-speaker separation - #7 Cocktail party filtering -#7 Cocktail party filtering -#11 Learned attention +✅ #1 Triangulation + gaze +✅ #2 Smooth tracking +✅ #3 Speaker-side awareness +✅ #4 Distance estimation + proximity zones +✅ #5 Multi-speaker separation + beam steering +✅ #6 Spatial audio scene mapping + anomaly detection +✅ #7 Cocktail party spatial filtering +✅ #8 Sound event localization (what + where + when) +⏭️ #9 Head orientation inference (needs rotating mount) +✅ #10 Binaural recording (opt-in) +⏭️ #11 Learned spatial attention (needs training data) +✅ #12 ITD cross-correlation ``` -## Notes +10 of 12 features implemented in one session. The remaining two (#9, #11) need +physical hardware changes or training data that accumulates over time. -- Items #1-3 can be built in a single session -- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}` -- DoA is already polled at 10Hz via `/doa` endpoint -- Array separation distance needs to be measured once and stored in config -- All of this feeds into the being0 "shaped by experience" philosophy +## Three-signal localization + +The system now fuses three independent spatial estimates: + +| Signal | Source | Strength | Best for | +|--------|--------|----------|----------| +| **DoA** | XVF3800 beamformer (auto-select beam) | Best overall | Horizontal angle | +| **ILD** | Volume difference between ears | Good for near sources | Distance + rough angle | +| **ITD** | Cross-correlation delay between ears | Good at low frequencies | Precise angle | + +Combined, they give more robust localization than any single signal — the same +three cues human hearing uses. + +## Key discoveries during development + +1. `DOA_VALUE` (resid=20) is sluggish — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35) +2. Processed DoA (index 0) = NaN when no speech → natural VAD indicator +3. `AEC_SPENERGY_VALUES` is always zero on 2-channel firmware +4. USB read responses have a 1-byte status header before data +5. Exact `wLength` matters — `count * type_size + 1`, not rounded up +6. 6-channel firmware breaks LED/control commands — use 2-channel only +7. Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget +8. ITD works on processed channels (not just raw mics) — less precise but free + +--- + +*Built during the Great Binaural Session of April 2026* +*"She hears in stereo now" 🦊👂👂* diff --git a/README.md b/README.md index 5300bbb..a0b22e8 100644 --- a/README.md +++ b/README.md @@ -49,11 +49,14 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial | Wake word detection | Porcupine | CPU | Needs Picovoice key | | Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms | | Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API | -| Spatial tracking | spatial.py | USB control | Triangulated gaze + ILD distance | +| Spatial tracking | spatial.py | USB control | 3-signal fusion: DoA + ILD + ITD | | Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) | +| ITD processing | spatial.py | audio cross-correlation | Sub-ms delay → bearing angle | +| Multi-speaker tracking | multi_speaker.py | XVF3800 beam steering | 2 simultaneous speakers, auto beam lock | +| Cocktail party filtering | multi_speaker.py + audio_stream.py | beam gating + focus | Target speaker isolation | | Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection | | Sound event localization | spatial_scene.py | — | What + where + when log | -| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based, 10% hysteresis | +| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based or focused attention | | LED control | xvf3800.py | WS2812 rings | DoA/solid/breath | | Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) | @@ -174,8 +177,10 @@ sudo systemctl start headmic | Endpoint | Method | Description | |----------|--------|-------------| -| `/doa` | GET | DoA from both arrays + triangulated position + gaze + distance + proximity | +| `/doa` | GET | DoA + triangulated position + ILD + ITD + gaze + distance + proximity | | `/devices` | GET | XVF3800 connection status, serials, ALSA devices | +| `/speakers/tracked` | GET | Multi-speaker positions, beam mode, lock state, target | +| `/speakers/focus` | POST | Switch cocktail party attention (query: speaker=0\|1) | | `/scene` | GET | Learned spatial scene (usual direction per category) + last anomaly | | `/scene/events` | GET | Recent sound events with what + where + when (query: seconds, category) | | `/scene/heatmap` | GET | Per-category angular distribution for visualization | @@ -243,9 +248,10 @@ sudo systemctl start headmic headmic/ ├── headmic.py # Main FastAPI service ├── audio_stream.py # Dual arecord streams + best-beam selection -├── spatial.py # Triangulation + ILD distance + smooth gaze + proximity +├── spatial.py # 3-signal fusion (DoA + ILD + ITD) + gaze + proximity ├── spatial_scene.py # Spatial audio scene map + anomaly detection -├── xvf3800.py # USB vendor control (DoA + LEDs) +├── multi_speaker.py # Multi-speaker tracking + beam steering + cocktail party +├── xvf3800.py # USB vendor control (DoA + LEDs + beam steering) ├── sound_id.py # YAMNet sound classification (CPU/Edge TPU) ├── speaker_id.py # Resemblyzer speaker identification ├── binaural_recorder.py # Stereo WAV recording from both ears @@ -277,4 +283,5 @@ Commands use USB vendor control transfers: `wValue = cmdid`, `wIndex = resid`. *Built by Vixy on Day 77 (January 17, 2026)* *Upgraded to dual XVF3800 binaural hearing on Day 161 (April 2026)* +*Full binaural suite (10/12 features) built Day 162* *"Hey Vivi" — the words that summon me* 💜