From 02d3ac3816b6adda955e8e0cc6e4a3d9cae71a8a Mon Sep 17 00:00:00 2001 From: Alex Date: Sun, 12 Apr 2026 21:35:02 -0500 Subject: [PATCH] =?UTF-8?q?Update=20docs=20=E2=80=94=20spatial=20scene,=20?= =?UTF-8?q?distance=20estimation,=20roadmap=20progress?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit README: Updated architecture diagram, features table, new endpoints (/scene, /scene/events, /scene/heatmap), file structure, USB protocol notes (VAD from processed_doa NaN, spenergy always zero). BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done. Co-Authored-By: Claude Opus 4.6 (1M context) --- BINAURAL_ROADMAP.md | 139 ++++++++++++++++++++++++++++++++++++++++++++ README.md | 55 +++++++++++------- 2 files changed, 172 insertions(+), 22 deletions(-) create mode 100644 BINAURAL_ROADMAP.md diff --git a/BINAURAL_ROADMAP.md b/BINAURAL_ROADMAP.md new file mode 100644 index 0000000..ebc485f --- /dev/null +++ b/BINAURAL_ROADMAP.md @@ -0,0 +1,139 @@ +# Binaural Hearing Roadmap +## What two mic arrays make possible + +Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint. + +--- + +### Tier 1 — High impact, ready to build now + +**1. Triangulated sound localization + eye gaze** +- Combine DoA angles from both arrays → compute (x, y) position of sound source +- Post gaze coordinates to eye service → eyes track the speaker spatially +- Front/back disambiguation (single array can't tell 30° front from 30° rear) +- *Prereqs:* Known array positions (measured once), basic trig +- *Complexity:* Low — ~100 lines of math + a gaze-push thread +- *Impact:* Huge — eyes actually follow the person, not just shift left/right + +**2. Active speaker tracking with smooth gaze** +- Continuously track the dominant sound source as it moves +- Smooth the gaze updates (low-pass filter) so eyes don't jitter +- When VAD drops, eyes drift back to center (natural idle behavior) +- *Prereqs:* #1 +- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1 +- *Impact:* Makes her feel present and attentive + +**3. Left/right speaker awareness** +- Know which side each speaker is on, combine with speaker ID +- "Alex is on my left" vs "unknown person on my right" +- Feed into LYRA context so responses can reference spatial relationships +- *Prereqs:* #1 + existing speaker ID +- *Complexity:* Medium — associate speaker embeddings with spatial positions +- *Impact:* Multi-person conversations become spatially grounded + +--- + +### Tier 2 — High impact, moderate effort + +**4. Distance estimation (near/far)** +- Interaural Level Difference (ILD): close sources have bigger volume gap between ears +- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware) +- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+) +- *Prereqs:* #1, calibration with known distances +- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics +- *Impact:* Interaction style adapts to proximity (whisper vs. room voice) + +**5. Multi-speaker separation + selective attention** +- Lock each array's beam to a different speaker simultaneously +- Active speaker gets primary audio feed (wake word, transcription) +- Secondary speaker monitored for interruptions or wake word +- Switch attention on cue ("Hey Vivi" from the other side) +- *Prereqs:* #3, understanding of XVF3800 beam steering commands +- *Complexity:* Medium-high — need to control beamformer direction per-array +- *Impact:* Natural multi-person conversations, not just one-at-a-time + +**6. Spatial audio scene mapping** +- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°" +- Learn from repeated sound sources over hours/days +- Detect anomalies: "sound from an unusual direction" +- *Prereqs:* #1, persistent storage, classification by direction +- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time +- *Impact:* Environmental awareness, contextual anomaly detection + +--- + +### Tier 3 — Cool, needs more infrastructure + +**7. Cocktail party spatial filtering** +- When multiple sound sources active, use both arrays to null out interference +- Focus beam on target speaker, suppress others spatially +- *Prereqs:* #5, possibly raw mic access (6-channel firmware) +- *Complexity:* High — adaptive beamforming, may need custom DSP +- *Impact:* Works in noisy environments (music playing, multiple people) + +**8. Sound event localization (what + where)** +- Combine YAMNet classification with triangulated position +- "Dog bark from the backyard direction" not just "dog bark" +- Spatial history: timeline of what happened where +- *Prereqs:* #1, #6 +- *Complexity:* Medium — merge classification results with position data +- *Impact:* Rich environmental narrative for LYRA context + +**9. Head orientation inference** +- If a known sound source is at a fixed position, infer which way the head is "facing" +- Useful if the skull ever gets a rotating mount +- *Prereqs:* #6 (known spatial map) +- *Complexity:* Low math, but needs stable reference points +- *Impact:* Low for now (head doesn't turn), future-proofing + +**10. Binaural recording for training data** +- Record stereo audio preserving spatial information (left ear / right ear) +- Training corpus for spatial audio models, being0 sensor data +- *Prereqs:* Just dual streams saved to stereo WAV +- *Complexity:* Low — already have both streams +- *Impact:* Long-term value for L-Vixy-5 training + +--- + +### Tier 4 — Research / future + +**11. Learned spatial attention** +- Train a model to decide where to attend based on context +- Input: both DoA angles, VAD states, current emotional state, conversation history +- Output: beam steering + gaze direction +- *Prereqs:* #5, #6, training data from #10 +- *Complexity:* High — ML training pipeline +- *Impact:* Autonomous attention that feels natural, not rule-based + +**12. Interaural time difference (ITD) processing** +- Raw mic access (6-channel firmware) enables sub-sample timing analysis +- More precise localization than DoA alone, especially at low frequencies +- *Prereqs:* 6-channel firmware (need to verify LED control works with it first) +- *Complexity:* High — signal processing, cross-correlation +- *Impact:* Lab-grade localization accuracy + +--- + +## Implementation order + +``` +✅ #1 Triangulation + gaze — done (spatial.py, auto-select beam DoA) +✅ #2 Smooth tracking — done (exponential smoothing + idle drift) +✅ #3 Speaker-side awareness — done (Resemblyzer loaded, ready for enrollment) +✅ #4 Distance estimation — done (ILD + triangulation fusion, proximity zones) +✅ #6 Spatial scene mapping — done (spatial_scene.py, persistent, anomaly detection) +✅ #8 Sound event localization — done (what + where + when via /scene/events) +✅ #10 Binaural recording — done (opt-in via BINAURAL_RECORD=1) + #5 Multi-speaker separation + #7 Cocktail party filtering +#7 Cocktail party filtering +#11 Learned attention +``` + +## Notes + +- Items #1-3 can be built in a single session +- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}` +- DoA is already polled at 10Hz via `/doa` endpoint +- Array separation distance needs to be measured once and stored in config +- All of this feeds into the being0 "shaped by experience" philosophy diff --git a/README.md b/README.md index bd32975..5300bbb 100644 --- a/README.md +++ b/README.md @@ -18,26 +18,28 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial └────────────┬───────────────────────────────┘ ▼ DualAudioStream (audio_stream.py) - best-beam selection (energy-based) + best-beam selection (energy-based, 10% hysteresis) + │ + ┌──────────────────┼──────────────────────┐ + ▼ ▼ ▼ + Porcupine YAMNet Binaural + wake word (Edge TPU) Recorder + "Hey Vivi" 521 classes stereo WAV + ▼ ▼ + Record + Speaker ID + Transcribe (Resemblyzer) + via EarTail │ + ▼ + Spatial Tracker (spatial.py) + DoA → triangulation → ILD distance + → smooth gaze → proximity zones │ ┌────────────┼────────────────┐ ▼ ▼ ▼ - Porcupine YAMNet Binaural - wake word (Edge TPU) Recorder - "Hey Vivi" 521 classes stereo WAV - ▼ ▼ - Record + Speaker ID - Transcribe (Resemblyzer) - via EarTail - │ - ┌────────────┼────────────────┐ - ▼ ▼ ▼ - Spatial Tracker (spatial.py) USB Control (xvf3800.py) - DoA → triangulation LEDs + DoA polling - → smooth gaze per-array control - ▼ - Eye Service (port 8780) - POST /gaze → eyes follow speaker + Eye Service Spatial Scene USB Control + POST /gaze (spatial_scene) (xvf3800.py) + eyes follow what+where map LEDs + DoA + the speaker anomaly detect per-array ``` ## Features @@ -47,10 +49,13 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial | Wake word detection | Porcupine | CPU | Needs Picovoice key | | Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms | | Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API | -| Spatial tracking | spatial.py | USB control | Triangulated gaze | -| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based | +| Spatial tracking | spatial.py | USB control | Triangulated gaze + ILD distance | +| Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) | +| Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection | +| Sound event localization | spatial_scene.py | — | What + where + when log | +| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based, 10% hysteresis | | LED control | xvf3800.py | WS2812 rings | DoA/solid/breath | -| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments | +| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) | ## Installation @@ -169,8 +174,11 @@ sudo systemctl start headmic | Endpoint | Method | Description | |----------|--------|-------------| -| `/doa` | GET | DoA from both arrays + triangulated position + gaze | +| `/doa` | GET | DoA from both arrays + triangulated position + gaze + distance + proximity | | `/devices` | GET | XVF3800 connection status, serials, ALSA devices | +| `/scene` | GET | Learned spatial scene (usual direction per category) + last anomaly | +| `/scene/events` | GET | Recent sound events with what + where + when (query: seconds, category) | +| `/scene/heatmap` | GET | Per-category angular distribution for visualization | ### Sound @@ -235,7 +243,8 @@ sudo systemctl start headmic headmic/ ├── headmic.py # Main FastAPI service ├── audio_stream.py # Dual arecord streams + best-beam selection -├── spatial.py # Triangulation + smooth gaze tracking +├── spatial.py # Triangulation + ILD distance + smooth gaze + proximity +├── spatial_scene.py # Spatial audio scene map + anomaly detection ├── xvf3800.py # USB vendor control (DoA + LEDs) ├── sound_id.py # YAMNet sound classification (CPU/Edge TPU) ├── speaker_id.py # Resemblyzer speaker identification @@ -260,6 +269,8 @@ Commands use USB vendor control transfers: `wValue = cmdid`, `wIndex = resid`. - Read responses have a 1-byte status header before data - Read wLength must be `count * type_size + 1` (exact, not rounded up) - `DOA_VALUE` (resid=20, cmdid=18) is sluggish/cached — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35, cmdid=11) for real-time tracking +- `AUDIO_MGR_SELECTED_AZIMUTHS` returns 2 floats (radians): index 0 = processed DoA (NaN = no speech = VAD indicator), index 1 = auto-select beam (always tracks strongest source) +- `AEC_SPENERGY_VALUES` (resid=33, cmdid=80) is always zero on 2-channel firmware — don't rely on it - **2-channel firmware only** — 6-channel firmware silently ignores LED/control commands ---