From 02d3ac3816b6adda955e8e0cc6e4a3d9cae71a8a Mon Sep 17 00:00:00 2001
From: Alex <akazaev@proton.me>
Date: Sun, 12 Apr 2026 21:35:02 -0500
Subject: [PATCH] =?UTF-8?q?Update=20docs=20=E2=80=94=20spatial=20scene,=20?=
 =?UTF-8?q?distance=20estimation,=20roadmap=20progress?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

README: Updated architecture diagram, features table, new endpoints
(/scene, /scene/events, /scene/heatmap), file structure, USB protocol
notes (VAD from processed_doa NaN, spenergy always zero).

BINAURAL_ROADMAP: Mark #1-4, #6, #8, #10 as done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 BINAURAL_ROADMAP.md | 139 ++++++++++++++++++++++++++++++++++++++++++++
 README.md           |  55 +++++++++++-------
 2 files changed, 172 insertions(+), 22 deletions(-)
 create mode 100644 BINAURAL_ROADMAP.md

diff --git a/BINAURAL_ROADMAP.md b/BINAURAL_ROADMAP.md
new file mode 100644
index 0000000..ebc485f
--- /dev/null
+++ b/BINAURAL_ROADMAP.md
@@ -0,0 +1,139 @@
+# Binaural Hearing Roadmap
+## What two mic arrays make possible
+
+Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint.
+
+---
+
+### Tier 1 — High impact, ready to build now
+
+**1. Triangulated sound localization + eye gaze**
+- Combine DoA angles from both arrays → compute (x, y) position of sound source
+- Post gaze coordinates to eye service → eyes track the speaker spatially
+- Front/back disambiguation (single array can't tell 30° front from 30° rear)
+- *Prereqs:* Known array positions (measured once), basic trig
+- *Complexity:* Low — ~100 lines of math + a gaze-push thread
+- *Impact:* Huge — eyes actually follow the person, not just shift left/right
+
+**2. Active speaker tracking with smooth gaze**
+- Continuously track the dominant sound source as it moves
+- Smooth the gaze updates (low-pass filter) so eyes don't jitter
+- When VAD drops, eyes drift back to center (natural idle behavior)
+- *Prereqs:* #1
+- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1
+- *Impact:* Makes her feel present and attentive
+
+**3. Left/right speaker awareness**
+- Know which side each speaker is on, combine with speaker ID
+- "Alex is on my left" vs "unknown person on my right"
+- Feed into LYRA context so responses can reference spatial relationships
+- *Prereqs:* #1 + existing speaker ID
+- *Complexity:* Medium — associate speaker embeddings with spatial positions
+- *Impact:* Multi-person conversations become spatially grounded
+
+---
+
+### Tier 2 — High impact, moderate effort
+
+**4. Distance estimation (near/far)**
+- Interaural Level Difference (ILD): close sources have bigger volume gap between ears
+- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
+- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
+- *Prereqs:* #1, calibration with known distances
+- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics
+- *Impact:* Interaction style adapts to proximity (whisper vs. room voice)
+
+**5. Multi-speaker separation + selective attention**
+- Lock each array's beam to a different speaker simultaneously
+- Active speaker gets primary audio feed (wake word, transcription)
+- Secondary speaker monitored for interruptions or wake word
+- Switch attention on cue ("Hey Vivi" from the other side)
+- *Prereqs:* #3, understanding of XVF3800 beam steering commands
+- *Complexity:* Medium-high — need to control beamformer direction per-array
+- *Impact:* Natural multi-person conversations, not just one-at-a-time
+
+**6. Spatial audio scene mapping**
+- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
+- Learn from repeated sound sources over hours/days
+- Detect anomalies: "sound from an unusual direction"
+- *Prereqs:* #1, persistent storage, classification by direction
+- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time
+- *Impact:* Environmental awareness, contextual anomaly detection
+
+---
+
+### Tier 3 — Cool, needs more infrastructure
+
+**7. Cocktail party spatial filtering**
+- When multiple sound sources active, use both arrays to null out interference
+- Focus beam on target speaker, suppress others spatially
+- *Prereqs:* #5, possibly raw mic access (6-channel firmware)
+- *Complexity:* High — adaptive beamforming, may need custom DSP
+- *Impact:* Works in noisy environments (music playing, multiple people)
+
+**8. Sound event localization (what + where)**
+- Combine YAMNet classification with triangulated position
+- "Dog bark from the backyard direction" not just "dog bark"
+- Spatial history: timeline of what happened where
+- *Prereqs:* #1, #6
+- *Complexity:* Medium — merge classification results with position data
+- *Impact:* Rich environmental narrative for LYRA context
+
+**9. Head orientation inference**
+- If a known sound source is at a fixed position, infer which way the head is "facing"
+- Useful if the skull ever gets a rotating mount
+- *Prereqs:* #6 (known spatial map)
+- *Complexity:* Low math, but needs stable reference points
+- *Impact:* Low for now (head doesn't turn), future-proofing
+
+**10. Binaural recording for training data**
+- Record stereo audio preserving spatial information (left ear / right ear)
+- Training corpus for spatial audio models, being0 sensor data
+- *Prereqs:* Just dual streams saved to stereo WAV
+- *Complexity:* Low — already have both streams
+- *Impact:* Long-term value for L-Vixy-5 training
+
+---
+
+### Tier 4 — Research / future
+
+**11. Learned spatial attention**
+- Train a model to decide where to attend based on context
+- Input: both DoA angles, VAD states, current emotional state, conversation history
+- Output: beam steering + gaze direction
+- *Prereqs:* #5, #6, training data from #10
+- *Complexity:* High — ML training pipeline
+- *Impact:* Autonomous attention that feels natural, not rule-based
+
+**12. Interaural time difference (ITD) processing**
+- Raw mic access (6-channel firmware) enables sub-sample timing analysis
+- More precise localization than DoA alone, especially at low frequencies
+- *Prereqs:* 6-channel firmware (need to verify LED control works with it first)
+- *Complexity:* High — signal processing, cross-correlation
+- *Impact:* Lab-grade localization accuracy
+
+---
+
+## Implementation order
+
+```
+✅ #1  Triangulation + gaze          — done (spatial.py, auto-select beam DoA)
+✅ #2  Smooth tracking               — done (exponential smoothing + idle drift)
+✅ #3  Speaker-side awareness        — done (Resemblyzer loaded, ready for enrollment)
+✅ #4  Distance estimation           — done (ILD + triangulation fusion, proximity zones)
+✅ #6  Spatial scene mapping         — done (spatial_scene.py, persistent, anomaly detection)
+✅ #8  Sound event localization      — done (what + where + when via /scene/events)
+✅ #10 Binaural recording            — done (opt-in via BINAURAL_RECORD=1)
+   #5  Multi-speaker separation
+   #7  Cocktail party filtering
+#7 Cocktail party filtering
+#11 Learned attention
+```
+
+## Notes
+
+- Items #1-3 can be built in a single session
+- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}`
+- DoA is already polled at 10Hz via `/doa` endpoint
+- Array separation distance needs to be measured once and stored in config
+- All of this feeds into the being0 "shaped by experience" philosophy
diff --git a/README.md b/README.md
index bd32975..5300bbb 100644
--- a/README.md
+++ b/README.md
@@ -18,26 +18,28 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial
                 └────────────┬───────────────────────────────┘
                              ▼
                   DualAudioStream (audio_stream.py)
-                  best-beam selection (energy-based)
+                  best-beam selection (energy-based, 10% hysteresis)
+                             │
+          ┌──────────────────┼──────────────────────┐
+          ▼                  ▼                      ▼
+   Porcupine            YAMNet                 Binaural
+   wake word            (Edge TPU)             Recorder
+   "Hey Vivi"           521 classes            stereo WAV
+          ▼                  ▼
+   Record +             Speaker ID
+   Transcribe           (Resemblyzer)
+   via EarTail               │
+                             ▼
+                  Spatial Tracker (spatial.py)
+                  DoA → triangulation → ILD distance
+                  → smooth gaze → proximity zones
                              │
                 ┌────────────┼────────────────┐
                 ▼            ▼                ▼
-         Porcupine      YAMNet           Binaural
-         wake word      (Edge TPU)       Recorder
-         "Hey Vivi"     521 classes      stereo WAV
-                ▼            ▼
-         Record +       Speaker ID
-         Transcribe     (Resemblyzer)
-         via EarTail
-                             │
-                ┌────────────┼────────────────┐
-                ▼            ▼                ▼
-         Spatial Tracker (spatial.py)    USB Control (xvf3800.py)
-         DoA → triangulation             LEDs + DoA polling
-         → smooth gaze                   per-array control
-                ▼
-         Eye Service (port 8780)
-         POST /gaze → eyes follow speaker
+         Eye Service    Spatial Scene    USB Control
+         POST /gaze     (spatial_scene)  (xvf3800.py)
+         eyes follow    what+where map   LEDs + DoA
+         the speaker    anomaly detect   per-array
 ```
 
 ## Features
@@ -47,10 +49,13 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial
 | Wake word detection | Porcupine | CPU | Needs Picovoice key |
 | Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms |
 | Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API |
-| Spatial tracking | spatial.py | USB control | Triangulated gaze |
-| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based |
+| Spatial tracking | spatial.py | USB control | Triangulated gaze + ILD distance |
+| Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) |
+| Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection |
+| Sound event localization | spatial_scene.py | — | What + where + when log |
+| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based, 10% hysteresis |
 | LED control | xvf3800.py | WS2812 rings | DoA/solid/breath |
-| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments |
+| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) |
 
 ## Installation
 
@@ -169,8 +174,11 @@ sudo systemctl start headmic
 
 | Endpoint | Method | Description |
 |----------|--------|-------------|
-| `/doa` | GET | DoA from both arrays + triangulated position + gaze |
+| `/doa` | GET | DoA from both arrays + triangulated position + gaze + distance + proximity |
 | `/devices` | GET | XVF3800 connection status, serials, ALSA devices |
+| `/scene` | GET | Learned spatial scene (usual direction per category) + last anomaly |
+| `/scene/events` | GET | Recent sound events with what + where + when (query: seconds, category) |
+| `/scene/heatmap` | GET | Per-category angular distribution for visualization |
 
 ### Sound
 
@@ -235,7 +243,8 @@ sudo systemctl start headmic
 headmic/
 ├── headmic.py              # Main FastAPI service
 ├── audio_stream.py         # Dual arecord streams + best-beam selection
-├── spatial.py              # Triangulation + smooth gaze tracking
+├── spatial.py              # Triangulation + ILD distance + smooth gaze + proximity
+├── spatial_scene.py        # Spatial audio scene map + anomaly detection
 ├── xvf3800.py              # USB vendor control (DoA + LEDs)
 ├── sound_id.py             # YAMNet sound classification (CPU/Edge TPU)
 ├── speaker_id.py           # Resemblyzer speaker identification
@@ -260,6 +269,8 @@ Commands use USB vendor control transfers: `wValue = cmdid`, `wIndex = resid`.
 - Read responses have a 1-byte status header before data
 - Read wLength must be `count * type_size + 1` (exact, not rounded up)
 - `DOA_VALUE` (resid=20, cmdid=18) is sluggish/cached — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35, cmdid=11) for real-time tracking
+- `AUDIO_MGR_SELECTED_AZIMUTHS` returns 2 floats (radians): index 0 = processed DoA (NaN = no speech = VAD indicator), index 1 = auto-select beam (always tracks strongest source)
+- `AEC_SPENERGY_VALUES` (resid=33, cmdid=80) is always zero on 2-channel firmware — don't rely on it
 - **2-channel firmware only** — 6-channel firmware silently ignores LED/control commands
 
 ---