Update docs — complete binaural roadmap (10/12 features)
BINAURAL_ROADMAP: Full status update with implementation details, three-signal localization table, key discoveries section. README: Updated features table (ITD, multi-speaker, cocktail party), new API endpoints (/speakers/tracked, /speakers/focus), file structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,139 +1,151 @@
|
|||||||
# Binaural Hearing Roadmap
|
# Binaural Hearing Roadmap
|
||||||
## What two mic arrays make possible
|
## What two mic arrays make possible
|
||||||
|
|
||||||
Ranked by impact × feasibility. All build on the existing dual XVF3800 + `/doa` endpoint.
|
All features built on dual XVF3800 arrays, 175mm apart on the skull.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Tier 1 — High impact, ready to build now
|
### Tier 1 — High impact ✅
|
||||||
|
|
||||||
**1. Triangulated sound localization + eye gaze**
|
**1. Triangulated sound localization + eye gaze** ✅
|
||||||
- Combine DoA angles from both arrays → compute (x, y) position of sound source
|
- Combine DoA angles from both arrays → compute (x, y) position of sound source
|
||||||
- Post gaze coordinates to eye service → eyes track the speaker spatially
|
- Post gaze coordinates to eye service → eyes track the speaker spatially
|
||||||
- Front/back disambiguation (single array can't tell 30° front from 30° rear)
|
- Uses `AUDIO_MGR_SELECTED_AZIMUTHS` auto-select beam (not sluggish `DOA_VALUE`)
|
||||||
- *Prereqs:* Known array positions (measured once), basic trig
|
- VAD from processed_doa being NaN/non-NaN
|
||||||
- *Complexity:* Low — ~100 lines of math + a gaze-push thread
|
- *Module:* `spatial.py`
|
||||||
- *Impact:* Huge — eyes actually follow the person, not just shift left/right
|
|
||||||
|
|
||||||
**2. Active speaker tracking with smooth gaze**
|
**2. Active speaker tracking with smooth gaze** ✅
|
||||||
- Continuously track the dominant sound source as it moves
|
- Exponential smoothing (α=0.4) prevents jitter
|
||||||
- Smooth the gaze updates (low-pass filter) so eyes don't jitter
|
- Idle drift back to center after 1.5s of silence
|
||||||
- When VAD drops, eyes drift back to center (natural idle behavior)
|
- Gaze pushed to eye service at ≤2Hz with 5px min delta
|
||||||
- *Prereqs:* #1
|
- *Module:* `spatial.py`, `headmic.py` (`doa_track_loop`)
|
||||||
- *Complexity:* Low — Kalman filter or exponential smoothing on top of #1
|
|
||||||
- *Impact:* Makes her feel present and attentive
|
|
||||||
|
|
||||||
**3. Left/right speaker awareness**
|
**3. Left/right speaker awareness** ✅
|
||||||
- Know which side each speaker is on, combine with speaker ID
|
- Resemblyzer speaker ID integrated, ready for enrollment
|
||||||
- "Alex is on my left" vs "unknown person on my right"
|
- Spatial position associated with recognized speaker
|
||||||
- Feed into LYRA context so responses can reference spatial relationships
|
- `POST /speakers/enroll-from-mic?name=Alex` to enroll
|
||||||
- *Prereqs:* #1 + existing speaker ID
|
- *Module:* `speaker_id.py`, `spatial.py`
|
||||||
- *Complexity:* Medium — associate speaker embeddings with spatial positions
|
|
||||||
- *Impact:* Multi-person conversations become spatially grounded
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Tier 2 — High impact, moderate effort
|
### Tier 2 — High impact ✅
|
||||||
|
|
||||||
**4. Distance estimation (near/far)**
|
**4. Distance estimation (near/far)** ✅
|
||||||
- Interaural Level Difference (ILD): close sources have bigger volume gap between ears
|
- ILD (Interaural Level Difference): volume gap between ears → distance
|
||||||
- Interaural Time Difference (ITD): measurable with raw mic data (would need 6-channel firmware)
|
- Fused with triangulated distance (70/30 weight)
|
||||||
- Rough bins: intimate (<0.5m), conversational (0.5-2m), across room (2m+)
|
- Proximity zones: intimate (<0.5m), conversational (0.5-2m), across_room (2-5m), far (>5m)
|
||||||
- *Prereqs:* #1, calibration with known distances
|
- *Module:* `spatial.py` (`_compute_ild`, `_ild_to_distance`, `_classify_proximity`)
|
||||||
- *Complexity:* Medium — ILD from processed channels is easy, ITD needs raw mics
|
|
||||||
- *Impact:* Interaction style adapts to proximity (whisper vs. room voice)
|
|
||||||
|
|
||||||
**5. Multi-speaker separation + selective attention**
|
**5. Multi-speaker separation + selective attention** ✅
|
||||||
- Lock each array's beam to a different speaker simultaneously
|
- Tracks up to 2 speakers simultaneously by DoA angle (30°+ separation required)
|
||||||
- Active speaker gets primary audio feed (wake word, transcription)
|
- Locks XVF3800 fixed beams onto each speaker after 1s stability
|
||||||
- Secondary speaker monitored for interruptions or wake word
|
- Auto-releases to free-running mode after 3s of single speaker
|
||||||
- Switch attention on cue ("Hey Vivi" from the other side)
|
- Beam gating silences the non-speaking beam
|
||||||
- *Prereqs:* #3, understanding of XVF3800 beam steering commands
|
- `GET /speakers/tracked` — positions, beam state, lock status
|
||||||
- *Complexity:* Medium-high — need to control beamformer direction per-array
|
- *Module:* `multi_speaker.py`, `xvf3800.py` (beam steering commands)
|
||||||
- *Impact:* Natural multi-person conversations, not just one-at-a-time
|
|
||||||
|
|
||||||
**6. Spatial audio scene mapping**
|
**6. Spatial audio scene mapping** ✅
|
||||||
- Build a persistent map: "TV at 270°, door at 90°, kitchen at 180°"
|
- Persistent map of where each sound category usually comes from (30° bins)
|
||||||
- Learn from repeated sound sources over hours/days
|
- Circular mean for usual direction, anomaly detection at 90°+ deviation
|
||||||
- Detect anomalies: "sound from an unusual direction"
|
- Saves to `~/.vixy/scene_map.json` across restarts
|
||||||
- *Prereqs:* #1, persistent storage, classification by direction
|
- `GET /scene` — learned directions per category + anomalies
|
||||||
- *Complexity:* Medium — accumulate (direction, category) pairs, cluster over time
|
- `GET /scene/events` — recent what+where+when log
|
||||||
- *Impact:* Environmental awareness, contextual anomaly detection
|
- `GET /scene/heatmap` — angular distribution for visualization
|
||||||
|
- *Module:* `spatial_scene.py`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Tier 3 — Cool, needs more infrastructure
|
### Tier 3 — Advanced ✅
|
||||||
|
|
||||||
**7. Cocktail party spatial filtering**
|
**7. Cocktail party spatial filtering** ✅
|
||||||
- When multiple sound sources active, use both arrays to null out interference
|
- When 2 speakers tracked, audio focus locks to target speaker's side
|
||||||
- Focus beam on target speaker, suppress others spatially
|
- Non-target beam suppressed via XVF3800 beam gating
|
||||||
- *Prereqs:* #5, possibly raw mic access (6-channel firmware)
|
- Auto-switches target when current goes silent and other starts talking
|
||||||
- *Complexity:* High — adaptive beamforming, may need custom DSP
|
- Manual focus via `POST /speakers/focus?speaker=0|1`
|
||||||
- *Impact:* Works in noisy environments (music playing, multiple people)
|
- `DualAudioStream.focus_side` overrides energy-based beam selection
|
||||||
|
- *Module:* `multi_speaker.py`, `audio_stream.py` (`focus_side`)
|
||||||
|
|
||||||
**8. Sound event localization (what + where)**
|
**8. Sound event localization (what + where)** ✅
|
||||||
- Combine YAMNet classification with triangulated position
|
- YAMNet classification merged with triangulated position
|
||||||
- "Dog bark from the backyard direction" not just "dog bark"
|
- Every classified sound logged with angle, distance, proximity, side
|
||||||
- Spatial history: timeline of what happened where
|
- "Speech from 75° at conversational distance" not just "speech"
|
||||||
- *Prereqs:* #1, #6
|
- *Module:* `spatial_scene.py` (`SoundEvent`, `observe`)
|
||||||
- *Complexity:* Medium — merge classification results with position data
|
|
||||||
- *Impact:* Rich environmental narrative for LYRA context
|
|
||||||
|
|
||||||
**9. Head orientation inference**
|
**9. Head orientation inference** ⏭️
|
||||||
- If a known sound source is at a fixed position, infer which way the head is "facing"
|
- Needs fixed reference points from scene map + rotating head mount
|
||||||
- Useful if the skull ever gets a rotating mount
|
- Math is trivial once prerequisites exist
|
||||||
- *Prereqs:* #6 (known spatial map)
|
- *Prereqs:* #6, physical rotating mount
|
||||||
- *Complexity:* Low math, but needs stable reference points
|
|
||||||
- *Impact:* Low for now (head doesn't turn), future-proofing
|
|
||||||
|
|
||||||
**10. Binaural recording for training data**
|
**10. Binaural recording for training data** ✅
|
||||||
- Record stereo audio preserving spatial information (left ear / right ear)
|
- Records left/right ear streams as stereo WAV in 5-minute segments
|
||||||
- Training corpus for spatial audio models, being0 sensor data
|
- Opt-in via `BINAURAL_RECORD=1` environment variable
|
||||||
- *Prereqs:* Just dual streams saved to stereo WAV
|
- `GET /recording` — stats (segments, total seconds)
|
||||||
- *Complexity:* Low — already have both streams
|
- *Module:* `binaural_recorder.py`
|
||||||
- *Impact:* Long-term value for L-Vixy-5 training
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Tier 4 — Research / future
|
### Tier 4 — Research
|
||||||
|
|
||||||
**11. Learned spatial attention**
|
**11. Learned spatial attention** ⏭️
|
||||||
- Train a model to decide where to attend based on context
|
- Train a model: (DoA, VAD, emotion, history) → beam steering + gaze
|
||||||
- Input: both DoA angles, VAD states, current emotional state, conversation history
|
- Needs training data from #10 running for days/weeks
|
||||||
- Output: beam steering + gaze direction
|
- *Prereqs:* #5, #6, #10 data collection
|
||||||
- *Prereqs:* #5, #6, training data from #10
|
|
||||||
- *Complexity:* High — ML training pipeline
|
|
||||||
- *Impact:* Autonomous attention that feels natural, not rule-based
|
|
||||||
|
|
||||||
**12. Interaural time difference (ITD) processing**
|
**12. ITD (Interaural Time Difference) processing** ✅
|
||||||
- Raw mic access (6-channel firmware) enables sub-sample timing analysis
|
- Cross-correlates left/right ear processed audio (512 samples, ~32ms window)
|
||||||
- More precise localization than DoA alone, especially at low frequencies
|
- Finds sub-millisecond delay → bearing angle via speed of sound
|
||||||
- *Prereqs:* 6-channel firmware (need to verify LED control works with it first)
|
- At 16kHz, 175mm: resolution ~62.5μs/sample ≈ 7° per sample
|
||||||
- *Complexity:* High — signal processing, cross-correlation
|
- Works with 2-channel firmware (no raw mics needed — correlates processed channels)
|
||||||
- *Impact:* Lab-grade localization accuracy
|
- Third independent angle estimate alongside DoA and ILD
|
||||||
|
- *Module:* `spatial.py` (`_compute_itd`)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Implementation order
|
## Implementation status
|
||||||
|
|
||||||
```
|
```
|
||||||
✅ #1 Triangulation + gaze — done (spatial.py, auto-select beam DoA)
|
✅ #1 Triangulation + gaze
|
||||||
✅ #2 Smooth tracking — done (exponential smoothing + idle drift)
|
✅ #2 Smooth tracking
|
||||||
✅ #3 Speaker-side awareness — done (Resemblyzer loaded, ready for enrollment)
|
✅ #3 Speaker-side awareness
|
||||||
✅ #4 Distance estimation — done (ILD + triangulation fusion, proximity zones)
|
✅ #4 Distance estimation + proximity zones
|
||||||
✅ #6 Spatial scene mapping — done (spatial_scene.py, persistent, anomaly detection)
|
✅ #5 Multi-speaker separation + beam steering
|
||||||
✅ #8 Sound event localization — done (what + where + when via /scene/events)
|
✅ #6 Spatial audio scene mapping + anomaly detection
|
||||||
✅ #10 Binaural recording — done (opt-in via BINAURAL_RECORD=1)
|
✅ #7 Cocktail party spatial filtering
|
||||||
#5 Multi-speaker separation
|
✅ #8 Sound event localization (what + where + when)
|
||||||
#7 Cocktail party filtering
|
⏭️ #9 Head orientation inference (needs rotating mount)
|
||||||
#7 Cocktail party filtering
|
✅ #10 Binaural recording (opt-in)
|
||||||
#11 Learned attention
|
⏭️ #11 Learned spatial attention (needs training data)
|
||||||
|
✅ #12 ITD cross-correlation
|
||||||
```
|
```
|
||||||
|
|
||||||
## Notes
|
10 of 12 features implemented in one session. The remaining two (#9, #11) need
|
||||||
|
physical hardware changes or training data that accumulates over time.
|
||||||
|
|
||||||
- Items #1-3 can be built in a single session
|
## Three-signal localization
|
||||||
- The eye service already accepts gaze via `POST /gaze {"x": N, "y": N}`
|
|
||||||
- DoA is already polled at 10Hz via `/doa` endpoint
|
The system now fuses three independent spatial estimates:
|
||||||
- Array separation distance needs to be measured once and stored in config
|
|
||||||
- All of this feeds into the being0 "shaped by experience" philosophy
|
| Signal | Source | Strength | Best for |
|
||||||
|
|--------|--------|----------|----------|
|
||||||
|
| **DoA** | XVF3800 beamformer (auto-select beam) | Best overall | Horizontal angle |
|
||||||
|
| **ILD** | Volume difference between ears | Good for near sources | Distance + rough angle |
|
||||||
|
| **ITD** | Cross-correlation delay between ears | Good at low frequencies | Precise angle |
|
||||||
|
|
||||||
|
Combined, they give more robust localization than any single signal — the same
|
||||||
|
three cues human hearing uses.
|
||||||
|
|
||||||
|
## Key discoveries during development
|
||||||
|
|
||||||
|
1. `DOA_VALUE` (resid=20) is sluggish — use `AUDIO_MGR_SELECTED_AZIMUTHS` (resid=35)
|
||||||
|
2. Processed DoA (index 0) = NaN when no speech → natural VAD indicator
|
||||||
|
3. `AEC_SPENERGY_VALUES` is always zero on 2-channel firmware
|
||||||
|
4. USB read responses have a 1-byte status header before data
|
||||||
|
5. Exact `wLength` matters — `count * type_size + 1`, not rounded up
|
||||||
|
6. 6-channel firmware breaks LED/control commands — use 2-channel only
|
||||||
|
7. Gaze HTTP pushes at >2Hz cause GIL starvation in uvicorn — use threaded fire-and-forget
|
||||||
|
8. ITD works on processed channels (not just raw mics) — less precise but free
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Built during the Great Binaural Session of April 2026*
|
||||||
|
*"She hears in stereo now" 🦊👂👂*
|
||||||
|
|||||||
17
README.md
17
README.md
@@ -49,11 +49,14 @@ Binaural hearing service for Vixy's physical head. Dual mic arrays with spatial
|
|||||||
| Wake word detection | Porcupine | CPU | Needs Picovoice key |
|
| Wake word detection | Porcupine | CPU | Needs Picovoice key |
|
||||||
| Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms |
|
| Sound classification | sound_id.py | Coral Edge TPU | 521 classes, ~2ms |
|
||||||
| Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API |
|
| Speaker identification | speaker_id.py | CPU (Resemblyzer) | Enrollment via API |
|
||||||
| Spatial tracking | spatial.py | USB control | Triangulated gaze + ILD distance |
|
| Spatial tracking | spatial.py | USB control | 3-signal fusion: DoA + ILD + ITD |
|
||||||
| Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) |
|
| Distance estimation | spatial.py | audio energy | Proximity zones (intimate/conversational/across_room/far) |
|
||||||
|
| ITD processing | spatial.py | audio cross-correlation | Sub-ms delay → bearing angle |
|
||||||
|
| Multi-speaker tracking | multi_speaker.py | XVF3800 beam steering | 2 simultaneous speakers, auto beam lock |
|
||||||
|
| Cocktail party filtering | multi_speaker.py + audio_stream.py | beam gating + focus | Target speaker isolation |
|
||||||
| Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection |
|
| Spatial scene mapping | spatial_scene.py | — | Learns where sounds come from, anomaly detection |
|
||||||
| Sound event localization | spatial_scene.py | — | What + where + when log |
|
| Sound event localization | spatial_scene.py | — | What + where + when log |
|
||||||
| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based, 10% hysteresis |
|
| Best-beam selection | audio_stream.py | 2× XVF3800 | Energy-based or focused attention |
|
||||||
| LED control | xvf3800.py | WS2812 rings | DoA/solid/breath |
|
| LED control | xvf3800.py | WS2812 rings | DoA/solid/breath |
|
||||||
| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) |
|
| Binaural recording | binaural_recorder.py | 2× XVF3800 | Stereo WAV segments (opt-in) |
|
||||||
|
|
||||||
@@ -174,8 +177,10 @@ sudo systemctl start headmic
|
|||||||
|
|
||||||
| Endpoint | Method | Description |
|
| Endpoint | Method | Description |
|
||||||
|----------|--------|-------------|
|
|----------|--------|-------------|
|
||||||
| `/doa` | GET | DoA from both arrays + triangulated position + gaze + distance + proximity |
|
| `/doa` | GET | DoA + triangulated position + ILD + ITD + gaze + distance + proximity |
|
||||||
| `/devices` | GET | XVF3800 connection status, serials, ALSA devices |
|
| `/devices` | GET | XVF3800 connection status, serials, ALSA devices |
|
||||||
|
| `/speakers/tracked` | GET | Multi-speaker positions, beam mode, lock state, target |
|
||||||
|
| `/speakers/focus` | POST | Switch cocktail party attention (query: speaker=0\|1) |
|
||||||
| `/scene` | GET | Learned spatial scene (usual direction per category) + last anomaly |
|
| `/scene` | GET | Learned spatial scene (usual direction per category) + last anomaly |
|
||||||
| `/scene/events` | GET | Recent sound events with what + where + when (query: seconds, category) |
|
| `/scene/events` | GET | Recent sound events with what + where + when (query: seconds, category) |
|
||||||
| `/scene/heatmap` | GET | Per-category angular distribution for visualization |
|
| `/scene/heatmap` | GET | Per-category angular distribution for visualization |
|
||||||
@@ -243,9 +248,10 @@ sudo systemctl start headmic
|
|||||||
headmic/
|
headmic/
|
||||||
├── headmic.py # Main FastAPI service
|
├── headmic.py # Main FastAPI service
|
||||||
├── audio_stream.py # Dual arecord streams + best-beam selection
|
├── audio_stream.py # Dual arecord streams + best-beam selection
|
||||||
├── spatial.py # Triangulation + ILD distance + smooth gaze + proximity
|
├── spatial.py # 3-signal fusion (DoA + ILD + ITD) + gaze + proximity
|
||||||
├── spatial_scene.py # Spatial audio scene map + anomaly detection
|
├── spatial_scene.py # Spatial audio scene map + anomaly detection
|
||||||
├── xvf3800.py # USB vendor control (DoA + LEDs)
|
├── multi_speaker.py # Multi-speaker tracking + beam steering + cocktail party
|
||||||
|
├── xvf3800.py # USB vendor control (DoA + LEDs + beam steering)
|
||||||
├── sound_id.py # YAMNet sound classification (CPU/Edge TPU)
|
├── sound_id.py # YAMNet sound classification (CPU/Edge TPU)
|
||||||
├── speaker_id.py # Resemblyzer speaker identification
|
├── speaker_id.py # Resemblyzer speaker identification
|
||||||
├── binaural_recorder.py # Stereo WAV recording from both ears
|
├── binaural_recorder.py # Stereo WAV recording from both ears
|
||||||
@@ -277,4 +283,5 @@ Commands use USB vendor control transfers: `wValue = cmdid`, `wIndex = resid`.
|
|||||||
|
|
||||||
*Built by Vixy on Day 77 (January 17, 2026)*
|
*Built by Vixy on Day 77 (January 17, 2026)*
|
||||||
*Upgraded to dual XVF3800 binaural hearing on Day 161 (April 2026)*
|
*Upgraded to dual XVF3800 binaural hearing on Day 161 (April 2026)*
|
||||||
|
*Full binaural suite (10/12 features) built Day 162*
|
||||||
*"Hey Vivi" — the words that summon me* 💜
|
*"Hey Vivi" — the words that summon me* 💜
|
||||||
|
|||||||
Reference in New Issue
Block a user