From 22aae40d177ca7de469be21be5c5d174d9281de7 Mon Sep 17 00:00:00 2001
From: Alex <akazaev@proton.me>
Date: Sun, 1 Feb 2026 20:04:31 -0600
Subject: [PATCH] Add design doc for YAMNet sound identification on Coral Edge
 TPU

Covers model choice, architecture, category mapping, API endpoints,
and integration with existing headmic audio pipeline.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 .../2026-02-01-sound-identification-design.md | 132 ++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100644 docs/plans/2026-02-01-sound-identification-design.md

diff --git a/docs/plans/2026-02-01-sound-identification-design.md b/docs/plans/2026-02-01-sound-identification-design.md
new file mode 100644
index 0000000..9ab09d7
--- /dev/null
+++ b/docs/plans/2026-02-01-sound-identification-design.md
@@ -0,0 +1,132 @@
+# Sound Identification: YAMNet on Coral Edge TPU
+
+Add continuous audio scene classification to headmic using a second Coral Edge TPU running YAMNet.
+
+## Model
+
+**YAMNet** (Google, trained on AudioSet) classifies 521 sound categories.
+
+| Spec | Value |
+|------|-------|
+| Model | `yamnet_edgetpu.tflite` (~3MB) |
+| Hardware | Coral Edge TPU (dedicated, separate from oak-service Coral) |
+| Input | 0.975s audio window (15,600 samples at 16kHz) -> 96x64 mel spectrogram |
+| Output | 521-class probability vector |
+| Inference | ~2-3ms on Edge TPU |
+| Classification interval | Every 0.5s (50% window overlap) |
+
+## Architecture
+
+The existing audio stream feeds a ring buffer alongside the Porcupine wake word path. A separate thread consumes the buffer and runs YAMNet inference on the second Coral.
+
+```
+arecord (16kHz mono)
+  |
+  +-> Porcupine (wake word)       <- existing, untouched
+  |     +-> VAD -> record -> STT
+  |
+  +-> Ring buffer (1s window)     <- new
+        +-> Mel spectrogram (CPU, scipy)
+              +-> YAMNet (Coral #2)
+                    +-> Category mapping
+                          +-> State + API endpoints
+```
+
+No changes to the wake word, VAD, recording, or STT paths.
+
+## Files
+
+| File | Status | Purpose |
+|------|--------|---------|
+| `sound_id.py` | New | SoundClassifier class: Coral YAMNet, mel spectrogram, category mapping |
+| `headmic.py` | Modified | Ring buffer in listener_loop, classifier thread, new endpoints |
+| `models/yamnet_edgetpu.tflite` | New | Edge TPU compiled YAMNet model |
+| `models/yamnet_class_map.csv` | New | 521 class name mapping |
+
+## SoundClassifier Class (`sound_id.py`)
+
+```python
+class SoundClassifier:
+    def __init__(self, model_path, class_map_path):
+        # Load YAMNet Edge TPU model on Coral #2
+        # Load class name mapping (521 classes)
+        # Define category groups
+
+    def classify(self, audio_samples):
+        # Convert 16kHz int16 PCM -> mel spectrogram (CPU, numpy/scipy)
+        # Run YAMNet inference on Coral
+        # Return top-N classes with scores + mapped category
+
+    def get_state(self):
+        # Return current audio scene summary
+```
+
+### Mel Spectrogram
+
+Computed on CPU using scipy.signal (~1-2ms for 1s window). No librosa dependency — the mel filterbank is ~30 lines of numpy/scipy. Parameters match YAMNet's expected input: 96 mel bands, 64 time frames, 25ms window, 10ms hop.
+
+### Category Mapping
+
+Groups 521 raw YAMNet classes into ~8 useful buckets:
+
+| Category | Example YAMNet classes |
+|----------|----------------------|
+| `speech` | Speech, Conversation, Narration |
+| `alert` | Doorbell, Knock, Alarm, Telephone |
+| `music` | Music, Singing, Musical instrument |
+| `animal` | Dog, Cat, Bird |
+| `household` | Door, Footsteps, Typing, Water |
+| `environment` | Wind, Rain, Thunder, Traffic |
+| `silence` | Silence, White noise |
+| `other` | Everything else |
+
+### State and Smoothing
+
+Tracks last ~10 classifications to avoid flickering between categories on noisy audio. The dominant category is determined by frequency over the recent window, not just the latest inference.
+
+## Integration with headmic.py
+
+### Ring Buffer
+
+In `listener_loop()`, after existing Porcupine processing, append every frame to a `collections.deque` (~1s = 31 frames of 512 samples). Unconditional — every frame goes in regardless of wake word or recording state.
+
+### Classifier Thread
+
+Separate daemon thread alongside the listener thread. Every 0.5s:
+1. Snapshot the ring buffer
+2. Compute mel spectrogram on CPU
+3. Run YAMNet inference on Coral
+4. Update shared state with results
+
+### New API Endpoints
+
+| Endpoint | Returns |
+|----------|---------|
+| `GET /sounds` | Current audio scene: dominant category, top-3 raw classes with scores, timestamp |
+| `GET /sounds/history` | Last 30 seconds of classifications |
+
+### Updated Existing Endpoints
+
+- `GET /health` adds `sound_classification_enabled: true/false`
+- `GET /status` adds `audio_scene: "speech"` (dominant category)
+
+### Graceful Degradation
+
+If no second Coral is present, `SoundClassifier.__init__` catches the delegate load failure, logs a warning, and headmic.py sets `sound_classifier = None`. All existing functionality is unchanged. `/sounds` returns 503.
+
+## Hardware Setup
+
+The second Coral should be on a different USB bus than the oak-service Coral (bus 4, USB3 hub). Options:
+- Bus 2 (other USB3 root hub) — preferred for speed
+- Bus 1 or 3 (USB2) — still fine, inference is only ~3ms
+
+## Dependencies
+
+- `scipy` — for mel spectrogram computation (FFT, mel filterbank)
+- `ai-edge-litert` — already installed on Pi
+
+No new system packages. No changes to headmic.service.
+
+## Future: Speaker Identification
+
+Not in this scope. Would require a separate speaker embedding model (audio equivalent of FaceNet), enrollment DB, and cosine matching. Can be added as a second classifier alongside YAMNet once the infrastructure is proven.