From 22aae40d177ca7de469be21be5c5d174d9281de7 Mon Sep 17 00:00:00 2001 From: Alex Date: Sun, 1 Feb 2026 20:04:31 -0600 Subject: [PATCH] Add design doc for YAMNet sound identification on Coral Edge TPU Covers model choice, architecture, category mapping, API endpoints, and integration with existing headmic audio pipeline. Co-Authored-By: Claude Opus 4.5 --- .../2026-02-01-sound-identification-design.md | 132 ++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 docs/plans/2026-02-01-sound-identification-design.md diff --git a/docs/plans/2026-02-01-sound-identification-design.md b/docs/plans/2026-02-01-sound-identification-design.md new file mode 100644 index 0000000..9ab09d7 --- /dev/null +++ b/docs/plans/2026-02-01-sound-identification-design.md @@ -0,0 +1,132 @@ +# Sound Identification: YAMNet on Coral Edge TPU + +Add continuous audio scene classification to headmic using a second Coral Edge TPU running YAMNet. + +## Model + +**YAMNet** (Google, trained on AudioSet) classifies 521 sound categories. + +| Spec | Value | +|------|-------| +| Model | `yamnet_edgetpu.tflite` (~3MB) | +| Hardware | Coral Edge TPU (dedicated, separate from oak-service Coral) | +| Input | 0.975s audio window (15,600 samples at 16kHz) -> 96x64 mel spectrogram | +| Output | 521-class probability vector | +| Inference | ~2-3ms on Edge TPU | +| Classification interval | Every 0.5s (50% window overlap) | + +## Architecture + +The existing audio stream feeds a ring buffer alongside the Porcupine wake word path. A separate thread consumes the buffer and runs YAMNet inference on the second Coral. + +``` +arecord (16kHz mono) + | + +-> Porcupine (wake word) <- existing, untouched + | +-> VAD -> record -> STT + | + +-> Ring buffer (1s window) <- new + +-> Mel spectrogram (CPU, scipy) + +-> YAMNet (Coral #2) + +-> Category mapping + +-> State + API endpoints +``` + +No changes to the wake word, VAD, recording, or STT paths. + +## Files + +| File | Status | Purpose | +|------|--------|---------| +| `sound_id.py` | New | SoundClassifier class: Coral YAMNet, mel spectrogram, category mapping | +| `headmic.py` | Modified | Ring buffer in listener_loop, classifier thread, new endpoints | +| `models/yamnet_edgetpu.tflite` | New | Edge TPU compiled YAMNet model | +| `models/yamnet_class_map.csv` | New | 521 class name mapping | + +## SoundClassifier Class (`sound_id.py`) + +```python +class SoundClassifier: + def __init__(self, model_path, class_map_path): + # Load YAMNet Edge TPU model on Coral #2 + # Load class name mapping (521 classes) + # Define category groups + + def classify(self, audio_samples): + # Convert 16kHz int16 PCM -> mel spectrogram (CPU, numpy/scipy) + # Run YAMNet inference on Coral + # Return top-N classes with scores + mapped category + + def get_state(self): + # Return current audio scene summary +``` + +### Mel Spectrogram + +Computed on CPU using scipy.signal (~1-2ms for 1s window). No librosa dependency — the mel filterbank is ~30 lines of numpy/scipy. Parameters match YAMNet's expected input: 96 mel bands, 64 time frames, 25ms window, 10ms hop. + +### Category Mapping + +Groups 521 raw YAMNet classes into ~8 useful buckets: + +| Category | Example YAMNet classes | +|----------|----------------------| +| `speech` | Speech, Conversation, Narration | +| `alert` | Doorbell, Knock, Alarm, Telephone | +| `music` | Music, Singing, Musical instrument | +| `animal` | Dog, Cat, Bird | +| `household` | Door, Footsteps, Typing, Water | +| `environment` | Wind, Rain, Thunder, Traffic | +| `silence` | Silence, White noise | +| `other` | Everything else | + +### State and Smoothing + +Tracks last ~10 classifications to avoid flickering between categories on noisy audio. The dominant category is determined by frequency over the recent window, not just the latest inference. + +## Integration with headmic.py + +### Ring Buffer + +In `listener_loop()`, after existing Porcupine processing, append every frame to a `collections.deque` (~1s = 31 frames of 512 samples). Unconditional — every frame goes in regardless of wake word or recording state. + +### Classifier Thread + +Separate daemon thread alongside the listener thread. Every 0.5s: +1. Snapshot the ring buffer +2. Compute mel spectrogram on CPU +3. Run YAMNet inference on Coral +4. Update shared state with results + +### New API Endpoints + +| Endpoint | Returns | +|----------|---------| +| `GET /sounds` | Current audio scene: dominant category, top-3 raw classes with scores, timestamp | +| `GET /sounds/history` | Last 30 seconds of classifications | + +### Updated Existing Endpoints + +- `GET /health` adds `sound_classification_enabled: true/false` +- `GET /status` adds `audio_scene: "speech"` (dominant category) + +### Graceful Degradation + +If no second Coral is present, `SoundClassifier.__init__` catches the delegate load failure, logs a warning, and headmic.py sets `sound_classifier = None`. All existing functionality is unchanged. `/sounds` returns 503. + +## Hardware Setup + +The second Coral should be on a different USB bus than the oak-service Coral (bus 4, USB3 hub). Options: +- Bus 2 (other USB3 root hub) — preferred for speed +- Bus 1 or 3 (USB2) — still fine, inference is only ~3ms + +## Dependencies + +- `scipy` — for mel spectrogram computation (FFT, mel filterbank) +- `ai-edge-litert` — already installed on Pi + +No new system packages. No changes to headmic.service. + +## Future: Speaker Identification + +Not in this scope. Would require a separate speaker embedding model (audio equivalent of FaceNet), enrollment DB, and cosine matching. Can be added as a second classifier alongside YAMNet once the infrastructure is proven.