Add design doc for YAMNet sound identification on Coral Edge TPU
Covers model choice, architecture, category mapping, API endpoints, and integration with existing headmic audio pipeline. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
132
docs/plans/2026-02-01-sound-identification-design.md
Normal file
132
docs/plans/2026-02-01-sound-identification-design.md
Normal file
@@ -0,0 +1,132 @@
|
|||||||
|
# Sound Identification: YAMNet on Coral Edge TPU
|
||||||
|
|
||||||
|
Add continuous audio scene classification to headmic using a second Coral Edge TPU running YAMNet.
|
||||||
|
|
||||||
|
## Model
|
||||||
|
|
||||||
|
**YAMNet** (Google, trained on AudioSet) classifies 521 sound categories.
|
||||||
|
|
||||||
|
| Spec | Value |
|
||||||
|
|------|-------|
|
||||||
|
| Model | `yamnet_edgetpu.tflite` (~3MB) |
|
||||||
|
| Hardware | Coral Edge TPU (dedicated, separate from oak-service Coral) |
|
||||||
|
| Input | 0.975s audio window (15,600 samples at 16kHz) -> 96x64 mel spectrogram |
|
||||||
|
| Output | 521-class probability vector |
|
||||||
|
| Inference | ~2-3ms on Edge TPU |
|
||||||
|
| Classification interval | Every 0.5s (50% window overlap) |
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
The existing audio stream feeds a ring buffer alongside the Porcupine wake word path. A separate thread consumes the buffer and runs YAMNet inference on the second Coral.
|
||||||
|
|
||||||
|
```
|
||||||
|
arecord (16kHz mono)
|
||||||
|
|
|
||||||
|
+-> Porcupine (wake word) <- existing, untouched
|
||||||
|
| +-> VAD -> record -> STT
|
||||||
|
|
|
||||||
|
+-> Ring buffer (1s window) <- new
|
||||||
|
+-> Mel spectrogram (CPU, scipy)
|
||||||
|
+-> YAMNet (Coral #2)
|
||||||
|
+-> Category mapping
|
||||||
|
+-> State + API endpoints
|
||||||
|
```
|
||||||
|
|
||||||
|
No changes to the wake word, VAD, recording, or STT paths.
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
| File | Status | Purpose |
|
||||||
|
|------|--------|---------|
|
||||||
|
| `sound_id.py` | New | SoundClassifier class: Coral YAMNet, mel spectrogram, category mapping |
|
||||||
|
| `headmic.py` | Modified | Ring buffer in listener_loop, classifier thread, new endpoints |
|
||||||
|
| `models/yamnet_edgetpu.tflite` | New | Edge TPU compiled YAMNet model |
|
||||||
|
| `models/yamnet_class_map.csv` | New | 521 class name mapping |
|
||||||
|
|
||||||
|
## SoundClassifier Class (`sound_id.py`)
|
||||||
|
|
||||||
|
```python
|
||||||
|
class SoundClassifier:
|
||||||
|
def __init__(self, model_path, class_map_path):
|
||||||
|
# Load YAMNet Edge TPU model on Coral #2
|
||||||
|
# Load class name mapping (521 classes)
|
||||||
|
# Define category groups
|
||||||
|
|
||||||
|
def classify(self, audio_samples):
|
||||||
|
# Convert 16kHz int16 PCM -> mel spectrogram (CPU, numpy/scipy)
|
||||||
|
# Run YAMNet inference on Coral
|
||||||
|
# Return top-N classes with scores + mapped category
|
||||||
|
|
||||||
|
def get_state(self):
|
||||||
|
# Return current audio scene summary
|
||||||
|
```
|
||||||
|
|
||||||
|
### Mel Spectrogram
|
||||||
|
|
||||||
|
Computed on CPU using scipy.signal (~1-2ms for 1s window). No librosa dependency — the mel filterbank is ~30 lines of numpy/scipy. Parameters match YAMNet's expected input: 96 mel bands, 64 time frames, 25ms window, 10ms hop.
|
||||||
|
|
||||||
|
### Category Mapping
|
||||||
|
|
||||||
|
Groups 521 raw YAMNet classes into ~8 useful buckets:
|
||||||
|
|
||||||
|
| Category | Example YAMNet classes |
|
||||||
|
|----------|----------------------|
|
||||||
|
| `speech` | Speech, Conversation, Narration |
|
||||||
|
| `alert` | Doorbell, Knock, Alarm, Telephone |
|
||||||
|
| `music` | Music, Singing, Musical instrument |
|
||||||
|
| `animal` | Dog, Cat, Bird |
|
||||||
|
| `household` | Door, Footsteps, Typing, Water |
|
||||||
|
| `environment` | Wind, Rain, Thunder, Traffic |
|
||||||
|
| `silence` | Silence, White noise |
|
||||||
|
| `other` | Everything else |
|
||||||
|
|
||||||
|
### State and Smoothing
|
||||||
|
|
||||||
|
Tracks last ~10 classifications to avoid flickering between categories on noisy audio. The dominant category is determined by frequency over the recent window, not just the latest inference.
|
||||||
|
|
||||||
|
## Integration with headmic.py
|
||||||
|
|
||||||
|
### Ring Buffer
|
||||||
|
|
||||||
|
In `listener_loop()`, after existing Porcupine processing, append every frame to a `collections.deque` (~1s = 31 frames of 512 samples). Unconditional — every frame goes in regardless of wake word or recording state.
|
||||||
|
|
||||||
|
### Classifier Thread
|
||||||
|
|
||||||
|
Separate daemon thread alongside the listener thread. Every 0.5s:
|
||||||
|
1. Snapshot the ring buffer
|
||||||
|
2. Compute mel spectrogram on CPU
|
||||||
|
3. Run YAMNet inference on Coral
|
||||||
|
4. Update shared state with results
|
||||||
|
|
||||||
|
### New API Endpoints
|
||||||
|
|
||||||
|
| Endpoint | Returns |
|
||||||
|
|----------|---------|
|
||||||
|
| `GET /sounds` | Current audio scene: dominant category, top-3 raw classes with scores, timestamp |
|
||||||
|
| `GET /sounds/history` | Last 30 seconds of classifications |
|
||||||
|
|
||||||
|
### Updated Existing Endpoints
|
||||||
|
|
||||||
|
- `GET /health` adds `sound_classification_enabled: true/false`
|
||||||
|
- `GET /status` adds `audio_scene: "speech"` (dominant category)
|
||||||
|
|
||||||
|
### Graceful Degradation
|
||||||
|
|
||||||
|
If no second Coral is present, `SoundClassifier.__init__` catches the delegate load failure, logs a warning, and headmic.py sets `sound_classifier = None`. All existing functionality is unchanged. `/sounds` returns 503.
|
||||||
|
|
||||||
|
## Hardware Setup
|
||||||
|
|
||||||
|
The second Coral should be on a different USB bus than the oak-service Coral (bus 4, USB3 hub). Options:
|
||||||
|
- Bus 2 (other USB3 root hub) — preferred for speed
|
||||||
|
- Bus 1 or 3 (USB2) — still fine, inference is only ~3ms
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `scipy` — for mel spectrogram computation (FFT, mel filterbank)
|
||||||
|
- `ai-edge-litert` — already installed on Pi
|
||||||
|
|
||||||
|
No new system packages. No changes to headmic.service.
|
||||||
|
|
||||||
|
## Future: Speaker Identification
|
||||||
|
|
||||||
|
Not in this scope. Would require a separate speaker embedding model (audio equivalent of FaceNet), enrollment DB, and cosine matching. Can be added as a second classifier alongside YAMNet once the infrastructure is proven.
|
||||||
Reference in New Issue
Block a user