# Sound Identification: YAMNet on Coral Edge TPU Add continuous audio scene classification to headmic using a second Coral Edge TPU running YAMNet. ## Model **YAMNet** (Google, trained on AudioSet) classifies 521 sound categories. | Spec | Value | |------|-------| | Model | `yamnet_edgetpu.tflite` (~3MB) | | Hardware | Coral Edge TPU (dedicated, separate from oak-service Coral) | | Input | 0.975s audio window (15,600 samples at 16kHz) -> 96x64 mel spectrogram | | Output | 521-class probability vector | | Inference | ~2-3ms on Edge TPU | | Classification interval | Every 0.5s (50% window overlap) | ## Architecture The existing audio stream feeds a ring buffer alongside the Porcupine wake word path. A separate thread consumes the buffer and runs YAMNet inference on the second Coral. ``` arecord (16kHz mono) | +-> Porcupine (wake word) <- existing, untouched | +-> VAD -> record -> STT | +-> Ring buffer (1s window) <- new +-> Mel spectrogram (CPU, scipy) +-> YAMNet (Coral #2) +-> Category mapping +-> State + API endpoints ``` No changes to the wake word, VAD, recording, or STT paths. ## Files | File | Status | Purpose | |------|--------|---------| | `sound_id.py` | New | SoundClassifier class: Coral YAMNet, mel spectrogram, category mapping | | `headmic.py` | Modified | Ring buffer in listener_loop, classifier thread, new endpoints | | `models/yamnet_edgetpu.tflite` | New | Edge TPU compiled YAMNet model | | `models/yamnet_class_map.csv` | New | 521 class name mapping | ## SoundClassifier Class (`sound_id.py`) ```python class SoundClassifier: def __init__(self, model_path, class_map_path): # Load YAMNet Edge TPU model on Coral #2 # Load class name mapping (521 classes) # Define category groups def classify(self, audio_samples): # Convert 16kHz int16 PCM -> mel spectrogram (CPU, numpy/scipy) # Run YAMNet inference on Coral # Return top-N classes with scores + mapped category def get_state(self): # Return current audio scene summary ``` ### Mel Spectrogram Computed on CPU using scipy.signal (~1-2ms for 1s window). No librosa dependency — the mel filterbank is ~30 lines of numpy/scipy. Parameters match YAMNet's expected input: 96 mel bands, 64 time frames, 25ms window, 10ms hop. ### Category Mapping Groups 521 raw YAMNet classes into ~8 useful buckets: | Category | Example YAMNet classes | |----------|----------------------| | `speech` | Speech, Conversation, Narration | | `alert` | Doorbell, Knock, Alarm, Telephone | | `music` | Music, Singing, Musical instrument | | `animal` | Dog, Cat, Bird | | `household` | Door, Footsteps, Typing, Water | | `environment` | Wind, Rain, Thunder, Traffic | | `silence` | Silence, White noise | | `other` | Everything else | ### State and Smoothing Tracks last ~10 classifications to avoid flickering between categories on noisy audio. The dominant category is determined by frequency over the recent window, not just the latest inference. ## Integration with headmic.py ### Ring Buffer In `listener_loop()`, after existing Porcupine processing, append every frame to a `collections.deque` (~1s = 31 frames of 512 samples). Unconditional — every frame goes in regardless of wake word or recording state. ### Classifier Thread Separate daemon thread alongside the listener thread. Every 0.5s: 1. Snapshot the ring buffer 2. Compute mel spectrogram on CPU 3. Run YAMNet inference on Coral 4. Update shared state with results ### New API Endpoints | Endpoint | Returns | |----------|---------| | `GET /sounds` | Current audio scene: dominant category, top-3 raw classes with scores, timestamp | | `GET /sounds/history` | Last 30 seconds of classifications | ### Updated Existing Endpoints - `GET /health` adds `sound_classification_enabled: true/false` - `GET /status` adds `audio_scene: "speech"` (dominant category) ### Graceful Degradation If no second Coral is present, `SoundClassifier.__init__` catches the delegate load failure, logs a warning, and headmic.py sets `sound_classifier = None`. All existing functionality is unchanged. `/sounds` returns 503. ## Hardware Setup The second Coral should be on a different USB bus than the oak-service Coral (bus 4, USB3 hub). Options: - Bus 2 (other USB3 root hub) — preferred for speed - Bus 1 or 3 (USB2) — still fine, inference is only ~3ms ## Dependencies - `scipy` — for mel spectrogram computation (FFT, mel filterbank) - `ai-edge-litert` — already installed on Pi No new system packages. No changes to headmic.service. ## Future: Speaker Identification Not in this scope. Would require a separate speaker embedding model (audio equivalent of FaceNet), enrollment DB, and cosine matching. Can be added as a second classifier alongside YAMNet once the infrastructure is proven.