Covers model choice, architecture, category mapping, API endpoints, and integration with existing headmic audio pipeline. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4.8 KiB
Sound Identification: YAMNet on Coral Edge TPU
Add continuous audio scene classification to headmic using a second Coral Edge TPU running YAMNet.
Model
YAMNet (Google, trained on AudioSet) classifies 521 sound categories.
| Spec | Value |
|---|---|
| Model | yamnet_edgetpu.tflite (~3MB) |
| Hardware | Coral Edge TPU (dedicated, separate from oak-service Coral) |
| Input | 0.975s audio window (15,600 samples at 16kHz) -> 96x64 mel spectrogram |
| Output | 521-class probability vector |
| Inference | ~2-3ms on Edge TPU |
| Classification interval | Every 0.5s (50% window overlap) |
Architecture
The existing audio stream feeds a ring buffer alongside the Porcupine wake word path. A separate thread consumes the buffer and runs YAMNet inference on the second Coral.
arecord (16kHz mono)
|
+-> Porcupine (wake word) <- existing, untouched
| +-> VAD -> record -> STT
|
+-> Ring buffer (1s window) <- new
+-> Mel spectrogram (CPU, scipy)
+-> YAMNet (Coral #2)
+-> Category mapping
+-> State + API endpoints
No changes to the wake word, VAD, recording, or STT paths.
Files
| File | Status | Purpose |
|---|---|---|
sound_id.py |
New | SoundClassifier class: Coral YAMNet, mel spectrogram, category mapping |
headmic.py |
Modified | Ring buffer in listener_loop, classifier thread, new endpoints |
models/yamnet_edgetpu.tflite |
New | Edge TPU compiled YAMNet model |
models/yamnet_class_map.csv |
New | 521 class name mapping |
SoundClassifier Class (sound_id.py)
class SoundClassifier:
def __init__(self, model_path, class_map_path):
# Load YAMNet Edge TPU model on Coral #2
# Load class name mapping (521 classes)
# Define category groups
def classify(self, audio_samples):
# Convert 16kHz int16 PCM -> mel spectrogram (CPU, numpy/scipy)
# Run YAMNet inference on Coral
# Return top-N classes with scores + mapped category
def get_state(self):
# Return current audio scene summary
Mel Spectrogram
Computed on CPU using scipy.signal (~1-2ms for 1s window). No librosa dependency — the mel filterbank is ~30 lines of numpy/scipy. Parameters match YAMNet's expected input: 96 mel bands, 64 time frames, 25ms window, 10ms hop.
Category Mapping
Groups 521 raw YAMNet classes into ~8 useful buckets:
| Category | Example YAMNet classes |
|---|---|
speech |
Speech, Conversation, Narration |
alert |
Doorbell, Knock, Alarm, Telephone |
music |
Music, Singing, Musical instrument |
animal |
Dog, Cat, Bird |
household |
Door, Footsteps, Typing, Water |
environment |
Wind, Rain, Thunder, Traffic |
silence |
Silence, White noise |
other |
Everything else |
State and Smoothing
Tracks last ~10 classifications to avoid flickering between categories on noisy audio. The dominant category is determined by frequency over the recent window, not just the latest inference.
Integration with headmic.py
Ring Buffer
In listener_loop(), after existing Porcupine processing, append every frame to a collections.deque (~1s = 31 frames of 512 samples). Unconditional — every frame goes in regardless of wake word or recording state.
Classifier Thread
Separate daemon thread alongside the listener thread. Every 0.5s:
- Snapshot the ring buffer
- Compute mel spectrogram on CPU
- Run YAMNet inference on Coral
- Update shared state with results
New API Endpoints
| Endpoint | Returns |
|---|---|
GET /sounds |
Current audio scene: dominant category, top-3 raw classes with scores, timestamp |
GET /sounds/history |
Last 30 seconds of classifications |
Updated Existing Endpoints
GET /healthaddssound_classification_enabled: true/falseGET /statusaddsaudio_scene: "speech"(dominant category)
Graceful Degradation
If no second Coral is present, SoundClassifier.__init__ catches the delegate load failure, logs a warning, and headmic.py sets sound_classifier = None. All existing functionality is unchanged. /sounds returns 503.
Hardware Setup
The second Coral should be on a different USB bus than the oak-service Coral (bus 4, USB3 hub). Options:
- Bus 2 (other USB3 root hub) — preferred for speed
- Bus 1 or 3 (USB2) — still fine, inference is only ~3ms
Dependencies
scipy— for mel spectrogram computation (FFT, mel filterbank)ai-edge-litert— already installed on Pi
No new system packages. No changes to headmic.service.
Future: Speaker Identification
Not in this scope. Would require a separate speaker embedding model (audio equivalent of FaceNet), enrollment DB, and cosine matching. Can be added as a second classifier alongside YAMNet once the infrastructure is proven.