Files

Alex 22aae40d17 Add design doc for YAMNet sound identification on Coral Edge TPU

Covers model choice, architecture, category mapping, API endpoints,
and integration with existing headmic audio pipeline.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-01 20:04:31 -06:00

4.8 KiB

Raw Blame History

Sound Identification: YAMNet on Coral Edge TPU

Add continuous audio scene classification to headmic using a second Coral Edge TPU running YAMNet.

Model

YAMNet (Google, trained on AudioSet) classifies 521 sound categories.

Spec	Value
Model	`yamnet_edgetpu.tflite` (~3MB)
Hardware	Coral Edge TPU (dedicated, separate from oak-service Coral)
Input	0.975s audio window (15,600 samples at 16kHz) -> 96x64 mel spectrogram
Output	521-class probability vector
Inference	~2-3ms on Edge TPU
Classification interval	Every 0.5s (50% window overlap)

Architecture

The existing audio stream feeds a ring buffer alongside the Porcupine wake word path. A separate thread consumes the buffer and runs YAMNet inference on the second Coral.

arecord (16kHz mono)
  |
  +-> Porcupine (wake word)       <- existing, untouched
  |     +-> VAD -> record -> STT
  |
  +-> Ring buffer (1s window)     <- new
        +-> Mel spectrogram (CPU, scipy)
              +-> YAMNet (Coral #2)
                    +-> Category mapping
                          +-> State + API endpoints

No changes to the wake word, VAD, recording, or STT paths.

Files

File	Status	Purpose
`sound_id.py`	New	SoundClassifier class: Coral YAMNet, mel spectrogram, category mapping
`headmic.py`	Modified	Ring buffer in listener_loop, classifier thread, new endpoints
`models/yamnet_edgetpu.tflite`	New	Edge TPU compiled YAMNet model
`models/yamnet_class_map.csv`	New	521 class name mapping

SoundClassifier Class (`sound_id.py`)

class SoundClassifier:
    def __init__(self, model_path, class_map_path):
        # Load YAMNet Edge TPU model on Coral #2
        # Load class name mapping (521 classes)
        # Define category groups

    def classify(self, audio_samples):
        # Convert 16kHz int16 PCM -> mel spectrogram (CPU, numpy/scipy)
        # Run YAMNet inference on Coral
        # Return top-N classes with scores + mapped category

    def get_state(self):
        # Return current audio scene summary

Mel Spectrogram

Computed on CPU using scipy.signal (~1-2ms for 1s window). No librosa dependency — the mel filterbank is ~30 lines of numpy/scipy. Parameters match YAMNet's expected input: 96 mel bands, 64 time frames, 25ms window, 10ms hop.

Category Mapping

Groups 521 raw YAMNet classes into ~8 useful buckets:

Category	Example YAMNet classes
`speech`	Speech, Conversation, Narration
`alert`	Doorbell, Knock, Alarm, Telephone
`music`	Music, Singing, Musical instrument
`animal`	Dog, Cat, Bird
`household`	Door, Footsteps, Typing, Water
`environment`	Wind, Rain, Thunder, Traffic
`silence`	Silence, White noise
`other`	Everything else

State and Smoothing

Tracks last ~10 classifications to avoid flickering between categories on noisy audio. The dominant category is determined by frequency over the recent window, not just the latest inference.

Integration with headmic.py

Ring Buffer

In listener_loop(), after existing Porcupine processing, append every frame to a collections.deque (~1s = 31 frames of 512 samples). Unconditional — every frame goes in regardless of wake word or recording state.

Classifier Thread

Separate daemon thread alongside the listener thread. Every 0.5s:

Snapshot the ring buffer
Compute mel spectrogram on CPU
Run YAMNet inference on Coral
Update shared state with results

New API Endpoints

Endpoint	Returns
`GET /sounds`	Current audio scene: dominant category, top-3 raw classes with scores, timestamp
`GET /sounds/history`	Last 30 seconds of classifications

Updated Existing Endpoints

GET /health adds sound_classification_enabled: true/false
GET /status adds audio_scene: "speech" (dominant category)

Graceful Degradation

If no second Coral is present, SoundClassifier.__init__ catches the delegate load failure, logs a warning, and headmic.py sets sound_classifier = None. All existing functionality is unchanged. /sounds returns 503.

Hardware Setup

The second Coral should be on a different USB bus than the oak-service Coral (bus 4, USB3 hub). Options:

Bus 2 (other USB3 root hub) — preferred for speed
Bus 1 or 3 (USB2) — still fine, inference is only ~3ms

Dependencies

scipy — for mel spectrogram computation (FFT, mel filterbank)
ai-edge-litert — already installed on Pi

No new system packages. No changes to headmic.service.

Future: Speaker Identification

Not in this scope. Would require a separate speaker embedding model (audio equivalent of FaceNet), enrollment DB, and cosine matching. Can be added as a second classifier alongside YAMNet once the infrastructure is proven.

4.8 KiB Raw Blame History