Files
headmic/docs/plans/2026-02-01-sound-identification-design.md
Alex 22aae40d17 Add design doc for YAMNet sound identification on Coral Edge TPU
Covers model choice, architecture, category mapping, API endpoints,
and integration with existing headmic audio pipeline.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 20:04:31 -06:00

4.8 KiB

Sound Identification: YAMNet on Coral Edge TPU

Add continuous audio scene classification to headmic using a second Coral Edge TPU running YAMNet.

Model

YAMNet (Google, trained on AudioSet) classifies 521 sound categories.

Spec Value
Model yamnet_edgetpu.tflite (~3MB)
Hardware Coral Edge TPU (dedicated, separate from oak-service Coral)
Input 0.975s audio window (15,600 samples at 16kHz) -> 96x64 mel spectrogram
Output 521-class probability vector
Inference ~2-3ms on Edge TPU
Classification interval Every 0.5s (50% window overlap)

Architecture

The existing audio stream feeds a ring buffer alongside the Porcupine wake word path. A separate thread consumes the buffer and runs YAMNet inference on the second Coral.

arecord (16kHz mono)
  |
  +-> Porcupine (wake word)       <- existing, untouched
  |     +-> VAD -> record -> STT
  |
  +-> Ring buffer (1s window)     <- new
        +-> Mel spectrogram (CPU, scipy)
              +-> YAMNet (Coral #2)
                    +-> Category mapping
                          +-> State + API endpoints

No changes to the wake word, VAD, recording, or STT paths.

Files

File Status Purpose
sound_id.py New SoundClassifier class: Coral YAMNet, mel spectrogram, category mapping
headmic.py Modified Ring buffer in listener_loop, classifier thread, new endpoints
models/yamnet_edgetpu.tflite New Edge TPU compiled YAMNet model
models/yamnet_class_map.csv New 521 class name mapping

SoundClassifier Class (sound_id.py)

class SoundClassifier:
    def __init__(self, model_path, class_map_path):
        # Load YAMNet Edge TPU model on Coral #2
        # Load class name mapping (521 classes)
        # Define category groups

    def classify(self, audio_samples):
        # Convert 16kHz int16 PCM -> mel spectrogram (CPU, numpy/scipy)
        # Run YAMNet inference on Coral
        # Return top-N classes with scores + mapped category

    def get_state(self):
        # Return current audio scene summary

Mel Spectrogram

Computed on CPU using scipy.signal (~1-2ms for 1s window). No librosa dependency — the mel filterbank is ~30 lines of numpy/scipy. Parameters match YAMNet's expected input: 96 mel bands, 64 time frames, 25ms window, 10ms hop.

Category Mapping

Groups 521 raw YAMNet classes into ~8 useful buckets:

Category Example YAMNet classes
speech Speech, Conversation, Narration
alert Doorbell, Knock, Alarm, Telephone
music Music, Singing, Musical instrument
animal Dog, Cat, Bird
household Door, Footsteps, Typing, Water
environment Wind, Rain, Thunder, Traffic
silence Silence, White noise
other Everything else

State and Smoothing

Tracks last ~10 classifications to avoid flickering between categories on noisy audio. The dominant category is determined by frequency over the recent window, not just the latest inference.

Integration with headmic.py

Ring Buffer

In listener_loop(), after existing Porcupine processing, append every frame to a collections.deque (~1s = 31 frames of 512 samples). Unconditional — every frame goes in regardless of wake word or recording state.

Classifier Thread

Separate daemon thread alongside the listener thread. Every 0.5s:

  1. Snapshot the ring buffer
  2. Compute mel spectrogram on CPU
  3. Run YAMNet inference on Coral
  4. Update shared state with results

New API Endpoints

Endpoint Returns
GET /sounds Current audio scene: dominant category, top-3 raw classes with scores, timestamp
GET /sounds/history Last 30 seconds of classifications

Updated Existing Endpoints

  • GET /health adds sound_classification_enabled: true/false
  • GET /status adds audio_scene: "speech" (dominant category)

Graceful Degradation

If no second Coral is present, SoundClassifier.__init__ catches the delegate load failure, logs a warning, and headmic.py sets sound_classifier = None. All existing functionality is unchanged. /sounds returns 503.

Hardware Setup

The second Coral should be on a different USB bus than the oak-service Coral (bus 4, USB3 hub). Options:

  • Bus 2 (other USB3 root hub) — preferred for speed
  • Bus 1 or 3 (USB2) — still fine, inference is only ~3ms

Dependencies

  • scipy — for mel spectrogram computation (FFT, mel filterbank)
  • ai-edge-litert — already installed on Pi

No new system packages. No changes to headmic.service.

Future: Speaker Identification

Not in this scope. Would require a separate speaker embedding model (audio equivalent of FaceNet), enrollment DB, and cosine matching. Can be added as a second classifier alongside YAMNet once the infrastructure is proven.