Files
headmic/docs/plans/2026-02-01-speaker-identification-design.md
Alex 0607be3db5 Add design doc for speaker identification with Resemblyzer
Voice-based speaker ID triggered by YAMNet speech detection.
Cosine similarity matching against SQLite enrollment DB.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 21:16:09 -06:00

4.0 KiB

Speaker Identification: Resemblyzer on Pi 5 CPU

Add voice-based speaker identification to headmic. Runs on CPU alongside YAMNet sound classification — only computes embeddings when speech is detected.

Model

Resemblyzer — GE2E speaker encoder, 256-dim embeddings.

Spec Value
Library resemblyzer (PyTorch-based)
Embedding 256-dim float32
Input Float32 audio at 16kHz
Inference ~50-100ms on Pi 5 CPU
Threshold 0.75 cosine similarity
Trigger Only when YAMNet detects speech

Architecture

sound_classifier_loop (every 0.5s)
  |
  +-> YAMNet classifies audio
  |
  +-> If category == "speech":
        +-> Resemblyzer computes 256-dim embedding
              +-> Cosine similarity against enrolled voices (SQLite)
                    +-> state.recognized_speaker + confidence

No new threads. Speaker ID runs inside the existing classifier thread.

Files

File Action Purpose
speaker_id.py New SpeakerRecognizer: Resemblyzer encoder, SQLite DB, cosine matching
headmic.py Modify Integrate speaker ID into classifier loop, new endpoints, enrollment LED
sound_id.py Modify Return float32 audio alongside classification for speaker ID

speaker_id.py — SpeakerRecognizer Class

class SpeakerRecognizer:
    def __init__(self, db_path="voices.db"):
        # Load Resemblyzer voice encoder
        # Init SQLite DB
        # Load embedding cache into memory

    def identify(self, audio_float32):
        # Compute 256-dim embedding
        # Cosine similarity against DB
        # Return (name, confidence) or (None, 0.0)

    def enroll(self, name, audio_float32):
        # Compute embedding, store in DB

    def list_speakers(self):
        # Return enrolled names with counts

    def delete_speaker(self, name):
        # Remove all embeddings for a name

SQLite Schema

CREATE TABLE voices (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    embedding BLOB NOT NULL,
    enrolled_at REAL NOT NULL,
    source TEXT
);
CREATE INDEX IF NOT EXISTS idx_voices_name ON voices(name);

Matching

  • Cosine similarity via dot product (Resemblyzer embeddings are L2-normalized)
  • Threshold: 0.75 for positive match
  • Compare against all stored embeddings, group by name, take best score per name

headmic.py Changes

Classifier Thread Update

In sound_classifier_loop(), after YAMNet classification:

if speaker_recognizer and result["category"] == "speech":
    name, confidence = speaker_recognizer.identify(audio_float32)
    state.recognized_speaker = name
    state.speaker_confidence = confidence

New API Endpoints

Endpoint Method Purpose
/speakers/enroll POST Multipart: name + audio file
/speakers/enroll-from-mic POST Record from live mic (5s, VAD stop)
/speakers GET List enrolled speakers
/speakers/{name} DELETE Remove a speaker

Updated Endpoints

  • GET /sounds — adds recognized_speaker, speaker_confidence
  • GET /status — adds recognized_speaker
  • GET /health — adds speaker_recognition_enabled

Enroll-from-Mic Recording

When /speakers/enroll-from-mic?name=X is called:

  1. Set enrollment flag + buffer
  2. Listener loop fills enrollment buffer for 5 seconds (VAD-based stop)
  3. Compute embedding from collected audio
  4. Store in DB

LED States

State Color Animation
Wake word White flash wakeup()
Listening Cyan (0x00FFFF) think() spin
Processing Purple (0x9400D3) spin()
Enrolling Orange (0xFF8C00) think() spin
Idle Off off()

Dependencies

  • resemblyzer — speaker embeddings (pulls PyTorch)
  • torch — required by Resemblyzer (~200MB)

Graceful Degradation

If Resemblyzer/PyTorch not installed, speaker_recognizer = None. All existing functionality unchanged. Speaker endpoints return 503.