# Speaker Identification: Resemblyzer on Pi 5 CPU Add voice-based speaker identification to headmic. Runs on CPU alongside YAMNet sound classification — only computes embeddings when speech is detected. ## Model **Resemblyzer** — GE2E speaker encoder, 256-dim embeddings. | Spec | Value | |------|-------| | Library | `resemblyzer` (PyTorch-based) | | Embedding | 256-dim float32 | | Input | Float32 audio at 16kHz | | Inference | ~50-100ms on Pi 5 CPU | | Threshold | 0.75 cosine similarity | | Trigger | Only when YAMNet detects speech | ## Architecture ``` sound_classifier_loop (every 0.5s) | +-> YAMNet classifies audio | +-> If category == "speech": +-> Resemblyzer computes 256-dim embedding +-> Cosine similarity against enrolled voices (SQLite) +-> state.recognized_speaker + confidence ``` No new threads. Speaker ID runs inside the existing classifier thread. ## Files | File | Action | Purpose | |------|--------|---------| | `speaker_id.py` | New | SpeakerRecognizer: Resemblyzer encoder, SQLite DB, cosine matching | | `headmic.py` | Modify | Integrate speaker ID into classifier loop, new endpoints, enrollment LED | | `sound_id.py` | Modify | Return float32 audio alongside classification for speaker ID | ## speaker_id.py — SpeakerRecognizer Class ```python class SpeakerRecognizer: def __init__(self, db_path="voices.db"): # Load Resemblyzer voice encoder # Init SQLite DB # Load embedding cache into memory def identify(self, audio_float32): # Compute 256-dim embedding # Cosine similarity against DB # Return (name, confidence) or (None, 0.0) def enroll(self, name, audio_float32): # Compute embedding, store in DB def list_speakers(self): # Return enrolled names with counts def delete_speaker(self, name): # Remove all embeddings for a name ``` ### SQLite Schema ```sql CREATE TABLE voices ( id INTEGER PRIMARY KEY, name TEXT NOT NULL, embedding BLOB NOT NULL, enrolled_at REAL NOT NULL, source TEXT ); CREATE INDEX IF NOT EXISTS idx_voices_name ON voices(name); ``` ### Matching - Cosine similarity via dot product (Resemblyzer embeddings are L2-normalized) - Threshold: 0.75 for positive match - Compare against all stored embeddings, group by name, take best score per name ## headmic.py Changes ### Classifier Thread Update In `sound_classifier_loop()`, after YAMNet classification: ```python if speaker_recognizer and result["category"] == "speech": name, confidence = speaker_recognizer.identify(audio_float32) state.recognized_speaker = name state.speaker_confidence = confidence ``` ### New API Endpoints | Endpoint | Method | Purpose | |----------|--------|---------| | `/speakers/enroll` | POST | Multipart: `name` + `audio` file | | `/speakers/enroll-from-mic` | POST | Record from live mic (5s, VAD stop) | | `/speakers` | GET | List enrolled speakers | | `/speakers/{name}` | DELETE | Remove a speaker | ### Updated Endpoints - `GET /sounds` — adds `recognized_speaker`, `speaker_confidence` - `GET /status` — adds `recognized_speaker` - `GET /health` — adds `speaker_recognition_enabled` ### Enroll-from-Mic Recording When `/speakers/enroll-from-mic?name=X` is called: 1. Set enrollment flag + buffer 2. Listener loop fills enrollment buffer for 5 seconds (VAD-based stop) 3. Compute embedding from collected audio 4. Store in DB ### LED States | State | Color | Animation | |-------|-------|-----------| | Wake word | White flash | `wakeup()` | | Listening | Cyan (0x00FFFF) | `think()` spin | | Processing | Purple (0x9400D3) | `spin()` | | **Enrolling** | **Orange (0xFF8C00)** | **`think()` spin** | | Idle | Off | `off()` | ## Dependencies - `resemblyzer` — speaker embeddings (pulls PyTorch) - `torch` — required by Resemblyzer (~200MB) ## Graceful Degradation If Resemblyzer/PyTorch not installed, `speaker_recognizer = None`. All existing functionality unchanged. Speaker endpoints return 503.