From 0607be3db51ea1d3833e999c376ce8eb11e6d0cf Mon Sep 17 00:00:00 2001 From: Alex Date: Sun, 1 Feb 2026 21:16:09 -0600 Subject: [PATCH] Add design doc for speaker identification with Resemblyzer Voice-based speaker ID triggered by YAMNet speech detection. Cosine similarity matching against SQLite enrollment DB. Co-Authored-By: Claude Opus 4.5 --- ...026-02-01-speaker-identification-design.md | 137 ++++++++++++++++++ 1 file changed, 137 insertions(+) create mode 100644 docs/plans/2026-02-01-speaker-identification-design.md diff --git a/docs/plans/2026-02-01-speaker-identification-design.md b/docs/plans/2026-02-01-speaker-identification-design.md new file mode 100644 index 0000000..6a8d4fd --- /dev/null +++ b/docs/plans/2026-02-01-speaker-identification-design.md @@ -0,0 +1,137 @@ +# Speaker Identification: Resemblyzer on Pi 5 CPU + +Add voice-based speaker identification to headmic. Runs on CPU alongside YAMNet sound classification — only computes embeddings when speech is detected. + +## Model + +**Resemblyzer** — GE2E speaker encoder, 256-dim embeddings. + +| Spec | Value | +|------|-------| +| Library | `resemblyzer` (PyTorch-based) | +| Embedding | 256-dim float32 | +| Input | Float32 audio at 16kHz | +| Inference | ~50-100ms on Pi 5 CPU | +| Threshold | 0.75 cosine similarity | +| Trigger | Only when YAMNet detects speech | + +## Architecture + +``` +sound_classifier_loop (every 0.5s) + | + +-> YAMNet classifies audio + | + +-> If category == "speech": + +-> Resemblyzer computes 256-dim embedding + +-> Cosine similarity against enrolled voices (SQLite) + +-> state.recognized_speaker + confidence +``` + +No new threads. Speaker ID runs inside the existing classifier thread. + +## Files + +| File | Action | Purpose | +|------|--------|---------| +| `speaker_id.py` | New | SpeakerRecognizer: Resemblyzer encoder, SQLite DB, cosine matching | +| `headmic.py` | Modify | Integrate speaker ID into classifier loop, new endpoints, enrollment LED | +| `sound_id.py` | Modify | Return float32 audio alongside classification for speaker ID | + +## speaker_id.py — SpeakerRecognizer Class + +```python +class SpeakerRecognizer: + def __init__(self, db_path="voices.db"): + # Load Resemblyzer voice encoder + # Init SQLite DB + # Load embedding cache into memory + + def identify(self, audio_float32): + # Compute 256-dim embedding + # Cosine similarity against DB + # Return (name, confidence) or (None, 0.0) + + def enroll(self, name, audio_float32): + # Compute embedding, store in DB + + def list_speakers(self): + # Return enrolled names with counts + + def delete_speaker(self, name): + # Remove all embeddings for a name +``` + +### SQLite Schema + +```sql +CREATE TABLE voices ( + id INTEGER PRIMARY KEY, + name TEXT NOT NULL, + embedding BLOB NOT NULL, + enrolled_at REAL NOT NULL, + source TEXT +); +CREATE INDEX IF NOT EXISTS idx_voices_name ON voices(name); +``` + +### Matching + +- Cosine similarity via dot product (Resemblyzer embeddings are L2-normalized) +- Threshold: 0.75 for positive match +- Compare against all stored embeddings, group by name, take best score per name + +## headmic.py Changes + +### Classifier Thread Update + +In `sound_classifier_loop()`, after YAMNet classification: + +```python +if speaker_recognizer and result["category"] == "speech": + name, confidence = speaker_recognizer.identify(audio_float32) + state.recognized_speaker = name + state.speaker_confidence = confidence +``` + +### New API Endpoints + +| Endpoint | Method | Purpose | +|----------|--------|---------| +| `/speakers/enroll` | POST | Multipart: `name` + `audio` file | +| `/speakers/enroll-from-mic` | POST | Record from live mic (5s, VAD stop) | +| `/speakers` | GET | List enrolled speakers | +| `/speakers/{name}` | DELETE | Remove a speaker | + +### Updated Endpoints + +- `GET /sounds` — adds `recognized_speaker`, `speaker_confidence` +- `GET /status` — adds `recognized_speaker` +- `GET /health` — adds `speaker_recognition_enabled` + +### Enroll-from-Mic Recording + +When `/speakers/enroll-from-mic?name=X` is called: +1. Set enrollment flag + buffer +2. Listener loop fills enrollment buffer for 5 seconds (VAD-based stop) +3. Compute embedding from collected audio +4. Store in DB + +### LED States + +| State | Color | Animation | +|-------|-------|-----------| +| Wake word | White flash | `wakeup()` | +| Listening | Cyan (0x00FFFF) | `think()` spin | +| Processing | Purple (0x9400D3) | `spin()` | +| **Enrolling** | **Orange (0xFF8C00)** | **`think()` spin** | +| Idle | Off | `off()` | + +## Dependencies + +- `resemblyzer` — speaker embeddings (pulls PyTorch) +- `torch` — required by Resemblyzer (~200MB) + +## Graceful Degradation + +If Resemblyzer/PyTorch not installed, `speaker_recognizer = None`. All existing functionality unchanged. Speaker endpoints return 503.