Add design doc for speaker identification with Resemblyzer
Voice-based speaker ID triggered by YAMNet speech detection. Cosine similarity matching against SQLite enrollment DB. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
137
docs/plans/2026-02-01-speaker-identification-design.md
Normal file
137
docs/plans/2026-02-01-speaker-identification-design.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Speaker Identification: Resemblyzer on Pi 5 CPU
|
||||
|
||||
Add voice-based speaker identification to headmic. Runs on CPU alongside YAMNet sound classification — only computes embeddings when speech is detected.
|
||||
|
||||
## Model
|
||||
|
||||
**Resemblyzer** — GE2E speaker encoder, 256-dim embeddings.
|
||||
|
||||
| Spec | Value |
|
||||
|------|-------|
|
||||
| Library | `resemblyzer` (PyTorch-based) |
|
||||
| Embedding | 256-dim float32 |
|
||||
| Input | Float32 audio at 16kHz |
|
||||
| Inference | ~50-100ms on Pi 5 CPU |
|
||||
| Threshold | 0.75 cosine similarity |
|
||||
| Trigger | Only when YAMNet detects speech |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
sound_classifier_loop (every 0.5s)
|
||||
|
|
||||
+-> YAMNet classifies audio
|
||||
|
|
||||
+-> If category == "speech":
|
||||
+-> Resemblyzer computes 256-dim embedding
|
||||
+-> Cosine similarity against enrolled voices (SQLite)
|
||||
+-> state.recognized_speaker + confidence
|
||||
```
|
||||
|
||||
No new threads. Speaker ID runs inside the existing classifier thread.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Action | Purpose |
|
||||
|------|--------|---------|
|
||||
| `speaker_id.py` | New | SpeakerRecognizer: Resemblyzer encoder, SQLite DB, cosine matching |
|
||||
| `headmic.py` | Modify | Integrate speaker ID into classifier loop, new endpoints, enrollment LED |
|
||||
| `sound_id.py` | Modify | Return float32 audio alongside classification for speaker ID |
|
||||
|
||||
## speaker_id.py — SpeakerRecognizer Class
|
||||
|
||||
```python
|
||||
class SpeakerRecognizer:
|
||||
def __init__(self, db_path="voices.db"):
|
||||
# Load Resemblyzer voice encoder
|
||||
# Init SQLite DB
|
||||
# Load embedding cache into memory
|
||||
|
||||
def identify(self, audio_float32):
|
||||
# Compute 256-dim embedding
|
||||
# Cosine similarity against DB
|
||||
# Return (name, confidence) or (None, 0.0)
|
||||
|
||||
def enroll(self, name, audio_float32):
|
||||
# Compute embedding, store in DB
|
||||
|
||||
def list_speakers(self):
|
||||
# Return enrolled names with counts
|
||||
|
||||
def delete_speaker(self, name):
|
||||
# Remove all embeddings for a name
|
||||
```
|
||||
|
||||
### SQLite Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE voices (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
embedding BLOB NOT NULL,
|
||||
enrolled_at REAL NOT NULL,
|
||||
source TEXT
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_voices_name ON voices(name);
|
||||
```
|
||||
|
||||
### Matching
|
||||
|
||||
- Cosine similarity via dot product (Resemblyzer embeddings are L2-normalized)
|
||||
- Threshold: 0.75 for positive match
|
||||
- Compare against all stored embeddings, group by name, take best score per name
|
||||
|
||||
## headmic.py Changes
|
||||
|
||||
### Classifier Thread Update
|
||||
|
||||
In `sound_classifier_loop()`, after YAMNet classification:
|
||||
|
||||
```python
|
||||
if speaker_recognizer and result["category"] == "speech":
|
||||
name, confidence = speaker_recognizer.identify(audio_float32)
|
||||
state.recognized_speaker = name
|
||||
state.speaker_confidence = confidence
|
||||
```
|
||||
|
||||
### New API Endpoints
|
||||
|
||||
| Endpoint | Method | Purpose |
|
||||
|----------|--------|---------|
|
||||
| `/speakers/enroll` | POST | Multipart: `name` + `audio` file |
|
||||
| `/speakers/enroll-from-mic` | POST | Record from live mic (5s, VAD stop) |
|
||||
| `/speakers` | GET | List enrolled speakers |
|
||||
| `/speakers/{name}` | DELETE | Remove a speaker |
|
||||
|
||||
### Updated Endpoints
|
||||
|
||||
- `GET /sounds` — adds `recognized_speaker`, `speaker_confidence`
|
||||
- `GET /status` — adds `recognized_speaker`
|
||||
- `GET /health` — adds `speaker_recognition_enabled`
|
||||
|
||||
### Enroll-from-Mic Recording
|
||||
|
||||
When `/speakers/enroll-from-mic?name=X` is called:
|
||||
1. Set enrollment flag + buffer
|
||||
2. Listener loop fills enrollment buffer for 5 seconds (VAD-based stop)
|
||||
3. Compute embedding from collected audio
|
||||
4. Store in DB
|
||||
|
||||
### LED States
|
||||
|
||||
| State | Color | Animation |
|
||||
|-------|-------|-----------|
|
||||
| Wake word | White flash | `wakeup()` |
|
||||
| Listening | Cyan (0x00FFFF) | `think()` spin |
|
||||
| Processing | Purple (0x9400D3) | `spin()` |
|
||||
| **Enrolling** | **Orange (0xFF8C00)** | **`think()` spin** |
|
||||
| Idle | Off | `off()` |
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `resemblyzer` — speaker embeddings (pulls PyTorch)
|
||||
- `torch` — required by Resemblyzer (~200MB)
|
||||
|
||||
## Graceful Degradation
|
||||
|
||||
If Resemblyzer/PyTorch not installed, `speaker_recognizer = None`. All existing functionality unchanged. Speaker endpoints return 503.
|
||||
Reference in New Issue
Block a user