Voice-based speaker ID triggered by YAMNet speech detection. Cosine similarity matching against SQLite enrollment DB. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4.0 KiB
4.0 KiB
Speaker Identification: Resemblyzer on Pi 5 CPU
Add voice-based speaker identification to headmic. Runs on CPU alongside YAMNet sound classification — only computes embeddings when speech is detected.
Model
Resemblyzer — GE2E speaker encoder, 256-dim embeddings.
| Spec | Value |
|---|---|
| Library | resemblyzer (PyTorch-based) |
| Embedding | 256-dim float32 |
| Input | Float32 audio at 16kHz |
| Inference | ~50-100ms on Pi 5 CPU |
| Threshold | 0.75 cosine similarity |
| Trigger | Only when YAMNet detects speech |
Architecture
sound_classifier_loop (every 0.5s)
|
+-> YAMNet classifies audio
|
+-> If category == "speech":
+-> Resemblyzer computes 256-dim embedding
+-> Cosine similarity against enrolled voices (SQLite)
+-> state.recognized_speaker + confidence
No new threads. Speaker ID runs inside the existing classifier thread.
Files
| File | Action | Purpose |
|---|---|---|
speaker_id.py |
New | SpeakerRecognizer: Resemblyzer encoder, SQLite DB, cosine matching |
headmic.py |
Modify | Integrate speaker ID into classifier loop, new endpoints, enrollment LED |
sound_id.py |
Modify | Return float32 audio alongside classification for speaker ID |
speaker_id.py — SpeakerRecognizer Class
class SpeakerRecognizer:
def __init__(self, db_path="voices.db"):
# Load Resemblyzer voice encoder
# Init SQLite DB
# Load embedding cache into memory
def identify(self, audio_float32):
# Compute 256-dim embedding
# Cosine similarity against DB
# Return (name, confidence) or (None, 0.0)
def enroll(self, name, audio_float32):
# Compute embedding, store in DB
def list_speakers(self):
# Return enrolled names with counts
def delete_speaker(self, name):
# Remove all embeddings for a name
SQLite Schema
CREATE TABLE voices (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
embedding BLOB NOT NULL,
enrolled_at REAL NOT NULL,
source TEXT
);
CREATE INDEX IF NOT EXISTS idx_voices_name ON voices(name);
Matching
- Cosine similarity via dot product (Resemblyzer embeddings are L2-normalized)
- Threshold: 0.75 for positive match
- Compare against all stored embeddings, group by name, take best score per name
headmic.py Changes
Classifier Thread Update
In sound_classifier_loop(), after YAMNet classification:
if speaker_recognizer and result["category"] == "speech":
name, confidence = speaker_recognizer.identify(audio_float32)
state.recognized_speaker = name
state.speaker_confidence = confidence
New API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/speakers/enroll |
POST | Multipart: name + audio file |
/speakers/enroll-from-mic |
POST | Record from live mic (5s, VAD stop) |
/speakers |
GET | List enrolled speakers |
/speakers/{name} |
DELETE | Remove a speaker |
Updated Endpoints
GET /sounds— addsrecognized_speaker,speaker_confidenceGET /status— addsrecognized_speakerGET /health— addsspeaker_recognition_enabled
Enroll-from-Mic Recording
When /speakers/enroll-from-mic?name=X is called:
- Set enrollment flag + buffer
- Listener loop fills enrollment buffer for 5 seconds (VAD-based stop)
- Compute embedding from collected audio
- Store in DB
LED States
| State | Color | Animation |
|---|---|---|
| Wake word | White flash | wakeup() |
| Listening | Cyan (0x00FFFF) | think() spin |
| Processing | Purple (0x9400D3) | spin() |
| Enrolling | Orange (0xFF8C00) | think() spin |
| Idle | Off | off() |
Dependencies
resemblyzer— speaker embeddings (pulls PyTorch)torch— required by Resemblyzer (~200MB)
Graceful Degradation
If Resemblyzer/PyTorch not installed, speaker_recognizer = None. All existing functionality unchanged. Speaker endpoints return 503.