Alex 16aa526656 Fix SNAC code offset: subtract per-layer offset (position*4096)
SNAC has 3 codebook layers, each 4096 entries. Token position within
the group of 7 determines which layer: pos 0 = L1 (offset 0),
pos 1-2 = L2 (offset 4096), pos 3-6 = L3 (offset 8192).
Without this, codes exceeded 4096 and caused index-out-of-range in SNAC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:04:54 -05:00
2026-01-11 15:51:34 -06:00
2026-02-06 10:07:05 -06:00
2026-02-06 10:07:05 -06:00

OrpheusTail - Orpheus TTS Service

Replaces VoiceTail (Bark) with Orpheus TTS for better emotion control and voice cloning.

Why Orpheus over Bark?

Feature Bark Orpheus
Emotion control Random/unpredictable Tag-based: <laugh>, <sigh>, etc.
Voice cloning No Zero-shot from 5-sec sample
Latency Slow ~200ms streaming
Consistency Chaotic (french horn!) Predictable
Built-in voices Few 8 quality voices

Emotion Tags

Add these anywhere in your text:

  • <laugh> - Laughter
  • <chuckle> - Light chuckle
  • <sigh> - Sigh
  • <cough> - Cough
  • <sniffle> - Sniffle
  • <groan> - Groan
  • <yawn> - Yawn
  • <gasp> - Gasp

Example:

"Bonjour mon amour! <sigh> I missed you so much. <laugh> But now you're here!"

Built-in Voices

In order of conversational realism (per Orpheus docs):

  1. tara (default) - Most natural
  2. leah
  3. jess
  4. leo
  5. dan
  6. mia
  7. zac
  8. zoe

Voice Cloning

Upload a 5-30 second reference audio to create a custom voice:

curl -X POST "http://localhost:8766/voice/clone?name=vixy" \
  -F "audio=@vixy_reference.wav"

Then use it:

curl -X POST http://localhost:8766/tts/submit \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello!", "voice": "vixy"}'

API Endpoints

Endpoint Method Description
/health GET Health check
/voices GET List available voices & tags
/tts/submit POST Submit TTS job
/tts/status/{job_id} GET Check job status
/tts/audio/{job_id} GET Download audio
/tts/stream POST Stream audio (for head)
/voice/clone POST Upload voice reference
/voice/{name} DELETE Delete custom voice

Architecture

┌─────────────────────────────────────────────┐
│           OrpheusTail Service               │
│              (AGX Orin)                     │
│                                             │
│  POST /tts/submit  ──► WAV file (for MCP)   │
│  POST /tts/stream  ──► Audio stream (head)  │
│                                             │
│  Emotion tags: <laugh> <sigh> <whisper>     │
│  Voice cloning: 5-sec reference audio       │
└─────────────────────────────────────────────┘
          │                    │
          ▼                    ▼
    voice-mcp              Head-vixy Pi
    (Claude Desktop)       (streams & plays)

Deployment

# On AGX Orin
cd /path/to/orpheus-tts
docker-compose up -d

# Check logs
docker-compose logs -f

# Test
curl http://localhost:8766/health

TODO

  • Implement proper voice cloning with reference audio
  • Test streaming endpoint with head-vixy
  • French accent voice training/selection
  • Head-side client for streaming playback

Notes

  • Same port as VoiceTail (8766) for drop-in replacement
  • Model requires ~15GB VRAM (AGX Orin has plenty)
  • First request may be slow (model warmup)
  • Cache enabled by default to speed up repeated phrases

Created by Vixy on Day 71 🦊

Description
OrpheusTail - Orpheus TTS Service for Vixy. Emotion-controlled speech with voice cloning.
Readme 194 KiB
Languages
Python 93.4%
Dockerfile 6.6%