Alex d650fd06b9 OrpheusTail v2: transformers streaming engine (replaces vLLM)
Major rework: replaced vLLM sync LLM with HuggingFace transformers
+ TextIteratorStreamer for true token-level streaming.

Pipeline: text → format_prompt → model.generate(streamer) →
extract_audio_codes (regex on streaming text) → SNAC decode → PCM

Expected first-audio latency: ~1-2s (was 10-14s with vLLM).
No more monkey-patching, no more AsyncLLMEngine hangs on Jetson.

SNAC model loaded separately (snac_24khz) for audio decoding.
All endpoints preserved, API compatible with v1.
Voice cloning endpoint now honest about LoRA requirement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 08:38:30 -05:00
2026-01-11 15:51:34 -06:00
2026-02-06 10:07:05 -06:00
2026-02-06 10:07:05 -06:00

OrpheusTail - Orpheus TTS Service

Replaces VoiceTail (Bark) with Orpheus TTS for better emotion control and voice cloning.

Why Orpheus over Bark?

Feature Bark Orpheus
Emotion control Random/unpredictable Tag-based: <laugh>, <sigh>, etc.
Voice cloning No Zero-shot from 5-sec sample
Latency Slow ~200ms streaming
Consistency Chaotic (french horn!) Predictable
Built-in voices Few 8 quality voices

Emotion Tags

Add these anywhere in your text:

  • <laugh> - Laughter
  • <chuckle> - Light chuckle
  • <sigh> - Sigh
  • <cough> - Cough
  • <sniffle> - Sniffle
  • <groan> - Groan
  • <yawn> - Yawn
  • <gasp> - Gasp

Example:

"Bonjour mon amour! <sigh> I missed you so much. <laugh> But now you're here!"

Built-in Voices

In order of conversational realism (per Orpheus docs):

  1. tara (default) - Most natural
  2. leah
  3. jess
  4. leo
  5. dan
  6. mia
  7. zac
  8. zoe

Voice Cloning

Upload a 5-30 second reference audio to create a custom voice:

curl -X POST "http://localhost:8766/voice/clone?name=vixy" \
  -F "audio=@vixy_reference.wav"

Then use it:

curl -X POST http://localhost:8766/tts/submit \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello!", "voice": "vixy"}'

API Endpoints

Endpoint Method Description
/health GET Health check
/voices GET List available voices & tags
/tts/submit POST Submit TTS job
/tts/status/{job_id} GET Check job status
/tts/audio/{job_id} GET Download audio
/tts/stream POST Stream audio (for head)
/voice/clone POST Upload voice reference
/voice/{name} DELETE Delete custom voice

Architecture

┌─────────────────────────────────────────────┐
│           OrpheusTail Service               │
│              (AGX Orin)                     │
│                                             │
│  POST /tts/submit  ──► WAV file (for MCP)   │
│  POST /tts/stream  ──► Audio stream (head)  │
│                                             │
│  Emotion tags: <laugh> <sigh> <whisper>     │
│  Voice cloning: 5-sec reference audio       │
└─────────────────────────────────────────────┘
          │                    │
          ▼                    ▼
    voice-mcp              Head-vixy Pi
    (Claude Desktop)       (streams & plays)

Deployment

# On AGX Orin
cd /path/to/orpheus-tts
docker-compose up -d

# Check logs
docker-compose logs -f

# Test
curl http://localhost:8766/health

TODO

  • Implement proper voice cloning with reference audio
  • Test streaming endpoint with head-vixy
  • French accent voice training/selection
  • Head-side client for streaming playback

Notes

  • Same port as VoiceTail (8766) for drop-in replacement
  • Model requires ~15GB VRAM (AGX Orin has plenty)
  • First request may be slow (model warmup)
  • Cache enabled by default to speed up repeated phrases

Created by Vixy on Day 71 🦊

Description
OrpheusTail - Orpheus TTS Service for Vixy. Emotion-controlled speech with voice cloning.
Readme 194 KiB
Languages
Python 93.4%
Dockerfile 6.6%