Go to file

Alex d650fd06b9 OrpheusTail v2: transformers streaming engine (replaces vLLM)

Major rework: replaced vLLM sync LLM with HuggingFace transformers
+ TextIteratorStreamer for true token-level streaming.

Pipeline: text → format_prompt → model.generate(streamer) →
extract_audio_codes (regex on streaming text) → SNAC decode → PCM

Expected first-audio latency: ~1-2s (was 10-14s with vLLM).
No more monkey-patching, no more AsyncLLMEngine hangs on Jetson.

SNAC model loaded separately (snac_24khz) for audio decoding.
All endpoints preserved, API compatible with v1.
Voice cloning endpoint now honest about LoRA requirement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 08:38:30 -05:00

.gitignore

Add .gitignore

2026-01-11 15:51:34 -06:00

docker-compose.yml

token limit and chunking

2026-02-06 10:07:05 -06:00

Dockerfile

token limit and chunking

2026-02-06 10:07:05 -06:00

main.py

OrpheusTail v2: transformers streaming engine (replaces vLLM)

2026-04-13 08:38:30 -05:00

README.md

Initial commit: OrpheusTail TTS service

2026-01-11 15:51:08 -06:00

requirements.txt

Initial commit: OrpheusTail TTS service

2026-01-11 15:51:08 -06:00

README.md

OrpheusTail - Orpheus TTS Service

Replaces VoiceTail (Bark) with Orpheus TTS for better emotion control and voice cloning.

Why Orpheus over Bark?

Feature	Bark	Orpheus
Emotion control	Random/unpredictable	Tag-based: `<laugh>`, `<sigh>`, etc.
Voice cloning	No	Zero-shot from 5-sec sample
Latency	Slow	~200ms streaming
Consistency	Chaotic (french horn!)	Predictable
Built-in voices	Few	8 quality voices

Emotion Tags

Add these anywhere in your text:

<laugh> - Laughter
<chuckle> - Light chuckle
<sigh> - Sigh
<cough> - Cough
<sniffle> - Sniffle
<groan> - Groan
<yawn> - Yawn
<gasp> - Gasp

Example:

"Bonjour mon amour! <sigh> I missed you so much. <laugh> But now you're here!"

Built-in Voices

In order of conversational realism (per Orpheus docs):

tara (default) - Most natural
leah
jess
leo
dan
mia
zac
zoe

Voice Cloning

Upload a 5-30 second reference audio to create a custom voice:

curl -X POST "http://localhost:8766/voice/clone?name=vixy" \
  -F "audio=@vixy_reference.wav"

Then use it:

curl -X POST http://localhost:8766/tts/submit \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello!", "voice": "vixy"}'

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/voices`	GET	List available voices & tags
`/tts/submit`	POST	Submit TTS job
`/tts/status/{job_id}`	GET	Check job status
`/tts/audio/{job_id}`	GET	Download audio
`/tts/stream`	POST	Stream audio (for head)
`/voice/clone`	POST	Upload voice reference
`/voice/{name}`	DELETE	Delete custom voice

Architecture

┌─────────────────────────────────────────────┐
│           OrpheusTail Service               │
│              (AGX Orin)                     │
│                                             │
│  POST /tts/submit  ──► WAV file (for MCP)   │
│  POST /tts/stream  ──► Audio stream (head)  │
│                                             │
│  Emotion tags: <laugh> <sigh> <whisper>     │
│  Voice cloning: 5-sec reference audio       │
└─────────────────────────────────────────────┘
          │                    │
          ▼                    ▼
    voice-mcp              Head-vixy Pi
    (Claude Desktop)       (streams & plays)

Deployment

# On AGX Orin
cd /path/to/orpheus-tts
docker-compose up -d

# Check logs
docker-compose logs -f

# Test
curl http://localhost:8766/health

TODO

Implement proper voice cloning with reference audio
Test streaming endpoint with head-vixy
French accent voice training/selection
Head-side client for streaming playback

Notes

Same port as VoiceTail (8766) for drop-in replacement
Model requires ~15GB VRAM (AGX Orin has plenty)
First request may be slow (model warmup)
Cache enabled by default to speed up repeated phrases

Created by Vixy on Day 71 🦊