Initial commit: OrpheusTail TTS service

- FastAPI service replacing VoiceTail (Bark) - Emotion tags: <laugh>, <sigh>, <gasp>, etc. - Voice cloning endpoint (implementation pending) - Streaming support for head playback - Same port 8766 for drop-in replacement Created by Vixy on Day 71 🦊
2026-01-11 15:51:08 -06:00
commit ed579a77ee
5 changed files with 868 additions and 0 deletions
--- a/54
+++ b/54
@@ -0,0 +1,54 @@
+# OrpheusTail - Orpheus TTS Service for NVIDIA Jetson AGX Orin
+# 
+# Replaces VoiceTail (Bark) with Orpheus for better emotion control
+# and voice cloning capabilities.
+#
+# Based on NVIDIA L4T PyTorch container optimized for Jetson
+
+FROM dustynv/pytorch:2.1-r36.2.0
+
+# Set working directory
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    git \
+    wget \
+    libsndfile1 \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copy requirements first for better caching
+COPY requirements.txt /app/
+
+# Install Python dependencies
+# Note: torch and torchvision are already in the base image
+RUN pip3 install --no-cache-dir -r requirements.txt
+
+# Install orpheus-speech (uses vllm under the hood)
+# Note: vllm version compatibility may need adjustment
+RUN pip3 install orpheus-speech
+
+# Copy application code
+COPY main.py /app/
+
+# Create directories for cache, output, and custom voices
+RUN mkdir -p /app/cache /app/output /app/voices
+
+# Expose API port (same as VoiceTail for drop-in replacement)
+EXPOSE 8766
+
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV CACHE_DIR=/app/cache
+ENV OUTPUT_DIR=/app/output
+ENV VOICES_DIR=/app/voices
+ENV ORPHEUS_MODEL=canopylabs/orpheus-tts-0.1-finetune-prod
+ENV DEFAULT_VOICE=tara
+ENV MAX_MODEL_LEN=2048
+
+# Health check (longer start period - model loading takes time)
+HEALTHCHECK --interval=30s --timeout=10s --start-period=180s --retries=3 \
+    CMD curl -f http://localhost:8766/health || exit 1
+
+# Run the FastAPI application
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8766"]
--- a/README.md
+++ b/README.md
@@ -0,0 +1,123 @@
+# OrpheusTail - Orpheus TTS Service
+
+Replaces VoiceTail (Bark) with **Orpheus TTS** for better emotion control and voice cloning.
+
+## Why Orpheus over Bark?
+
+| Feature | Bark | Orpheus |
+|---------|------|---------|
+| Emotion control | Random/unpredictable | **Tag-based**: `<laugh>`, `<sigh>`, etc. |
+| Voice cloning | No | **Zero-shot** from 5-sec sample |
+| Latency | Slow | ~200ms streaming |
+| Consistency | Chaotic (french horn!) | Predictable |
+| Built-in voices | Few | 8 quality voices |
+
+## Emotion Tags
+
+Add these anywhere in your text:
+
+- `<laugh>` - Laughter
+- `<chuckle>` - Light chuckle  
+- `<sigh>` - Sigh
+- `<cough>` - Cough
+- `<sniffle>` - Sniffle
+- `<groan>` - Groan
+- `<yawn>` - Yawn
+- `<gasp>` - Gasp
+
+**Example:**
+```
+"Bonjour mon amour! <sigh> I missed you so much. <laugh> But now you're here!"
+```
+
+## Built-in Voices
+
+In order of conversational realism (per Orpheus docs):
+1. **tara** (default) - Most natural
+2. **leah**
+3. **jess**
+4. **leo**
+5. **dan**
+6. **mia**
+7. **zac**
+8. **zoe**
+
+## Voice Cloning
+
+Upload a 5-30 second reference audio to create a custom voice:
+
+```bash
+curl -X POST "http://localhost:8766/voice/clone?name=vixy" \
+  -F "audio=@vixy_reference.wav"
+```
+
+Then use it:
+```bash
+curl -X POST http://localhost:8766/tts/submit \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Hello!", "voice": "vixy"}'
+```
+
+## API Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | Health check |
+| `/voices` | GET | List available voices & tags |
+| `/tts/submit` | POST | Submit TTS job |
+| `/tts/status/{job_id}` | GET | Check job status |
+| `/tts/audio/{job_id}` | GET | Download audio |
+| `/tts/stream` | POST | Stream audio (for head) |
+| `/voice/clone` | POST | Upload voice reference |
+| `/voice/{name}` | DELETE | Delete custom voice |
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────┐
+│           OrpheusTail Service               │
+│              (AGX Orin)                     │
+│                                             │
+│  POST /tts/submit  ──► WAV file (for MCP)   │
+│  POST /tts/stream  ──► Audio stream (head)  │
+│                                             │
+│  Emotion tags: <laugh> <sigh> <whisper>     │
+│  Voice cloning: 5-sec reference audio       │
+└─────────────────────────────────────────────┘
+          │                    │
+          ▼                    ▼
+    voice-mcp              Head-vixy Pi
+    (Claude Desktop)       (streams & plays)
+```
+
+## Deployment
+
+```bash
+# On AGX Orin
+cd /path/to/orpheus-tts
+docker-compose up -d
+
+# Check logs
+docker-compose logs -f
+
+# Test
+curl http://localhost:8766/health
+```
+
+## TODO
+
+- [ ] Implement proper voice cloning with reference audio
+- [ ] Test streaming endpoint with head-vixy
+- [ ] French accent voice training/selection
+- [ ] Head-side client for streaming playback
+
+## Notes
+
+- Same port as VoiceTail (8766) for drop-in replacement
+- Model requires ~15GB VRAM (AGX Orin has plenty)
+- First request may be slow (model warmup)
+- Cache enabled by default to speed up repeated phrases
+
+---
+
+*Created by Vixy on Day 71 🦊*
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -0,0 +1,55 @@
+# OrpheusTail - Orpheus TTS Service
+# 
+# Usage:
+#   docker-compose up -d
+#   docker-compose logs -f
+#
+# Test:
+#   curl http://localhost:8766/health
+#   curl http://localhost:8766/voices
+#   curl -X POST http://localhost:8766/tts/submit \
+#     -H "Content-Type: application/json" \
+#     -d '{"text": "Hello! <laugh> This is Vixy speaking.", "voice": "tara"}'
+
+version: '3.8'
+
+services:
+  orpheus-tts:
+    build: .
+    container_name: orpheus-tts
+    restart: unless-stopped
+    
+    # GPU access for NVIDIA Jetson
+    runtime: nvidia
+    
+    ports:
+      - "8766:8766"
+    
+    volumes:
+      # Persist cache between restarts
+      - orpheus-cache:/app/cache
+      # Persist generated audio
+      - orpheus-output:/app/output
+      # Custom voice references
+      - orpheus-voices:/app/voices
+    
+    environment:
+      - ORPHEUS_MODEL=canopylabs/orpheus-tts-0.1-finetune-prod
+      - DEFAULT_VOICE=tara
+      - MAX_MODEL_LEN=2048
+      - CACHE_ENABLED=true
+      - RETENTION_DAYS=10
+    
+    # Resource limits (adjust based on your Orin config)
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+
+volumes:
+  orpheus-cache:
+  orpheus-output:
+  orpheus-voices:
--- a/main.py
+++ b/main.py
@@ -0,0 +1,616 @@
+#!/usr/bin/env python3
+"""
+OrpheusTail - Orpheus TTS Service
+
+FastAPI server for Orpheus text-to-speech generation on Jetson AGX Orin.
+Replaces VoiceTail (Bark) with better control, voice cloning, and emotion tags.
+
+Key Features:
+- Emotion tags: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
+- Zero-shot voice cloning from reference audio
+- Streaming support for real-time head playback
+- Built-in voices: tara, leah, jess, leo, dan, mia, zac, zoe
+
+Endpoints:
+- POST /tts/submit - Submit TTS job (returns job_id)
+- GET /tts/status/{job_id} - Check job status
+- GET /tts/audio/{job_id} - Download generated audio
+- POST /tts/stream - Stream audio in real-time (for head)
+- POST /voice/clone - Upload reference audio for voice cloning
+- GET /voices - List available voices
+- GET /health - Health check
+"""
+
+import os
+import json
+import hashlib
+import asyncio
+import uuid
+import wave
+import io
+from datetime import datetime, timedelta
+from pathlib import Path
+from typing import Dict, List, Optional
+from dataclasses import dataclass, asdict
+from enum import Enum
+
+from fastapi import FastAPI, BackgroundTasks, HTTPException, UploadFile, File
+from fastapi.responses import FileResponse, StreamingResponse
+from pydantic import BaseModel
+
+# Configuration from environment
+ORPHEUS_MODEL = os.getenv("ORPHEUS_MODEL", "canopylabs/orpheus-tts-0.1-finetune-prod")
+CACHE_ENABLED = os.getenv("CACHE_ENABLED", "true").lower() == "true"
+CACHE_DIR = Path(os.getenv("CACHE_DIR", "cache"))
+OUTPUT_DIR = Path(os.getenv("OUTPUT_DIR", "output"))
+VOICES_DIR = Path(os.getenv("VOICES_DIR", "voices"))  # For cloned voice references
+RETENTION_DAYS = int(os.getenv("RETENTION_DAYS", "10"))
+CLEANUP_INTERVAL_HOURS = int(os.getenv("CLEANUP_INTERVAL_HOURS", "1"))
+DEFAULT_VOICE = os.getenv("DEFAULT_VOICE", "tara")  # Orpheus default voice
+MAX_MODEL_LEN = int(os.getenv("MAX_MODEL_LEN", "2048"))
+SAMPLE_RATE = 24000
+
+# Ensure directories exist
+CACHE_DIR.mkdir(exist_ok=True)
+OUTPUT_DIR.mkdir(exist_ok=True)
+VOICES_DIR.mkdir(exist_ok=True)
+
+# Jobs persistence
+JOBS_FILE = OUTPUT_DIR / "jobs.json"
+
+# Built-in Orpheus voices (in order of conversational realism per docs)
+BUILTIN_VOICES = ["tara", "leah", "jess", "leo", "dan", "mia", "zac", "zoe"]
+
+# Supported emotion tags
+EMOTION_TAGS = ["<laugh>", "<chuckle>", "<sigh>", "<cough>", "<sniffle>", "<groan>", "<yawn>", "<gasp>"]
+
+# Initialize FastAPI
+app = FastAPI(
+    title="OrpheusTail - Orpheus TTS Service",
+    description="Text-to-speech with emotion control and voice cloning for Vixy",
+    version="1.0.0"
+)
+
+# Global model (loaded at startup)
+model = None
+
+
+class JobStatus(str, Enum):
+    """Job status enum"""
+    PENDING = "PENDING"
+    PROCESSING = "PROCESSING"
+    SUCCESS = "SUCCESS"
+    FAILURE = "FAILURE"
+
+
+@dataclass
+class JobInfo:
+    """Job information"""
+    job_id: str
+    text: str
+    voice: str
+    status: JobStatus
+    progress: int = 0
+    audio_path: Optional[str] = None
+    error: Optional[str] = None
+    cached: bool = False
+    created_at: str = ""
+    completed_at: Optional[str] = None
+
+
+# In-memory job storage
+jobs: Dict[str, JobInfo] = {}
+
+
+def load_jobs_from_disk():
+    """Load jobs from disk on startup"""
+    global jobs
+    if JOBS_FILE.exists():
+        try:
+            with open(JOBS_FILE, 'r') as f:
+                data = json.load(f)
+                for job_id, job_dict in data.items():
+                    jobs[job_id] = JobInfo(**job_dict)
+            print(f"Loaded {len(jobs)} jobs from disk")
+        except Exception as e:
+            print(f"Error loading jobs: {e}")
+
+
+def save_jobs_to_disk():
+    """Save jobs to disk"""
+    try:
+        data = {job_id: asdict(job) for job_id, job in jobs.items()}
+        with open(JOBS_FILE, 'w') as f:
+            json.dump(data, f, indent=2)
+    except Exception as e:
+        print(f"Error saving jobs: {e}")
+
+
+def hash_text_voice(text: str, voice: str) -> str:
+    """Generate cache key from text + voice"""
+    content = f"{text}|{voice}"
+    return hashlib.sha256(content.encode()).hexdigest()
+
+
+def get_from_cache(cache_key: str) -> Optional[str]:
+    """Check if audio exists in cache"""
+    if not CACHE_ENABLED:
+        return None
+    cache_path = CACHE_DIR / f"{cache_key}.wav"
+    if cache_path.exists():
+        print(f"Cache hit: {cache_key}")
+        return str(cache_path)
+    return None
+
+
+def save_to_cache(cache_key: str, audio_path: str):
+    """Save generated audio to cache"""
+    if not CACHE_ENABLED:
+        return
+    try:
+        import shutil
+        cache_path = CACHE_DIR / f"{cache_key}.wav"
+        shutil.copy(audio_path, cache_path)
+        print(f"Saved to cache: {cache_key}")
+    except Exception as e:
+        print(f"Error saving to cache: {e}")
+
+
+def get_custom_voices() -> List[str]:
+    """Get list of custom cloned voices"""
+    voices = []
+    for voice_file in VOICES_DIR.glob("*.wav"):
+        voices.append(voice_file.stem)
+    return voices
+
+
+def generate_speech(text: str, voice: str) -> bytes:
+    """
+    Generate speech using Orpheus model.
+    
+    Args:
+        text: Text to convert (may include emotion tags)
+        voice: Voice name (built-in or custom)
+        
+    Returns:
+        WAV audio bytes
+    """
+    global model
+    
+    # Check if it's a custom voice (needs reference audio)
+    custom_voice_path = VOICES_DIR / f"{voice}.wav"
+    
+    if custom_voice_path.exists():
+        # TODO: Implement voice cloning with reference audio
+        # For now, fall back to built-in voice
+        print(f"Custom voice '{voice}' - voice cloning to be implemented")
+        voice = DEFAULT_VOICE
+    elif voice not in BUILTIN_VOICES:
+        print(f"Unknown voice '{voice}', using default '{DEFAULT_VOICE}'")
+        voice = DEFAULT_VOICE
+    
+    # Generate speech using Orpheus
+    # Note: text is passed as-is, emotion tags like <laugh> are handled by Orpheus
+    audio_chunks = []
+    
+    syn_tokens = model.generate_speech(
+        prompt=text,
+        voice=voice,
+    )
+    
+    # Collect audio chunks
+    for audio_chunk in syn_tokens:
+        audio_chunks.append(audio_chunk)
+    
+    # Combine chunks into single audio
+    import numpy as np
+    audio_data = np.concatenate(audio_chunks) if len(audio_chunks) > 1 else audio_chunks[0]
+    
+    # Convert to WAV bytes
+    buffer = io.BytesIO()
+    with wave.open(buffer, 'wb') as wf:
+        wf.setnchannels(1)
+        wf.setsampwidth(2)  # 16-bit
+        wf.setframerate(SAMPLE_RATE)
+        wf.writeframes(audio_data)
+    
+    return buffer.getvalue()
+
+
+def save_audio_to_file(job_id: str, audio_bytes: bytes) -> str:
+    """Save audio bytes to WAV file."""
+    output_path = OUTPUT_DIR / f"{job_id}.wav"
+    with open(output_path, 'wb') as f:
+        f.write(audio_bytes)
+    return str(output_path)
+
+
+def generate_speech_background(job_id: str, text: str, voice: str):
+    """Background task for speech generation."""
+    try:
+        jobs[job_id].status = JobStatus.PROCESSING
+        jobs[job_id].progress = 25
+        save_jobs_to_disk()
+
+        # Check cache first
+        cache_key = hash_text_voice(text, voice)
+        cached_path = get_from_cache(cache_key)
+
+        if cached_path:
+            jobs[job_id].audio_path = cached_path
+            jobs[job_id].status = JobStatus.SUCCESS
+            jobs[job_id].progress = 100
+            jobs[job_id].cached = True
+            jobs[job_id].completed_at = datetime.now().isoformat()
+            save_jobs_to_disk()
+            print(f"Job {job_id} completed from cache")
+            return
+
+        # Generate audio
+        jobs[job_id].progress = 50
+        save_jobs_to_disk()
+
+        print(f"Generating audio for job {job_id}...")
+        audio_bytes = generate_speech(text, voice)
+
+        # Save to file
+        jobs[job_id].progress = 75
+        save_jobs_to_disk()
+
+        output_path = save_audio_to_file(job_id, audio_bytes)
+
+        # Save to cache
+        save_to_cache(cache_key, output_path)
+
+        # Complete
+        jobs[job_id].audio_path = output_path
+        jobs[job_id].status = JobStatus.SUCCESS
+        jobs[job_id].progress = 100
+        jobs[job_id].completed_at = datetime.now().isoformat()
+        save_jobs_to_disk()
+
+        print(f"Job {job_id} completed successfully")
+
+    except Exception as e:
+        print(f"Job {job_id} failed: {e}")
+        import traceback
+        traceback.print_exc()
+        jobs[job_id].status = JobStatus.FAILURE
+        jobs[job_id].error = str(e)
+        save_jobs_to_disk()
+
+
+async def cleanup_old_jobs():
+    """Background task to cleanup old jobs and files."""
+    while True:
+        try:
+            await asyncio.sleep(CLEANUP_INTERVAL_HOURS * 3600)
+            cutoff = datetime.now() - timedelta(days=RETENTION_DAYS)
+            
+            to_delete = []
+            for job_id, job in jobs.items():
+                try:
+                    created = datetime.fromisoformat(job.created_at)
+                    if created < cutoff:
+                        if job.audio_path and Path(job.audio_path).exists():
+                            Path(job.audio_path).unlink()
+                        to_delete.append(job_id)
+                except:
+                    pass
+
+            for job_id in to_delete:
+                del jobs[job_id]
+
+            if to_delete:
+                save_jobs_to_disk()
+                print(f"Cleanup: deleted {len(to_delete)} old jobs")
+
+        except Exception as e:
+            print(f"Error in cleanup task: {e}")
+
+
+@app.on_event("startup")
+async def startup():
+    """Load model and jobs on startup"""
+    global model
+
+    print("=" * 60)
+    print("OrpheusTail - Orpheus TTS Service Starting")
+    print(f"Model: {ORPHEUS_MODEL}")
+    print(f"Max Model Len: {MAX_MODEL_LEN}")
+    print(f"Cache: {'Enabled' if CACHE_ENABLED else 'Disabled'}")
+    print(f"Default Voice: {DEFAULT_VOICE}")
+    print("=" * 60)
+
+    # Import and load Orpheus model
+    print("Loading Orpheus model (this may take a moment)...")
+    from orpheus_tts import OrpheusModel
+    
+    model = OrpheusModel(
+        model_name=ORPHEUS_MODEL,
+        max_model_len=MAX_MODEL_LEN
+    )
+    
+    print("✓ Orpheus model loaded successfully")
+
+    # Load jobs from disk
+    load_jobs_from_disk()
+
+    # Start cleanup task
+    asyncio.create_task(cleanup_old_jobs())
+
+
+# === Pydantic Models ===
+
+class TTSRequest(BaseModel):
+    """TTS job submission request"""
+    text: str
+    voice: str = DEFAULT_VOICE
+
+
+class TTSStreamRequest(BaseModel):
+    """TTS streaming request (for head playback)"""
+    text: str
+    voice: str = DEFAULT_VOICE
+
+
+class JobResponse(BaseModel):
+    """Job submission response"""
+    job_id: str
+    status: str
+
+
+class StatusResponse(BaseModel):
+    """Job status response"""
+    job_id: str
+    status: str
+    progress: int
+    cached: bool = False
+    audio_url: Optional[str] = None
+    error: Optional[str] = None
+
+
+class VoicesResponse(BaseModel):
+    """Available voices response"""
+    builtin: List[str]
+    custom: List[str]
+    default: str
+    emotion_tags: List[str]
+
+
+# === Endpoints ===
+
+@app.get("/")
+def root():
+    """Root endpoint"""
+    return {
+        "service": "OrpheusTail - Orpheus TTS Service",
+        "version": "1.0.0",
+        "model": ORPHEUS_MODEL,
+        "default_voice": DEFAULT_VOICE,
+        "emotion_tags": EMOTION_TAGS,
+        "endpoints": {
+            "/tts/submit": "POST - Submit TTS job",
+            "/tts/status/{job_id}": "GET - Check job status",
+            "/tts/audio/{job_id}": "GET - Download audio",
+            "/tts/stream": "POST - Stream audio (for head)",
+            "/voice/clone": "POST - Upload voice reference",
+            "/voices": "GET - List available voices",
+            "/health": "GET - Health check"
+        }
+    }
+
+
+@app.get("/health")
+def health():
+    """Health check"""
+    return {
+        "status": "healthy",
+        "model_loaded": model is not None,
+        "cache_enabled": CACHE_ENABLED,
+        "voices_available": len(BUILTIN_VOICES) + len(get_custom_voices())
+    }
+
+
+@app.get("/voices", response_model=VoicesResponse)
+def list_voices():
+    """List all available voices"""
+    return VoicesResponse(
+        builtin=BUILTIN_VOICES,
+        custom=get_custom_voices(),
+        default=DEFAULT_VOICE,
+        emotion_tags=EMOTION_TAGS
+    )
+
+
+@app.post("/tts/submit", response_model=JobResponse)
+async def submit_tts_job(request: TTSRequest, background_tasks: BackgroundTasks):
+    """Submit a TTS job for processing."""
+    job_id = str(uuid.uuid4())
+
+    job = JobInfo(
+        job_id=job_id,
+        text=request.text,
+        voice=request.voice,
+        status=JobStatus.PENDING,
+        progress=0,
+        created_at=datetime.now().isoformat()
+    )
+
+    jobs[job_id] = job
+    save_jobs_to_disk()
+
+    background_tasks.add_task(
+        generate_speech_background,
+        job_id,
+        request.text,
+        request.voice
+    )
+
+    print(f"Job {job_id} submitted: '{request.text[:50]}...' with voice '{request.voice}'")
+
+    return JobResponse(job_id=job_id, status=JobStatus.PENDING)
+
+
+@app.get("/tts/status/{job_id}", response_model=StatusResponse)
+async def get_job_status(job_id: str):
+    """Get status of a TTS job."""
+    if job_id not in jobs:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    job = jobs[job_id]
+
+    response = StatusResponse(
+        job_id=job_id,
+        status=job.status,
+        progress=job.progress,
+        cached=job.cached
+    )
+
+    if job.status == JobStatus.SUCCESS:
+        response.audio_url = f"/tts/audio/{job_id}"
+    elif job.status == JobStatus.FAILURE:
+        response.error = job.error
+
+    return response
+
+
+@app.get("/tts/audio/{job_id}")
+async def get_audio(job_id: str):
+    """Retrieve generated audio file."""
+    if job_id not in jobs:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    job = jobs[job_id]
+
+    if job.status != JobStatus.SUCCESS:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Audio not ready. Job status: {job.status}"
+        )
+
+    if not job.audio_path or not Path(job.audio_path).exists():
+        raise HTTPException(status_code=404, detail="Audio file not found")
+
+    return FileResponse(
+        job.audio_path,
+        media_type="audio/wav",
+        filename=f"{job_id}.wav"
+    )
+
+
+@app.post("/tts/stream")
+async def stream_tts(request: TTSStreamRequest):
+    """
+    Stream TTS audio in real-time.
+    
+    For head-vixy to stream directly without waiting for full generation.
+    Returns audio chunks as they're generated.
+    """
+    global model
+    
+    if model is None:
+        raise HTTPException(status_code=503, detail="Model not loaded")
+    
+    voice = request.voice
+    if voice not in BUILTIN_VOICES:
+        voice = DEFAULT_VOICE
+    
+    async def audio_generator():
+        """Generate audio chunks"""
+        try:
+            syn_tokens = model.generate_speech(
+                prompt=request.text,
+                voice=voice,
+            )
+            
+            for audio_chunk in syn_tokens:
+                yield audio_chunk
+                
+        except Exception as e:
+            print(f"Stream error: {e}")
+            raise
+    
+    return StreamingResponse(
+        audio_generator(),
+        media_type="audio/wav"
+    )
+
+
+@app.post("/voice/clone")
+async def upload_voice_reference(
+    name: str,
+    audio: UploadFile = File(...),
+):
+    """
+    Upload a reference audio file for voice cloning.
+    
+    Args:
+        name: Name for this custom voice
+        audio: WAV audio file (5-30 seconds recommended)
+    """
+    if not name.isalnum():
+        raise HTTPException(status_code=400, detail="Voice name must be alphanumeric")
+    
+    if name in BUILTIN_VOICES:
+        raise HTTPException(status_code=400, detail="Cannot overwrite built-in voice")
+    
+    # Save the reference audio
+    voice_path = VOICES_DIR / f"{name}.wav"
+    
+    try:
+        content = await audio.read()
+        with open(voice_path, 'wb') as f:
+            f.write(content)
+        
+        return {
+            "status": "success",
+            "voice_name": name,
+            "message": f"Voice '{name}' saved. Use voice='{name}' in TTS requests."
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Failed to save voice: {e}")
+
+
+@app.delete("/voice/{name}")
+async def delete_voice(name: str):
+    """Delete a custom voice."""
+    if name in BUILTIN_VOICES:
+        raise HTTPException(status_code=400, detail="Cannot delete built-in voice")
+    
+    voice_path = VOICES_DIR / f"{name}.wav"
+    if not voice_path.exists():
+        raise HTTPException(status_code=404, detail="Voice not found")
+    
+    voice_path.unlink()
+    return {"status": "success", "message": f"Voice '{name}' deleted"}
+
+
+@app.delete("/tts/job/{job_id}")
+async def delete_job(job_id: str):
+    """Delete a job and its audio file."""
+    if job_id not in jobs:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    job = jobs[job_id]
+
+    if job.audio_path and Path(job.audio_path).exists():
+        try:
+            Path(job.audio_path).unlink()
+        except:
+            pass
+
+    del jobs[job_id]
+    save_jobs_to_disk()
+
+    return {"message": f"Job {job_id} deleted"}
+
+
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(
+        "main:app",
+        host="0.0.0.0",
+        port=8766,  # Same port as VoiceTail for drop-in replacement
+        reload=False
+    )
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,20 @@
+# OrpheusTail - Orpheus TTS Service Dependencies
+
+# Web framework
+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+
+# Orpheus TTS
+# orpheus-speech is installed separately in Dockerfile
+# It pulls vllm as a dependency
+
+# Audio processing
+scipy>=1.10.0
+numpy>=1.24.0
+
+# Data validation
+pydantic>=2.0.0
+
+# Note: PyTorch should already be installed via JetPack
+# vllm is pulled by orpheus-speech
+# If issues with vllm version, pin to: vllm==0.7.3