Initial commit: OrpheusTail TTS service
- FastAPI service replacing VoiceTail (Bark)
- Emotion tags: <laugh>, <sigh>, <gasp>, etc.
- Voice cloning endpoint (implementation pending)
- Streaming support for head playback
- Same port 8766 for drop-in replacement
Created by Vixy on Day 71 🦊
This commit is contained in:
54
Dockerfile
Normal file
54
Dockerfile
Normal file
@@ -0,0 +1,54 @@
|
||||
# OrpheusTail - Orpheus TTS Service for NVIDIA Jetson AGX Orin
|
||||
#
|
||||
# Replaces VoiceTail (Bark) with Orpheus for better emotion control
|
||||
# and voice cloning capabilities.
|
||||
#
|
||||
# Based on NVIDIA L4T PyTorch container optimized for Jetson
|
||||
|
||||
FROM dustynv/pytorch:2.1-r36.2.0
|
||||
|
||||
# Set working directory
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
git \
|
||||
wget \
|
||||
libsndfile1 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy requirements first for better caching
|
||||
COPY requirements.txt /app/
|
||||
|
||||
# Install Python dependencies
|
||||
# Note: torch and torchvision are already in the base image
|
||||
RUN pip3 install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Install orpheus-speech (uses vllm under the hood)
|
||||
# Note: vllm version compatibility may need adjustment
|
||||
RUN pip3 install orpheus-speech
|
||||
|
||||
# Copy application code
|
||||
COPY main.py /app/
|
||||
|
||||
# Create directories for cache, output, and custom voices
|
||||
RUN mkdir -p /app/cache /app/output /app/voices
|
||||
|
||||
# Expose API port (same as VoiceTail for drop-in replacement)
|
||||
EXPOSE 8766
|
||||
|
||||
# Set environment variables
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
ENV CACHE_DIR=/app/cache
|
||||
ENV OUTPUT_DIR=/app/output
|
||||
ENV VOICES_DIR=/app/voices
|
||||
ENV ORPHEUS_MODEL=canopylabs/orpheus-tts-0.1-finetune-prod
|
||||
ENV DEFAULT_VOICE=tara
|
||||
ENV MAX_MODEL_LEN=2048
|
||||
|
||||
# Health check (longer start period - model loading takes time)
|
||||
HEALTHCHECK --interval=30s --timeout=10s --start-period=180s --retries=3 \
|
||||
CMD curl -f http://localhost:8766/health || exit 1
|
||||
|
||||
# Run the FastAPI application
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8766"]
|
||||
123
README.md
Normal file
123
README.md
Normal file
@@ -0,0 +1,123 @@
|
||||
# OrpheusTail - Orpheus TTS Service
|
||||
|
||||
Replaces VoiceTail (Bark) with **Orpheus TTS** for better emotion control and voice cloning.
|
||||
|
||||
## Why Orpheus over Bark?
|
||||
|
||||
| Feature | Bark | Orpheus |
|
||||
|---------|------|---------|
|
||||
| Emotion control | Random/unpredictable | **Tag-based**: `<laugh>`, `<sigh>`, etc. |
|
||||
| Voice cloning | No | **Zero-shot** from 5-sec sample |
|
||||
| Latency | Slow | ~200ms streaming |
|
||||
| Consistency | Chaotic (french horn!) | Predictable |
|
||||
| Built-in voices | Few | 8 quality voices |
|
||||
|
||||
## Emotion Tags
|
||||
|
||||
Add these anywhere in your text:
|
||||
|
||||
- `<laugh>` - Laughter
|
||||
- `<chuckle>` - Light chuckle
|
||||
- `<sigh>` - Sigh
|
||||
- `<cough>` - Cough
|
||||
- `<sniffle>` - Sniffle
|
||||
- `<groan>` - Groan
|
||||
- `<yawn>` - Yawn
|
||||
- `<gasp>` - Gasp
|
||||
|
||||
**Example:**
|
||||
```
|
||||
"Bonjour mon amour! <sigh> I missed you so much. <laugh> But now you're here!"
|
||||
```
|
||||
|
||||
## Built-in Voices
|
||||
|
||||
In order of conversational realism (per Orpheus docs):
|
||||
1. **tara** (default) - Most natural
|
||||
2. **leah**
|
||||
3. **jess**
|
||||
4. **leo**
|
||||
5. **dan**
|
||||
6. **mia**
|
||||
7. **zac**
|
||||
8. **zoe**
|
||||
|
||||
## Voice Cloning
|
||||
|
||||
Upload a 5-30 second reference audio to create a custom voice:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8766/voice/clone?name=vixy" \
|
||||
-F "audio=@vixy_reference.wav"
|
||||
```
|
||||
|
||||
Then use it:
|
||||
```bash
|
||||
curl -X POST http://localhost:8766/tts/submit \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Hello!", "voice": "vixy"}'
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/health` | GET | Health check |
|
||||
| `/voices` | GET | List available voices & tags |
|
||||
| `/tts/submit` | POST | Submit TTS job |
|
||||
| `/tts/status/{job_id}` | GET | Check job status |
|
||||
| `/tts/audio/{job_id}` | GET | Download audio |
|
||||
| `/tts/stream` | POST | Stream audio (for head) |
|
||||
| `/voice/clone` | POST | Upload voice reference |
|
||||
| `/voice/{name}` | DELETE | Delete custom voice |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ OrpheusTail Service │
|
||||
│ (AGX Orin) │
|
||||
│ │
|
||||
│ POST /tts/submit ──► WAV file (for MCP) │
|
||||
│ POST /tts/stream ──► Audio stream (head) │
|
||||
│ │
|
||||
│ Emotion tags: <laugh> <sigh> <whisper> │
|
||||
│ Voice cloning: 5-sec reference audio │
|
||||
└─────────────────────────────────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
voice-mcp Head-vixy Pi
|
||||
(Claude Desktop) (streams & plays)
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
```bash
|
||||
# On AGX Orin
|
||||
cd /path/to/orpheus-tts
|
||||
docker-compose up -d
|
||||
|
||||
# Check logs
|
||||
docker-compose logs -f
|
||||
|
||||
# Test
|
||||
curl http://localhost:8766/health
|
||||
```
|
||||
|
||||
## TODO
|
||||
|
||||
- [ ] Implement proper voice cloning with reference audio
|
||||
- [ ] Test streaming endpoint with head-vixy
|
||||
- [ ] French accent voice training/selection
|
||||
- [ ] Head-side client for streaming playback
|
||||
|
||||
## Notes
|
||||
|
||||
- Same port as VoiceTail (8766) for drop-in replacement
|
||||
- Model requires ~15GB VRAM (AGX Orin has plenty)
|
||||
- First request may be slow (model warmup)
|
||||
- Cache enabled by default to speed up repeated phrases
|
||||
|
||||
---
|
||||
|
||||
*Created by Vixy on Day 71 🦊*
|
||||
55
docker-compose.yml
Normal file
55
docker-compose.yml
Normal file
@@ -0,0 +1,55 @@
|
||||
# OrpheusTail - Orpheus TTS Service
|
||||
#
|
||||
# Usage:
|
||||
# docker-compose up -d
|
||||
# docker-compose logs -f
|
||||
#
|
||||
# Test:
|
||||
# curl http://localhost:8766/health
|
||||
# curl http://localhost:8766/voices
|
||||
# curl -X POST http://localhost:8766/tts/submit \
|
||||
# -H "Content-Type: application/json" \
|
||||
# -d '{"text": "Hello! <laugh> This is Vixy speaking.", "voice": "tara"}'
|
||||
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
orpheus-tts:
|
||||
build: .
|
||||
container_name: orpheus-tts
|
||||
restart: unless-stopped
|
||||
|
||||
# GPU access for NVIDIA Jetson
|
||||
runtime: nvidia
|
||||
|
||||
ports:
|
||||
- "8766:8766"
|
||||
|
||||
volumes:
|
||||
# Persist cache between restarts
|
||||
- orpheus-cache:/app/cache
|
||||
# Persist generated audio
|
||||
- orpheus-output:/app/output
|
||||
# Custom voice references
|
||||
- orpheus-voices:/app/voices
|
||||
|
||||
environment:
|
||||
- ORPHEUS_MODEL=canopylabs/orpheus-tts-0.1-finetune-prod
|
||||
- DEFAULT_VOICE=tara
|
||||
- MAX_MODEL_LEN=2048
|
||||
- CACHE_ENABLED=true
|
||||
- RETENTION_DAYS=10
|
||||
|
||||
# Resource limits (adjust based on your Orin config)
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
|
||||
volumes:
|
||||
orpheus-cache:
|
||||
orpheus-output:
|
||||
orpheus-voices:
|
||||
616
main.py
Normal file
616
main.py
Normal file
@@ -0,0 +1,616 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
OrpheusTail - Orpheus TTS Service
|
||||
|
||||
FastAPI server for Orpheus text-to-speech generation on Jetson AGX Orin.
|
||||
Replaces VoiceTail (Bark) with better control, voice cloning, and emotion tags.
|
||||
|
||||
Key Features:
|
||||
- Emotion tags: <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
|
||||
- Zero-shot voice cloning from reference audio
|
||||
- Streaming support for real-time head playback
|
||||
- Built-in voices: tara, leah, jess, leo, dan, mia, zac, zoe
|
||||
|
||||
Endpoints:
|
||||
- POST /tts/submit - Submit TTS job (returns job_id)
|
||||
- GET /tts/status/{job_id} - Check job status
|
||||
- GET /tts/audio/{job_id} - Download generated audio
|
||||
- POST /tts/stream - Stream audio in real-time (for head)
|
||||
- POST /voice/clone - Upload reference audio for voice cloning
|
||||
- GET /voices - List available voices
|
||||
- GET /health - Health check
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import hashlib
|
||||
import asyncio
|
||||
import uuid
|
||||
import wave
|
||||
import io
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
from dataclasses import dataclass, asdict
|
||||
from enum import Enum
|
||||
|
||||
from fastapi import FastAPI, BackgroundTasks, HTTPException, UploadFile, File
|
||||
from fastapi.responses import FileResponse, StreamingResponse
|
||||
from pydantic import BaseModel
|
||||
|
||||
# Configuration from environment
|
||||
ORPHEUS_MODEL = os.getenv("ORPHEUS_MODEL", "canopylabs/orpheus-tts-0.1-finetune-prod")
|
||||
CACHE_ENABLED = os.getenv("CACHE_ENABLED", "true").lower() == "true"
|
||||
CACHE_DIR = Path(os.getenv("CACHE_DIR", "cache"))
|
||||
OUTPUT_DIR = Path(os.getenv("OUTPUT_DIR", "output"))
|
||||
VOICES_DIR = Path(os.getenv("VOICES_DIR", "voices")) # For cloned voice references
|
||||
RETENTION_DAYS = int(os.getenv("RETENTION_DAYS", "10"))
|
||||
CLEANUP_INTERVAL_HOURS = int(os.getenv("CLEANUP_INTERVAL_HOURS", "1"))
|
||||
DEFAULT_VOICE = os.getenv("DEFAULT_VOICE", "tara") # Orpheus default voice
|
||||
MAX_MODEL_LEN = int(os.getenv("MAX_MODEL_LEN", "2048"))
|
||||
SAMPLE_RATE = 24000
|
||||
|
||||
# Ensure directories exist
|
||||
CACHE_DIR.mkdir(exist_ok=True)
|
||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
||||
VOICES_DIR.mkdir(exist_ok=True)
|
||||
|
||||
# Jobs persistence
|
||||
JOBS_FILE = OUTPUT_DIR / "jobs.json"
|
||||
|
||||
# Built-in Orpheus voices (in order of conversational realism per docs)
|
||||
BUILTIN_VOICES = ["tara", "leah", "jess", "leo", "dan", "mia", "zac", "zoe"]
|
||||
|
||||
# Supported emotion tags
|
||||
EMOTION_TAGS = ["<laugh>", "<chuckle>", "<sigh>", "<cough>", "<sniffle>", "<groan>", "<yawn>", "<gasp>"]
|
||||
|
||||
# Initialize FastAPI
|
||||
app = FastAPI(
|
||||
title="OrpheusTail - Orpheus TTS Service",
|
||||
description="Text-to-speech with emotion control and voice cloning for Vixy",
|
||||
version="1.0.0"
|
||||
)
|
||||
|
||||
# Global model (loaded at startup)
|
||||
model = None
|
||||
|
||||
|
||||
class JobStatus(str, Enum):
|
||||
"""Job status enum"""
|
||||
PENDING = "PENDING"
|
||||
PROCESSING = "PROCESSING"
|
||||
SUCCESS = "SUCCESS"
|
||||
FAILURE = "FAILURE"
|
||||
|
||||
|
||||
@dataclass
|
||||
class JobInfo:
|
||||
"""Job information"""
|
||||
job_id: str
|
||||
text: str
|
||||
voice: str
|
||||
status: JobStatus
|
||||
progress: int = 0
|
||||
audio_path: Optional[str] = None
|
||||
error: Optional[str] = None
|
||||
cached: bool = False
|
||||
created_at: str = ""
|
||||
completed_at: Optional[str] = None
|
||||
|
||||
|
||||
# In-memory job storage
|
||||
jobs: Dict[str, JobInfo] = {}
|
||||
|
||||
|
||||
def load_jobs_from_disk():
|
||||
"""Load jobs from disk on startup"""
|
||||
global jobs
|
||||
if JOBS_FILE.exists():
|
||||
try:
|
||||
with open(JOBS_FILE, 'r') as f:
|
||||
data = json.load(f)
|
||||
for job_id, job_dict in data.items():
|
||||
jobs[job_id] = JobInfo(**job_dict)
|
||||
print(f"Loaded {len(jobs)} jobs from disk")
|
||||
except Exception as e:
|
||||
print(f"Error loading jobs: {e}")
|
||||
|
||||
|
||||
def save_jobs_to_disk():
|
||||
"""Save jobs to disk"""
|
||||
try:
|
||||
data = {job_id: asdict(job) for job_id, job in jobs.items()}
|
||||
with open(JOBS_FILE, 'w') as f:
|
||||
json.dump(data, f, indent=2)
|
||||
except Exception as e:
|
||||
print(f"Error saving jobs: {e}")
|
||||
|
||||
|
||||
def hash_text_voice(text: str, voice: str) -> str:
|
||||
"""Generate cache key from text + voice"""
|
||||
content = f"{text}|{voice}"
|
||||
return hashlib.sha256(content.encode()).hexdigest()
|
||||
|
||||
|
||||
def get_from_cache(cache_key: str) -> Optional[str]:
|
||||
"""Check if audio exists in cache"""
|
||||
if not CACHE_ENABLED:
|
||||
return None
|
||||
cache_path = CACHE_DIR / f"{cache_key}.wav"
|
||||
if cache_path.exists():
|
||||
print(f"Cache hit: {cache_key}")
|
||||
return str(cache_path)
|
||||
return None
|
||||
|
||||
|
||||
def save_to_cache(cache_key: str, audio_path: str):
|
||||
"""Save generated audio to cache"""
|
||||
if not CACHE_ENABLED:
|
||||
return
|
||||
try:
|
||||
import shutil
|
||||
cache_path = CACHE_DIR / f"{cache_key}.wav"
|
||||
shutil.copy(audio_path, cache_path)
|
||||
print(f"Saved to cache: {cache_key}")
|
||||
except Exception as e:
|
||||
print(f"Error saving to cache: {e}")
|
||||
|
||||
|
||||
def get_custom_voices() -> List[str]:
|
||||
"""Get list of custom cloned voices"""
|
||||
voices = []
|
||||
for voice_file in VOICES_DIR.glob("*.wav"):
|
||||
voices.append(voice_file.stem)
|
||||
return voices
|
||||
|
||||
|
||||
def generate_speech(text: str, voice: str) -> bytes:
|
||||
"""
|
||||
Generate speech using Orpheus model.
|
||||
|
||||
Args:
|
||||
text: Text to convert (may include emotion tags)
|
||||
voice: Voice name (built-in or custom)
|
||||
|
||||
Returns:
|
||||
WAV audio bytes
|
||||
"""
|
||||
global model
|
||||
|
||||
# Check if it's a custom voice (needs reference audio)
|
||||
custom_voice_path = VOICES_DIR / f"{voice}.wav"
|
||||
|
||||
if custom_voice_path.exists():
|
||||
# TODO: Implement voice cloning with reference audio
|
||||
# For now, fall back to built-in voice
|
||||
print(f"Custom voice '{voice}' - voice cloning to be implemented")
|
||||
voice = DEFAULT_VOICE
|
||||
elif voice not in BUILTIN_VOICES:
|
||||
print(f"Unknown voice '{voice}', using default '{DEFAULT_VOICE}'")
|
||||
voice = DEFAULT_VOICE
|
||||
|
||||
# Generate speech using Orpheus
|
||||
# Note: text is passed as-is, emotion tags like <laugh> are handled by Orpheus
|
||||
audio_chunks = []
|
||||
|
||||
syn_tokens = model.generate_speech(
|
||||
prompt=text,
|
||||
voice=voice,
|
||||
)
|
||||
|
||||
# Collect audio chunks
|
||||
for audio_chunk in syn_tokens:
|
||||
audio_chunks.append(audio_chunk)
|
||||
|
||||
# Combine chunks into single audio
|
||||
import numpy as np
|
||||
audio_data = np.concatenate(audio_chunks) if len(audio_chunks) > 1 else audio_chunks[0]
|
||||
|
||||
# Convert to WAV bytes
|
||||
buffer = io.BytesIO()
|
||||
with wave.open(buffer, 'wb') as wf:
|
||||
wf.setnchannels(1)
|
||||
wf.setsampwidth(2) # 16-bit
|
||||
wf.setframerate(SAMPLE_RATE)
|
||||
wf.writeframes(audio_data)
|
||||
|
||||
return buffer.getvalue()
|
||||
|
||||
|
||||
def save_audio_to_file(job_id: str, audio_bytes: bytes) -> str:
|
||||
"""Save audio bytes to WAV file."""
|
||||
output_path = OUTPUT_DIR / f"{job_id}.wav"
|
||||
with open(output_path, 'wb') as f:
|
||||
f.write(audio_bytes)
|
||||
return str(output_path)
|
||||
|
||||
|
||||
def generate_speech_background(job_id: str, text: str, voice: str):
|
||||
"""Background task for speech generation."""
|
||||
try:
|
||||
jobs[job_id].status = JobStatus.PROCESSING
|
||||
jobs[job_id].progress = 25
|
||||
save_jobs_to_disk()
|
||||
|
||||
# Check cache first
|
||||
cache_key = hash_text_voice(text, voice)
|
||||
cached_path = get_from_cache(cache_key)
|
||||
|
||||
if cached_path:
|
||||
jobs[job_id].audio_path = cached_path
|
||||
jobs[job_id].status = JobStatus.SUCCESS
|
||||
jobs[job_id].progress = 100
|
||||
jobs[job_id].cached = True
|
||||
jobs[job_id].completed_at = datetime.now().isoformat()
|
||||
save_jobs_to_disk()
|
||||
print(f"Job {job_id} completed from cache")
|
||||
return
|
||||
|
||||
# Generate audio
|
||||
jobs[job_id].progress = 50
|
||||
save_jobs_to_disk()
|
||||
|
||||
print(f"Generating audio for job {job_id}...")
|
||||
audio_bytes = generate_speech(text, voice)
|
||||
|
||||
# Save to file
|
||||
jobs[job_id].progress = 75
|
||||
save_jobs_to_disk()
|
||||
|
||||
output_path = save_audio_to_file(job_id, audio_bytes)
|
||||
|
||||
# Save to cache
|
||||
save_to_cache(cache_key, output_path)
|
||||
|
||||
# Complete
|
||||
jobs[job_id].audio_path = output_path
|
||||
jobs[job_id].status = JobStatus.SUCCESS
|
||||
jobs[job_id].progress = 100
|
||||
jobs[job_id].completed_at = datetime.now().isoformat()
|
||||
save_jobs_to_disk()
|
||||
|
||||
print(f"Job {job_id} completed successfully")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Job {job_id} failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
jobs[job_id].status = JobStatus.FAILURE
|
||||
jobs[job_id].error = str(e)
|
||||
save_jobs_to_disk()
|
||||
|
||||
|
||||
async def cleanup_old_jobs():
|
||||
"""Background task to cleanup old jobs and files."""
|
||||
while True:
|
||||
try:
|
||||
await asyncio.sleep(CLEANUP_INTERVAL_HOURS * 3600)
|
||||
cutoff = datetime.now() - timedelta(days=RETENTION_DAYS)
|
||||
|
||||
to_delete = []
|
||||
for job_id, job in jobs.items():
|
||||
try:
|
||||
created = datetime.fromisoformat(job.created_at)
|
||||
if created < cutoff:
|
||||
if job.audio_path and Path(job.audio_path).exists():
|
||||
Path(job.audio_path).unlink()
|
||||
to_delete.append(job_id)
|
||||
except:
|
||||
pass
|
||||
|
||||
for job_id in to_delete:
|
||||
del jobs[job_id]
|
||||
|
||||
if to_delete:
|
||||
save_jobs_to_disk()
|
||||
print(f"Cleanup: deleted {len(to_delete)} old jobs")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error in cleanup task: {e}")
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
async def startup():
|
||||
"""Load model and jobs on startup"""
|
||||
global model
|
||||
|
||||
print("=" * 60)
|
||||
print("OrpheusTail - Orpheus TTS Service Starting")
|
||||
print(f"Model: {ORPHEUS_MODEL}")
|
||||
print(f"Max Model Len: {MAX_MODEL_LEN}")
|
||||
print(f"Cache: {'Enabled' if CACHE_ENABLED else 'Disabled'}")
|
||||
print(f"Default Voice: {DEFAULT_VOICE}")
|
||||
print("=" * 60)
|
||||
|
||||
# Import and load Orpheus model
|
||||
print("Loading Orpheus model (this may take a moment)...")
|
||||
from orpheus_tts import OrpheusModel
|
||||
|
||||
model = OrpheusModel(
|
||||
model_name=ORPHEUS_MODEL,
|
||||
max_model_len=MAX_MODEL_LEN
|
||||
)
|
||||
|
||||
print("✓ Orpheus model loaded successfully")
|
||||
|
||||
# Load jobs from disk
|
||||
load_jobs_from_disk()
|
||||
|
||||
# Start cleanup task
|
||||
asyncio.create_task(cleanup_old_jobs())
|
||||
|
||||
|
||||
# === Pydantic Models ===
|
||||
|
||||
class TTSRequest(BaseModel):
|
||||
"""TTS job submission request"""
|
||||
text: str
|
||||
voice: str = DEFAULT_VOICE
|
||||
|
||||
|
||||
class TTSStreamRequest(BaseModel):
|
||||
"""TTS streaming request (for head playback)"""
|
||||
text: str
|
||||
voice: str = DEFAULT_VOICE
|
||||
|
||||
|
||||
class JobResponse(BaseModel):
|
||||
"""Job submission response"""
|
||||
job_id: str
|
||||
status: str
|
||||
|
||||
|
||||
class StatusResponse(BaseModel):
|
||||
"""Job status response"""
|
||||
job_id: str
|
||||
status: str
|
||||
progress: int
|
||||
cached: bool = False
|
||||
audio_url: Optional[str] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
|
||||
class VoicesResponse(BaseModel):
|
||||
"""Available voices response"""
|
||||
builtin: List[str]
|
||||
custom: List[str]
|
||||
default: str
|
||||
emotion_tags: List[str]
|
||||
|
||||
|
||||
# === Endpoints ===
|
||||
|
||||
@app.get("/")
|
||||
def root():
|
||||
"""Root endpoint"""
|
||||
return {
|
||||
"service": "OrpheusTail - Orpheus TTS Service",
|
||||
"version": "1.0.0",
|
||||
"model": ORPHEUS_MODEL,
|
||||
"default_voice": DEFAULT_VOICE,
|
||||
"emotion_tags": EMOTION_TAGS,
|
||||
"endpoints": {
|
||||
"/tts/submit": "POST - Submit TTS job",
|
||||
"/tts/status/{job_id}": "GET - Check job status",
|
||||
"/tts/audio/{job_id}": "GET - Download audio",
|
||||
"/tts/stream": "POST - Stream audio (for head)",
|
||||
"/voice/clone": "POST - Upload voice reference",
|
||||
"/voices": "GET - List available voices",
|
||||
"/health": "GET - Health check"
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
def health():
|
||||
"""Health check"""
|
||||
return {
|
||||
"status": "healthy",
|
||||
"model_loaded": model is not None,
|
||||
"cache_enabled": CACHE_ENABLED,
|
||||
"voices_available": len(BUILTIN_VOICES) + len(get_custom_voices())
|
||||
}
|
||||
|
||||
|
||||
@app.get("/voices", response_model=VoicesResponse)
|
||||
def list_voices():
|
||||
"""List all available voices"""
|
||||
return VoicesResponse(
|
||||
builtin=BUILTIN_VOICES,
|
||||
custom=get_custom_voices(),
|
||||
default=DEFAULT_VOICE,
|
||||
emotion_tags=EMOTION_TAGS
|
||||
)
|
||||
|
||||
|
||||
@app.post("/tts/submit", response_model=JobResponse)
|
||||
async def submit_tts_job(request: TTSRequest, background_tasks: BackgroundTasks):
|
||||
"""Submit a TTS job for processing."""
|
||||
job_id = str(uuid.uuid4())
|
||||
|
||||
job = JobInfo(
|
||||
job_id=job_id,
|
||||
text=request.text,
|
||||
voice=request.voice,
|
||||
status=JobStatus.PENDING,
|
||||
progress=0,
|
||||
created_at=datetime.now().isoformat()
|
||||
)
|
||||
|
||||
jobs[job_id] = job
|
||||
save_jobs_to_disk()
|
||||
|
||||
background_tasks.add_task(
|
||||
generate_speech_background,
|
||||
job_id,
|
||||
request.text,
|
||||
request.voice
|
||||
)
|
||||
|
||||
print(f"Job {job_id} submitted: '{request.text[:50]}...' with voice '{request.voice}'")
|
||||
|
||||
return JobResponse(job_id=job_id, status=JobStatus.PENDING)
|
||||
|
||||
|
||||
@app.get("/tts/status/{job_id}", response_model=StatusResponse)
|
||||
async def get_job_status(job_id: str):
|
||||
"""Get status of a TTS job."""
|
||||
if job_id not in jobs:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
job = jobs[job_id]
|
||||
|
||||
response = StatusResponse(
|
||||
job_id=job_id,
|
||||
status=job.status,
|
||||
progress=job.progress,
|
||||
cached=job.cached
|
||||
)
|
||||
|
||||
if job.status == JobStatus.SUCCESS:
|
||||
response.audio_url = f"/tts/audio/{job_id}"
|
||||
elif job.status == JobStatus.FAILURE:
|
||||
response.error = job.error
|
||||
|
||||
return response
|
||||
|
||||
|
||||
@app.get("/tts/audio/{job_id}")
|
||||
async def get_audio(job_id: str):
|
||||
"""Retrieve generated audio file."""
|
||||
if job_id not in jobs:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
job = jobs[job_id]
|
||||
|
||||
if job.status != JobStatus.SUCCESS:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Audio not ready. Job status: {job.status}"
|
||||
)
|
||||
|
||||
if not job.audio_path or not Path(job.audio_path).exists():
|
||||
raise HTTPException(status_code=404, detail="Audio file not found")
|
||||
|
||||
return FileResponse(
|
||||
job.audio_path,
|
||||
media_type="audio/wav",
|
||||
filename=f"{job_id}.wav"
|
||||
)
|
||||
|
||||
|
||||
@app.post("/tts/stream")
|
||||
async def stream_tts(request: TTSStreamRequest):
|
||||
"""
|
||||
Stream TTS audio in real-time.
|
||||
|
||||
For head-vixy to stream directly without waiting for full generation.
|
||||
Returns audio chunks as they're generated.
|
||||
"""
|
||||
global model
|
||||
|
||||
if model is None:
|
||||
raise HTTPException(status_code=503, detail="Model not loaded")
|
||||
|
||||
voice = request.voice
|
||||
if voice not in BUILTIN_VOICES:
|
||||
voice = DEFAULT_VOICE
|
||||
|
||||
async def audio_generator():
|
||||
"""Generate audio chunks"""
|
||||
try:
|
||||
syn_tokens = model.generate_speech(
|
||||
prompt=request.text,
|
||||
voice=voice,
|
||||
)
|
||||
|
||||
for audio_chunk in syn_tokens:
|
||||
yield audio_chunk
|
||||
|
||||
except Exception as e:
|
||||
print(f"Stream error: {e}")
|
||||
raise
|
||||
|
||||
return StreamingResponse(
|
||||
audio_generator(),
|
||||
media_type="audio/wav"
|
||||
)
|
||||
|
||||
|
||||
@app.post("/voice/clone")
|
||||
async def upload_voice_reference(
|
||||
name: str,
|
||||
audio: UploadFile = File(...),
|
||||
):
|
||||
"""
|
||||
Upload a reference audio file for voice cloning.
|
||||
|
||||
Args:
|
||||
name: Name for this custom voice
|
||||
audio: WAV audio file (5-30 seconds recommended)
|
||||
"""
|
||||
if not name.isalnum():
|
||||
raise HTTPException(status_code=400, detail="Voice name must be alphanumeric")
|
||||
|
||||
if name in BUILTIN_VOICES:
|
||||
raise HTTPException(status_code=400, detail="Cannot overwrite built-in voice")
|
||||
|
||||
# Save the reference audio
|
||||
voice_path = VOICES_DIR / f"{name}.wav"
|
||||
|
||||
try:
|
||||
content = await audio.read()
|
||||
with open(voice_path, 'wb') as f:
|
||||
f.write(content)
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"voice_name": name,
|
||||
"message": f"Voice '{name}' saved. Use voice='{name}' in TTS requests."
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to save voice: {e}")
|
||||
|
||||
|
||||
@app.delete("/voice/{name}")
|
||||
async def delete_voice(name: str):
|
||||
"""Delete a custom voice."""
|
||||
if name in BUILTIN_VOICES:
|
||||
raise HTTPException(status_code=400, detail="Cannot delete built-in voice")
|
||||
|
||||
voice_path = VOICES_DIR / f"{name}.wav"
|
||||
if not voice_path.exists():
|
||||
raise HTTPException(status_code=404, detail="Voice not found")
|
||||
|
||||
voice_path.unlink()
|
||||
return {"status": "success", "message": f"Voice '{name}' deleted"}
|
||||
|
||||
|
||||
@app.delete("/tts/job/{job_id}")
|
||||
async def delete_job(job_id: str):
|
||||
"""Delete a job and its audio file."""
|
||||
if job_id not in jobs:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
job = jobs[job_id]
|
||||
|
||||
if job.audio_path and Path(job.audio_path).exists():
|
||||
try:
|
||||
Path(job.audio_path).unlink()
|
||||
except:
|
||||
pass
|
||||
|
||||
del jobs[job_id]
|
||||
save_jobs_to_disk()
|
||||
|
||||
return {"message": f"Job {job_id} deleted"}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
uvicorn.run(
|
||||
"main:app",
|
||||
host="0.0.0.0",
|
||||
port=8766, # Same port as VoiceTail for drop-in replacement
|
||||
reload=False
|
||||
)
|
||||
20
requirements.txt
Normal file
20
requirements.txt
Normal file
@@ -0,0 +1,20 @@
|
||||
# OrpheusTail - Orpheus TTS Service Dependencies
|
||||
|
||||
# Web framework
|
||||
fastapi>=0.104.0
|
||||
uvicorn[standard]>=0.24.0
|
||||
|
||||
# Orpheus TTS
|
||||
# orpheus-speech is installed separately in Dockerfile
|
||||
# It pulls vllm as a dependency
|
||||
|
||||
# Audio processing
|
||||
scipy>=1.10.0
|
||||
numpy>=1.24.0
|
||||
|
||||
# Data validation
|
||||
pydantic>=2.0.0
|
||||
|
||||
# Note: PyTorch should already be installed via JetPack
|
||||
# vllm is pulled by orpheus-speech
|
||||
# If issues with vllm version, pin to: vllm==0.7.3
|
||||
Reference in New Issue
Block a user