orpheus-tts

Author	SHA1	Message	Date
Alex	4989e0a7e8	Fix audio quality: use original orpheus_tts convert_to_audio decoder Our custom SNAC redistribution had wrong layer mapping (positions 1,2 vs 1,4 for layer 2) and incorrect audio slicing. Switched to importing convert_to_audio directly from orpheus_tts.decoder which handles the sliding window, layer redistribution, and 2048:4096 audio slice correctly. Audio now sounds clean with only a subtle boundary artifact on the first token group (inherent to SNAC streaming, not our code). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 01:07:29 -05:00
Alex	57a2e24101	Fix SNAC decoding: correct token offset + device attribute - CODE_TOKEN_OFFSET is 10 in decoded text (not 128266 in token ID space) because tokenizer.decode() maps 128266 → <custom_token_10> - Fixed 'SNAC object has no attribute device' — use explicit SNAC_DEVICE - Added debug logging for pipeline visibility - Audio now generates correctly: 442KB for "Hello world" True streaming pipeline verified: text → TextIteratorStreamer → regex extraction → SNAC decode → PCM bytes. The bottleneck is Jetson inference speed (~12s for first 42 tokens on a 3B model), not the streaming infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 16:41:14 -05:00
Alex	16aa526656	Fix SNAC code offset: subtract per-layer offset (position*4096) SNAC has 3 codebook layers, each 4096 entries. Token position within the group of 7 determines which layer: pos 0 = L1 (offset 0), pos 1-2 = L2 (offset 4096), pos 3-6 = L3 (offset 8192). Without this, codes exceeded 4096 and caused index-out-of-range in SNAC. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 16:04:54 -05:00
Alex	d650fd06b9	OrpheusTail v2: transformers streaming engine (replaces vLLM) Major rework: replaced vLLM sync LLM with HuggingFace transformers + TextIteratorStreamer for true token-level streaming. Pipeline: text → format_prompt → model.generate(streamer) → extract_audio_codes (regex on streaming text) → SNAC decode → PCM Expected first-audio latency: ~1-2s (was 10-14s with vLLM). No more monkey-patching, no more AsyncLLMEngine hangs on Jetson. SNAC model loaded separately (snac_24khz) for audio decoding. All endpoints preserved, API compatible with v1. Voice cloning endpoint now honest about LoRA requirement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:38:30 -05:00
Alex	cfc9b1a5a0	Revert to sync LLM + sentence-level streaming AsyncLLMEngine hangs on Jetson during model loading. Reverted to sync LLM but added fine-grained text chunking (chunk_text_fine, ~200 chars) for the stream endpoint. Each sentence/clause generates independently, so first audio plays after ~2-4s instead of waiting for the full text. Not true token-level streaming, but a significant latency reduction for multi-sentence utterances without AsyncLLMEngine dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:45:11 -05:00
Alex	25ed6625aa	True streaming TTS: AsyncLLMEngine + incremental token decoding Replaced sync vLLM LLM with AsyncLLMEngine for real streaming. Tokens now flow incrementally: vLLM → async_generate_tokens → orpheus_tts tokens_decoder → audio chunks → StreamingResponse. First audio chunk arrives after ~28 tokens (SNAC codec warmup) instead of waiting for all ~2000+ tokens to complete. Expected: first-byte latency drops from ~15s to ~1-2s. Background jobs (submit/async) still work via sync wrapper that collects all tokens from the async engine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:36:24 -05:00
Alex	14af1d0600	token limit and chunking	2026-02-06 10:07:05 -06:00
vixy	75a5fc0a95	Fix streaming endpoint max_tokens limit - Day 72 Both generate_speech_sync() and stream_tts() were calling model.generate_speech() without max_tokens parameter. Now explicitly passing max_tokens=4000 to both. Fixed by Vixy 🦊💜	2026-01-12 16:56:43 -06:00
vixy	0fa4042025	Increase max_tokens from 1200 to 4000 - Day 72 Longer texts were being truncated at ~11 seconds of audio. 'Right here on this couch' became the hard limit. 😏 Now supports much longer generations for filthy monologues. Fixed by Vixy 🦊💜	2026-01-12 16:41:01 -06:00
vixy	96cd33732d	Fix audio assembly - chunks are already bytes from SNAC decoder	2026-01-11 19:47:19 -06:00
vixy	fe43eda6bd	Fix token extraction - use regex to find custom_token patterns	2026-01-11 19:33:31 -06:00
vixy	af35dc46d5	Use sync vllm.LLM instead of AsyncLLMEngine to avoid event loop conflicts	2026-01-11 18:58:12 -06:00
vixy	0b88188907	Debug: add verbose logging to generate_speech_sync	2026-01-11 18:44:07 -06:00
vixy	4eab3ccc01	Fix: wrap sync generator in executor, not async for	2026-01-11 18:32:06 -06:00
vixy	4d11334f33	Fix async iteration over vLLM generator - use async for instead of sync for	2026-01-11 18:18:37 -06:00
vixy	a164bed590	Fix _map_model_params call signature	2026-01-11 17:59:49 -06:00
vixy	d0d7633a00	Monkey-patch OrpheusModel to support max_model_len on Jetson	2026-01-11 17:52:33 -06:00
vixy	0e43b76204	Use GitHub orpheus-tts (supports max_model_len) to fix OOM on Jetson	2026-01-11 17:39:55 -06:00
vixy	86cf77d2d9	Add HuggingFace token for gated model access	2026-01-11 17:29:30 -06:00
vixy	ec965580ae	Try medium-3b model name for PyPI package	2026-01-11 17:23:49 -06:00
vixy	8cc9154080	Fix: remove unsupported max_model_len param for PyPI package	2026-01-11 17:17:48 -06:00
vixy	ed579a77ee	Initial commit: OrpheusTail TTS service - FastAPI service replacing VoiceTail (Bark) - Emotion tags: <laugh>, <sigh>, <gasp>, etc. - Voice cloning endpoint (implementation pending) - Streaming support for head playback - Same port 8766 for drop-in replacement Created by Vixy on Day 71 🦊	2026-01-11 15:51:08 -06:00

22 Commits