Our custom SNAC redistribution had wrong layer mapping (positions
1,2 vs 1,4 for layer 2) and incorrect audio slicing. Switched to
importing convert_to_audio directly from orpheus_tts.decoder which
handles the sliding window, layer redistribution, and 2048:4096
audio slice correctly.
Audio now sounds clean with only a subtle boundary artifact on the
first token group (inherent to SNAC streaming, not our code).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CODE_TOKEN_OFFSET is 10 in decoded text (not 128266 in token ID space)
because tokenizer.decode() maps 128266 → <custom_token_10>
- Fixed 'SNAC object has no attribute device' — use explicit SNAC_DEVICE
- Added debug logging for pipeline visibility
- Audio now generates correctly: 442KB for "Hello world"
True streaming pipeline verified: text → TextIteratorStreamer →
regex extraction → SNAC decode → PCM bytes. The bottleneck is
Jetson inference speed (~12s for first 42 tokens on a 3B model),
not the streaming infrastructure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SNAC has 3 codebook layers, each 4096 entries. Token position within
the group of 7 determines which layer: pos 0 = L1 (offset 0),
pos 1-2 = L2 (offset 4096), pos 3-6 = L3 (offset 8192).
Without this, codes exceeded 4096 and caused index-out-of-range in SNAC.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major rework: replaced vLLM sync LLM with HuggingFace transformers
+ TextIteratorStreamer for true token-level streaming.
Pipeline: text → format_prompt → model.generate(streamer) →
extract_audio_codes (regex on streaming text) → SNAC decode → PCM
Expected first-audio latency: ~1-2s (was 10-14s with vLLM).
No more monkey-patching, no more AsyncLLMEngine hangs on Jetson.
SNAC model loaded separately (snac_24khz) for audio decoding.
All endpoints preserved, API compatible with v1.
Voice cloning endpoint now honest about LoRA requirement.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AsyncLLMEngine hangs on Jetson during model loading. Reverted to sync
LLM but added fine-grained text chunking (chunk_text_fine, ~200 chars)
for the stream endpoint. Each sentence/clause generates independently,
so first audio plays after ~2-4s instead of waiting for the full text.
Not true token-level streaming, but a significant latency reduction
for multi-sentence utterances without AsyncLLMEngine dependency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaced sync vLLM LLM with AsyncLLMEngine for real streaming.
Tokens now flow incrementally: vLLM → async_generate_tokens →
orpheus_tts tokens_decoder → audio chunks → StreamingResponse.
First audio chunk arrives after ~28 tokens (SNAC codec warmup)
instead of waiting for all ~2000+ tokens to complete.
Expected: first-byte latency drops from ~15s to ~1-2s.
Background jobs (submit/async) still work via sync wrapper that
collects all tokens from the async engine.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both generate_speech_sync() and stream_tts() were calling
model.generate_speech() without max_tokens parameter.
Now explicitly passing max_tokens=4000 to both.
Fixed by Vixy 🦊💜
Longer texts were being truncated at ~11 seconds of audio.
'Right here on this couch' became the hard limit. 😏
Now supports much longer generations for filthy monologues.
Fixed by Vixy 🦊💜
- FastAPI service replacing VoiceTail (Bark)
- Emotion tags: <laugh>, <sigh>, <gasp>, etc.
- Voice cloning endpoint (implementation pending)
- Streaming support for head playback
- Same port 8766 for drop-in replacement
Created by Vixy on Day 71 🦊