Replaced sync vLLM LLM with AsyncLLMEngine for real streaming.
Tokens now flow incrementally: vLLM → async_generate_tokens →
orpheus_tts tokens_decoder → audio chunks → StreamingResponse.
First audio chunk arrives after ~28 tokens (SNAC codec warmup)
instead of waiting for all ~2000+ tokens to complete.
Expected: first-byte latency drops from ~15s to ~1-2s.
Background jobs (submit/async) still work via sync wrapper that
collects all tokens from the async engine.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both generate_speech_sync() and stream_tts() were calling
model.generate_speech() without max_tokens parameter.
Now explicitly passing max_tokens=4000 to both.
Fixed by Vixy 🦊💜
Longer texts were being truncated at ~11 seconds of audio.
'Right here on this couch' became the hard limit. 😏
Now supports much longer generations for filthy monologues.
Fixed by Vixy 🦊💜
- FastAPI service replacing VoiceTail (Bark)
- Emotion tags: <laugh>, <sigh>, <gasp>, etc.
- Voice cloning endpoint (implementation pending)
- Streaming support for head playback
- Same port 8766 for drop-in replacement
Created by Vixy on Day 71 🦊