Major rework: replaced vLLM sync LLM with HuggingFace transformers + TextIteratorStreamer for true token-level streaming. Pipeline: text → format_prompt → model.generate(streamer) → extract_audio_codes (regex on streaming text) → SNAC decode → PCM Expected first-audio latency: ~1-2s (was 10-14s with vLLM). No more monkey-patching, no more AsyncLLMEngine hangs on Jetson. SNAC model loaded separately (snac_24khz) for audio decoding. All endpoints preserved, API compatible with v1. Voice cloning endpoint now honest about LoRA requirement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
22 KiB
22 KiB