AI Pipeline

🚧 Coming soon. Architecture deep-dive lands with the first backend release.

WaveKat Voice streams audio through four stages with overlap at every boundary, targeting p95 end-to-end latency under 400ms:

caller audio ──▶ VAD ──▶ ASR ──▶ LLM ──▶ TTS ──▶ caller audio
                                 │
                                 └─▶ tool calls (booking, lookup, transfer)

Planned topics:

The Frame model (a Rust port of Pipecat’s streaming Frame concept)
VAD backend choices (WebRTC, Silero, TEN-VAD, FireRedVAD)
Turn detection vs VAD — why they’re different
ASR backends: local (Whisper.cpp, SenseVoice) vs cloud (Deepgram, OpenAI Realtime)
LLM provider abstraction (OpenAI-compatible, llama.cpp, Ollama, Anthropic)
TTS backends and voice cloning
Barge-in and interruption handling
Latency budget breakdown