AI Pipeline
🚧 Coming soon. Architecture deep-dive lands with the first backend release.
WaveKat Voice streams audio through four stages with overlap at every boundary, targeting p95 end-to-end latency under 400ms:
caller audio ──▶ VAD ──▶ ASR ──▶ LLM ──▶ TTS ──▶ caller audio
│
└─▶ tool calls (booking, lookup, transfer)
Planned topics:
- The Frame model (a Rust port of Pipecat’s streaming Frame concept)
- VAD backend choices (WebRTC, Silero, TEN-VAD, FireRedVAD)
- Turn detection vs VAD — why they’re different
- ASR backends: local (Whisper.cpp, SenseVoice) vs cloud (Deepgram, OpenAI Realtime)
- LLM provider abstraction (OpenAI-compatible, llama.cpp, Ollama, Anthropic)
- TTS backends and voice cloning
- Barge-in and interruption handling
- Latency budget breakdown