Voice Engine
Voquii runs a fully self-hosted voice pipeline on proprietary NVIDIA RTX and Blackwell GPUs. No third-party API dependencies. 375ms time-to-first-audio on standard telephony calls — 2x faster than Vapi, Retell, and Bland AI.
Pipeline Architecture
ASR — Voquii ASR (Self-Hosted)
Customer speech is transcribed in real-time using proprietary ASR running on bare-metal GPUs. Multiple instances with weighted load balancing ensure low latency even under heavy load.
Safety Gate (Deterministic)
Before any AI processing, a regex-based safety gate blocks medical, legal, PII, and off-topic queries with pre-written responses. No LLM involved — zero latency overhead.
RAG — Vector Search (Qdrant)
The transcribed question is matched against your knowledge base using page-aware blended search. Relevant chunks from crawled pages and uploaded documents are retrieved with score boosting for page-specific context.
LLM — Self-Hosted Inference
The AI generates a response using tiered routing. Fast tier handles simple queries instantly; medium tier handles complex, multi-context questions. All inference runs on proprietary hardware.
TTS — Voquii TTS (Self-Hosted)
Text-to-speech converts the AI response into natural, human-sounding audio using Voquii TTS on dedicated NVIDIA RTX GPUs. Adaptive chunking starts audio playback before the full response is generated, minimizing perceived latency.
Performance Metrics
| Metric | What It Measures | Typical |
|---|---|---|
| TTFA | Time to first audio — caller hears first word | 375ms |
| ASR | Speech-to-text transcription latency | <100ms |
| LLM TTFT | Time to first token from the language model | <150ms |
| TTS | Text-to-speech generation per chunk | <80ms |
Available Voices
Choose from a library of Voquii neural voices. Each voice can be assigned per-agent, so different clients can have different voice personalities.
Voquii TTS
Included — No Extra CostHigh-quality neural voices running on proprietary GPUs. No per-character or per-minute TTS fees. Multiple voice options including male, female, and various accents. Voices are optimized for telephony audio quality (8kHz mulaw).
Adaptive Chunking
The voice pipeline uses adaptive text chunking to minimize time-to-first-audio. Instead of waiting for the full LLM response, audio generation begins as soon as a natural speech boundary is detected.
- First chunk — Starts TTS as soon as the first sentence fragment is available
- Subsequent chunks — Larger buffers for natural-sounding speech
- Automatic cleanup — URLs, markdown, and citations are stripped before TTS so spoken audio sounds natural
Infrastructure
| Component | Details |
|---|---|
| GPUs | NVIDIA RTX A6000 + RTX Pro (dedicated, not shared) |
| ASR Engine | Voquii ASR — multiple instances with weighted load balancing |
| TTS Engine | Voquii TTS — batched inference across GPU cluster |
| LLM Inference | Self-hosted on bare-metal GPUs with tiered routing |
| Vector Store | Qdrant — page-aware blended search |
| Third-Party APIs | Zero — fully self-hosted |