Voice Engine

By Arden Talbot-Last updated April 19, 2026

Voquii runs a fully self-hosted voice pipeline on proprietary NVIDIA RTX and Blackwell GPUs. No third-party API dependencies. 375ms time-to-first-audio on standard telephony calls — 2x faster than Vapi, Retell, and Bland AI.

Pipeline Architecture

ASR — Voquii ASR (Self-Hosted)

Customer speech is transcribed in real-time using proprietary ASR running on bare-metal GPUs. Multiple instances with weighted load balancing ensure low latency even under heavy load.

Safety Gate (Deterministic)

Before any AI processing, a regex-based safety gate blocks medical, legal, PII, and off-topic queries with pre-written responses. No LLM involved — zero latency overhead.

RAG — Vector Search (Qdrant)

The transcribed question is matched against your knowledge base using page-aware blended search. Relevant chunks from crawled pages and uploaded documents are retrieved with score boosting for page-specific context.

LLM — Self-Hosted Inference

The AI generates a response using tiered routing. Fast tier handles simple queries instantly; medium tier handles complex, multi-context questions. All inference runs on proprietary hardware.

TTS — Voquii TTS (Self-Hosted)

Text-to-speech converts the AI response into natural, human-sounding audio using Voquii TTS on dedicated NVIDIA RTX GPUs. Adaptive chunking starts audio playback before the full response is generated, minimizing perceived latency.

Performance Metrics

Metric	What It Measures	Typical
TTFA	Time to first audio — caller hears first word	375ms
ASR	Speech-to-text transcription latency	<100ms
LLM TTFT	Time to first token from the language model	<150ms
TTS	Text-to-speech generation per chunk	<80ms

Available Voices

Choose from a library of Voquii neural voices. Each voice can be assigned per-agent, so different clients can have different voice personalities.

Voquii TTS

Included — No Extra Cost

High-quality neural voices running on proprietary GPUs. No per-character or per-minute TTS fees. Multiple voice options including male, female, and various accents. Voices are optimized for telephony audio quality (8kHz mulaw).

Tip: Pick a voice that matches your client's brand tone. A friendly, conversational voice works well for service businesses. A calm, professional voice suits financial or legal services.

Adaptive Chunking

The voice pipeline uses adaptive text chunking to minimize time-to-first-audio. Instead of waiting for the full LLM response, audio generation begins as soon as a natural speech boundary is detected.

First chunk — Starts TTS as soon as the first sentence fragment is available
Subsequent chunks — Larger buffers for natural-sounding speech
Automatic cleanup — URLs, markdown, and citations are stripped before TTS so spoken audio sounds natural

Infrastructure

Component	Details
GPUs	NVIDIA RTX A6000 + RTX Pro (dedicated, not shared)
ASR Engine	Voquii ASR — multiple instances with weighted load balancing
TTS Engine	Voquii TTS — batched inference across GPU cluster
LLM Inference	Self-hosted on bare-metal GPUs with tiered routing
Vector Store	Qdrant — page-aware blended search
Third-Party APIs	Zero — fully self-hosted