Voice Engine

Voquii runs a fully self-hosted voice pipeline on proprietary NVIDIA RTX and Blackwell GPUs. No third-party API dependencies. 375ms time-to-first-audio on standard telephony calls — 2x faster than Vapi, Retell, and Bland AI.

Pipeline Architecture

1

ASR — Voquii ASR (Self-Hosted)

Customer speech is transcribed in real-time using proprietary ASR running on bare-metal GPUs. Multiple instances with weighted load balancing ensure low latency even under heavy load.

2

Safety Gate (Deterministic)

Before any AI processing, a regex-based safety gate blocks medical, legal, PII, and off-topic queries with pre-written responses. No LLM involved — zero latency overhead.

3

RAG — Vector Search (Qdrant)

The transcribed question is matched against your knowledge base using page-aware blended search. Relevant chunks from crawled pages and uploaded documents are retrieved with score boosting for page-specific context.

4

LLM — Self-Hosted Inference

The AI generates a response using tiered routing. Fast tier handles simple queries instantly; medium tier handles complex, multi-context questions. All inference runs on proprietary hardware.

5

TTS — Voquii TTS (Self-Hosted)

Text-to-speech converts the AI response into natural, human-sounding audio using Voquii TTS on dedicated NVIDIA RTX GPUs. Adaptive chunking starts audio playback before the full response is generated, minimizing perceived latency.

Performance Metrics

MetricWhat It MeasuresTypical
TTFATime to first audio — caller hears first word375ms
ASRSpeech-to-text transcription latency<100ms
LLM TTFTTime to first token from the language model<150ms
TTSText-to-speech generation per chunk<80ms

Available Voices

Choose from a library of Voquii neural voices. Each voice can be assigned per-agent, so different clients can have different voice personalities.

Voquii TTS

Included — No Extra Cost

High-quality neural voices running on proprietary GPUs. No per-character or per-minute TTS fees. Multiple voice options including male, female, and various accents. Voices are optimized for telephony audio quality (8kHz mulaw).

Tip: Pick a voice that matches your client's brand tone. A friendly, conversational voice works well for service businesses. A calm, professional voice suits financial or legal services.

Adaptive Chunking

The voice pipeline uses adaptive text chunking to minimize time-to-first-audio. Instead of waiting for the full LLM response, audio generation begins as soon as a natural speech boundary is detected.

  • First chunk — Starts TTS as soon as the first sentence fragment is available
  • Subsequent chunks — Larger buffers for natural-sounding speech
  • Automatic cleanup — URLs, markdown, and citations are stripped before TTS so spoken audio sounds natural

Infrastructure

ComponentDetails
GPUsNVIDIA RTX A6000 + RTX Pro (dedicated, not shared)
ASR EngineVoquii ASR — multiple instances with weighted load balancing
TTS EngineVoquii TTS — batched inference across GPU cluster
LLM InferenceSelf-hosted on bare-metal GPUs with tiered routing
Vector StoreQdrant — page-aware blended search
Third-Party APIsZero — fully self-hosted

Next Steps