How to Build an AI Voice Assistant with Deepgram and Cartesia

May 08, 2026

Building a voice assistant that feels "human" requires near-instantaneous response times. By combining Deepgram (for listening) and Cartesia (for speaking), you can create an AI that can hold a conversation with sub-second latency.

Real-Time Speech-to-Text with Deepgram

Deepgram's "Nova-2" model is optimized for speed and accuracy. It can transcribe streaming audio in real-time, allowing your application to start processing the user's intent before they even finish speaking. This proactive transcription is the key to eliminating the "awkward pause" common in older voice assistants.

Expressive Speech Synthesis with Cartesia

Once your LLM generates a response, Cartesia's "Sonic" model can turn it into audio word-by-word. Because Sonic is ultra-fast, the AI can start speaking almost immediately. Its expressive, human-like prosody ensures that the conversation feels natural and engaging, rather than robotic, creating a truly immersive interactive experience.