Back to PortfolioAI Voice Agent

VoiceFlow

Voice receptionist that speaks naturally, understands context, and guides customers to a booking

Live Demo Source code — available on purchase

industries configurable

Live

real-time voice response

Zero

install required

The Problem

Why this needed to be built

Hiring a front-desk receptionist costs $2,500–$3,500/month. IVR trees ('press 1 for...') frustrate customers and kill conversion. At the same time, most chatbots feel robotic — customers want to speak, not type. VoiceFlow is a browser-based voice interface that explores the middle ground: no app install, no phone line required, just open a page and have a natural conversation. The current version runs entirely in the browser; phone line integration is the planned next step.

The Solution

How it works

01Browser-native voice interface via Web Speech API — no download, works on any device with a mic

02Groq + Llama 3.3 70B for sub-400ms text generation — fast enough for natural conversation

03ElevenLabs voice synthesis for human-sounding replies (not robotic TTS)

04Multi-turn context: the agent remembers what was said earlier in the conversation

05Booking intent detection: automatically guides callers through scheduling

06Configurable industry templates (salon, dental, hotel, fitness) — live in minutes

Tech Stack

Built with

Node.jsExpressGroq APILlama 3.3 70BElevenLabs TTSWeb Speech APIAudioContext APIJavaScript

Architecture

System design

Browser (Web Speech API)
     │
     ▼ Real-time transcript
Express API Server
  ├── Session Manager
  │   (conversation history,
  │    industry config, user state)
  └── Groq — Llama 3.3 70B
      (intent detection + reply gen)
     │
     ▼ Response text
ElevenLabs TTS
(natural voice synthesis)
     │
     ▼ Audio stream
Browser AudioContext
  ├── Playback (real voice)
  └── On booking intent:
      ├── Slot confirmation flow
      └── Lead capture (name, phone)

Key Decisions & Lessons

What I learned

Groq is the reason real-time voice is possible

OpenAI's API averages 800–1,200ms for a full response at gpt-4o. For a voice conversation, anything over 500ms feels like a frozen call. Groq's LPU inference delivers Llama 3.3 70B responses in under 300ms consistently. Switching from OpenAI to Groq was the single change that made the product feel natural rather than broken.

Web Speech API silence detection is unpredictable

The browser's built-in silence detection ends the transcript after 2–5 seconds of quiet, but the threshold varies across Chrome, Edge, and Safari. The fix was a custom end-of-utterance detector: start a 1.5s timer on every `onresult` event, reset it if new words arrive, and fire the API call when the timer expires. This gives consistent behavior across browsers.

Prompting for voice: shorter is always better

A well-structured GPT response with headers and bullet points sounds terrible when spoken aloud. The system prompt had to forbid markdown, limit responses to 2 sentences max, and instruct the model to end with a question to keep the conversation going. Treating the LLM output as speech script rather than text output improved naturalness dramatically.

What I'd do differently

Two things. First, replace the request/response pattern with WebSockets for true streaming — pipe ElevenLabs audio chunks directly to the browser as they're generated instead of waiting for the full audio file. This would cut perceived latency from ~800ms to under 200ms; the architecture is already designed for it. Second, and more importantly: add phone line integration via Twilio or Vapi. The current version is browser-only, which is great for demos but limits real-world deployment. The core AI pipeline (Groq + ElevenLabs) is already fast enough for telephony; it just needs a SIP/PSTN layer on top. That's the planned next step.

Want me to build something similar for you?

Hire me for your project →