AI How-ToMay 22, 20266 min read

How to Build a Voice AI Booking Agent With ElevenLabs and n8n

A 60-day production report: 7 restaurants, 200 bookings a month, native Italian voice, and an €87/month stack built on ElevenLabs and n8n.

Reeve Yew

Restaurant operators miss roughly 1 in 4 inbound phone calls during peak service hours, according to the Toast 2024 Restaurant Technology Report. Those calls are direct, zero-commission revenue. A voice AI booking agent fixes this at the phone layer. With ElevenLabs and n8n, you can answer every call, capture every reservation, and run the whole system for under €100 a month.

That is not vendor marketing. Developer Andrea Sisofo documented 60 days of live production data from seven Italian restaurants: 200 bookings a month, native Italian voice, an €87/month stack, and a full record of every failure mode with the fix applied. This post walks through that stack, those failures, and how to decide if the same build fits your context.

What Is a Voice AI Booking Agent and How Does It Work?

A voice AI booking agent is a phone-answering pipeline. It combines a speech-to-text layer, an LLM reasoning core, and text-to-speech output to complete a structured task, in this case reservation capture, without human intervention.

The key difference from a chatbot is latency sensitivity. Text chat tolerates a two-second pause. A phone call does not. A 2-second gap feels like a dropped line to most callers, and they hang up. That constraint shapes every architectural choice you make downstream.

ElevenLabs Conversational AI handles the voice layer with streaming TTS, a prebuilt or cloned voice, and built-in interruption handling. Its WebSocket transport is designed to keep round-trip latency under 800 ms in the EU region when the stack is correctly pinned to a regional endpoint.

Understanding the latency budget is the first step. Every other decision, region choice, data store design, and filler phrase strategy, flows from that single number.

What Does the Full Stack Look Like in Production?

The stack has four layers. ElevenLabs Conversational AI covers voice: streaming STT, LLM inference, and TTS in one managed service. n8n handles orchestration: webhook routing, slot validation, and a confirmed-reservation write to Google Sheets or Airtable. A SIP bridge, Twilio or Vonage, connects the real phone network to the WebSocket pipeline.

As of May 2026, Twilio Voice and ElevenLabs maintain a documented integration path via WebSocket media streams, with official sample code in the ElevenLabs developer docs. Bridge setup is roughly 40 lines of configuration, not a custom build. That is a meaningful reduction compared to twelve months ago.

n8n is the right orchestration choice at this scale. It is self-hostable, carries no per-execution pricing at low volume, and the Starter cloud plan covers 2,500 executions a month. That is enough for 200 bookings with room for test runs. The visual canvas lets a non-engineer inspect exactly what the agent did on any given call.

For a wider view of how agents connect to external tools and data, the Model Context Protocol: How MCP Connects AI to Your Tools guide explains the connectivity standard that sits beneath orchestration layers like this one.

How Do You Handle Native-Language Conversations at Production Quality?

Language config in ElevenLabs is a project-level setting. Set the agent language to Italian. The STT model then uses a language-specific acoustic model that handles regional accents and restaurant-domain vocabulary better than a multilingual fallback.

As of May 2026, ElevenLabs Conversational AI supports 31 languages with dedicated STT acoustic models, up from 12 at launch in late 2024. Italian, Spanish, French, German, and Japanese all have full support. For operators building outside English-speaking markets, that coverage removes the main blocker.

Prompt engineering for a non-English agent has one firm rule: write the system prompt in the target language. Do not translate at runtime. Mid-call translation adds latency and degrades intent classification on edge cases like party-size phrasing, for example 'siamo in quattro' versus 'quattro persone'.

For slot-filling, define four required slots explicitly: name, date, time, and party size. Let the LLM re-ask in natural Italian if a slot is missing or ambiguous. Do not route to an error branch on the first failed parse. Most callers self-correct on the second ask.

What Does It Actually Cost to Run This at Scale?

The reported spend at 200 bookings a month across 7 locations is approximately €87/month total. That covers ElevenLabs character and minute costs, n8n cloud or a self-hosted VPS, and SIP trunk fees. The original case study by Andrea Sisofo breaks each line item down.

Cost scaling is roughly linear with call minutes, not bookings. Average call length in the Italian deployment was 90 seconds. That means 200 bookings equals about 300 billed voice minutes. Failed or abandoned calls still consume billed time, so design your no-answer fallback to exit quickly.

Comparison: a part-time reservationist in Italy costs roughly €8 to €10 per hour at current minimum wage. A single weekend shift costs more than a full month of this stack.

At 500 bookings, double the call minutes and add roughly €30 to €40 to the monthly bill. The fixed costs, n8n and SIP trunk base fees, barely move. If you are watching API spend across multiple automation layers, the post on how to reduce AI API costs without changing your code covers token-level optimizations that apply here too.

What Broke and How Was It Fixed?

Three failure modes surfaced in the first 30 days. Each one is documented with its root cause and fix.

Latency spikes. Cross-region WebSocket routing caused P95 latency to exceed 1.5 seconds on roughly 12% of calls. Fix: pin the ElevenLabs agent endpoint to EU-West and use an EU-based Twilio number. P95 dropped to under 800 ms after the change.

Slot hallucination. The LLM occasionally confirmed a booking time that was not in the allowed schedule. Fix: add a deterministic validation step in n8n that checks the captured time against an allowed-hours list before the agent speaks the confirmation.

Caller hang-up on silence. When the Sheets API write took more than 1.2 seconds, the agent went silent. Callers hung up. Fix: insert a filler phrase ('Un momento, sto controllando...') as a non-blocking n8n step while the write completes.

These are not AI failures. They are latency, data-write timing, and slot-validation problems. Any production engineer can instrument and fix them with execution logs.

How Do You Test a Voice Agent Before Going Live?

Testing a voice AI booking agent runs in three stages.

Script-based regression tests: write 20 to 30 conversation transcripts covering happy paths, wrong dates, party-size edge cases, and mid-call interruptions. Replay them via the ElevenLabs Conversational AI test console. Flag any failed slot captures or unexpected confirmation phrases before touching a live number.

Latency benchmarking: use n8n's execution log timestamps to measure STT-to-LLM and LLM-to-TTS gaps independently. Target under 600 ms combined before going live. The n8n webhook documentation shows where to find execution timing in the run history panel.

Soft launch with call forwarding: route overflow calls, not the primary line, to the agent for the first two weeks. Compare booking capture rate and error logs before cutting over the main number. This gives real caller behavior data without risking primary reservations.

If you run other automated workflows alongside this build, the AI automation workflows with n8n guide covers workflow patterns that pair well with a voice agent pipeline.

When Should You Use This Stack vs. a Simpler or More Expensive Alternative?

This stack fits a narrow, high-repetition phone workflow: single task, structured output, fewer than 500 calls a month per location. Reservations, appointment booking, and order status checks are the right use cases.

Wrong fit: complex upsell conversations where the script branches unpredictably, calls requiring live POS system access because the Sheets write pattern breaks under real-time inventory demands, and languages where ElevenLabs STT coverage is limited. Check the supported language list before committing.

Alternatives worth knowing: Retell AI and Bland AI both offer managed voice agent platforms with built-in telephony. Per-minute cost is higher, but integration overhead is lower. If your team has no n8n experience, a managed platform may be the right entry point.

The Italian deployment is proof that a low-budget voice AI booking agent handles production traffic without a dedicated engineering team. For comparable load data across different voice stacks, the production voice AI stack benchmarks post has side-by-side numbers worth reviewing before you commit.

Start narrow: one language, one task, one location. Ship it, read the execution logs, then scale.

If you want to see practical AI workflows built live, Gen AI Summit Asia is opening in Kuala Lumpur on August 8-9, 2026: two days of AI shortcuts across eight real business tracks. Find out more about Gen AI Summit Asia.

FAQ

How do I build a voice AI agent that answers phone calls?

The core stack has three parts: a voice AI platform (ElevenLabs Conversational AI handles speech-to-text, LLM reasoning, and text-to-speech in one API), an orchestration layer (n8n receives the structured output via webhook and routes it to your data store), and a SIP or PSTN bridge (Twilio or Vonage) that connects a real phone number to the voice platform via WebSocket media streams. The ElevenLabs developer docs include sample WebSocket integration code. Plan for roughly one to two weeks of testing before going live, focusing on latency and slot-validation edge cases.

What does it cost to run a voice AI booking agent with ElevenLabs?

At low volume (around 200 bookings per month with average call length of 90 seconds), the total stack cost is in the €80 to €100 per month range. That covers ElevenLabs Conversational AI billed by voice minutes, n8n on a self-hosted VPS or starter cloud plan, and a SIP trunk for PSTN access. Costs scale roughly linearly with call minutes, not just completed bookings — failed and abandoned calls still consume billed time. At 500 bookings per month the stack typically stays under €200 per month, well below the cost of a part-time human reservationist.

Can a voice AI agent handle phone calls in Italian or other non-English languages?

Yes, as of May 2026 ElevenLabs Conversational AI supports 31 languages with dedicated speech-to-text acoustic models, including Italian, Spanish, French, German, and Japanese. The key requirement is writing the agent system prompt in the target language — not relying on runtime translation — because mid-call translation adds latency and degrades intent classification on colloquial or regional phrasing. Cloning a native-speaker voice for the TTS output significantly improves caller trust and reduces hang-up rates compared to a generic synthesized voice.

What is the real-world latency of ElevenLabs Conversational AI on phone calls?

In a documented 60-day production deployment, the combined STT-to-LLM-to-TTS round-trip averaged under 800 milliseconds when the ElevenLabs agent endpoint was pinned to the EU-West region and the SIP trunk was also EU-based. Cross-region routing (EU caller, US-routed WebSocket) pushed P95 latency above 1.5 seconds, which callers perceived as a dropped call or dead air. The fix is region-matching: use an EU ElevenLabs endpoint with an EU phone number and an EU Twilio Voice region. Target under 600 ms combined before going live.

What breaks when you deploy a voice AI phone agent in production?

Three failure modes appear most often. First, latency spikes from cross-region WebSocket routing cause callers to hang up — fix by pinning the voice endpoint to the same region as your phone number. Second, LLM slot hallucination where the agent confirms a booking time outside your allowed schedule — fix with a deterministic validation node in your orchestration layer before the agent speaks the confirmation. Third, data-write silence where a slow API call (to Google Sheets, Airtable, etc.) leaves the agent quiet for over a second — fix by inserting a filler phrase as a non-blocking step while the write completes.

Is n8n a good choice for building a voice AI agent workflow?

n8n works well for this use case at low to mid volume (under roughly 1,000 calls per month per location). Its advantages are self-hostable deployment with no per-execution cost at scale, a visual canvas that lets non-engineers inspect what the agent did on any call, and built-in webhook nodes that connect directly to ElevenLabs output without custom code. The limitation is that n8n adds a small latency hop (typically 50 to 150 ms) for each node execution, which matters in real-time voice pipelines. For high-volume or ultra-low-latency needs, a custom FastAPI or Node.js orchestration layer is faster.

How do I test a voice AI phone agent before it goes live?

Use three layers of testing. First, script-based regression: write 20 to 30 canonical conversation transcripts covering the happy path, wrong-date inputs, large party sizes, and mid-sentence interruptions, then replay them through the ElevenLabs Conversational AI test console. Second, latency benchmarking: use your orchestration layer's execution logs to measure STT-to-LLM and LLM-to-TTS gaps independently and target under 600 ms combined. Third, a soft launch with call forwarding: route overflow calls (not your primary line) to the agent for two weeks, then compare booking capture rate and error logs before cutting over the main number.