Safa Global · Voice AI

Which voice layer
should we build on?

A 2026 vendor decision for the holding company, spanning Aquiii, Safa Health, and support operations.

Prepared 2026-05-25

► Listen · executive audio briefing

Executive audio briefing. Two hosts, about 22 minutes.

The decision

Standardize on ElevenLabs as the holding-wide default. It is the only 2026 vendor that bundles top-tier TTS, accurate STT (Scribe v2), best-in-class voice cloning, and native agent, telephony, and WhatsApp deployment in one ecosystem. Breadth, not raw quality, is why it wins as the default.

Deviate twice: an Azure + Deepgram pipeline for Safa Health (HIPAA), and Cartesia + Deepgram + Retell for high-volume telephony. Self-host (Kokoro 82M) only when a workload's text volume permanently exceeds 10M characters per month.

01 The 2026 landscape verdict

ElevenLabs is no longer the outright quality leader, but it is still the most complete platform.

Quality (Artificial Analysis Speech Arena): Inworld TTS 1.5 Max is #1 (Elo ~1236), Google Gemini 3.1 Flash TTS close behind, ElevenLabs Eleven v3 around #4. The gap is narrow, and v3 cannot run real-time.
Latency: Cartesia Sonic 3 owns it (40ms time-to-first-audio). ElevenLabs Flash v2.5 (~75ms) is competitive; Deepgram Aura-2 ~120-150ms.
Price (per 1M characters): Google Cloud TTS $4 WaveNet / $16 Neural2 / $30 Chirp3 / $160 Studio; Azure Neural ~$14 (HD ~$92); Amazon Polly $4 / $16 / $30 / $100; Deepgram Aura-2 $30; Cartesia Sonic $50; self-hosted Kokoro ~$0.70.
Where ElevenLabs still wins decisively: voice cloning (IVC + PVC), voice-library breadth, expressive v3 audio tags, and an integrated agent + dubbing + STT stack.

Reading: for a brand-voice, multi-product holding company, ecosystem breadth beats a few Elo points. Standardizing cuts integration cost across ventures.

02 Per-venture stacks

A · Aquiii / RiiiRiii brand mascot voice

Consumer, Spanish-first Mexico, expressive character, WhatsApp ordering.

Layer	Pick	Why
TTS	ElevenLabs Eleven v3	70+ languages incl. es-MX, deepest emotion via audio tags, cloning to lock the exact mascot voice
STT	ElevenLabs Scribe v2	~4% WER, 90+ languages, conversational
Orchestration	ElevenLabs Agents	Native WhatsApp inbound/outbound (messages + calls)
Biggest risk	Latency. v3 trades speed for quality (1-2s). Use Flash v2.5 for real-time turns; reserve v3 for scripted/expressive content.

B · Safa Health / Together

Chronic-disease care, ES/EN/AR, handles patient health info (PHI).

Layer	Pick	Why
TTS	Azure Neural	140+ languages incl. Arabic, enterprise HIPAA. Deepgram TTS is English-only; ElevenLabs cloning is not BAA-covered
STT	Deepgram Nova-3 Multilingual	High accuracy on noisy audio, auto language detect, BAA available for PHI
Orchestration	Chained pipeline (Pipecat / LiveKit) to a HIPAA-eligible LLM (Azure OpenAI)	OpenAI Realtime audio is NOT HIPAA-eligible as of May 2026
Biggest risk	Loss of real-time speed. Compliance forces a chained pipeline, so expect 1.2-1.8s end-to-end and less natural prosody.

C · Aquiii support agents

Inbound/outbound phone, real-time, cost-sensitive at scale.

Layer	Pick	Why
TTS	Cartesia Sonic 3	Latency leader (40ms TTFA), $50 / 1M char
STT	Deepgram Nova-3	Sub-300ms, $0.0048 / min
Orchestration	Retell AI	Built for telephony: SIP, inbound routing, warm human handoff
Biggest risk	PSTN audio degradation. 8kHz G.711 phone audio hurts both STT accuracy and TTS naturalness vs web audio.

03 Cost arc · start, scale, own

START ElevenLabs Starter ~$6/mo, or Agents at ~$0.08/min. Validate product-market fit with zero infra risk.
SCALE ~$8k/mo at 100k agent-minutes. Costs climb fastest on premium multilingual models (5-10k char/request caps multiply API overhead) and regenerations from long-form quality drift.
OWN at sustained >10M char/mo, self-host Kokoro 82M on one A100 (~$749/mo, ~3.6B char/mo, 50+ concurrent streams). The real cost of owning is people: a minimal ML team is ~$270k-550k/yr plus ~$40-100k/yr maintenance. Do not self-host to "save money" until volume justifies the team.

04 HIPAA reality

The load-bearing constraint for Safa Health.

ElevenLabs CAN process PHI, conditionally

Only with ALL of: Enterprise tier + signed BAA + Zero Retention Mode ON + LLM restricted to an approved allowlist (Gemini/Claude) or your own API keys.

Two landmines

Zero Retention Mode disables inbound WhatsApp and request-stitching/history. You cannot have both HIPAA-mode and WhatsApp on the same agent.

Voice cloning is not confirmed BAA-covered and is listed as not zero-retention-eligible. Treat a cloned voice + PHI as non-compliant until legal confirms.

OpenAI Realtime: audio modality is NOT HIPAA-eligible. Any OpenAI-based health voice must chain HIPAA-safe STT + TTS around a text LLM.

Implication: keep RiiiRiii (cloning + WhatsApp) and Safa Health (HIPAA) on separate stacks. Do not force one vendor config to serve both.

05 Recommended next actions

Pilot RiiiRiii on ElevenLabs: clone the mascot voice (PVC), build a WhatsApp ordering agent on Flash v2.5, reserve v3 for scripted content.
Spike the Safa Health pipeline on Azure + Deepgram with a redacted-text LLM hop; open the BAA conversation with both vendors early (Enterprise-gated).
Benchmark a telephony POC (Cartesia + Deepgram + Retell) against ElevenLabs Agents on a real Aquiii support flow; decide on measured latency and cost, not the spec sheet.
Instrument volume per workload so the 10M char/mo self-hosting trigger is data-driven.
Revisit in ~2 quarters: Inworld and Gemini TTS are closing fast on quality and could change the default.