SAFASTUDIOS

Safa Global · Voice AI

Which voice layer
should we build on?

A 2026 vendor decision for the holding company, spanning Aquiii, Safa Health, and support operations.

Prepared 2026-05-25

Listen · executive audio briefing

Executive audio briefing. Two hosts, about 22 minutes.


Download audio (.m4a)

The decision

Standardize on ElevenLabs as the holding-wide default. It is the only 2026 vendor that bundles top-tier TTS, accurate STT (Scribe v2), best-in-class voice cloning, and native agent, telephony, and WhatsApp deployment in one ecosystem. Breadth, not raw quality, is why it wins as the default.

Deviate twice: an Azure + Deepgram pipeline for Safa Health (HIPAA), and Cartesia + Deepgram + Retell for high-volume telephony. Self-host (Kokoro 82M) only when a workload's text volume permanently exceeds 10M characters per month.

01 The 2026 landscape verdict

ElevenLabs is no longer the outright quality leader, but it is still the most complete platform.

Reading: for a brand-voice, multi-product holding company, ecosystem breadth beats a few Elo points. Standardizing cuts integration cost across ventures.

02 Per-venture stacks

A · Aquiii / RiiiRiii brand mascot voice

Consumer, Spanish-first Mexico, expressive character, WhatsApp ordering.

LayerPickWhy
TTSElevenLabs Eleven v370+ languages incl. es-MX, deepest emotion via audio tags, cloning to lock the exact mascot voice
STTElevenLabs Scribe v2~4% WER, 90+ languages, conversational
OrchestrationElevenLabs AgentsNative WhatsApp inbound/outbound (messages + calls)
Biggest riskLatency. v3 trades speed for quality (1-2s). Use Flash v2.5 for real-time turns; reserve v3 for scripted/expressive content.

B · Safa Health / Together

Chronic-disease care, ES/EN/AR, handles patient health info (PHI).

LayerPickWhy
TTSAzure Neural140+ languages incl. Arabic, enterprise HIPAA. Deepgram TTS is English-only; ElevenLabs cloning is not BAA-covered
STTDeepgram Nova-3 MultilingualHigh accuracy on noisy audio, auto language detect, BAA available for PHI
OrchestrationChained pipeline (Pipecat / LiveKit) to a HIPAA-eligible LLM (Azure OpenAI)OpenAI Realtime audio is NOT HIPAA-eligible as of May 2026
Biggest riskLoss of real-time speed. Compliance forces a chained pipeline, so expect 1.2-1.8s end-to-end and less natural prosody.

C · Aquiii support agents

Inbound/outbound phone, real-time, cost-sensitive at scale.

LayerPickWhy
TTSCartesia Sonic 3Latency leader (40ms TTFA), $50 / 1M char
STTDeepgram Nova-3Sub-300ms, $0.0048 / min
OrchestrationRetell AIBuilt for telephony: SIP, inbound routing, warm human handoff
Biggest riskPSTN audio degradation. 8kHz G.711 phone audio hurts both STT accuracy and TTS naturalness vs web audio.

03 Cost arc · start, scale, own

04 HIPAA reality

The load-bearing constraint for Safa Health.

ElevenLabs CAN process PHI, conditionally

Only with ALL of: Enterprise tier + signed BAA + Zero Retention Mode ON + LLM restricted to an approved allowlist (Gemini/Claude) or your own API keys.

Two landmines

Zero Retention Mode disables inbound WhatsApp and request-stitching/history. You cannot have both HIPAA-mode and WhatsApp on the same agent.

Voice cloning is not confirmed BAA-covered and is listed as not zero-retention-eligible. Treat a cloned voice + PHI as non-compliant until legal confirms.

OpenAI Realtime: audio modality is NOT HIPAA-eligible. Any OpenAI-based health voice must chain HIPAA-safe STT + TTS around a text LLM.

Implication: keep RiiiRiii (cloning + WhatsApp) and Safa Health (HIPAA) on separate stacks. Do not force one vendor config to serve both.

05 Recommended next actions

  1. Pilot RiiiRiii on ElevenLabs: clone the mascot voice (PVC), build a WhatsApp ordering agent on Flash v2.5, reserve v3 for scripted content.
  2. Spike the Safa Health pipeline on Azure + Deepgram with a redacted-text LLM hop; open the BAA conversation with both vendors early (Enterprise-gated).
  3. Benchmark a telephony POC (Cartesia + Deepgram + Retell) against ElevenLabs Agents on a real Aquiii support flow; decide on measured latency and cost, not the spec sheet.
  4. Instrument volume per workload so the 10M char/mo self-hosting trigger is data-driven.
  5. Revisit in ~2 quarters: Inworld and Gemini TTS are closing fast on quality and could change the default.