Trast Sales — Voice Agent

Configure session

Scenario Customer first name Registration age (human) Preferred genre License jurisdiction Language (ASR) Voice LLM temperature 0.70 Speaking rate 1.00×

Test call uses your browser microphone and speakers. The agent will speak first.

Live conversation

Outbound call

Phone number (E.164) Scenario Customer first name Registration age (human) Preferred genre License jurisdiction Language (ASR) Voice LLM temperature 0.70

Twilio dials the number from your configured caller ID. The customer's phone rings, they pick up, and the agent runs the chosen scenario.

Live transcript

Per-turn latency

#	ASR final	LLM TTFT	TTS TTFA	Answer	Reply dur.

Answer = time from when you stopped speaking to first audio out (the perceived response latency). Reply dur. = how long the bot's full multi-sentence reply lasts.

Tool calls & events

send_sms / record_dnc / escalate_to_human / log_disposition firings, plus end-of-call markers.

Studio — fully automated training

One-click pipeline. Maverick generates phonetically-balanced prompts, ElevenLabs synthesises audio in your chosen voice, the system audits every clip, builds NeMo manifests, and queues a training job. You watch the progress here.

1. Source voice

ElevenLabs voice

Probing /api/studio/voices…

2. Dataset config

Language Utterance count Topic / focus (optional)

3. Run

First step is fast (LLM prompts ~30 s). Audio synthesis runs concurrent against ElevenLabs (4-8 in flight). 500 utterances ≈ 8-15 min wall-clock.

Active & recent jobs

No jobs yet. Click "Start dataset generation" above.

How to make the model 1000% accurate

Training a custom voice for Riva is fundamentally a data-quality problem, not a compute problem. A clean 30 minutes outperforms a sloppy 5 hours. Read this whole page before recording — most issues are unrecoverable after capture.

1. The room

Closet method (best): sit inside a clothes closet. Clothes absorb echo. Best free studio you have.
Soft surfaces beat hard ones. Carpets, curtains, sofa, bed — all absorb. Glass, tile, drywall — all reflect.
Avoid bathrooms, kitchens, empty rooms. Reverb leaks into the recording and becomes a permanent artifact in the cloned voice.
Kill all noise sources: HVAC off, fridge off if possible, phones on silent, dogs out. Even a faint fan-hum is learnt by the model.
Same room every session. Recording two sessions in different rooms produces audible "voice changes" in the synth. Pick one space, commit to it.

2. The microphone

Required class: USB condenser, not the built-in laptop mic. Built-in mics record room more than voice.
Recommended: RØDE NT-USB+ (€200), Shure MV7 (€250), Blue Yeti X (€170). All are fine.
Pop filter — non-negotiable. €10. Without one, every "P" and "B" peaks and clips. Clipping is unrecoverable.
Distance: 15-20 cm with the pop filter between you and the capsule. Same distance every session.
Cardioid polar pattern. Most USB condensers default to this; check your mic's settings.

3. Your voice

Same time of day. Voice changes through the day (morning gravelly, evening tired). Pick one slot, stick to it. Mid-morning is generally most consistent.
Hydrate. Water before and during. Avoid dairy (creates phlegm). Avoid coffee right before (dries out vocal cords).
Sit, don't stand. More consistent posture and breath support.
30-min sessions max. Take 10-min breaks. Voice fatigue produces inconsistent training data — a worse problem than less data.
Read in your normal speaking voice, not a "presenter" voice. The bot will sound like the voice you train. Demo-acceptance favours conversational tone.
Pace yourself. Don't rush. Let punctuation breathe. Comma → small pause. Period → full pause. Question → rising pitch.

4. Dataset size — find your minimum

Stage	Net audio	Reading effort	Quality
Smoke test	30 min	~1 h	Recognisably you, robotic on hard sounds
Demo-ready	1.5 h	~3 h over a couple of days	Acceptable for prospect demo
Production	5 h	~10 h over a week	What commercial voice agents ship with
Audiobook-tier	20+ h	~50 h over a month	Indistinguishable in casual listening

Don't commit to 5 hours upfront. Record 30 min, train, listen. Decide if quality is acceptable; if not, record more. The kit's "Record" tab shows your running net-audio total per stage.

5. Phonetic coverage matters more than raw hours

5 hours of "the cat sat on the mat" repeated produces a model that knows only those phonemes. Your prompt list must cover:

Every vowel in stressed and unstressed positions.
Every consonant in initial, medial, and final positions of words.
All diphthongs (Greek: αι, οι, ει, αυ, ευ).
Tricky consonant clusters (Greek: στ, στρ, τζ, τσ, μπ, ντ, γκ, νθ, χθ).
All sentence intonations — declarative, interrogative, exclamatory. Without question prompts you'll get a model that can't ask questions.
Numbers, dates, currency spelled out at least 50 times each.

The included stage_0_starter.txt is a curated 50-prompt set covering common Greek casino phonetics. For longer datasets, use Mozilla Common Voice Greek's prompt list — it's already phonetically balanced by construction.

6. Pre-flight checklist

Quiet room with soft surfaces ✓
Mic at fixed 15-20 cm distance with pop filter ✓
Mic level peaks at ~70-90% on loudest words (no clipping) ✓
Test recording one prompt → listen back → no echo, no hum, no breath blast ✓
Same time-of-day slot scheduled for all sessions ✓

7. Common pitfalls

Mouth noises (clicks, smacks) — swallow before each prompt. Lip-smacks are the hardest artifact to remove.
Inconsistent distance — voice volume changes proportional to distance². Mark a fixed spot.
Re-recording dirty takes — never keep a take with a mistake "to fix later". Press redo, do it clean. The model learns whatever is in the data.
Reading in a robot voice — your speaking voice is what the bot will sound like. Speak naturally as if telling a friend.
Recording when sick/congested — your voice is different. Skip that day, resume when healthy.

8. After recording — what we do

Quality QC: trim silence, detect clipping, check loudness consistency, listen to random 5% of clips.
Forced alignment with Montreal Forced Aligner — produces phoneme-level timestamps NeMo needs.
FastPitch training on a separate L40S/A100 GPU — 18-24 hours.
HiFi-GAN fine-tune from NVIDIA's universal vocoder — 3-6 hours.
Convert to Riva via nemo2riva + riva-build.
Deploy as a new riva-speech-greek service on the Nebius box (separate gRPC port from Magpie).
You test; if not good enough, record more, retrain.

9. The 1000%-accuracy reality check

Voice cloning quality plateaus around what the dataset supports. With our pipeline:

5 h clean dataset + good phonetic balance + clean room: ~95% intelligibility, mostly natural prosody. Sounds like you.
20 h with a voice actor and studio recording: ~99% intelligibility, indistinguishable in casual listening.
"Indistinguishable from real human in adversarial testing" requires either hundreds of hours or model architectures (e.g. VALL-E-X scale) that don't exist self-hosted yet.

"1000% accurate" doesn't exist. But "indistinguishable for a 2-min sales call" is achievable with 5 hours and discipline.

Record

Workflow

Click "Generate with LLM" (right panel) → pick language → wait ~5 s for prompts to load
Click "Choose folder" and pick a destination on your Mac (e.g. ~/Desktop/voice/). Required before saving.
Press SPACE or click Record, read the prompt, SPACE again to stop
Press ENTER or click "Save & Next" to write the WAV and advance

→ Open recorder in a full new tab (better for long sessions)

Datasets

List of recorded datasets ready for training. Drop a folder produced by the recorder into the ~/riva-stack/datasets/ bind mount on Nebius and it appears here.

Stub — full dataset management UI lands in v2 (after the demo). For now: ship the recorded folder to me out-of-band and I run training on a separate GPU pod.

Training jobs

Submit a dataset for training, monitor progress, see live logs.

Stub — needs a separate GPU training pod with a queue + WebSocket log relay. v2 build. For demo, training runs on an off-board GPU pod and I report results back manually.

Trained models

List of completed voice models, deploy controls, A/B comparison playground.

Stub — v2. Currently models are deployed manually as riva-speech-greek docker-compose services after training completes.