Test call uses your browser microphone and speakers. The agent will speak first.
Live conversation
Outbound call
Twilio dials the number from your configured caller ID. The customer's phone rings,
they pick up, and the agent runs the chosen scenario.
Live transcript
Per-turn latency
#
ASR final
LLM TTFT
TTS TTFA
Answer
Reply dur.
Answer = time from when you stopped speaking to first audio out (the perceived response latency). Reply dur. = how long the bot's full multi-sentence reply lasts.
One-click pipeline. Maverick generates phonetically-balanced prompts, ElevenLabs synthesises audio in your chosen voice, the system audits every clip, builds NeMo manifests, and queues a training job. You watch the progress here.
1. Source voice
Probing /api/studio/voices…
2. Dataset config
3. Run
First step is fast (LLM prompts ~30 s). Audio synthesis runs concurrent against ElevenLabs (4-8 in flight). 500 utterances ≈ 8-15 min wall-clock.
Active & recent jobs
No jobs yet. Click "Start dataset generation" above.
How to make the model 1000% accurate
Training a custom voice for Riva is fundamentally a data-quality problem, not a compute problem. A clean 30 minutes outperforms a sloppy 5 hours. Read this whole page before recording — most issues are unrecoverable after capture.
1. The room
Closet method (best): sit inside a clothes closet. Clothes absorb echo. Best free studio you have.
Soft surfaces beat hard ones. Carpets, curtains, sofa, bed — all absorb. Glass, tile, drywall — all reflect.
Avoid bathrooms, kitchens, empty rooms. Reverb leaks into the recording and becomes a permanent artifact in the cloned voice.
Kill all noise sources: HVAC off, fridge off if possible, phones on silent, dogs out. Even a faint fan-hum is learnt by the model.
Same room every session. Recording two sessions in different rooms produces audible "voice changes" in the synth. Pick one space, commit to it.
2. The microphone
Required class: USB condenser, not the built-in laptop mic. Built-in mics record room more than voice.
Recommended: RØDE NT-USB+ (€200), Shure MV7 (€250), Blue Yeti X (€170). All are fine.
Pop filter — non-negotiable. €10. Without one, every "P" and "B" peaks and clips. Clipping is unrecoverable.
Distance: 15-20 cm with the pop filter between you and the capsule. Same distance every session.
Cardioid polar pattern. Most USB condensers default to this; check your mic's settings.
3. Your voice
Same time of day. Voice changes through the day (morning gravelly, evening tired). Pick one slot, stick to it. Mid-morning is generally most consistent.
Hydrate. Water before and during. Avoid dairy (creates phlegm). Avoid coffee right before (dries out vocal cords).
Sit, don't stand. More consistent posture and breath support.
30-min sessions max. Take 10-min breaks. Voice fatigue produces inconsistent training data — a worse problem than less data.
Read in your normal speaking voice, not a "presenter" voice. The bot will sound like the voice you train. Demo-acceptance favours conversational tone.
Pace yourself. Don't rush. Let punctuation breathe. Comma → small pause. Period → full pause. Question → rising pitch.
4. Dataset size — find your minimum
Stage
Net audio
Reading effort
Quality
Smoke test
30 min
~1 h
Recognisably you, robotic on hard sounds
Demo-ready
1.5 h
~3 h over a couple of days
Acceptable for prospect demo
Production
5 h
~10 h over a week
What commercial voice agents ship with
Audiobook-tier
20+ h
~50 h over a month
Indistinguishable in casual listening
Don't commit to 5 hours upfront. Record 30 min, train, listen. Decide if quality is acceptable; if not, record more. The kit's "Record" tab shows your running net-audio total per stage.
5. Phonetic coverage matters more than raw hours
5 hours of "the cat sat on the mat" repeated produces a model that knows only those phonemes. Your prompt list must cover:
Every vowel in stressed and unstressed positions.
Every consonant in initial, medial, and final positions of words.
All sentence intonations — declarative, interrogative, exclamatory. Without question prompts you'll get a model that can't ask questions.
Numbers, dates, currency spelled out at least 50 times each.
The included stage_0_starter.txt is a curated 50-prompt set covering common Greek casino phonetics. For longer datasets, use Mozilla Common Voice Greek's prompt list — it's already phonetically balanced by construction.
6. Pre-flight checklist
Quiet room with soft surfaces ✓
Mic at fixed 15-20 cm distance with pop filter ✓
Mic level peaks at ~70-90% on loudest words (no clipping) ✓
Test recording one prompt → listen back → no echo, no hum, no breath blast ✓
Same time-of-day slot scheduled for all sessions ✓
7. Common pitfalls
Mouth noises (clicks, smacks) — swallow before each prompt. Lip-smacks are the hardest artifact to remove.
Inconsistent distance — voice volume changes proportional to distance². Mark a fixed spot.
Re-recording dirty takes — never keep a take with a mistake "to fix later". Press redo, do it clean. The model learns whatever is in the data.
Reading in a robot voice — your speaking voice is what the bot will sound like. Speak naturally as if telling a friend.
Recording when sick/congested — your voice is different. Skip that day, resume when healthy.
8. After recording — what we do
Quality QC: trim silence, detect clipping, check loudness consistency, listen to random 5% of clips.
Forced alignment with Montreal Forced Aligner — produces phoneme-level timestamps NeMo needs.
FastPitch training on a separate L40S/A100 GPU — 18-24 hours.
HiFi-GAN fine-tune from NVIDIA's universal vocoder — 3-6 hours.
Convert to Riva via nemo2riva + riva-build.
Deploy as a new riva-speech-greek service on the Nebius box (separate gRPC port from Magpie).
You test; if not good enough, record more, retrain.
9. The 1000%-accuracy reality check
Voice cloning quality plateaus around what the dataset supports. With our pipeline:
5 h clean dataset + good phonetic balance + clean room: ~95% intelligibility, mostly natural prosody. Sounds like you.
20 h with a voice actor and studio recording: ~99% intelligibility, indistinguishable in casual listening.
"Indistinguishable from real human in adversarial testing" requires either hundreds of hours or model architectures (e.g. VALL-E-X scale) that don't exist self-hosted yet.
"1000% accurate" doesn't exist. But "indistinguishable for a 2-min sales call" is achievable with 5 hours and discipline.
Record
Workflow
Click "Generate with LLM" (right panel) → pick language → wait ~5 s for prompts to load
Click "Choose folder" and pick a destination on your Mac (e.g. ~/Desktop/voice/). Required before saving.
Press SPACE or click Record, read the prompt, SPACE again to stop
Press ENTER or click "Save & Next" to write the WAV and advance
List of recorded datasets ready for training. Drop a folder produced by the recorder into the ~/riva-stack/datasets/ bind mount on Nebius and it appears here.
Stub — full dataset management UI lands in v2 (after the demo). For now: ship the recorded folder to me out-of-band and I run training on a separate GPU pod.
Training jobs
Submit a dataset for training, monitor progress, see live logs.
Stub — needs a separate GPU training pod with a queue + WebSocket log relay. v2 build. For demo, training runs on an off-board GPU pod and I report results back manually.
Trained models
List of completed voice models, deploy controls, A/B comparison playground.
Stub — v2. Currently models are deployed manually as riva-speech-greek docker-compose services after training completes.