Your agent can hear them.Now it can grade them.
Chivox MCP turns raw speech into a dense, agent-ready payload — phoneme scores, stress, tone, fluency, audio quality — all in one MCP call, any LLM. The listening layer under every voice-native agent you’re about to ship.
60+ phonemes scored for accuracy, stress and intonation — not a single opaque number.
{
"overall": 48,
"pron": { "accuracy": 44, "integrity": 90, "fluency": 72, "rhythm": 65 },
"fluency": { "pause": 2, "speed": 118 },
"audio_quality": { "snr": 19.2, "clip": 0 },
"details": [
{
"word": "think",
"score": 48, "dp_type": "mispron",
"start": 2400, "end": 2910,
"liaison": "none",
"phonemes": [
{ "ipa": "θ", "score": 35, "dp_type": "mispron" },
{ "ipa": "ɪ", "score": 88, "dp_type": "normal" }
],
"phoneme_error": { "expected": "/θ/", "actual": "/s/" }
}
]
}“I noticed you pronounced think as sink. Place your tongue between your teeth for the /θ/ sound. Try: “Thirty thirsty thinkers thought…””
The listening layer, as four MCP tools
Twenty years of pronunciation-assessment R&D, exposed as a structured payload your LLM can reason over. Drop into LangChain, LlamaIndex, the OpenAI Agents SDK or any custom loop — skip the months of DSP work.
Score a learner’s speech
Stream mic audio or post a file. Get overall / accuracy / integrity / fluency / rhythm scores, plus word and phoneme-level diagnostics.
Mandarin & English, natively
Tones, pinyin, neutral tone, erhua, tone sandhi for Chinese. Stress, rhythm, CEFR-aligned scoring for English. One flag switches between them.
Score free-flow dialogue
Open-ended AI-talk evaluation returns 5-dimensional scores on fluency, content, grammar, accuracy and rhythm — ready for the next LLM turn.
Personalize the next practice
Feed the JSON straight to GPT / Claude / Gemini. Use the shipped prompt-skill to generate targeted drills for weak phonemes or tones.
Production-ready in 3 steps
Watch it run. Paste config → server connects → your LLM calls a tool and gets structured scores back.
Add one block to your MCP configrunning
Paste the snippet into Cursor, Claude Desktop, or your custom agent — pick a tab on the right.
Call a tool from your LLM
Hand your model the audio. It gets back nested JSON: pron sub-scores, fluency + WPM, audio SNR, and details[] with ms ranges, stress, liaison and per-phoneme rows.
API referenceMandarin is where the payload proves itself.
If it resolves tonal sandhi, it resolves anything.
English handles the long tail of L2 learners. Mandarin is the pressure test — four tones, sandhi, erhua, retroflex, the acoustic edge cases that kill generic STT. Both ship as pron.* / details[] on the same payload contract, with tone objects and per-phoneme windows for zh, stress and CEFR alignment for en. One integration, two acoustically opposite languages.
你好,今天天气……
Score a heritage speaker mid-sentence as they flip between languages — “I told her 我下周回家and she was thrilled.” Returns separate EN / zh sub-scores plus a blended fluency index. Same payload contract, two languages interleaved.
Built for what developers actually ship
Tutors, coaches, companions, QA tooling — pick the scenario that\u2019s yours and see how the agent loop looks in practice.
The only MCP that feeds LLMs phoneme-level Mandarin
Tone objects, sandhi resolution and per-phoneme windows returned in the same payload shape every other language ships. Your agent reasons over 睡觉 vs. 水饺 at the acoustic layer, not the transcript — signal a Whisper-stack integration simply can’t surface.
Score candidate speech, not just transcripts
Screen English fluency, pronunciation confidence and rhythm at scale. Your LLM reasons over numbers, not vibes — explainable rubrics every HR team will trust.
Agent training & call-script compliance
Evaluate standard-phrase delivery, articulation, pacing and keyword hits for call-center reps. Flag exactly which second drifted off-script and auto-generate coaching drills.
Voice-gated NPCs and pronunciation-powered gameplay
Players unlock spells, dialogues or levels by saying the phrase correctly. Get a pass/fail plus the exact phoneme that missed, at <300 ms p95 — fast enough for real-time game loops.
Speech scoring driven by research
The engine behind Chivox MCP is 20 years of R&D in pronunciation assessment. Here\u2019s how it holds up in production.
Scores align with certified human expert rubrics at 95%+ correlation. Validated by national standardized speaking tests in 100+ cities.
Same payload. Your agent. Your production loop.
Drop Chivox MCP into Cursor, Claude Desktop, or any agent SDK. One npx and you’re reading the same JSON you just saw above.
Starter key free · spend caps · low-balance alerts · zero audio retention
Let’s build your voice agent together.
Tell us what you’re building. We’ll reply within one business day with pilot credits, pricing, or a deployment plan — whichever you need first.
- Enterprise pricing & self-hosted deploymentsVolume tiers, VPC install, SLAs, and on-prem engines for regulated buyers.
- Missing a language or dialect?We train new acoustic models on request. Send us your target accent.
- Pilot credits for evaluation teamsFree benchmark run on your own audio, with a side-by-side report.