Give your agent phoneme-level ears.
Chivox MCP returns a structured score matrix for every syllable, tone, stress, phoneme, and pause — in both Mandarin and English. These docs take you from npx to production in ten minutes.
Introduction
Chivox MCP is a Model-Context-Protocol server that turns any LLM into a linguistics-grade speech examiner. Point your agent at an audio clip and it gets back a structured score matrix — overall, accuracy, fluency, rhythm, plus syllable / word / phoneme detail for both English and tonal Mandarin.
The public catalog ships 16 tools today: 10 English tasks (word, sentence, paragraph, correction, phonics, multiple choice, semi-open, realtime) and 6 Mandarin tasks (character, pinyin, sentence, paragraph, constrained recognition, AI-Talk). Every tool returns the same top-level shape, so switching locales or granularities costs you zero schema work.
If you’ve shipped OpenAI Realtime or Whisper, you already know what transcription buys you. Chivox MCP is the layer on top: how well did the user actually say it, down to individual phonemes and tones, with every field typed and documented.
Architecture
Chivox exposes two parallel front doors to the same scoring engine. Pick whichever matches your client runtime — the scoring result is byte-identical.
- JSON-RPC 2.0 over Streamable HTTP or stdio
- Zero-code drop-in for any MCP-aware client — Cursor, Claude Desktop, Cline, LangChain, Mastra, Agents SDK.
- Tool list auto-injected; new tools require no client change.
- Optional local proxy
chivox-local-mcpadds microphone streaming.
- OpenAI-style REST + WebSocket (aka
cvx_fc). - No MCP SDK required — any HTTP/WS client works.
- Built-in
resume_token,intermediateresults,backpressureframes. - Ships with iOS, Android, Flutter, mini-program, and legacy Java / PHP backends in mind.
/ws/audio/{sid} and function-calling /ws/eval/{sid} endpoints live in separate session namespaces — a session_idfrom one won’t work on the other.Requirements
Anything that speaks MCP over Streamable HTTP or stdio. If you only need file-based scoring (no live mic), you don’t need a local proxy at all.
| Dependency | Version | Needed when |
|---|---|---|
Node.js | ≥ 18 | Running chivox-local-mcp |
SoX | any | Streaming from a local microphone |
macOS / Linux / Windows | latest 2 LTS | Every tested platform |
# macOS
brew install sox
# Ubuntu / Debian
sudo apt-get install sox
# Arch
sudo pacman -S soxQuickstart
60 seconds, zero audio-engineering knowledge.
Zero-install path — any MCP-aware client can talk to our hosted endpoint over Streamable HTTP. Below is the Cursor config; others are covered under Clients.
{
"name": "chivox-speech-eval",
"type": "streamable-http",
"url": "https://mcp.cloud.chivox.com"
}Need microphone streaming? Run the optional local proxy instead — see the Claude Desktop setup.
Your LLM now sees all 16 tools. Call one directly from an MCP-aware client:
import { McpClient } from '@modelcontextprotocol/sdk';
const mcp = new McpClient({ name: 'chivox' });
await mcp.connectHttp('https://mcp.cloud.chivox.com');
// Upload a file first, then evaluate by audioId:
const { audioId } = await fetch('https://your-audio-host.com/upload', {
method: 'POST',
body: wavBuffer,
headers: { 'Content-Type': 'audio/wav' },
}).then(r => r.json());
const result = await mcp.callTool('en_sentence_eval', {
audioId,
ref_text: 'I think therefore I am',
});
console.log(result.overall); // 85
console.log(result.details[0].phone); // [{ phoneme: 'θ', score: 91 }, ...]The payload is hand-shaped for LLM reasoning: short field names, flat arrays, bounded ranges. Your model can now generate per-user feedback, drills, and next-lesson plans without a second speech round-trip. See Secondary analysis with LLM.
Authentication
The hosted URL https://mcp.cloud.chivox.com is open for evaluation; for production traffic, every call is authenticated with an API key. The local proxy and framework integrations read them from environment variables.
| Parameter | Type | Required | Notes |
|---|---|---|---|
MCP_REMOTE_URL | string | required | Remote endpoint, usually https://mcp.cloud.chivox.com. |
MCP_API_KEY | string | optional | Bearer token when your deployment enforces auth. |
MCP_TIMEOUT_MS | number | default 30000 | Per-request timeout for the proxy to upstream. |
Cursor
Settings → MCP → Add new MCP server. Cursor speaks Streamable HTTP directly — no binary to install.
{
"name": "chivox-speech-eval",
"type": "streamable-http",
"url": "https://mcp.cloud.chivox.com"
}Claude Desktop
Claude Desktop talks to the local proxy over stdio, which in turn bridges to the hosted scoring engine — this path unlocks microphone streaming.
# Option A — global (recommended)
npm install -g chivox-local-mcp
# Option B — run via npx, no install
MCP_REMOTE_URL=https://mcp.cloud.chivox.com npx chivox-local-mcp{
"mcpServers": {
"chivox": {
"command": "chivox-local-mcp",
"env": {
"MCP_REMOTE_URL": "https://mcp.cloud.chivox.com",
"MCP_API_KEY": "your-api-key"
}
}
}
}Claude Code
One command — Claude Code will persist it under the user config.
claude mcp add chivox -- \
env MCP_REMOTE_URL=https://mcp.cloud.chivox.com \
chivox-local-mcpWindsurf · Zed · Continue · Cline
All four accept the same Streamable HTTP JSON as Cursor — only the settings file path differs.
{
"mcpServers": {
"chivox-speech-eval": {
"type": "streamable-http",
"url": "https://mcp.cloud.chivox.com"
}
}
}LangChain · Mastra · OpenAI Agents SDK
Point any MCP-aware framework adapter at our hosted URL; the framework drives the tool loop (discovery → tool_calls → execution → follow-up messages). All three share the same 16 tools.
# pip install langchain-mcp-adapters langgraph
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent
client = MultiServerMCPClient({
"chivox": {
"transport": "streamable_http",
"url": "https://mcp.cloud.chivox.com",
}
})
tools = await client.get_tools()
agent = create_react_agent("openai:gpt-4o-mini", tools)
result = await agent.ainvoke({"messages":
[("user", "Score audioId=abc123, ref: I think therefore I am")]})
print(result["messages"][-1].content)import { MCPClient } from '@mastra/mcp';
import { Agent } from '@mastra/core/agent';
import { openai } from '@ai-sdk/openai';
const mcp = new MCPClient({
servers: {
chivox: { url: new URL('https://mcp.cloud.chivox.com') },
},
});
export const coach = new Agent({
name: 'speech-coach',
instructions: 'Use Chivox tools to score speech and give feedback.',
model: openai('gpt-4o-mini'),
tools: await mcp.getTools(),
});# pip install openai-agents
from agents import Agent, Runner
from agents.mcp import MCPServerStreamableHttp
chivox = MCPServerStreamableHttp(
params={"url": "https://mcp.cloud.chivox.com"},
name="chivox-speech-eval",
)
async with chivox:
agent = Agent(
name="coach",
instructions="Professional speaking coach",
mcp_servers=[chivox],
)
r = await Runner.run(agent, "Score audioId=abc123")
print(r.final_output)LlamaIndex, AutoGen, CrewAI, and Spring AI ship similar bridges — same URL, same 16 tools.
What the engine returns
Every scoring tool returns the same top-level shape, regardless of locale or task. The header block is three numbers you can ship straight to a UI; details[] is where the phoneme-level reasoning fuel lives.
{
"overall": 85,
"accuracy": 82,
"pron": 88,
"fluency": { "overall": 78, "speed": 65, "pause": 2 },
"integrity": 95,
"details": [
{
"char": "hello",
"score": 85,
"phone": [
{ "phoneme": "h", "score": 90, "dp_type": "normal" },
{ "phoneme": "ɛ", "score": 82, "dp_type": "normal" },
{ "phoneme": "l", "score": 88, "dp_type": "normal" },
{ "phoneme": "oʊ", "score": 80, "dp_type": "normal" }
]
}
]
}See the complete list of fields and their valid ranges under API reference → Response fields.
Mandarin tonal scoring
Most open-source STT collapses on tone. Chivox runs a dedicated F₀ contour evaluator that understands four lexical tones + neutral + sandhi. Each syllable gets a tone_ref (expected) and tone_detected (produced), plus a score.
The engine knows the common sandhi rules — e.g. T3 + T3 → T2 + T3— so “你好” pronounced as (T2, T3) is marked normal, not mis-pronounced, even though the surface tone differs from the dictionary form.
dp_type says normal, it’s correct, full stop — no “well actually” templates required.English phoneme scoring
Every word is broken down into IPA phonemes, each with a score and a dp_type of normal, mispron, or missing. The engine also returns a best-guess “what the user actually said instead” via phoneme_error — invaluable for drilling.
{ expected: "/θ/", actual: "/s/" }. Your LLM can generate a targeted tongue-twister on the spot — no second round-trip to the scorer.Streaming vs file evaluation
Two evaluation modes cover every UX you are likely to build. Both return the same result shape; they differ only in how audio gets in.
Audio flows over WebSocket while the user speaks. No intermediate file, lowest latency — ideal for live tutoring UX. Supports interrupts, reconnects via resume_token, and intermediate frames while the user is still talking.
POST a clip to https://your-audio-host.com/upload (local path, Base64, or URL), get an audioId back, then call any scoring tool with it. Perfect for batch jobs and async pipelines.
Build an AI pronunciation tutor
The canonical recipe: score → reason → drill → repeat. Chivox handles step 1; your LLM handles 2 and 3; step 4 just loops.
- Score. Record the user, call one of the English or Mandarin tools with the reference text.
- Reason. Hand the raw JSON to your LLM with a persona prompt (see Prompt templates).
- Drill. The LLM returns a short tongue-twister or a minimal-pair list. TTS it back to the user.
- Repeat. Score the drill, show progress vs. baseline, persist deltas into the student profile.
Secondary analysis with LLM
The payload is designed to be reasoning fuel, not a terminal score. Give your agent this prompt scaffold and it will outperform any hard-coded feedback template:
You receive a Chivox scoring payload.
1. Group mispronounced phonemes by type
(voiceless fricatives, rhotics, tones, etc.).
2. Pick the single most costly pattern
(biggest total score loss across all words).
3. Generate ONE minimal-pair drill that targets exactly that pattern.
4. Output JSON:
{
"diagnosis": "<= 40 words",
"drill": ["word1", "word2", ...],
"next_step": "read the drill three times, then retry the sentence"
}
Respond in the user's locale.Long-term student profiling
Persist details[*].phone across sessions and you have everything you need for a longitudinal profile. Common derived signals:
- Pattern-over-time: rolling rate of
dp_type === "mispron"per phoneme. - Progress curve: 7-day moving average of
overallon a fixed reference set. - Spaced-repetition triggers: promote a phoneme out of drills after N consecutive
normalpasses. - Tonal stability: Mandarin only —
tone_detectedvariance over the last 20 utterances.
Prompt templates
Mount these as the system message and pass the tool JSON in the user turn. They mirror the same five-part shape used by our hosted demo: persona / task / method / output format / tone.
You are a warm, experienced pronunciation coach.
Given a JSON scoring payload from Chivox:
- Highlight up to 3 phoneme-level issues, using IPA.
- For each, show what the learner said vs. what they should say,
and a one-sentence articulation cue (e.g. "touch your tongue tip
to the alveolar ridge for /t/").
- End with one sentence of encouragement.
Keep responses under 90 words. Do not repeat the raw scores verbatim.你是一位耐心的普通话口语老师。
给定一段 Chivox 打分 JSON:
- 只关注 tone_detected 与 tone_ref 不一致的音节。
- 提供"实际听到" vs "正确声调",附 1 句发音提示
(例如:"第三声要先降后升")。
- 若触发变调规则 (T3+T3 → T2+T3) 已由 dp_type=normal 覆盖,
则不要提该音节。
用简体中文回答,不超过 80 字。English tools · 10
Call any of these by name via MCP tools/call or via function-calling. Each returns the standard result shape (see Response fields).
| Tool name | Task | Notes |
|---|---|---|
en_word_eval | Word scoring | Single-word pronunciation |
en_word_correction | Word correction | Detect omissions, extras, wrong phones |
en_vocab_eval | Vocabulary scoring | Multiple words in one clip |
en_sentence_eval | Sentence scoring | Accuracy + fluency |
en_sentence_correction | Sentence correction | Per-word feedback |
en_paragraph_eval | Paragraph read-aloud | Long-passage quality |
en_phonics_eval | Phonics scoring | Letter-to-sound rules |
en_choice_eval | Oral multiple choice | Constrained answers |
en_semi_open_eval | Semi-open prompt | Scenario speaking |
en_realtime_eval | Realtime read-aloud | Streaming feedback |
Mandarin tools · 6
Mandarin-specific tools. All six honour sandhi rules and the neutral tone; cn_aitalk_eval additionally scores topic adherence and coherence for open-ended answers.
| Tool name | Task | Notes |
|---|---|---|
cn_word_raw_eval | Character scoring | Hanzi pronunciation |
cn_word_pinyin_eval | Pinyin scoring | Syllable-level with tone |
cn_sentence_eval | Phrase scoring | Short utterances |
cn_paragraph_eval | Paragraph scoring | Long text |
cn_rec_eval | Constrained recognition | Pick-one answers |
cn_aitalk_eval | AI Talk scoring | Open-ended dialog evaluation |
Response fields
The schema is stable across v1.x. Unknown future fields will be additive and documented in the Changelog.
| Field | Type | Notes |
|---|---|---|
overall | number · 0–100 | Weighted roll-up. Safe for UI. |
accuracy | number · 0–100 | Phoneme / syllable correctness. |
pron | number · 0–100 | Articulation quality. |
integrity | number · 0–100 | Did the user read every word? |
fluency.overall | number · 0–100 | Pauses + speed + hesitations. |
fluency.speed | number | Syllables or words per minute. |
fluency.pause | number | Count of unexpected pauses. |
details[i].phone[j].dp_type | "normal" · "mispron" · "missing" | Per-phoneme verdict. |
details[i].phone[j].phoneme_error | { expected, actual } | Only present on English mispron. Ideal for drills. |
details[i].tone_ref · tone_detected | 1–5 · neutral | Mandarin only. |
Error codes
HTTP status codes
| Status | Meaning |
|---|---|
400 | Malformed request (missing fields / wrong types). |
401 | Unauthenticated — missing Authorization header. |
403 | Forbidden — invalid key or no permission for the requested tool. |
404 | Session not found (wrong / reaped session_id). |
409 | Invalid state (e.g. audio sent after stop). |
Structured streaming error codes
WebSocket error frames and the error payload of get_stream_result share the same code set — dispatch client-side handling off these:
| Code | Meaning | Suggested action |
|---|---|---|
SESSION_NOT_FOUND | Session does not exist | Recreate with start_stream_eval / create_stream_session |
SESSION_EXPIRED | Session expired (idle > 60s) | Recreate session |
INVALID_STATE | Operation not allowed in current state | Check call order (audio after stop, etc.) |
INVALID_PARAMS | Invalid parameters | Check core_type / ref_text / sample rate |
RESUME_INVALID | resume_token invalid or expired | Recreate session — each resume issues a fresh token |
AUDIO_TOO_LARGE | Audio payload exceeds 50MB | Compress or segment before retry |
UPSTREAM_CONNECT | Scoring engine unreachable | Retry with backoff; contact Chivox if persistent |
UPSTREAM_TIMEOUT | Scoring engine timed out | Shorten audio / check network |
UPSTREAM_EVAL_ERROR | Scoring engine returned an error | Inspect the message field |
CAPACITY_FULL | Concurrent session quota exceeded | Back off and retry, or upgrade quota |
Endpoints
All hosted under mcp.cloud.chivox.com. MCP and function-calling (cvx_fc) coexist — pick one per session, not per request.
| Path | Method | Purpose |
|---|---|---|
/upload | POST | Audio upload (returns audioId for file evaluation). |
/mcp | POST | MCP mode · JSON-RPC over Streamable HTTP. |
/ws/audio/{session_id} | WS | MCP mode · streaming audio WebSocket. |
/v1/functions | GET | Function-calling · list all scoring functions. |
/v1/call | POST | Function-calling · one-shot eval / create stream session / fetch result. |
/ws/eval/{session_id} | WS | Function-calling · streaming audio WebSocket. |
/health | GET | Health check (no auth, safe for probes). |
Limits
These defaults describe technical guardrails on the hosted endpoint. Billing quotas (trial credits, concurrency tiers, call volume) are a separate dimension documented on the dashboard.
| Item | Default | Notes |
|---|---|---|
Scratch storage | 500 MB | Temporary audio cache. |
Queue depth | 10 | Pending scoring jobs per key. |
Concurrency | 3 | Parallel scoring workers. |
Audio TTL | 5 min | Expires if not scored. |
Max upload | 50 MB | Per file. |
Streaming idle | 60 s | Session drops after 60 s without audio. |
Privacy & data handling
Audio is processed, scored, and dropped. We retain the resulting JSON for 30 days (for debugging and your own dashboard) and nothing else. Customers on enterprise plans can provision a region-locked tenant (US · EU · SG).
- No audio is ever used for model training.
- TLS 1.3 on every hop. Keys are hashed at rest.
- GDPR & CCPA DSRs honoured within 10 business days — email BD@chivox.com.
FAQ
Is this just another wrapper around Whisper?+
Does it work offline / on-device?+
What about dialects?+
Can I use this in a browser?+
/upload.Which LLMs are known to work out of the box?+
Changelog
- v1.0 — Current build. Includes the full 16-tool surface (Mandarin sandhi-aware tone verdicts,
phoneme_error.actualfor English mispronunciations,cn_aitalk_evalwith acoherencefield for open-ended tasks). Distributed to design partners on request.