Docs

Give your agent phoneme-level ears.

Chivox MCP returns a structured score matrix for every syllable, tone, stress, phoneme, and pause — in both Mandarin and English. These docs take you from npx to production in ten minutes.

/get-started

Introduction

Chivox MCP is a Model-Context-Protocol server that turns any LLM into a linguistics-grade speech examiner. Point your agent at an audio clip and it gets back a structured score matrix — overall, accuracy, fluency, rhythm, plus syllable / word / phoneme detail for both English and tonal Mandarin.

The public catalog ships 16 tools today: 10 English tasks (word, sentence, paragraph, correction, phonics, multiple choice, semi-open, realtime) and 6 Mandarin tasks (character, pinyin, sentence, paragraph, constrained recognition, AI-Talk). Every tool returns the same top-level shape, so switching locales or granularities costs you zero schema work.

If you’ve shipped OpenAI Realtime or Whisper, you already know what transcription buys you. Chivox MCP is the layer on top: how well did the user actually say it, down to individual phonemes and tones, with every field typed and documented.

What this isn’t
Not another STT. Not a wrapper around Whisper. Chivox runs a dedicated pronunciation-scoring engine, trained on exam-grade reference audio and battle-tested as the backbone of national-scale Chinese English exams for over a decade. MCP is just the doorway.

Architecture

Chivox exposes two parallel front doors to the same scoring engine. Pick whichever matches your client runtime — the scoring result is byte-identical.

MCP mode · recommended
  • JSON-RPC 2.0 over Streamable HTTP or stdio
  • Zero-code drop-in for any MCP-aware client — Cursor, Claude Desktop, Cline, LangChain, Mastra, Agents SDK.
  • Tool list auto-injected; new tools require no client change.
  • Optional local proxy chivox-local-mcp adds microphone streaming.
Function-calling mode · fallback
  • OpenAI-style REST + WebSocket (aka cvx_fc).
  • No MCP SDK required — any HTTP/WS client works.
  • Built-in resume_token, intermediate results, backpressure frames.
  • Ships with iOS, Android, Flutter, mini-program, and legacy Java / PHP backends in mind.
Pick exactly one per session
The MCP /ws/audio/{sid} and function-calling /ws/eval/{sid} endpoints live in separate session namespaces — a session_idfrom one won’t work on the other.

Requirements

Anything that speaks MCP over Streamable HTTP or stdio. If you only need file-based scoring (no live mic), you don’t need a local proxy at all.

DependencyVersionNeeded when
Node.js≥ 18Running chivox-local-mcp
SoXanyStreaming from a local microphone
macOS / Linux / Windowslatest 2 LTSEvery tested platform
bash·Install SoX (only for streaming capture)
# macOS
brew install sox

# Ubuntu / Debian
sudo apt-get install sox

# Arch
sudo pacman -S sox

Quickstart

60 seconds, zero audio-engineering knowledge.

1
Wire up the server

Zero-install path — any MCP-aware client can talk to our hosted endpoint over Streamable HTTP. Below is the Cursor config; others are covered under Clients.

json·Cursor · Settings → MCP
{
  "name": "chivox-speech-eval",
  "type": "streamable-http",
  "url": "https://mcp.cloud.chivox.com"
}

Need microphone streaming? Run the optional local proxy instead — see the Claude Desktop setup.

2
Call a scoring tool

Your LLM now sees all 16 tools. Call one directly from an MCP-aware client:

ts·score.ts · raw MCP SDK
import { McpClient } from '@modelcontextprotocol/sdk';

const mcp = new McpClient({ name: 'chivox' });
await mcp.connectHttp('https://mcp.cloud.chivox.com');

// Upload a file first, then evaluate by audioId:
const { audioId } = await fetch('https://your-audio-host.com/upload', {
  method: 'POST',
  body: wavBuffer,
  headers: { 'Content-Type': 'audio/wav' },
}).then(r => r.json());

const result = await mcp.callTool('en_sentence_eval', {
  audioId,
  ref_text: 'I think therefore I am',
});

console.log(result.overall);              // 85
console.log(result.details[0].phone);     // [{ phoneme: 'θ', score: 91 }, ...]
3
Feed the JSON back to your LLM

The payload is hand-shaped for LLM reasoning: short field names, flat arrays, bounded ranges. Your model can now generate per-user feedback, drills, and next-lesson plans without a second speech round-trip. See Secondary analysis with LLM.

Authentication

The hosted URL https://mcp.cloud.chivox.com is open for evaluation; for production traffic, every call is authenticated with an API key. The local proxy and framework integrations read them from environment variables.

ParameterTypeRequiredNotes
MCP_REMOTE_URLstringrequiredRemote endpoint, usually https://mcp.cloud.chivox.com.
MCP_API_KEYstringoptionalBearer token when your deployment enforces auth.
MCP_TIMEOUT_MSnumberdefault 30000Per-request timeout for the proxy to upstream.
Key rotation
Keys can be rotated at any time from the dashboard. Expect a few seconds of propagation to the edge. Old keys continue to work for a 5-minute grace window.

/clients

Cursor

Settings → MCP → Add new MCP server. Cursor speaks Streamable HTTP directly — no binary to install.

json·~/.cursor/mcp.json
{
  "name": "chivox-speech-eval",
  "type": "streamable-http",
  "url": "https://mcp.cloud.chivox.com"
}
No microphone over HTTP
Cursor’s MCP transport doesn’t expose mic capture. If you need live streaming assessment, wire up Claude Desktop with the local proxy instead.

Claude Desktop

Claude Desktop talks to the local proxy over stdio, which in turn bridges to the hosted scoring engine — this path unlocks microphone streaming.

bash·Install the local proxy
# Option A — global (recommended)
npm install -g chivox-local-mcp

# Option B — run via npx, no install
MCP_REMOTE_URL=https://mcp.cloud.chivox.com npx chivox-local-mcp
json·~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "chivox": {
      "command": "chivox-local-mcp",
      "env": {
        "MCP_REMOTE_URL": "https://mcp.cloud.chivox.com",
        "MCP_API_KEY": "your-api-key"
      }
    }
  }
}

Claude Code

One command — Claude Code will persist it under the user config.

bash·Claude Code CLI
claude mcp add chivox -- \
  env MCP_REMOTE_URL=https://mcp.cloud.chivox.com \
  chivox-local-mcp

Windsurf · Zed · Continue · Cline

All four accept the same Streamable HTTP JSON as Cursor — only the settings file path differs.

json·mcp.json
{
  "mcpServers": {
    "chivox-speech-eval": {
      "type": "streamable-http",
      "url": "https://mcp.cloud.chivox.com"
    }
  }
}

LangChain · Mastra · OpenAI Agents SDK

Point any MCP-aware framework adapter at our hosted URL; the framework drives the tool loop (discovery → tool_calls → execution → follow-up messages). All three share the same 16 tools.

python·LangGraph · langchain-mcp-adapters
# pip install langchain-mcp-adapters langgraph
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent

client = MultiServerMCPClient({
    "chivox": {
        "transport": "streamable_http",
        "url": "https://mcp.cloud.chivox.com",
    }
})
tools = await client.get_tools()

agent = create_react_agent("openai:gpt-4o-mini", tools)
result = await agent.ainvoke({"messages":
    [("user", "Score audioId=abc123, ref: I think therefore I am")]})
print(result["messages"][-1].content)
typescript·Mastra
import { MCPClient } from '@mastra/mcp';
import { Agent } from '@mastra/core/agent';
import { openai } from '@ai-sdk/openai';

const mcp = new MCPClient({
  servers: {
    chivox: { url: new URL('https://mcp.cloud.chivox.com') },
  },
});

export const coach = new Agent({
  name: 'speech-coach',
  instructions: 'Use Chivox tools to score speech and give feedback.',
  model: openai('gpt-4o-mini'),
  tools: await mcp.getTools(),
});
python·OpenAI Agents SDK
# pip install openai-agents
from agents import Agent, Runner
from agents.mcp import MCPServerStreamableHttp

chivox = MCPServerStreamableHttp(
    params={"url": "https://mcp.cloud.chivox.com"},
    name="chivox-speech-eval",
)

async with chivox:
    agent = Agent(
        name="coach",
        instructions="Professional speaking coach",
        mcp_servers=[chivox],
    )
    r = await Runner.run(agent, "Score audioId=abc123")
    print(r.final_output)

LlamaIndex, AutoGen, CrewAI, and Spring AI ship similar bridges — same URL, same 16 tools.


/core-concepts

What the engine returns

Every scoring tool returns the same top-level shape, regardless of locale or task. The header block is three numbers you can ship straight to a UI; details[] is where the phoneme-level reasoning fuel lives.

json·result.json · shape at a glance
{
  "overall":   85,
  "accuracy":  82,
  "pron":      88,
  "fluency":   { "overall": 78, "speed": 65, "pause": 2 },
  "integrity": 95,
  "details": [
    {
      "char":  "hello",
      "score": 85,
      "phone": [
        { "phoneme": "h",  "score": 90, "dp_type": "normal"  },
        { "phoneme": "ɛ",  "score": 82, "dp_type": "normal"  },
        { "phoneme": "l",  "score": 88, "dp_type": "normal"  },
        { "phoneme": "oʊ", "score": 80, "dp_type": "normal"  }
      ]
    }
  ]
}

See the complete list of fields and their valid ranges under API reference → Response fields.

Mandarin tonal scoring

Most open-source STT collapses on tone. Chivox runs a dedicated F₀ contour evaluator that understands four lexical tones + neutral + sandhi. Each syllable gets a tone_ref (expected) and tone_detected (produced), plus a score.

The engine knows the common sandhi rules — e.g. T3 + T3 → T2 + T3— so “你好” pronounced as (T2, T3) is marked normal, not mis-pronounced, even though the surface tone differs from the dictionary form.

Why this matters for agents
Sandhi-aware verdicts mean your LLM never has to second-guess a legitimate surface change. If dp_type says normal, it’s correct, full stop — no “well actually” templates required.

English phoneme scoring

Every word is broken down into IPA phonemes, each with a score and a dp_type of normal, mispron, or missing. The engine also returns a best-guess “what the user actually said instead” via phoneme_error — invaluable for drilling.

Actually useful example
A user reading “think” as /sɪŋk/ comes back as { expected: "/θ/", actual: "/s/" }. Your LLM can generate a targeted tongue-twister on the spot — no second round-trip to the scorer.

Streaming vs file evaluation

Two evaluation modes cover every UX you are likely to build. Both return the same result shape; they differ only in how audio gets in.

Streaming microphone

Audio flows over WebSocket while the user speaks. No intermediate file, lowest latency — ideal for live tutoring UX. Supports interrupts, reconnects via resume_token, and intermediate frames while the user is still talking.

File upload

POST a clip to https://your-audio-host.com/upload (local path, Base64, or URL), get an audioId back, then call any scoring tool with it. Perfect for batch jobs and async pipelines.


/guides

Build an AI pronunciation tutor

The canonical recipe: score → reason → drill → repeat. Chivox handles step 1; your LLM handles 2 and 3; step 4 just loops.

  1. Score. Record the user, call one of the English or Mandarin tools with the reference text.
  2. Reason. Hand the raw JSON to your LLM with a persona prompt (see Prompt templates).
  3. Drill. The LLM returns a short tongue-twister or a minimal-pair list. TTS it back to the user.
  4. Repeat. Score the drill, show progress vs. baseline, persist deltas into the student profile.
Want to see it live?
The full loop is implemented on the interactive demo — crack open the devtools network tab, every call is unredacted.

Secondary analysis with LLM

The payload is designed to be reasoning fuel, not a terminal score. Give your agent this prompt scaffold and it will outperform any hard-coded feedback template:

markdown·prompt.md
You receive a Chivox scoring payload.

1. Group mispronounced phonemes by type
   (voiceless fricatives, rhotics, tones, etc.).
2. Pick the single most costly pattern
   (biggest total score loss across all words).
3. Generate ONE minimal-pair drill that targets exactly that pattern.
4. Output JSON:
   {
     "diagnosis": "<= 40 words",
     "drill":     ["word1", "word2", ...],
     "next_step": "read the drill three times, then retry the sentence"
   }

Respond in the user's locale.

Long-term student profiling

Persist details[*].phone across sessions and you have everything you need for a longitudinal profile. Common derived signals:

  • Pattern-over-time: rolling rate of dp_type === "mispron" per phoneme.
  • Progress curve: 7-day moving average of overall on a fixed reference set.
  • Spaced-repetition triggers: promote a phoneme out of drills after N consecutive normal passes.
  • Tonal stability: Mandarin only — tone_detected variance over the last 20 utterances.

Prompt templates

Mount these as the system message and pass the tool JSON in the user turn. They mirror the same five-part shape used by our hosted demo: persona / task / method / output format / tone.

markdown·system · pronunciation coach (English)
You are a warm, experienced pronunciation coach.

Given a JSON scoring payload from Chivox:
- Highlight up to 3 phoneme-level issues, using IPA.
- For each, show what the learner said vs. what they should say,
  and a one-sentence articulation cue (e.g. "touch your tongue tip
  to the alveolar ridge for /t/").
- End with one sentence of encouragement.

Keep responses under 90 words. Do not repeat the raw scores verbatim.
markdown·system · Mandarin tone coach
你是一位耐心的普通话口语老师。

给定一段 Chivox 打分 JSON:
- 只关注 tone_detected 与 tone_ref 不一致的音节。
- 提供"实际听到" vs "正确声调",附 1 句发音提示
  (例如:"第三声要先降后升")。
- 若触发变调规则 (T3+T3 → T2+T3) 已由 dp_type=normal 覆盖,
  则不要提该音节。

用简体中文回答,不超过 80 字。

/api-reference

English tools · 10

Call any of these by name via MCP tools/call or via function-calling. Each returns the standard result shape (see Response fields).

Tool nameTaskNotes
en_word_evalWord scoringSingle-word pronunciation
en_word_correctionWord correctionDetect omissions, extras, wrong phones
en_vocab_evalVocabulary scoringMultiple words in one clip
en_sentence_evalSentence scoringAccuracy + fluency
en_sentence_correctionSentence correctionPer-word feedback
en_paragraph_evalParagraph read-aloudLong-passage quality
en_phonics_evalPhonics scoringLetter-to-sound rules
en_choice_evalOral multiple choiceConstrained answers
en_semi_open_evalSemi-open promptScenario speaking
en_realtime_evalRealtime read-aloudStreaming feedback

Mandarin tools · 6

Mandarin-specific tools. All six honour sandhi rules and the neutral tone; cn_aitalk_eval additionally scores topic adherence and coherence for open-ended answers.

Tool nameTaskNotes
cn_word_raw_evalCharacter scoringHanzi pronunciation
cn_word_pinyin_evalPinyin scoringSyllable-level with tone
cn_sentence_evalPhrase scoringShort utterances
cn_paragraph_evalParagraph scoringLong text
cn_rec_evalConstrained recognitionPick-one answers
cn_aitalk_evalAI Talk scoringOpen-ended dialog evaluation

Response fields

The schema is stable across v1.x. Unknown future fields will be additive and documented in the Changelog.

FieldTypeNotes
overallnumber · 0–100Weighted roll-up. Safe for UI.
accuracynumber · 0–100Phoneme / syllable correctness.
pronnumber · 0–100Articulation quality.
integritynumber · 0–100Did the user read every word?
fluency.overallnumber · 0–100Pauses + speed + hesitations.
fluency.speednumberSyllables or words per minute.
fluency.pausenumberCount of unexpected pauses.
details[i].phone[j].dp_type"normal" · "mispron" · "missing"Per-phoneme verdict.
details[i].phone[j].phoneme_error{ expected, actual }Only present on English mispron. Ideal for drills.
details[i].tone_ref · tone_detected1–5 · neutralMandarin only.

Error codes

HTTP status codes

StatusMeaning
400Malformed request (missing fields / wrong types).
401Unauthenticated — missing Authorization header.
403Forbidden — invalid key or no permission for the requested tool.
404Session not found (wrong / reaped session_id).
409Invalid state (e.g. audio sent after stop).

Structured streaming error codes

WebSocket error frames and the error payload of get_stream_result share the same code set — dispatch client-side handling off these:

CodeMeaningSuggested action
SESSION_NOT_FOUNDSession does not existRecreate with start_stream_eval / create_stream_session
SESSION_EXPIREDSession expired (idle > 60s)Recreate session
INVALID_STATEOperation not allowed in current stateCheck call order (audio after stop, etc.)
INVALID_PARAMSInvalid parametersCheck core_type / ref_text / sample rate
RESUME_INVALIDresume_token invalid or expiredRecreate session — each resume issues a fresh token
AUDIO_TOO_LARGEAudio payload exceeds 50MBCompress or segment before retry
UPSTREAM_CONNECTScoring engine unreachableRetry with backoff; contact Chivox if persistent
UPSTREAM_TIMEOUTScoring engine timed outShorten audio / check network
UPSTREAM_EVAL_ERRORScoring engine returned an errorInspect the message field
CAPACITY_FULLConcurrent session quota exceededBack off and retry, or upgrade quota

/operations

Endpoints

All hosted under mcp.cloud.chivox.com. MCP and function-calling (cvx_fc) coexist — pick one per session, not per request.

PathMethodPurpose
/uploadPOSTAudio upload (returns audioId for file evaluation).
/mcpPOSTMCP mode · JSON-RPC over Streamable HTTP.
/ws/audio/{session_id}WSMCP mode · streaming audio WebSocket.
/v1/functionsGETFunction-calling · list all scoring functions.
/v1/callPOSTFunction-calling · one-shot eval / create stream session / fetch result.
/ws/eval/{session_id}WSFunction-calling · streaming audio WebSocket.
/healthGETHealth check (no auth, safe for probes).

Limits

These defaults describe technical guardrails on the hosted endpoint. Billing quotas (trial credits, concurrency tiers, call volume) are a separate dimension documented on the dashboard.

ItemDefaultNotes
Scratch storage500 MBTemporary audio cache.
Queue depth10Pending scoring jobs per key.
Concurrency3Parallel scoring workers.
Audio TTL5 minExpires if not scored.
Max upload50 MBPer file.
Streaming idle60 sSession drops after 60 s without audio.

Privacy & data handling

Audio is processed, scored, and dropped. We retain the resulting JSON for 30 days (for debugging and your own dashboard) and nothing else. Customers on enterprise plans can provision a region-locked tenant (US · EU · SG).

  • No audio is ever used for model training.
  • TLS 1.3 on every hop. Keys are hashed at rest.
  • GDPR & CCPA DSRs honoured within 10 business days — email BD@chivox.com.

FAQ

Is this just another wrapper around Whisper?+
No. Whisper transcribes; Chivox scores. The engine is trained on tens of millions of exam-graded samples and has been the evaluation backbone for national-scale Chinese English exams for over a decade.
Does it work offline / on-device?+
The MCP server needs an outbound call to our scoring engine. For air-gapped use, talk to us — we ship an on-prem container for enterprise customers.
What about dialects?+
Mandarin scoring targets standard Pǔtōnghuà. English supports en-US, en-GB, and en-AU rubrics; select via locale parameters on the relevant tools.
Can I use this in a browser?+
For quick demos, yes — but hosted traffic should always flow through a trusted backend so your API key isn’t exposed. The browser can upload to your backend, which forwards to /upload.
Which LLMs are known to work out of the box?+
Any model that supports OpenAI-style function calling: GPT-4o / 5.x, Claude Sonnet / Opus, Gemini, DeepSeek, GLM, Kimi, Doubao, Qwen. Tool schemas are forwarded verbatim — no per-vendor adapters needed.

Changelog

Pre-launch
The service is in invite-only access and ships from local builds — there is no public version cadence yet. Once we cut a public release, versioned entries will land here.
  • v1.0 — Current build. Includes the full 16-tool surface (Mandarin sandhi-aware tone verdicts, phoneme_error.actual for English mispronunciations, cn_aitalk_eval with a coherence field for open-ended tasks). Distributed to design partners on request.