Docs

Give your agent phoneme-level ears.

Chivox MCP returns a structured score matrix for every syllable, tone, stress, phoneme, and pause — in both Mandarin and English. These docs take you from npx to production in ten minutes.

Jump to Quickstart Try the live demo View on GitHub

/get-started

Introduction

Chivox MCP is a Model-Context-Protocol server that turns any LLM into a linguistics-grade speech examiner. Point your agent at an audio clip and it gets back a structured score matrix — overall, accuracy, fluency, rhythm, plus syllable / word / phoneme detail for both English and tonal Mandarin.

The public catalog ships 16 tools today: 10 English tasks (word, sentence, paragraph, correction, phonics, multiple choice, semi-open, realtime) and 6 Mandarin tasks (character, pinyin, sentence, paragraph, constrained recognition, AI-Talk). Every tool returns the same top-level shape, so switching locales or granularities costs you zero schema work.

If you’ve shipped OpenAI Realtime or Whisper, you already know what transcription buys you. Chivox MCP is the layer on top: how well did the user actually say it, down to individual phonemes and tones, with every field typed and documented.

What this isn’t

Not another STT. Not a wrapper around Whisper. Chivox runs a dedicated pronunciation-scoring engine, trained on exam-grade reference audio and battle-tested as the backbone of national-scale Chinese English exams for over a decade. MCP is just the doorway.

Architecture

Chivox exposes two parallel front doors to the same scoring engine. Pick whichever matches your client runtime — the scoring result is byte-identical.

MCP mode · recommended

JSON-RPC 2.0 over Streamable HTTP or stdio
Zero-code drop-in for any MCP-aware client — Cursor, Claude Desktop, Cline, LangChain, Mastra, Agents SDK.
Tool list auto-injected; new tools require no client change.
Optional local proxy chivox-local-mcp adds microphone streaming.

Function-calling mode · fallback

OpenAI-style REST + WebSocket (aka cvx_fc).
No MCP SDK required — any HTTP/WS client works.
Built-in resume_token, intermediate results, backpressure frames.
Ships with iOS, Android, Flutter, mini-program, and legacy Java / PHP backends in mind.

Pick exactly one per session

The MCP /ws/audio/{sid} and function-calling /ws/eval/{sid} endpoints live in separate session namespaces — a session_idfrom one won’t work on the other.

Requirements

Anything that speaks MCP over Streamable HTTP or stdio. If you only need file-based scoring (no live mic), you don’t need a local proxy at all.

Dependency	Version	Needed when
`Node.js`	≥ 18	Running chivox-local-mcp
`SoX`	any	Streaming from a local microphone
`macOS / Linux / Windows`	latest 2 LTS	Every tested platform

bash·Install SoX (only for streaming capture)

# macOS
brew install sox

# Ubuntu / Debian
sudo apt-get install sox

# Arch
sudo pacman -S sox

Quickstart

60 seconds, zero audio-engineering knowledge.

Wire up the server

Zero-install path — any MCP-aware client can talk to our hosted endpoint over Streamable HTTP. Below is the Cursor config; others are covered under Clients.

json·Cursor · Settings → MCP

{
  "name": "chivox-speech-eval",
  "type": "streamable-http",
  "url": "https://mcp.cloud.chivox.com"
}

Need microphone streaming? Run the optional local proxy instead — see the Claude Desktop setup.

Call a scoring tool

Your LLM now sees all 16 tools. Call one directly from an MCP-aware client:

ts·score.ts · raw MCP SDK

import { McpClient } from '@modelcontextprotocol/sdk';

const mcp = new McpClient({ name: 'chivox' });
await mcp.connectHttp('https://mcp.cloud.chivox.com');

// Upload a file first, then evaluate by audioId:
const { audioId } = await fetch('https://your-audio-host.com/upload', {
  method: 'POST',
  body: wavBuffer,
  headers: { 'Content-Type': 'audio/wav' },
}).then(r => r.json());

const result = await mcp.callTool('en_sentence_eval', {
  audioId,
  ref_text: 'I think therefore I am',
});

console.log(result.overall);              // 85
console.log(result.details[0].phone);     // [{ phoneme: 'θ', score: 91 }, ...]

Feed the JSON back to your LLM

The payload is hand-shaped for LLM reasoning: short field names, flat arrays, bounded ranges. Your model can now generate per-user feedback, drills, and next-lesson plans without a second speech round-trip. See Secondary analysis with LLM.

Authentication

The hosted URL https://mcp.cloud.chivox.com is open for evaluation; for production traffic, every call is authenticated with an API key. The local proxy and framework integrations read them from environment variables.

Parameter	Type	Required	Notes
`MCP_REMOTE_URL`	string	required	Remote endpoint, usually https://mcp.cloud.chivox.com.
`MCP_API_KEY`	string	optional	Bearer token when your deployment enforces auth.
`MCP_TIMEOUT_MS`	number	default 30000	Per-request timeout for the proxy to upstream.

Key rotation

Keys can be rotated at any time from the dashboard. Expect a few seconds of propagation to the edge. Old keys continue to work for a 5-minute grace window.

/clients

Cursor

Settings → MCP → Add new MCP server. Cursor speaks Streamable HTTP directly — no binary to install.

json·~/.cursor/mcp.json

{
  "name": "chivox-speech-eval",
  "type": "streamable-http",
  "url": "https://mcp.cloud.chivox.com"
}

No microphone over HTTP

Cursor’s MCP transport doesn’t expose mic capture. If you need live streaming assessment, wire up Claude Desktop with the local proxy instead.

Claude Desktop

Claude Desktop talks to the local proxy over stdio, which in turn bridges to the hosted scoring engine — this path unlocks microphone streaming.

bash·Install the local proxy

# Option A — global (recommended)
npm install -g chivox-local-mcp

# Option B — run via npx, no install
MCP_REMOTE_URL=https://mcp.cloud.chivox.com npx chivox-local-mcp

json·~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "chivox": {
      "command": "chivox-local-mcp",
      "env": {
        "MCP_REMOTE_URL": "https://mcp.cloud.chivox.com",
        "MCP_API_KEY": "your-api-key"
      }
    }
  }
}

Claude Code

One command — Claude Code will persist it under the user config.

bash·Claude Code CLI

claude mcp add chivox -- \
  env MCP_REMOTE_URL=https://mcp.cloud.chivox.com \
  chivox-local-mcp

Windsurf · Zed · Continue · Cline

All four accept the same Streamable HTTP JSON as Cursor — only the settings file path differs.

json·mcp.json

{
  "mcpServers": {
    "chivox-speech-eval": {
      "type": "streamable-http",
      "url": "https://mcp.cloud.chivox.com"
    }
  }
}

LangChain · Mastra · OpenAI Agents SDK

Point any MCP-aware framework adapter at our hosted URL; the framework drives the tool loop (discovery → tool_calls → execution → follow-up messages). All three share the same 16 tools.

python·LangGraph · langchain-mcp-adapters

# pip install langchain-mcp-adapters langgraph
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent

client = MultiServerMCPClient({
    "chivox": {
        "transport": "streamable_http",
        "url": "https://mcp.cloud.chivox.com",
    }
})
tools = await client.get_tools()

agent = create_react_agent("openai:gpt-4o-mini", tools)
result = await agent.ainvoke({"messages":
    [("user", "Score audioId=abc123, ref: I think therefore I am")]})
print(result["messages"][-1].content)

typescript·Mastra

import { MCPClient } from '@mastra/mcp';
import { Agent } from '@mastra/core/agent';
import { openai } from '@ai-sdk/openai';

const mcp = new MCPClient({
  servers: {
    chivox: { url: new URL('https://mcp.cloud.chivox.com') },
  },
});

export const coach = new Agent({
  name: 'speech-coach',
  instructions: 'Use Chivox tools to score speech and give feedback.',
  model: openai('gpt-4o-mini'),
  tools: await mcp.getTools(),
});

python·OpenAI Agents SDK

# pip install openai-agents
from agents import Agent, Runner
from agents.mcp import MCPServerStreamableHttp

chivox = MCPServerStreamableHttp(
    params={"url": "https://mcp.cloud.chivox.com"},
    name="chivox-speech-eval",
)

async with chivox:
    agent = Agent(
        name="coach",
        instructions="Professional speaking coach",
        mcp_servers=[chivox],
    )
    r = await Runner.run(agent, "Score audioId=abc123")
    print(r.final_output)

LlamaIndex, AutoGen, CrewAI, and Spring AI ship similar bridges — same URL, same 16 tools.

/core-concepts

What the engine returns

Every scoring tool returns the same top-level shape, regardless of locale or task. The header block is three numbers you can ship straight to a UI; details[] is where the phoneme-level reasoning fuel lives.

json·result.json · shape at a glance

{
  "overall":   85,
  "accuracy":  82,
  "pron":      88,
  "fluency":   { "overall": 78, "speed": 65, "pause": 2 },
  "integrity": 95,
  "details": [
    {
      "char":  "hello",
      "score": 85,
      "phone": [
        { "phoneme": "h",  "score": 90, "dp_type": "normal"  },
        { "phoneme": "ɛ",  "score": 82, "dp_type": "normal"  },
        { "phoneme": "l",  "score": 88, "dp_type": "normal"  },
        { "phoneme": "oʊ", "score": 80, "dp_type": "normal"  }
      ]
    }
  ]
}

See the complete list of fields and their valid ranges under API reference → Response fields.

Mandarin tonal scoring

Most open-source STT collapses on tone. Chivox runs a dedicated F₀ contour evaluator that understands four lexical tones + neutral + sandhi. Each syllable gets a tone_ref (expected) and tone_detected (produced), plus a score.

The engine knows the common sandhi rules — e.g. T3 + T3 → T2 + T3— so “你好” pronounced as (T2, T3) is marked normal, not mis-pronounced, even though the surface tone differs from the dictionary form.

Why this matters for agents

Sandhi-aware verdicts mean your LLM never has to second-guess a legitimate surface change. If dp_type says normal, it’s correct, full stop — no “well actually” templates required.

English phoneme scoring

Every word is broken down into IPA phonemes, each with a score and a dp_type of normal, mispron, or missing. The engine also returns a best-guess “what the user actually said instead” via phoneme_error — invaluable for drilling.

Actually useful example

A user reading “think” as /sɪŋk/ comes back as { expected: "/θ/", actual: "/s/" }. Your LLM can generate a targeted tongue-twister on the spot — no second round-trip to the scorer.

Streaming vs file evaluation

Two evaluation modes cover every UX you are likely to build. Both return the same result shape; they differ only in how audio gets in.

Streaming microphone

Audio flows over WebSocket while the user speaks. No intermediate file, lowest latency — ideal for live tutoring UX. Supports interrupts, reconnects via resume_token, and intermediate frames while the user is still talking.

File upload

POST a clip to https://your-audio-host.com/upload (local path, Base64, or URL), get an audioId back, then call any scoring tool with it. Perfect for batch jobs and async pipelines.

/guides

Build an AI pronunciation tutor

The canonical recipe: score → reason → drill → repeat. Chivox handles step 1; your LLM handles 2 and 3; step 4 just loops.

Score. Record the user, call one of the English or Mandarin tools with the reference text.
Reason. Hand the raw JSON to your LLM with a persona prompt (see Prompt templates).
Drill. The LLM returns a short tongue-twister or a minimal-pair list. TTS it back to the user.
Repeat. Score the drill, show progress vs. baseline, persist deltas into the student profile.

Want to see it live?

The full loop is implemented on the interactive demo — crack open the devtools network tab, every call is unredacted.

Secondary analysis with LLM

The payload is designed to be reasoning fuel, not a terminal score. Give your agent this prompt scaffold and it will outperform any hard-coded feedback template:

markdown·prompt.md

You receive a Chivox scoring payload.

1. Group mispronounced phonemes by type
   (voiceless fricatives, rhotics, tones, etc.).
2. Pick the single most costly pattern
   (biggest total score loss across all words).
3. Generate ONE minimal-pair drill that targets exactly that pattern.
4. Output JSON:
   {
     "diagnosis": "<= 40 words",
     "drill":     ["word1", "word2", ...],
     "next_step": "read the drill three times, then retry the sentence"
   }

Respond in the user's locale.

Long-term student profiling

Persist details[*].phone across sessions and you have everything you need for a longitudinal profile. Common derived signals:

Pattern-over-time: rolling rate of dp_type === "mispron" per phoneme.
Progress curve: 7-day moving average of overall on a fixed reference set.
Spaced-repetition triggers: promote a phoneme out of drills after N consecutive normal passes.
Tonal stability: Mandarin only — tone_detected variance over the last 20 utterances.

Prompt templates

Mount these as the system message and pass the tool JSON in the user turn. They mirror the same five-part shape used by our hosted demo: persona / task / method / output format / tone.

markdown·system · pronunciation coach (English)

You are a warm, experienced pronunciation coach.

Given a JSON scoring payload from Chivox:
- Highlight up to 3 phoneme-level issues, using IPA.
- For each, show what the learner said vs. what they should say,
  and a one-sentence articulation cue (e.g. "touch your tongue tip
  to the alveolar ridge for /t/").
- End with one sentence of encouragement.

Keep responses under 90 words. Do not repeat the raw scores verbatim.

markdown·system · Mandarin tone coach

你是一位耐心的普通话口语老师。

给定一段 Chivox 打分 JSON：
- 只关注 tone_detected 与 tone_ref 不一致的音节。
- 提供"实际听到" vs "正确声调"，附 1 句发音提示
  （例如："第三声要先降后升"）。
- 若触发变调规则 (T3+T3 → T2+T3) 已由 dp_type=normal 覆盖，
  则不要提该音节。

用简体中文回答，不超过 80 字。

/api-reference

English tools · 10

Call any of these by name via MCP tools/call or via function-calling. Each returns the standard result shape (see Response fields).

Tool name	Task	Notes
`en_word_eval`	Word scoring	Single-word pronunciation
`en_word_correction`	Word correction	Detect omissions, extras, wrong phones
`en_vocab_eval`	Vocabulary scoring	Multiple words in one clip
`en_sentence_eval`	Sentence scoring	Accuracy + fluency
`en_sentence_correction`	Sentence correction	Per-word feedback
`en_paragraph_eval`	Paragraph read-aloud	Long-passage quality
`en_phonics_eval`	Phonics scoring	Letter-to-sound rules
`en_choice_eval`	Oral multiple choice	Constrained answers
`en_semi_open_eval`	Semi-open prompt	Scenario speaking
`en_realtime_eval`	Realtime read-aloud	Streaming feedback

Mandarin tools · 6

Mandarin-specific tools. All six honour sandhi rules and the neutral tone; cn_aitalk_eval additionally scores topic adherence and coherence for open-ended answers.

Tool name	Task	Notes
`cn_word_raw_eval`	Character scoring	Hanzi pronunciation
`cn_word_pinyin_eval`	Pinyin scoring	Syllable-level with tone
`cn_sentence_eval`	Phrase scoring	Short utterances
`cn_paragraph_eval`	Paragraph scoring	Long text
`cn_rec_eval`	Constrained recognition	Pick-one answers
`cn_aitalk_eval`	AI Talk scoring	Open-ended dialog evaluation

Response fields

The schema is stable across v1.x. Unknown future fields will be additive and documented in the Changelog.

Field	Type	Notes
`overall`	number · 0–100	Weighted roll-up. Safe for UI.
`accuracy`	number · 0–100	Phoneme / syllable correctness.
`pron`	number · 0–100	Articulation quality.
`integrity`	number · 0–100	Did the user read every word?
`fluency.overall`	number · 0–100	Pauses + speed + hesitations.
`fluency.speed`	number	Syllables or words per minute.
`fluency.pause`	number	Count of unexpected pauses.
`details[i].phone[j].dp_type`	"normal" · "mispron" · "missing"	Per-phoneme verdict.
`details[i].phone[j].phoneme_error`	{ expected, actual }	Only present on English mispron. Ideal for drills.
`details[i].tone_ref · tone_detected`	1–5 · neutral	Mandarin only.

Error codes

HTTP status codes

Status	Meaning
`400`	Malformed request (missing fields / wrong types).
`401`	Unauthenticated — missing Authorization header.
`403`	Forbidden — invalid key or no permission for the requested tool.
`404`	Session not found (wrong / reaped session_id).
`409`	Invalid state (e.g. audio sent after stop).

Structured streaming error codes

WebSocket error frames and the error payload of get_stream_result share the same code set — dispatch client-side handling off these:

Code	Meaning	Suggested action
`SESSION_NOT_FOUND`	Session does not exist	Recreate with start_stream_eval / create_stream_session
`SESSION_EXPIRED`	Session expired (idle > 60s)	Recreate session
`INVALID_STATE`	Operation not allowed in current state	Check call order (audio after stop, etc.)
`INVALID_PARAMS`	Invalid parameters	Check core_type / ref_text / sample rate
`RESUME_INVALID`	resume_token invalid or expired	Recreate session — each resume issues a fresh token
`AUDIO_TOO_LARGE`	Audio payload exceeds 50MB	Compress or segment before retry
`UPSTREAM_CONNECT`	Scoring engine unreachable	Retry with backoff; contact Chivox if persistent
`UPSTREAM_TIMEOUT`	Scoring engine timed out	Shorten audio / check network
`UPSTREAM_EVAL_ERROR`	Scoring engine returned an error	Inspect the message field
`CAPACITY_FULL`	Concurrent session quota exceeded	Back off and retry, or upgrade quota

/operations

Endpoints

All hosted under mcp.cloud.chivox.com. MCP and function-calling (cvx_fc) coexist — pick one per session, not per request.

Path	Method	Purpose
`/upload`	POST	Audio upload (returns audioId for file evaluation).
`/mcp`	POST	MCP mode · JSON-RPC over Streamable HTTP.
`/ws/audio/{session_id}`	WS	MCP mode · streaming audio WebSocket.
`/v1/functions`	GET	Function-calling · list all scoring functions.
`/v1/call`	POST	Function-calling · one-shot eval / create stream session / fetch result.
`/ws/eval/{session_id}`	WS	Function-calling · streaming audio WebSocket.
`/health`	GET	Health check (no auth, safe for probes).

Limits

These defaults describe technical guardrails on the hosted endpoint. Billing quotas (trial credits, concurrency tiers, call volume) are a separate dimension documented on the dashboard.

Item	Default	Notes
`Scratch storage`	500 MB	Temporary audio cache.
`Queue depth`	10	Pending scoring jobs per key.
`Concurrency`	3	Parallel scoring workers.
`Audio TTL`	5 min	Expires if not scored.
`Max upload`	50 MB	Per file.
`Streaming idle`	60 s	Session drops after 60 s without audio.

Privacy & data handling

Audio is processed, scored, and dropped. We retain the resulting JSON for 30 days (for debugging and your own dashboard) and nothing else. Customers on enterprise plans can provision a region-locked tenant (US · EU · SG).

No audio is ever used for model training.
TLS 1.3 on every hop. Keys are hashed at rest.
GDPR & CCPA DSRs honoured within 10 business days — email BD@chivox.com.

FAQ

Is this just another wrapper around Whisper?+

No. Whisper transcribes; Chivox scores. The engine is trained on tens of millions of exam-graded samples and has been the evaluation backbone for national-scale Chinese English exams for over a decade.

Does it work offline / on-device?+

The MCP server needs an outbound call to our scoring engine. For air-gapped use, talk to us — we ship an on-prem container for enterprise customers.

What about dialects?+

Mandarin scoring targets standard Pǔtōnghuà. English supports en-US, en-GB, and en-AU rubrics; select via locale parameters on the relevant tools.

Can I use this in a browser?+

For quick demos, yes — but hosted traffic should always flow through a trusted backend so your API key isn’t exposed. The browser can upload to your backend, which forwards to /upload.

Which LLMs are known to work out of the box?+

Any model that supports OpenAI-style function calling: GPT-4o / 5.x, Claude Sonnet / Opus, Gemini, DeepSeek, GLM, Kimi, Doubao, Qwen. Tool schemas are forwarded verbatim — no per-vendor adapters needed.

Changelog

Pre-launch

The service is in invite-only access and ships from local builds — there is no public version cadence yet. Once we cut a public release, versioned entries will land here.

v1.0 — Current build. Includes the full 16-tool surface (Mandarin sandhi-aware tone verdicts, phoneme_error.actual for English mispronunciations, cn_aitalk_eval with a coherence field for open-ended tasks). Distributed to design partners on request.

Architecture