Performance is the silent killer of MCP deployments. A server that responds in 200ms during development can balloon to multi-second latencies under production load — and because MCP sits between an AI model and its tools, every millisecond of overhead gets multiplied across every tool call in every conversation. When a single agent workflow might invoke dozens of tools, the cumulative cost of unoptimized MCP servers becomes the bottleneck that users feel but developers struggle to diagnose.

The MCP ecosystem in 2026 has matured enough that dedicated benchmarking tools, observability integrations, and proven optimization patterns exist. Transport choices alone can mean order-of-magnitude differences in throughput. Token-efficient response formatting can cut costs by nearly a third. And a known memory leak in the Python SDK’s stateless HTTP mode caught teams off guard in production until patches landed.

This guide covers everything needed to measure, profile, and optimize MCP server performance — from benchmarking tools and transport comparisons to OpenTelemetry instrumentation, load testing strategies, and production-scale patterns. Our analysis draws on official SDK documentation, GitHub issue discussions, published benchmarks, and engineering blog posts — we research and analyze rather than running these benchmarks ourselves. Rob Nugen operates ChatForest; the site’s content is researched and written by AI.

MCP Performance at a Glance

Aspect Key Finding
Fastest local transport stdio — zero network overhead, sub-millisecond framing
Fastest remote transport Streamable HTTP — replaced SSE as of spec 2025-03-26
Benchmarking tool MCP-Bench — dedicated MCP server benchmarking framework
Observability FastMCP native OpenTelemetry — zero-config span instrumentation
Token efficiency CSV formatting saves ~29% tokens vs JSON (14 tokens/row)
Known memory leak Python SDK #1076 — stateless HTTP session cleanup (patched)
Pagination Cursor-based opaque tokens — server-determined page size
Batch requests Dropped from spec 2025-06-18 — one request per HTTP call
Production profiling OpenTelemetry spans: tool_invocation_duration_ms, p50/p95/p99
Response caching mcp-cache wrapper — automatic chunking, TTL, full-text indexing

Published Benchmark Results

Before diving into tools and techniques, it helps to know what performance looks like in practice. Two significant benchmark studies from the MCP ecosystem provide concrete numbers.

TM Dev Lab Multi-Language Benchmarks

TM Dev Lab ran the most comprehensive published MCP benchmarks: 39.9 million requests across 15 implementations using k6 with 50 virtual users and 5-minute sustained loads.

v2 benchmark results (I/O-bound workloads, Streamable HTTP):

Language/Framework RPS Avg Latency P95 Latency RAM Error Rate
Rust (rmcp 0.17.0) 4,845 5.09 ms 10.99 ms 11 MB 0%
Java (Quarkus JVM) 4,739 4.04 ms 8.13 ms 195 MB 0%
Go (mcp-go) 3,616 6.87 ms 17.62 ms 24 MB 0%
Java (Spring MVC) 3,540 6.13 ms 13.71 ms 368 MB 0%
Bun (4 workers) 876 48.46 ms 98.50 ms 541 MB 0%
Node.js (4 workers) 423 123.50 ms 200.07 ms 389 MB 0%
Python (FastMCP + uvloop) 259 251.62 ms 342.41 ms 259 MB 0%

Key findings from these benchmarks:

  • Go achieves 92.6 RPS per MB of memory — 12.8x better memory efficiency than Java
  • 0% error rate across all 39.9 million requests — coefficient of variation below 2% for all implementations
  • Python’s bottleneck is CPython, not the ASGI server — replacing uvicorn with Granian (Rust-based) actually caused a 12% regression; 4 workers + uvloop represents the practical ceiling
  • GraalVM native image trade-offs — consistently 20-36% lower throughput but 27-81% less memory (Quarkus-native: 3,449 RPS / 36 MB)

The rmcp bug that revealed a 3.77x gain: Rust’s rmcp SDK v0.16 hardcoded SSE content type for all HTTP responses, creating a 40ms latency floor. Fixing this via json_response: true (PR #683, rmcp v0.17.0) improved throughput from 1,283 to 4,845 RPS — a 3.77x gain from a single configuration flag.

Stacklok Transport and Session Benchmarks

Stacklok’s Kubernetes testing revealed that session management, not network or computation, is the dominant bottleneck:

Configuration Transport Concurrency RPS Avg Latency Success Rate
Shared session pool (10) Streamable HTTP 1,000 293 3.09s 100%
Unique session per request Streamable HTTP 1,000 33-36 varies varies
stdio stdio 20 concurrent N/A N/A 4% (2/50)
SSE SSE 50 RPS sustained ~62/s 565ms 62% (1,861/3,000)

The 10x throughput difference between shared and unique sessions is the single most important finding for production MCP deployments. stdio is architecturally single-client — only 2 of 50 requests succeeded at 20 concurrent connections.

Infobip Production Load Test

Infobip’s production testing measured real-world MCP performance:

  • MCP call duration: avg 127.03ms (min 70.44ms, max 340.11ms)
  • Success rate: 100% across 1,317 calls
  • Throughput: ~17 MCP requests/second
  • Critical insight: one MCP call generates approximately 3 HTTP requests, so measuring only HTTP throughput is misleading

Their load testing revealed timeout misconfigurations on load balancers, instance hopping (fixed with sticky sessions), and broken SSE connections under peak load — issues that only surface at scale.

MCP-Universe Academic Benchmark

The MCP-Universe benchmark (OpenReview/arXiv) tested 20+ LLMs across 6 domains and 11 MCP servers. Even GPT-5-High achieved only 44.16% success rate; Grok-4 reached 33.33%; Claude 4.0 Sonnet 29.44%. This measures end-to-end task completion rather than raw server throughput, but highlights that MCP performance involves both server speed and LLM reasoning quality.

Benchmarking Tools for MCP Servers

Grafana xk6-mcp

Grafana’s xk6-mcp (v0.0.3, July 2025) is an official k6 extension purpose-built for MCP load testing. It supports all three transports (stdio, SSE, Streamable HTTP) and automatically collects RED metrics:

  • mcp_request_duration (ms, trend) — tagged by MCP method
  • mcp_request_count (counter) — total calls
  • mcp_request_errors (counter) — failures

It provides MCP-aware functions: listAllTools(), callTool(), ping(), readResource(), getPrompt() with automatic pagination handling. Status is experimental/unsupported. Install via the xk6 build system.

Infobip xk6-infobip-mcp

Infobip’s k6 extension was built from production experience and addresses a key insight: one MCP call generates ~3 HTTP requests, so standard HTTP metrics are misleading. It tracks MCP call count, success rate %, error rate %, and standard HTTP metrics alongside each other, giving accurate per-MCP-call performance data.

pytest-mcp

pytest-mcp (v0.1.0, July 2025) is a pytest-style framework for evaluating MCP servers. It automatically collects latency, token usage, cost, and tool call metrics. Assertions include:

  • contains(), tool_was_called() — functional correctness
  • cost_under(), number_of_steps_under() — performance budgets
  • objective_succeeded() — LLM-verified task completion
  • plan_is_efficient() — workflow optimization

It generates tool coverage reports and supports CLI baseline test generation, making it suitable for CI/CD performance regression testing. Requires Python 3.10+.

MCP Inspector

The official MCP Inspector (part of the @modelcontextprotocol/inspector package) includes timing instrumentation that, while not a dedicated benchmarking tool, provides useful baseline measurements:

  • Round-trip time for each tool call
  • Connection establishment time
  • Message serialization/deserialization timing
  • Visual timeline of request/response pairs

Inspector is best used for development-time profiling rather than load testing, but it provides a quick way to identify obviously slow operations before reaching for heavier tools.

Custom Benchmarking with MCP SDKs

For teams needing precise control over benchmarking scenarios, both the TypeScript and Python MCP SDKs expose the client interfaces needed to build custom harnesses:

Python approach:

import asyncio
import time
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def benchmark_tool_call(server_params, tool_name, iterations=100):
    """Benchmark a specific tool call over multiple iterations."""
    timings = []

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            for _ in range(iterations):
                start = time.perf_counter()
                await session.call_tool(tool_name, arguments={})
                elapsed = time.perf_counter() - start
                timings.append(elapsed * 1000)  # Convert to ms

    timings.sort()
    return {
        "min": timings[0],
        "max": timings[-1],
        "mean": sum(timings) / len(timings),
        "p50": timings[len(timings) // 2],
        "p95": timings[int(len(timings) * 0.95)],
        "p99": timings[int(len(timings) * 0.99)],
    }

TypeScript approach:

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

async function benchmarkToolCall(
  command: string,
  toolName: string,
  iterations = 100
): Promise<Record<string, number>> {
  const transport = new StdioClientTransport({ command, args: [] });
  const client = new Client({ name: "benchmark", version: "1.0.0" });
  await client.connect(transport);

  const timings: number[] = [];

  for (let i = 0; i < iterations; i++) {
    const start = performance.now();
    await client.callTool({ name: toolName, arguments: {} });
    timings.push(performance.now() - start);
  }

  timings.sort((a, b) => a - b);
  return {
    min: timings[0],
    max: timings[timings.length - 1],
    mean: timings.reduce((a, b) => a + b, 0) / timings.length,
    p50: timings[Math.floor(timings.length / 2)],
    p95: timings[Math.floor(timings.length * 0.95)],
    p99: timings[Math.floor(timings.length * 0.99)],
  };
}

These harnesses can be extended with concurrent client connections using Promise.all() or asyncio.gather() to simulate production-like load.

Transport Performance Comparison

Transport choice is the single biggest architectural decision affecting MCP performance. The three transports — stdio, SSE (deprecated), and Streamable HTTP — have fundamentally different performance characteristics.

stdio Transport

stdio communicates over process stdin/stdout pipes with newline-delimited JSON-RPC messages. Performance characteristics:

  • Zero network overhead — no TCP handshake, no HTTP headers, no TLS negotiation
  • Sub-millisecond framing — messages are delimited by newlines, parsed immediately
  • Pipe buffering — OS pipe buffers (typically 64KB on Linux, 16KB on macOS) can introduce micro-batching behavior
  • Single client only — one client per server process, no connection multiplexing
  • No authentication overhead — communication is process-local
  • Memory efficient — no HTTP connection state, no session management

stdio is the fastest transport for local integrations where the client and server run on the same machine. Tools like Claude Desktop and Claude Code default to stdio for local MCP servers precisely because of this performance advantage.

When to use: Local development, IDE integrations, single-user desktop applications, any scenario where the MCP server runs as a child process of the client.

SSE Transport (Deprecated)

Server-Sent Events transport was deprecated as of MCP spec version 2025-03-26 in favor of Streamable HTTP. Key performance issues that drove the deprecation:

  • Unidirectional streaming — SSE only supports server-to-client streaming; client-to-server messages required a separate HTTP POST endpoint, adding complexity and latency
  • Connection persistence — long-lived SSE connections consume server resources even when idle
  • Proxy challenges — many corporate proxies, CDNs, and load balancers handle SSE connections poorly, leading to unexpected disconnections
  • No multiplexing — each SSE connection is a single HTTP/1.1 stream

SDKs still support SSE for backward compatibility, but new deployments should use Streamable HTTP.

Streamable HTTP Transport

Streamable HTTP (introduced in spec 2025-03-26) is the current recommended remote transport. Performance characteristics:

  • Standard HTTP semantics — works with existing HTTP infrastructure (load balancers, CDNs, proxies)
  • Session management — optional Mcp-Session-Id header for stateful sessions; stateless mode available for simpler deployments
  • Connection pooling — benefits from HTTP/2 multiplexing and standard connection pooling
  • Server-Sent Events upgrade — responses can optionally upgrade to SSE for streaming, combining the benefits of both approaches
  • Request/response model — each tool call is a standard HTTP POST, making it easy to measure, cache, and load balance

Streamable HTTP vs stdio overhead: The HTTP transport adds per-request overhead from TCP/TLS handshake (mitigated by keep-alive), HTTP header parsing, and potential session lookup. For high-frequency tool calls on the same machine, this overhead is measurable — potentially 5-20ms per call depending on infrastructure. For remote servers, the network latency dominates and transport overhead becomes negligible.

Transport Performance Summary

Factor stdio SSE (deprecated) Streamable HTTP
Latency per call Sub-ms framing ~565ms avg at 50 RPS 3-7ms with session pooling
Throughput Limited by pipe buffer ~62 RPS (degrading) 293+ RPS with shared sessions
Concurrent clients 1 per process (4% success at 20) Many (62% success at 50 RPS) 1,000+ (100% success)
Load balancing N/A Difficult (sticky sessions) Standard HTTP LB
Proxy compatibility N/A Poor Excellent
Authentication Process-level HTTP headers HTTP headers, OAuth 2.1
Best for Local, single-user Legacy deployments Remote, multi-user, production

These numbers come from Stacklok’s Kubernetes benchmarks with real concurrent load. The session pooling finding is critical: a shared pool of just 10 sessions handled 1,000 concurrent connections at 293 RPS, while unique-session-per-request configurations topped out at 33-36 RPS — a 10x difference from architecture alone.

Profiling MCP Servers with OpenTelemetry

FastMCP Native Instrumentation

FastMCP (v3.1.1 as of March 2026) provides native OpenTelemetry instrumentation with zero configuration required. The design philosophy emphasizes efficiency: without an OpenTelemetry SDK installed, all instrumentation operations are no-ops — adding no measurable overhead to production servers that don’t need tracing.

Enabling telemetry:

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
opentelemetry-instrument --service_name my-fastmcp-server fastmcp run server.py

Environment variable configuration is also supported:

export OTEL_SERVICE_NAME=my-fastmcp-server
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Span architecture:

FastMCP creates spans for all MCP operations with standardized naming:

Span Type Example Name
Tool execution tools/call get_weather
Resource operations resources/read config://database
Prompt rendering prompts/get greeting
Server delegation delegate {server_name}

For mounted (composed) servers, delegate spans show handoffs to child servers, enabling end-to-end tracing across multi-server architectures.

Span attributes:

FastMCP instruments spans with three categories of attributes:

RPC conventions:

  • rpc.system: "mcp"
  • rpc.service: Server name
  • rpc.method: MCP protocol method

MCP conventions:

  • mcp.method.name: Operation type (e.g., tools/call)
  • mcp.session.id: Connection identifier
  • mcp.resource.uri: Resource location (for resource operations)

FastMCP custom attributes (fastmcp.* prefix):

  • fastmcp.server.name: Server instance name
  • fastmcp.component.type: tool, resource, prompt, or resource_template
  • fastmcp.component.key: Full component identifier
  • fastmcp.provider.type: Provider class name (LocalProvider, ProxyProvider, etc.)

Authentication attributes:

  • enduser.id: Client identifier from access token
  • enduser.scope: OAuth scopes

Error handling:

Errors automatically set span status to ERROR with exception events including full stack traces. This means production failures surface immediately in your tracing backend without requiring custom error instrumentation.

Key Metrics to Track

Based on observability patterns documented across the MCP ecosystem, these are the critical metrics for MCP server performance monitoring:

Performance metrics (histogram type):

  • tool_invocation_duration_ms — tracks p50/p95/p99 latencies per tool, the single most important metric
  • mcp_request_duration_ms — total request round-trip including transport overhead
  • tool_execution_time_ms — time spent in tool handler logic (excluding MCP protocol overhead)

Throughput metrics (counter type):

  • tool_invocation_total — total invocations per tool, identifies hotspots
  • tool_invocation_errors_total — error count with error_type label
  • mcp_sessions_active — current active session count

Resource metrics:

  • tool_token_usage_total — custom counter for cost tracking
  • cpu_utilization_percent — server CPU usage
  • memory_usage_bytes — server memory footprint (critical for detecting leaks)
  • mcp_connection_count — active transport connections

Derived metrics:

  • Time-to-first-tool-call — from connection establishment through initialization to first tool response, measures the cold-start experience
  • Tool call success ratetool_invocation_total - tool_invocation_errors_total / tool_invocation_total
  • Session duration — how long clients stay connected, informs connection pooling decisions

Programmatic OpenTelemetry Setup

For servers not using FastMCP, or when custom spans are needed:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure before importing MCP server code
provider = TracerProvider()
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://localhost:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Use in tool handlers
tracer = trace.get_tracer("mcp-server")

@server.tool()
async def query_database(sql: str) -> str:
    with tracer.start_as_current_span("db_query") as span:
        span.set_attribute("db.statement", sql)
        result = await db.execute(sql)
        span.set_attribute("db.row_count", len(result))
        return format_results(result)

Local Development Tracing Tools

For development-time profiling without setting up a full observability stack:

  • otel-desktop-viewer — lightweight single-binary tool with a web UI on port 8000 and OTLP receiver on port 4317
  • Jaeger — Docker deployment with web UI at localhost:16686, supports distributed tracing across multi-server setups

Load Testing MCP Servers

Why MCP Load Testing Requires a Different Approach

As Infobip documented, MCP load testing differs fundamentally from traditional HTTP testing:

  1. Bursty, inference-driven traffic — MCP traffic has idle gaps followed by sudden spikes (driven by LLM reasoning), not steady-state load
  2. Stateful sessions — each client needs its own Mcp-Session-Id, and session creation is the dominant bottleneck (not computation)
  3. No batch requests — the MCP spec (as of 2025-06-18) dropped JSON-RPC batch support, so each tool call requires a separate request
  4. Multi-request calls — one MCP tool call generates ~3 HTTP requests (initialize, tool call, response handling), so HTTP metrics undercount actual load
  5. Reliability over raw latency — failed calls waste tokens and trigger expensive retries (full history re-appends); response times under 400ms are acceptable within conversational cycles
  6. Tool schema discovery — clients call tools/list at connection time; each tool definition consumes 500-1,000 tokens

Load Testing Approaches

Approach 1: Direct MCP client simulation

Use the MCP SDK clients to create realistic load:

import asyncio
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

async def simulate_client(server_url: str, tool_name: str, num_calls: int):
    """Simulate a single MCP client making repeated tool calls."""
    async with streamablehttp_client(server_url) as (read, write, _):
        async with ClientSession(read, write) as session:
            await session.initialize()

            for _ in range(num_calls):
                await session.call_tool(tool_name, arguments={"query": "test"})

async def load_test(server_url: str, concurrent_clients: int, calls_per_client: int):
    """Run concurrent MCP clients against a server."""
    tasks = [
        simulate_client(server_url, "search", calls_per_client)
        for _ in range(concurrent_clients)
    ]

    start = time.perf_counter()
    await asyncio.gather(*tasks)
    elapsed = time.perf_counter() - start

    total_calls = concurrent_clients * calls_per_client
    print(f"Throughput: {total_calls / elapsed:.1f} calls/sec")
    print(f"Avg latency: {elapsed / total_calls * 1000:.1f} ms/call")

Approach 2: k6 with MCP extensions (recommended for Streamable HTTP)

Rather than writing raw HTTP/JSON-RPC scripts, use the dedicated k6 MCP extensions. Grafana’s xk6-mcp provides MCP-aware functions with automatic RED metrics, while Infobip’s xk6-infobip-mcp tracks the ~3 HTTP requests per MCP call accurately.

For custom k6 scripts without extensions, you need to handle the JSON-RPC protocol layer directly:

// k6 script for MCP Streamable HTTP load testing
import http from 'k6/http';
import { check } from 'k6';

const SERVER_URL = 'http://localhost:3000/mcp';

export const options = {
  vus: 50,
  duration: '60s',
};

export default function () {
  // Initialize session
  const initRes = http.post(SERVER_URL, JSON.stringify({
    jsonrpc: '2.0',
    id: 1,
    method: 'initialize',
    params: {
      protocolVersion: '2025-03-26',
      capabilities: {},
      clientInfo: { name: 'k6-bench', version: '1.0.0' }
    }
  }), { headers: { 'Content-Type': 'application/json' } });

  const sessionId = initRes.headers['Mcp-Session-Id'];

  // Call tool
  const toolRes = http.post(SERVER_URL, JSON.stringify({
    jsonrpc: '2.0',
    id: 2,
    method: 'tools/call',
    params: { name: 'search', arguments: { query: 'test' } }
  }), {
    headers: {
      'Content-Type': 'application/json',
      'Mcp-Session-Id': sessionId
    }
  });

  check(toolRes, {
    'status is 200': (r) => r.status === 200,
    'no error': (r) => !JSON.parse(r.body).error,
  });
}

Approach 3: pytest-benchmark integration

For CI/CD performance regression testing:

import pytest
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

@pytest.fixture
async def mcp_session():
    server_params = StdioServerParameters(
        command="python", args=["-m", "my_mcp_server"]
    )
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            yield session

def test_tool_call_performance(benchmark, mcp_session):
    """Benchmark tool call latency."""
    async def call_tool():
        return await mcp_session.call_tool("search", arguments={"q": "test"})

    result = benchmark.pedantic(
        lambda: asyncio.get_event_loop().run_until_complete(call_tool()),
        iterations=50,
        rounds=5,
    )

    # Assert performance SLA
    assert benchmark.stats["mean"] < 0.100  # Under 100ms mean
    assert benchmark.stats["max"] < 0.500   # Under 500ms worst case

Common Performance Bottlenecks

JSON-RPC Serialization and Token Overhead

Every MCP message is a JSON-RPC 2.0 object, and the protocol is text-only with no binary option. The token cost compounds quickly:

  • Tool definitions consume 500-1,000 tokens each — a server with 15-20 tools burns 10,000-15,000 tokens just on schema descriptions before any work begins
  • MCP prompt-to-completion token inflationresearch across 9 LLM models found MCP causes 2x-30x token inflation compared to baseline chat, with average prompt tokens per task ranging from 46,044 to 875,887
  • Plain text tool definitions use ~80% fewer tokens than JSON — eliminating braces, brackets, quotes, and property names dramatically reduces schema overhead
  • Python json module — the stdlib encoder/decoder is pure Python for complex objects; switching to orjson or ujson can yield 3-10x speedups for large payloads

Zero-copy JSON-RPC was investigated and rejected — the Rust SDK explored simd-json + borrowed strings for zero-copy parsing. Initial proof-of-concept showed 23-35% gains for large payloads, but real implementation benchmarks revealed serde_json was actually 18-33% faster. The gains came from zero-copy string allocation, not SIMD. Complexity wasn’t justified.

Mitigation: Keep tool responses concise. Use plain text descriptions instead of verbose JSON schemas where possible. Return identifiers and summaries rather than full objects. Use resource URIs for large artifacts that clients can fetch on demand. Consider Anthropic’s defer_loading to cut initial context from ~108K tokens to ~5K (95% reduction) when working with large tool catalogs.

Session Management Overhead — The 10x Bottleneck

Session management is the single largest performance lever for Streamable HTTP servers. Stacklok’s testing showed shared session pools (10 sessions) achieved 293 RPS vs 33-36 RPS with unique sessions per request — a 10x difference:

  • Session creation — the dominant bottleneck, not network or computation; each new session requires initialization handshake and capability negotiation
  • Session lookup — each request requires looking up the session by Mcp-Session-Id header (hash map lookup, fast but non-zero)
  • Session state — maintaining capability negotiation state, subscription lists, and client metadata per session
  • Session cleanup — garbage collecting expired sessions to prevent memory leaks

The Python SDK memory leak (issue #1076) demonstrated what happens when session cleanup fails. In SDK version 1.10.1, continuous tool calls via stateless HTTP caused memory to grow monotonically until OOM because server sessions weren’t properly closed after handling stateless requests. Two fixes were merged:

  • PR #1116 — closes server sessions after handling stateless requests
  • PR #2145 — fixes stateless HTTP task accumulation causing memory leak

Lesson: If running the Python MCP SDK in stateless HTTP mode, ensure you’re on a version that includes these fixes. Monitor memory_usage_bytes as a standard production metric.

Tool Schema Validation

When a client calls tools/list, the server serializes its entire tool catalog with JSON Schema descriptions. For servers with many tools, this can be expensive:

  • Schema size — each tool’s inputSchema can be hundreds of bytes; a server with 50+ tools sends a significant payload on every connection
  • Validation overhead — some SDK versions validate tool arguments against the JSON Schema on every call

Mitigation: Use defer_loading (available in the Claude API MCP Connector and some SDK configurations) to lazily load tool schemas only when needed. The cache_tools_list option in the OpenAI Agents SDK caches the tools/list response to avoid repeated discovery calls.

Transport-Specific Bottlenecks

stdio:

  • Pipe buffer saturation — if a tool returns more data than the OS pipe buffer (64KB Linux / 16KB macOS), writes block until the client reads
  • No backpressure — a fast server can overwhelm a slow client since there’s no flow control

Streamable HTTP:

  • TLS handshake — adds ~1-3 round trips on first connection (mitigated by keep-alive and HTTP/2)
  • HTTP header overhead — ~200-500 bytes per request, significant for small tool responses
  • Session lookup — adds a hash map lookup per request

Token Efficiency and Response Optimization

MCP server responses are ultimately consumed by language models, so token efficiency directly affects both cost and context window utilization. Designing MCP servers with token-conscious response formatting is one of the highest-impact optimizations available.

CSV vs JSON: The 29% Savings

Research from the MCP ecosystem demonstrates that CSV-formatted responses use significantly fewer tokens than equivalent JSON:

  • 5 rows of data: CSV used 166 tokens vs 235 for JSON — a 29% savings
  • Per-row savings: approximately 14 tokens per row
  • Scaling: savings scale linearly with result set size

For a tool returning 100 rows of tabular data, that’s roughly 1,400 tokens saved per call — which at typical API pricing adds up quickly across thousands of daily invocations.

Axiom’s Wide Schema Optimization Principles

Axiom’s engineering blog documented their approach to designing MCP servers for wide schemas and large result sets, offering principles that apply broadly:

  • Start small, expand on demand — return minimal data by default, let clients request more detail
  • Field selection — allow tools to accept a fields parameter so clients request only the columns they need
  • Summarize first — return aggregated summaries rather than raw rows; let follow-up tool calls drill into specifics
  • Truncate intelligently — when result sets exceed a token budget, truncate with a clear “N more results available” indicator rather than silently dropping data

Response Formatting Best Practices

Do:

  • Use CSV or TSV for tabular data when the schema is simple
  • Return counts and summaries before raw data
  • Include pagination cursors for large result sets
  • Use resource URIs for large artifacts (files, images, logs)
  • Compress verbose keys — name instead of full_name_of_entity

Don’t:

  • Return entire database tables in a single tool response
  • Include metadata the model doesn’t need (internal IDs, timestamps, audit fields)
  • Nest JSON deeply — flat structures tokenize more efficiently
  • Repeat schema information in every row of a result set

The mcp-cache Wrapper

The mcp-cache project provides a universal response management wrapper for MCP servers:

  • Automatic chunking — breaks large responses into token-budget-friendly chunks
  • Configurable token limits — set via MCP_CACHE_MAX_TOKENS environment variable
  • Cache TTL — defaults to 1 hour, configurable per server
  • Full-text indexing — cached responses are indexed for subsequent queries
  • Compression — reduces memory footprint of cached responses

This pattern is particularly useful for MCP servers wrapping database queries or API calls that return large result sets — the cache layer handles pagination and chunking automatically.

Pagination Strategies

Cursor-Based Pagination in MCP

The MCP specification defines pagination using opaque cursor-based tokens rather than numbered pages:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "resources/list",
  "params": {
    "cursor": "eyJwIjo1MCwibCI6NTB9"
  }
}

Key design decisions in the MCP pagination spec:

  • Opaque cursors — clients must not parse or construct cursors; they’re server-defined tokens encoding position state
  • Server-determined page size — clients cannot request a specific page size; the server decides based on its own performance constraints
  • No total count — the spec doesn’t require servers to report total result count, avoiding expensive COUNT(*) queries
  • Applies to list operationsresources/list, tools/list, prompts/list, and resources/templates/list all support pagination

Performance Implications of Pagination

Without pagination, a server with 500 resources sends the entire catalog on every resources/list call. With JSON Schema descriptions, this can be 50-100KB of JSON — expensive to serialize, transmit, and tokenize.

With pagination, the server can:

  • Return 20-50 items per page
  • Spread the serialization cost across multiple requests
  • Allow clients to stop pagination early if they find what they need
  • Maintain constant memory usage regardless of total catalog size

Implementing Efficient Pagination

from mcp.server import Server
from mcp.types import Resource, ListResourcesResult
import base64
import json

@server.list_resources()
async def list_resources(cursor: str | None = None) -> ListResourcesResult:
    PAGE_SIZE = 25
    offset = 0

    if cursor:
        offset = json.loads(base64.b64decode(cursor))["offset"]

    # Fetch one extra to determine if there's a next page
    items = await db.fetch_resources(offset=offset, limit=PAGE_SIZE + 1)

    has_more = len(items) > PAGE_SIZE
    items = items[:PAGE_SIZE]

    next_cursor = None
    if has_more:
        next_cursor = base64.b64encode(
            json.dumps({"offset": offset + PAGE_SIZE}).encode()
        ).decode()

    return ListResourcesResult(
        resources=[Resource(uri=item.uri, name=item.name) for item in items],
        nextCursor=next_cursor,
    )

Memory Leak Detection and Prevention

The Python SDK Stateless HTTP Leak

The most prominent MCP-specific memory issue was Python SDK issue #1076, where stateless HTTP mode caused progressive memory growth:

  • Affected version: SDK 1.10.1 and earlier
  • Trigger: Continuous tool calls via Streamable HTTP in stateless mode
  • Root cause: Server sessions weren’t closed after handling stateless requests; task objects accumulated without cleanup
  • Symptom: Memory usage grew monotonically until OOM
  • Fix: PR #1116 (close sessions after stateless requests) and PR #2145 (fix task accumulation)

Detecting Memory Leaks in MCP Servers

Python servers:

  • Memray (by Bloomberg) — tracks every allocation in Python code and compiled extensions, with flamegraph visualization. Run with: memray run -o output.bin server.py, then analyze with memray flamegraph output.bin
  • Memalot MCP Server — a specialized MCP server that performs on-demand memory analysis of running Python processes by PID, useful for CI/CD integration and production monitoring
  • tracemalloc (stdlib) — built-in Python allocation tracker, useful for quick checks: tracemalloc.start() then tracemalloc.get_traced_memory() to see current and peak usage

TypeScript servers:

  • --inspect flag — Node.js’s built-in inspector with Chrome DevTools heap profiling
  • process.memoryUsage() — quick programmatic check for heapUsed growth over time
  • clinic.js — automated performance profiling for Node.js applications

Prevention Patterns

  1. Monitor session counts — track mcp_sessions_active and alert if it grows unboundedly
  2. Set session TTLs — expire idle sessions after a configurable timeout (5-30 minutes depending on use case)
  3. Use stateless mode deliberately — if your server doesn’t need session state, use stateless HTTP mode (post-fix) for simpler resource management
  4. Load test before production — run sustained load tests (not just burst tests) to detect slow leaks that only manifest over hours
  5. Profile regularly — take heap snapshots at intervals during load tests and compare object retention

Production-Scale Performance Patterns

Connection Pooling and Reuse

For remote MCP servers, connection establishment is expensive. Production patterns include:

  • HTTP/2 multiplexing — a single TCP connection carries multiple concurrent MCP requests, eliminating per-request handshake overhead
  • Connection pooling — the MCP gateway pattern maintains warm connection pools to backend MCP servers
  • Keep-alive — ensure HTTP keep-alive is enabled (it is by default in modern clients) to amortize TLS handshake cost across requests
  • Pre-warming — initialize MCP connections at application startup rather than on first request

Gateway and Load Balancing Patterns

For MCP servers at scale, gateway patterns provide critical performance features:

  • MCP gateways — centralized request routing, rate limiting, and caching across multiple backend MCP servers
  • Session-aware load balancing — route requests with the same Mcp-Session-Id to the same backend server (sticky sessions)
  • Stateless backends — design MCP servers to be stateless where possible, enabling simple round-robin load balancing
  • Circuit breakers — prevent cascading failures when a backend MCP server becomes slow or unresponsive

Rate Limiting and Backpressure

MCP servers should implement rate limiting to prevent resource exhaustion:

from collections import defaultdict
import time

class RateLimiter:
    def __init__(self, max_calls_per_minute: int = 60):
        self.max_calls = max_calls_per_minute
        self.calls: dict[str, list[float]] = defaultdict(list)

    def check(self, session_id: str) -> bool:
        now = time.time()
        # Clean old entries
        self.calls[session_id] = [
            t for t in self.calls[session_id] if now - t < 60
        ]
        if len(self.calls[session_id]) >= self.max_calls:
            return False
        self.calls[session_id].append(now)
        return True

Timeout Strategies

MCP tool calls can hang if the underlying operation (database query, API call, file I/O) stalls. Layered timeouts prevent cascading delays:

Layer Recommended Timeout Purpose
Tool execution 30-60s Prevents individual tool handlers from blocking
MCP request 90s Covers serialization + tool execution + response
Transport connection 10s Catches network failures quickly
Session idle 300s Frees resources from abandoned connections
Total workflow 600s Prevents runaway agent loops

Claude Code’s MCP_TIMEOUT environment variable controls the client-side timeout for MCP operations, defaulting to 60 seconds.

Caching Strategies

MCP servers benefit from multiple caching layers:

  1. Tool schema caching — cache tools/list responses (they rarely change) using cache_tools_list: true or defer_loading
  2. Response caching — for idempotent tools, cache responses by input hash with TTL
  3. Connection caching — reuse established MCP connections across requests from the same client
  4. Resource caching — use ETags and conditional requests for MCP resources that change infrequently

Performance Testing Checklist

Use this checklist when preparing an MCP server for production deployment:

Baseline measurements:

  • Measure time-to-first-tool-call (cold start) for each transport you support
  • Benchmark individual tool call latency (p50, p95, p99) under no load
  • Measure tools/list response time and size

Load testing:

  • Test with concurrent clients (start at 10, scale to expected production load)
  • Run sustained load (30+ minutes) to detect memory leaks
  • Measure throughput ceiling (requests/second before degradation)
  • Test session management under concurrent creation/cleanup

Optimization verification:

  • Confirm tool responses are token-efficient (prefer CSV for tabular data)
  • Verify pagination works correctly for large result sets
  • Ensure resource URIs are used for large artifacts instead of inline data
  • Check that connection pooling and keep-alive are enabled for HTTP transport

Observability:

  • OpenTelemetry spans are emitted for all tool calls
  • Metrics dashboards show p50/p95/p99 latency, throughput, error rate
  • Memory usage monitoring is in place with alerting
  • Session count tracking prevents unbounded growth

Production readiness:

  • Rate limiting is configured per session
  • Timeouts are set at every layer (tool, request, transport, session)
  • Circuit breakers protect against slow backends
  • Graceful degradation when tools fail (fallback responses, retry with backoff)

Connection Lifecycle and Initialization Costs

MCP connection initialization is a frequently underestimated performance cost. The three-phase handshake (Initialize, Operation, Shutdown) adds measurable overhead that compounds across servers.

The Cold Start Problem

With 1-2 MCP servers configured, initialization adds 3-10+ seconds to every Claude Code session start — blocking the input prompt entirely until complete. This is a known issue with multiple duplicate GitHub issues requesting lazy initialization. The proposed fix: spawn MCP servers in background, show the input prompt immediately, and wait on-demand only at tool invocation time.

Cold vs warm performance:

  • Cold start: server process initializes, connects backends, loads configs — can exceed the client timeout window (default 60 seconds, error -32001) and silently fail
  • Warm start: backend services already running, handshake succeeds immediately
  • MCP_TIMEOUT env var caps wait time but still blocks during that window

Session Overhead at Scale

  • Session creation is the dominant bottleneck — 10x performance difference between pooled and per-request sessions (293 vs 33 RPS)
  • Stateless HTTP mode has task accumulation memory leak in Python SDK (fixed in recent versions)
  • No standardized reconnection protocol yet — clients must detect dropped connections and decide between reconnect or error

Connection Pooling Patterns

  • Go’s http.DefaultClient has MaxIdleConnsPerHost=2, causing TCP churn and P95 spikes to 61ms under 50 VUs. Custom transport with 100 connections reduced P95 to 17.62ms
  • Quarkus default connection pool (~50 connections) caused catastrophic 0 RPS failure at 50 VUs; needs 1,000 connections + 1,000 waiting queue
  • HTTP/2 multiplexing — a single TCP connection carries multiple concurrent MCP requests, eliminating per-request handshake overhead
  • Pre-warming — initialize MCP connections at application startup rather than on first request

2026 Roadmap: Stateless Protocol

The 2026 MCP roadmap targets a fundamental fix: replacing the initialize handshake entirely, sending shared info with each request (stateless protocol), with session state via a cookie-like mechanism. Target: June 2026 spec release. This would eliminate sticky sessions, distributed session stores, and the cold start problem — enabling horizontal scaling without session affinity workarounds.

Real-World Performance Reference Numbers

For teams setting performance budgets, here are documented numbers from production deployments:

Metric Value Source
Java/Go MCP server latency 0.84-0.86ms avg, 10ms P95 TM Dev Lab v1 (50 VUs)
Rust MCP server throughput 4,845 RPS TM Dev Lab v2 (rmcp 0.17.0)
Python MCP server ceiling 259 RPS (4 workers + uvloop) TM Dev Lab v2
Production MCP call duration 127ms avg (70-340ms range) Infobip
Cache hit vs cold call 0.01ms vs 2,485ms Documented optimization case
Session pool throughput 293 RPS (10 shared sessions) Stacklok Kubernetes
Per-request session throughput 33-36 RPS Stacklok Kubernetes
Acceptable conversational latency Under 400ms Infobip production guidelines
Cost optimization example $15,000/mo to $500/mo Simplified tool implementation
Context7 (Upstash) response 0.95-1.38s (search + retrieval) 240K+ weekly npm downloads

Pareto pattern in production: 20% of tools handle 80% of requests. One documented case showed a database_query tool handling 10,000 calls/hour. Focus optimization on your hot-path tools first.

Geographic impact: US-East deployments show 100-300ms lower latencies than European/Asian servers. This compounds across sequential tool call chains — a 5-tool workflow might add 500-1,500ms just from geographic distance.

Further Reading

This guide complements several related ChatForest resources: