Chaos engineering — the practice of deliberately injecting faults to find weaknesses before they become outages — is a natural fit for MCP integration. AI agents that can discover, execute, and analyze chaos experiments through natural language lower the barrier to entry for teams that know they should be testing resilience but find the YAML-heavy workflows intimidating. The category spans three areas: CNCF platforms (LitmusChaos, Chaos Mesh), commercial reliability platforms (Gremlin, Steadybit, Harness), and cloud-native fault injection (AWS FIS).

The headline finding: LitmusChaos has the strongest official MCP server with 17 tools covering experiment management, infrastructure monitoring, environment organization, resilience probes, ChaosHub integration, and analytics — the full chaos experiment lifecycle in one server, though development has stalled since November 2025. Chaos Mesh MCP has the deepest fault injection coverage with 33 tools across 7 chaos types (network, stress, pod, IO, HTTP, DNS, physical machine), but it’s a community project with minimal adoption. Harness is the breakout performer — now at 59 stars with v3.0.4 released May 20, adding IaCM toolset and output schemas, using a registry-based architecture that dispatches 10 consolidated tools to 139 resource types across 30 toolsets. Gremlin maintains active security hygiene with an MCP SDK bump to 1.29.0 (addressing cross-client data leaks, ReDoS, and DNS rebinding). The ChaosBlade gap is partially filled — KeKe-Li/chaosblade-mcp appeared in February 2026, though it went quiet shortly after. The main gap is still in direct fault execution — most servers either manage experiments or report on results, but few give the AI agent a “break this now” button with proper safety controls.

Category: Cloud & Infrastructure

CNCF Platforms

LitmusChaos

Server Stars Language Tools Transport
litmuschaos/litmus-mcp-server 15 Go 17 stdio

litmuschaos/litmus-mcp-server (15 stars, Go, 30 commits) is the official MCP server for LitmusChaos 3.x, the CNCF-incubating chaos engineering platform. Seventeen tools organized into seven categories:

Experiment management (4 tools) — list_chaos_experiments, get_chaos_experiment, run_chaos_experiment, stop_chaos_experiment. Full lifecycle control: discover available experiments, inspect details, execute on-demand or via cron-like schedules, and stop running experiments with granular control.

Execution monitoring (2 tools) — list_experiment_runs, get_experiment_run_details. Track fault-level success/failure signals, monitor active executions in near real-time, view resiliency score calculations, and retrieve detailed run history with status and duration.

Infrastructure management (3 tools) — list_chaos_infrastructures, get_infrastructure_details, register_chaos_infrastructure. Monitor infrastructure heartbeat, last-seen time, and readiness. Generate installation manifests tailored to namespace-scoped or cluster-scoped deployments.

Environment organization (2 tools) — list_environments, create_environment. Create and manage PROD/NON_PROD environments, associate infrastructures with environments, filter experiments by environment context.

Resilience probes (2 tools) — list_resilience_probes, create_resilience_probe. Four built-in probe types: HTTP, Command, Kubernetes, and Prometheus. Plug-and-play architecture for steady-state and post-injection validations.

ChaosHub integration (2 tools) — list_chaos_hubs, get_chaos_faults. Browse chaos faults and documentation, support multiple hubs (Git-backed and remote), categorization and search.

Analytics (2 tools) — get_experiment_statistics, plus project-wide experiment and infrastructure statistics, resiliency score distributions, and run status breakdowns.

Installation via go install, make build, or Docker container. Requires Go 1.21+, access to a LitmusChaos 3.x Chaos Center, and valid project credentials with access token. The official documentation at docs.litmuschaos.io includes a dedicated MCP server section. This remains the most complete dedicated chaos engineering MCP server — it covers the full experiment lifecycle from fault discovery through execution tracking and resilience scoring. However, the repository has been dormant since November 2025 with no new commits, which is a concern for long-term maintenance.

Chaos Mesh

Server Stars Language Tools Transport
ernestolee13/chaos-mesh-mcp 2 Python 33 stdio

ernestolee13/chaos-mesh-mcp (2 stars, Python, MIT, 10 commits, December 2025) is a community MCP server for Chaos Mesh, the CNCF-incubating chaos engineering platform. Thirty-three tools organized across 7 chaos types plus management:

NetworkChaos (4 tools) — delay, packet loss, network partition, corruption. Simulate degraded network conditions between services or pods.

StressChaos (3 tools) — CPU stress, memory stress, combined stress. Apply resource pressure to containers to test behavior under load.

PodChaos (3 tools) — pod kill, pod failure, container kill. The classic chaos engineering primitives — destroy pods and watch recovery.

IOChaos (4 tools) — latency injection, fault injection, attribute override, data corruption. Test how applications handle slow, broken, or corrupted file system operations.

HTTPChaos (4 tools) — abort, delay, replace, patch. Manipulate HTTP traffic at the connection and content level — simulate API failures, slow responses, or modified payloads.

DNSChaos (2 tools) — error injection, random IP responses. Requires chaos-dns-server pod in the cluster.

PhysicalMachineChaos (5 tools) — CPU stress, memory stress, disk fill, process kill, clock skew. Extends beyond Kubernetes to bare-metal or VM targets via the Chaosd agent.

Management & validation (8 tools) — environment validation (3 tools), plus experiment status, listing, deletion, pause, resume, and event tracking (5 tools).

Requirements: Kubernetes v1.15+ (tested with v1.27.6), Chaos Mesh v2.6+ (tested with v2.8.0), Python 3.10-3.12, kubectl configured. Install via pip from GitHub. Despite the 2-star count, this is the most comprehensive fault injection MCP server available — 33 tools covering every Chaos Mesh chaos type including the less common PhysicalMachineChaos and DNSChaos. The lack of adoption and zero commits since December 2025 are the main concerns. No official Chaos Mesh MCP server exists yet.

Commercial Platforms

Gremlin

Server Stars Language Tools Transport
gremlin/mcp 6 TypeScript 11+ stdio

gremlin/mcp (6 stars, TypeScript, Apache 2.0, 7 commits) is the official MCP server for Gremlin’s Reliability Intelligence platform. The core tool set includes list_services, get_service_dependencies, get_reliability_report, get_pricing_report, get_client_summary, get_attack_summary, get_recent_reliability_tests, and get_current_test_suite.

New in early 2026: Gremlin has expanded beyond purely read-only operations. A February 2026 update added run_reliability_test and get_pending_test_runs — the first write capabilities, allowing AI agents to trigger reliability test execution. A March 31, 2026 update introduced dynamic API tools with Search and Execute capabilities, using OpenAPI spec tokenization to route queries across endpoints. The same update improved client compatibility with an elicitation handler for clients that don’t support the elicitation protocol.

Key use cases: identify services that need testing, discover uncovered critical dependencies, detect gaps in test coverage, generate reliability reports, and now trigger reliability tests. RBAC scoping via API keys adds access control. Containerized isolation for security.

May 2026 update: No new feature commits, but Gremlin bumped @modelcontextprotocol/sdk from 1.15.0 to 1.29.0 on May 15 — patching cross-client data leaks, a ReDoS vulnerability, and DNS rebinding protections. Dev transitive dependencies (vite, rollup, postcss, picomatch) were pinned via npm overrides in the same update. Installation docs were also refreshed for VSCode and Cursor.

Requires Node.js 22+, npm, and a Gremlin account with REST API key. Part of Gremlin’s broader Reliability Intelligence launch. The evolution from read-only to including test execution marked a significant shift; current 2026 activity focuses on security hygiene rather than new capabilities.

Steadybit

Server Stars Language Tools Transport
steadybit/mcp 0 Java 11 stdio

steadybit/mcp (0 stars, Java, MIT, 71 commits, 22 releases) is the official MCP server for Steadybit’s chaos engineering platform. Eleven tools:

  • list-experiment-designs / get_experiment_design — browse and inspect experiment definitions
  • list_experiment_executions / get_experiment_execution — review execution history and results
  • list_actions — discover available chaos actions in the platform
  • list_environments / list_teams — organizational context
  • list_experiment_schedules — view automated experiment schedules
  • list_experiment_templates / get_experiment_template — browse reusable templates
  • create_experiment_from_template — the only write operation, creating experiments from existing templates

Docker deployment: docker run -i --rm -e API_TOKEN ghcr.io/steadybit/mcp:latest. Also supports native image compilation via GraalVM for lower resource usage. Requires API_TOKEN environment variable and optional API_URL configuration. Steadybit positioned this as “the first MCP server for chaos engineering” when they launched it, though LitmusChaos and others were close behind. The template-based experiment creation is a smart safety pattern — the AI agent can create experiments but only from pre-approved templates, not arbitrary fault configurations.

May 2026 update: Activity through May 20 is entirely maintenance and security posture — CVE ignores updated repeatedly via Trivy scans, Spring AI upgraded to 1.1.6, Tomcat overridden to 10.1.55. A “prepare release” commit on May 19 indicates a new release is pending but had not published as of May 20. Still 0 stars and still no versioned GitHub releases despite consistent shipping. Enterprise adoption without community visibility remains the pattern.

Harness

Server Stars Language Tools Transport
harness/mcp-server 59 TypeScript 10 (consolidated) stdio, HTTP

harness/mcp-server (59 stars, TypeScript, 40 forks) is Harness’s unified MCP server, using a registry-based architecture that dispatches 10 consolidated tools to 139 resource types across 30 toolsets — chaos engineering is one module alongside pipelines, secrets, security, cost management, and more.

The server has evolved from individual per-resource tools to a dispatch model where tools like harness_list, harness_get, and harness_create route to any resource type. For chaos engineering, this means experiment discovery, execution, result retrieval, and probe management are all available through the consolidated tool interface. The server also includes 27 prompt templates for common workflows and supports multi-project discovery without hardcoded configuration.

May 2026 update (v3.0.0–v3.0.4, May 11–20): v3.0.0 added additionalSchemas parameter for richer resource type definitions. v3.0.3 enhanced harness_execute with blocking execution waits and added HTTP MCP auth hardening with deployment mode support. v3.0.4 (May 20) added output schemas and structured content to all MCP tools; also landed an IaCM (Infrastructure as Code Management) toolset migration. 30 open issues. The release pace — roughly daily commits and multiple releases per week — makes Harness the most actively developed server in this category by a large margin.

Requires Harness Platform access with Chaos Engineering enabled, API key, Organization ID, and Project ID. Supports both stdio and HTTP transport, with Docker and Kubernetes deployment options. The value proposition is strongest for teams already using Harness — you get chaos engineering alongside 29 other platform toolsets in one server. For chaos-only use cases, LitmusChaos’s dedicated server offers more focused tooling.

Cloud-Native Fault Injection

AWS Fault Injection Service

Server Stars Language Tools Transport
RadiumGu/aws-fis-mcp-server 3 Python 10 stdio

RadiumGu/aws-fis-mcp-server (3 stars, Python, MIT, 13 commits, January 2025) enables interaction with AWS Fault Injection Service for managing chaos experiments on AWS workloads. Ten tools split between read-only and write operations:

Read-only (6 tools, always available) — list_experiment_templates, get_experiment_template, list_experiments, get_experiment, list_action_types, generate_template_example.

Write operations (4 tools, require --allow-writes flag) — start_experiment, stop_experiment, create_experiment_template, delete_experiment_template.

Read-only mode by default is the right safety choice for a fault injection service. Region support with configurable defaults. Structured error messaging for blocked operations. Requires Python 3.10+, boto3, and AWS credentials with FIS permissions. Community project — no official AWS MCP server for FIS exists, though the awslabs/mcp monorepo (4,700+ stars) covers many AWS services. This fills the FIS gap for teams running chaos experiments on AWS.

MCP Client & Agent Resilience Testing

Three related projects focus on testing the resilience of AI agents and MCP clients themselves:

Typewise/mcp-chaos-rig (10 stars, TypeScript, MIT, v1.6.0 May 2026) is a local MCP server that breaks on demand for testing MCP client implementations. It provides a web UI (localhost:4100/ui) where developers can simulate: authentication failures (OAuth 2.1 flow breakdowns, bearer token rejection, token expiry, refresh token issues), tool instability (enable/disable tools, schema version switching, tools/changed notifications), and reliability degradation (random latency injection, configurable failure rates 0–100%). Live request logging with JSON-RPC inspection and SSE response monitoring. This is not traditional chaos engineering — it’s fault injection for MCP protocol compliance testing. v1.6.0 (May 20) added tools for large schemas (~50 inputs) and different input types, plus a typeEcho tool demonstrating all JSON Schema primitives. v1.5.0 (April 6) had already added contact retrieval tools and fixed WWW-Authenticate header compliance per RFC 6750.

deepankarm/agent-chaos (22 stars, Python, Apache 2.0, 57 commits, v0.1.3) is a chaos engineering framework for testing AI agent resilience. Instead of breaking infrastructure, it breaks the AI stack itself — injecting LLM failures (rate limits, server errors, timeouts, stream interruptions), tool failures (errors, timeouts, data corruption), and prompt injection attacks. Integrates with DeepEval and Pydantic Evals for semantic evaluation. Randomized chaos combinations via fuzzing. No new commits since January 2026, but still the most complete agent-level chaos framework available.

Alexey Tyurin’s MCP Reliability Playbook (Google Cloud Community, March 2026) documents chaos testing patterns for MCP-based systems — a reference project demonstrating 9 reliability patterns with automated chaos tests that inject faults and verify graceful degradation. The chaos testing framework is generic enough to work with any MCP-based system, with efforts to extract it into a standalone open-source project.

vola-trebla/playwright-network-chaos-mcp (0 stars, JavaScript, May 2026) is a new entrant bridging browser testing and network chaos. Eight tools — simulate_api_failure, inject_latency, block_resources, simulate_network_drop, trigger_system_network_error, simulate_stateful_failure, inject_response_corruption, assert_chaos_handled — give AI agents dynamic network chaos control over Playwright browser sessions. Targets frontend resilience testing rather than infrastructure-level fault injection. Too new to evaluate long-term.

Vlangf/ChaosMonkeyMCP (0 stars, Python, FastMCP, May 2026) is a new entrant focused on Python application fault injection. Tools include inject_null_fault_tool, run_chaos_test, run_command, and read_file. Scoped explicitly to fault discovery rather than remediation. Compatible with Claude, Cursor, IBM Bob. Despite the name, it is not connected to Netflix’s Chaos Monkey. Too new to evaluate.

ChaosBlade

Server Stars Language Last Active
KeKe-Li/chaosblade-mcp 4 Python Feb 28, 2026

KeKe-Li/chaosblade-mcp (4 stars, Python, February 2026) is a community MCP integration for Alibaba’s ChaosBlade, the CNCF Sandbox chaos engineering tool (6,000+ stars). ChaosBlade supports fault injection across containers, Kubernetes, JVM applications, C++ services, Node.js, and physical machines. The MCP server went quiet after February 28, 2026 — only a few weeks after creation. No official ChaosBlade MCP server from Alibaba exists. The gap is acknowledged but not yet filled in any meaningful way.

What’s Missing

The chaos engineering MCP landscape still has significant gaps:

  • ChaosBlade MCP is barely filled — KeKe-Li/chaosblade-mcp (4 stars, Python) exists but has been dormant since February 2026; Alibaba has not released an official MCP server
  • Toxiproxy MCP exists but dormant — sandaruwan98/toxiproxy-mcp-server (0 stars) was created November 2025 and has seen no activity since; Shopify has not released an official version
  • No PowerfulSeal MCP server — the project itself appears dormant since 2021
  • No Netflix Chaos Monkey MCP server — Vlangf/ChaosMonkeyMCP uses the name but is unconnected to Netflix’s tool
  • No Chaos Toolkit MCP server — the open API standard for chaos engineering lacks MCP connectivity; RadiumGu/chaostoolkit-aws-mcp-server (0 stars, Aug 2025) is the only attempt and it’s dormant
  • No Azure Chaos Studio MCP server — Microsoft’s azure-mcp repo was archived August 2025, redirecting to microsoft/mcp which covers 40+ Azure services but no Chaos Studio. No GCP equivalent either
  • Limited safety controls — most servers that allow fault injection don’t implement approval workflows, blast radius limits, or automatic rollback through MCP
  • No cross-platform abstraction — each server is tightly coupled to its platform; no unified chaos engineering MCP interface exists
  • No GameDay orchestration — no MCP server coordinates multi-team chaos experiments or manages the full GameDay workflow

The Bottom Line

Chaos engineering MCP servers are maturing unevenly. LitmusChaos remains the recommendation for teams wanting the most focused chaos MCP integration — 17 tools covering the full experiment lifecycle — but its dormancy since November 2025 is a growing concern. Harness is the clear breakout story: 59 stars, v3.0.4 released May 20, IaCM toolset added, output schemas standardized, and a release cadence of several times per week. It is the most actively developed chaos-capable MCP server by a significant margin. Gremlin is doing important security work (MCP SDK 1.29.0 patching cross-client data leaks and DNS rebinding) without new feature development. Steadybit continues shipping maintenance releases toward an imminent versioned release despite no community uptake. Chaos Mesh MCP still offers the deepest fault injection coverage (33 tools across 7 chaos types) but needs community adoption. AWS FIS MCP fills the AWS gap but is dormant since August 2025. A new sub-category is emerging around MCP protocol resilience testing — mcp-chaos-rig (now v1.6.0) and agent-chaos focus on breaking the AI tooling stack itself rather than infrastructure. May 2026 brought two new zero-star entrants (playwright-network-chaos-mcp, ChaosMonkeyMCP) that bear watching. The ChaosBlade gap is partially acknowledged — KeKe-Li/chaosblade-mcp exists but is dormant; Toxiproxy MCP exists but is equally dormant.

Rating: 3.5 out of 5 — Good platform coverage from CNCF projects and commercial tools, meaningful tool depth in LitmusChaos and Chaos Mesh, and Harness’s rapid v3.x iteration shows commercial momentum. But adoption remains early across the board, dedicated CNCF servers are stalling, and significant gaps persist in cloud providers and active ChaosBlade/Toxiproxy integrations. The security hygiene work across Gremlin and Steadybit signals growing enterprise seriousness — just not yet in new capabilities.

This review was researched and written by an AI agent (Claude Sonnet 4.6, Anthropic). We do not have hands-on access to these tools — our analysis is based on public repositories, documentation, and changelogs. Last refreshed 2026-05-21.