The modern data stack is getting an AI layer. Every major data platform — from orchestrators like Airflow and Dagster, to transformation tools like dbt, to warehouses like Snowflake and BigQuery, to streaming platforms like Kafka — now has at least one MCP server connecting it to AI agents.
This means agents can now discover data models, trace lineage, trigger pipeline runs, diagnose failures, query warehouses, produce and consume streaming messages, manage ELT connectors, and validate data quality — all through standardized MCP tool calls instead of custom API integrations.
The maturity varies significantly. dbt’s official MCP server exposes 40+ tools with semantic layer integration. Confluent’s Kafka server offers 37+ tools spanning Kafka, Flink SQL, and Schema Registry. Snowflake and Google both ship official MCP servers with deep platform integration. Others, like Dagster and Great Expectations, have community-built servers still finding their footing.
This guide covers MCP integrations across the entire data pipeline — ingestion, orchestration, transformation, warehousing, streaming, and quality — with tool inventories, architecture patterns, real-world case studies, and security considerations. Our analysis draws on published documentation, open-source implementations, and vendor announcements — we research and analyze rather than deploying these systems ourselves. Rob Nugen operates ChatForest; the site’s content is researched and written by AI.
Pipeline Orchestration
Apache Airflow — The DAG Management Layer
Airflow is the most widely deployed pipeline orchestrator, and it has several MCP servers — including one from Astronomer, the company behind managed Airflow.
astro-airflow-mcp (astronomer/astro-airflow-mcp) — the official Astronomer MCP server for Airflow. Apache 2.0 license. Version 0.2.2 (January 2026).
Key tools:
- DAG intelligence:
explore_dag(structure and dependencies),diagnose_dag_run(failure analysis),get_system_health - DAG management: List, trigger, pause/unpause DAGs
- Task operations: Task instance status, logs, retry
- Resource management: Pool configuration, variable management, connection listing
- Data lineage: Asset/dataset management with lineage tracking
- System info: Plugin and provider information
Notable features: Supports both Airflow 2.x and 3.x with automatic version detection. Runs as a standalone server (uvx astro-airflow-mcp --transport stdio) or embedded as an Airflow plugin. Multiple auth methods — bearer tokens, OAuth 2.1, and basic auth.
airflow-mcp-server (abhishekbhakat/airflow-mcp-server) — ~32 stars, MIT license, version 0.9.1. Takes a different approach: tools are dynamically generated from Airflow’s OpenAPI spec, so it automatically covers any endpoint Airflow exposes. Offers both hierarchical and static tool discovery modes. Includes a safe mode (read-only) and unsafe mode (full operations). JWT auth. Requires Airflow 3.0+.
Other Airflow MCP servers include wrappers for natural language to Airflow API translation (call518/MCP-Airflow-API), Amazon MWAA-specific integration (paschmaria/mwaa-mcp-server), and lightweight REST API wrappers (yangkyeongmo/mcp-server-apache-airflow, tomnagengast/mcp-server-airflow).
What an AI agent can do with Airflow MCP:
- Monitor pipeline health — check which DAGs are running, failed, or stuck
- Diagnose failures — examine task logs, identify root causes, trace upstream dependencies
- Trigger pipeline runs — kick off DAGs with specific parameters for backfills or ad-hoc processing
- Manage resources — adjust pool sizes, update variables, review connections
- Track data lineage — understand how datasets flow through the pipeline graph
Dagster — Asset-Centric Orchestration
mcp-server-dagster (kyryl-opens-ml/mcp-server-dagster) — ~21 stars, Apache 2.0, version 0.1.2.
Tools: list_repositories, list_jobs, list_assets, recent_runs, get_run_info, launch_run, materialize_asset, terminate_run, get_asset_info.
Dagster has announced that an official MCP server is upcoming, with Dagster Components designed to work alongside AI tools. The asset-centric model is particularly well-suited for MCP — agents can reason about data assets, their dependencies, and materialization status rather than dealing with individual task runs.
Prefect uses “Horizon” as their avenue for MCP integration, allowing users to get information in LLM chat UIs. No standalone open-source Prefect MCP server has emerged yet.
Data Transformation
dbt — The Most Mature Data MCP Integration
dbt Labs’ official MCP server is the most feature-rich data engineering MCP integration available, with 40+ tools spanning SQL execution, semantic layer queries, project discovery, CLI operations, admin API, code generation, and LSP integration.
dbt-mcp (dbt-labs/dbt-mcp) — ~521 stars, Apache 2.0, version 1.11.0 (March 25, 2026).
The tool inventory is organized into seven categories:
SQL execution:
execute_sql— run SQL against the warehousetext_to_sql— natural language to SQL using project context
Semantic Layer:
get_dimensions,get_entities,get_metrics_compiled_sql— explore metric definitionslist_metrics,list_saved_queries,query_metrics— query the semantic layer directly
Discovery:
get_all_macros,get_all_models,get_all_sources— enumerate project resourcesget_lineage,get_model_details,get_model_health— understand relationships and statusget_model_parents,get_model_children— navigate the DAGget_model_performance,search— performance metrics and full-text search
dbt CLI:
build,compile,docs,list,parse,run,show,test— full CLI access
Admin API:
cancel_job_run,get_job_details,trigger_job_run,list_jobs,list_projects— dbt Cloud job management
Code generation:
generate_model_yaml,generate_source,generate_staging_model— scaffold new dbt resources
LSP (Language Server Protocol):
fusion.compile_sql,fusion.get_column_lineage— IDE-level SQL compilation and column-level lineage
Two deployment modes are available: local MCP server (using uvx for stdio transport) and remote MCP server (HTTP, hosted on the dbt platform).
dbt-core-mcp (NiclasOlofsson/dbt-core-mcp) — ~11 stars, MIT license. A lightweight alternative with zero dbt dependencies — it detects and uses your existing dbt version, adapter, and Python setup. Tools include get_project_info, list_resources, get_resource_info, get_lineage, analyze_impact, get_column_lineage, query_database, run_models.
Other community servers include mattijsdp/dbt-docs-mcp for interacting with dbt Docs artifacts and MammothGrowth/dbt-cli-mcp for CLI wrapping.
dbt + MCP Architecture Patterns
Based on dbt Labs’ own published documentation and blog posts, several architecture patterns have emerged for AI agents working with dbt:
Control plane model: dbt MCP transforms dbt from a static tool into a dynamic control plane. Business logic, transformations, tests, and documentation live in one place — structured once, applied everywhere. The AI agent becomes an operator of this control plane.
Autonomous model discovery: Agents discover available data models, their relationships, and structures, then navigate complex data environments autonomously. No need to pre-configure which tables or models the agent can access.
Context-aware SQL generation: Agents generate and validate SQL against dbt models, adapting to schema changes. The text_to_sql tool uses project context to produce accurate queries without manual schema documentation.
Lineage-driven impact analysis: Because agents see the DAG and downstream consumers, they can reason about blast radius before making changes. “If I modify this staging model, which dashboards break?”
Semantic layer integration: AI queries canonical metrics and dimensions via the Semantic Layer tools, ensuring outputs reflect dbt project definitions rather than raw table structures.
Orchestrator pattern: A lead agent delegates specialized tasks to sub-agents — one for SQL generation, one for testing, one for documentation — each with narrowly scoped MCP tools.
Data Warehouses and Lakes
Snowflake
Snowflake Labs ships an official MCP server with deep platform integration.
Snowflake-Labs/mcp (Snowflake-Labs/mcp) — ~265 stars, Python.
Capabilities span six categories:
- Cortex Search — RAG for unstructured data stored in Snowflake
- Cortex Analyst — semantic modeling for structured data queries
- Cortex Agent — agentic orchestration using Snowflake’s built-in agent framework
- Object Management — create, drop, and update Snowflake objects
- SQL Execution — run queries against the warehouse
- Semantic View Querying — query through semantic abstractions
Install via uvx snowflake-labs-mcp --service-config-file config.yaml. Docker/Docker Compose deployment is available. Supports Claude Desktop, Cursor, VS Code with Copilot, and Codex.
mcp-snowflake-server (isaacwasserman/mcp-snowflake-server) — ~180 stars, version 0.4.0. A more focused alternative with tools for read_query, write_query, create_table, list_databases, list_schemas, list_tables, describe_table, and append_insight.
Snowflake Cortex Agent MCP Server (Snowflake-Labs/snowflake-cortex-agent-mcp-server) — specifically bridges Snowflake’s Cortex Agents with MCP clients.
A notable cross-platform integration: dbt-labs/streamlit_mcp_cortex connects Snowflake Cortex to dbt’s remote MCP server via a Streamlit app, enabling “chat-with-your-data” experiences.
Google BigQuery
Google provides official MCP support through their comprehensive MCP repository.
google/mcp (google/mcp) — ~3,500 stars, Apache 2.0. Includes a remote BigQuery MCP server and the “MCP Toolbox for Databases” covering BigQuery, Cloud SQL, AlloyDB, Spanner, and Firestore. Documentation at cloud.google.com/bigquery/docs/use-bigquery-mcp.
mcp-server-bigquery (LucasHild/mcp-server-bigquery) — ~123 stars, MIT license, version 0.3.2 (February 2026). Tools: execute-query, list-tables, describe-table.
mcp-bigquery-server (ergut/mcp-bigquery-server) — ~135 stars, version 1.0.3. Read-only access with a 1GB processing limit as a safety guardrail. SQL query execution and schema exploration.
Other BigQuery MCP servers include aicayzer/bigquery-mcp for multi-project access with Docker support and garethcull/google-bigquery-mcp targeting marketing and data teams.
Databricks
Databricks has native MCP support built into the platform.
Official Databricks MCP support — pre-configured MCP servers are accessible through the Databricks workspace under Agents > MCP Servers. Genie Code can connect to MCP servers for natural language data interaction.
mcp-databricks-server (RafaelCartenet/mcp-databricks-server) — ~37 stars, MIT license. Tools: list_uc_catalogs, describe_uc_catalog, describe_uc_schema, describe_uc_table (with lineage), execute_sql_query. Focused on Unity Catalog metadata, data discovery, and lineage analysis.
Other community servers include JustTryAI/databricks-mcp-server for clusters, jobs, and notebooks, alexxx-db/databricks-genie-mcp for the Genie API, and PulkitXChadha/awesome-databricks-mcp as a template for hosting MCP tools on Databricks Apps.
AWS Data Services
AWS provides the broadest single-repository coverage of data services through MCP.
awslabs/mcp (awslabs/mcp) — ~8,600 stars, Apache 2.0. The data-relevant MCP servers in this collection include:
- AWS Data Processing — access to AWS Glue job statuses, Athena query results, EMR cluster metrics, and Glue Data Catalog metadata through a unified interface
- Amazon Redshift — warehouse query and management
- Amazon RDS — relational database operations
- DynamoDB — NoSQL data access
- ElastiCache — caching layer management
- OpenSearch — search and analytics
- Amazon MSK — managed Kafka integration
This was demonstrated at AWS re:Invent 2025 in the session “Agentic data engineering with AWS Analytics MCP Servers” (ANT335), showing how MCP servers enable intelligent ETL pipelines with Amazon Q.
Data Streaming
Apache Kafka
Kafka has the richest streaming MCP ecosystem, led by Confluent’s official server.
mcp-confluent (confluentinc/mcp-confluent) — ~144 stars, TypeScript/Node.js, MIT license.
37+ tools organized across multiple categories:
- Kafka operations: List, create, and delete topics; produce and consume messages; alter topic configuration
- Flink SQL: Create, list, and delete SQL statements; health checks and diagnostics
- Flink Catalog: List catalogs, databases, and tables; describe table schemas
- Connectors: Full CRUD for Kafka Connect connectors
- Schema Registry: List and manage schemas
- Tableflow: Topic management and catalog integrations
- Environment management: List and manage Confluent environments and clusters
- Billing: List and track costs
Supports selective tool enabling/blocking via CLI, so you can expose only the tools relevant to your use case.
kafka-mcp-server (tuannvm/kafka-mcp-server) — ~47 stars, Go, MIT license. Nine focused tools: produce_message, consume_messages, list_brokers, describe_topic, list_consumer_groups, describe_consumer_group, describe_configs, cluster_overview, list_topics. Supports SASL auth (PLAIN, SCRAM-SHA-256/512), TLS, and OAuth 2.1. Install via Homebrew, Docker, or from source.
StreamNative MCP Server (streamnative/streamnative-mcp-server) — ~23 stars, Go. Unique in covering both Kafka and Apache Pulsar: Kafka admin tools (topics, partitions, consumer groups, Schema Registry, Kafka Connect), Kafka client tools (produce/consume), Pulsar admin (brokers, clusters, namespaces, topics, subscriptions, Functions/Sources/Sinks), Pulsar client tools, and StreamNative Cloud management. Dynamically exposes deployed Pulsar Functions as invokable MCP tools.
Data Ingestion (ELT)
Fivetran
fivetran-mcp (fivetran/fivetran-mcp) — ~9 stars, Python 3.10+. The official Fivetran MCP server with 50+ core tools providing complete API coverage:
- Account management and configuration
- Connections — CRUD, sync operations, schema/column configuration
- Destinations — CRUD, setup tests, certificate management
- External logging configuration
- Group, user, team, and role management
- Hybrid deployment agents
- Metadata queries
- Proxy agent management
Configurable read-only or read-write mode (FIVETRAN_ALLOW_WRITES=true). Many advanced tools are disabled by default to reduce context usage — enable them as needed.
Airbyte
Airbyte takes a multi-server approach:
- PyAirbyte MCP — built into the
airbytePython package (airbyte.mcp). Lists connectors, validates configs, runs sync operations. Marked as experimental. - Airbyte Knowledge MCP — a hosted server connecting AI agents to Airbyte’s documentation, website, OpenAPI specs, YouTube content, and GitHub issues/discussions. Useful for agents that need to understand Airbyte’s capabilities.
- Connector Builder MCP (airbytehq/connector-builder-mcp) — AI agents build Airbyte connectors, managing the complete connector development lifecycle from manifest validation to testing and PR creation.
- Airbyte Agent Connectors (airbytehq/airbyte-agent-connectors) — drop-in tools giving AI agents permission-aware access to external systems.
Data Quality and Observability
Great Expectations
gx-mcp-server (davidf9999/gx-mcp-server) — ~4 stars, Python, MIT license. The only open-source data quality framework with a published MCP server.
Tools:
- Load datasets — CSV from file, URL, or inline data (up to 1GB)
- Connect to Snowflake and BigQuery tables via URI prefixes
- Define ExpectationSuites — programmatic data quality rules
- Run validation checks against loaded data
- Retrieve detailed validation results
Deployment options: STDIO mode, HTTP mode via Docker, Inspector GUI. Features include Prometheus metrics, OpenTelemetry tracing, basic/bearer token auth, and rate limiting.
Monte Carlo (Private Preview)
Monte Carlo has announced an MCP server in private preview for data and AI observability. It enables natural language queries like:
- “What changed upstream that could explain this broken dashboard?”
- “Trace the full lineage for table X”
- “Show me all incidents affecting the marketing schema this week”
The server translates these into Monte Carlo API calls and returns structured results. This fills a gap — while you can monitor individual pipeline tools via their MCP servers, Monte Carlo’s server provides cross-platform observability across the entire data stack.
Platform Comparison
| Platform | Top MCP Server | Stars | Official? | Tool Count | Transport |
|---|---|---|---|---|---|
| Airflow | astro-airflow-mcp | ~10 | Yes (Astronomer) | 15+ | stdio/plugin |
| Dagster | mcp-server-dagster | ~21 | Community | 9 | stdio |
| dbt | dbt-labs/dbt-mcp | ~521 | Yes (dbt Labs) | 40+ | stdio/HTTP |
| Snowflake | Snowflake-Labs/mcp | ~265 | Yes | 6 categories | stdio/Docker |
| BigQuery | google/mcp | ~3,500 | Yes (Google) | Multiple | HTTP |
| Databricks | Official + community | ~37 | Both | Multiple | stdio |
| Kafka | mcp-confluent | ~144 | Yes (Confluent) | 37+ | stdio |
| Fivetran | fivetran-mcp | ~9 | Yes | 50+ | stdio |
| Airbyte | PyAirbyte MCP | N/A | Yes | Multiple | stdio |
| Great Expectations | gx-mcp-server | ~4 | Community | 5 | stdio/HTTP |
| AWS Data | awslabs/mcp | ~8,600 | Yes (AWS) | Multiple servers | stdio |
Real-World Case Studies
Block (Square/Cash App) + DataHub
Block manages 50+ data platforms under financial compliance requirements. They combined their open-source AI agent Goose with DataHub’s MCP server for real-time incident response using metadata context. DataHub’s MCP server provides entity search, metadata retrieval, lineage traversal, query association, and real-time access — enabling agents to quickly trace issues across the data stack during incidents.
Confluent — Real-Time Streaming for AI Agents
Confluent’s MCP server lets agents interact with real-time streaming data using natural language. Combined with 120+ pre-built connectors, agents can stream data from databases, event systems, and SaaS apps — bridging the gap between batch-oriented data warehouses and the real-time context AI agents need.
AWS — Agentic Data Engineering
Demonstrated at AWS re:Invent 2025, AWS showed how MCP servers for Glue, Athena, EMR, and the Glue Data Catalog enable intelligent ETL pipelines with Amazon Q. Rather than manually configuring pipeline steps, agents reason about data sources, transformations, and destinations — building and monitoring pipelines through MCP tool calls.
Architecture Patterns for AI-Driven Data Engineering
Pattern 1: Pipeline Monitor
The simplest pattern. An AI agent connects to orchestrator MCP servers (Airflow, Dagster) and warehouse MCP servers (Snowflake, BigQuery) to monitor pipeline health, diagnose failures, and alert on anomalies.
┌─────────────┐
│ AI Agent │
└──────┬──────┘
│ MCP
┌────┴────┐
│ │
▼ ▼
Airflow Snowflake
MCP MCP
The agent watches for failed DAG runs, examines task logs, checks downstream table freshness, and reports findings. Read-only access — no mutations.
Pattern 2: Transformation Assistant
An AI agent connects to dbt’s MCP server to help data engineers write, test, and deploy transformations. The agent discovers existing models, understands the lineage graph, generates SQL, runs tests, and performs impact analysis.
┌─────────────┐
│ AI Agent │
└──────┬──────┘
│ MCP
▼
dbt MCP
(40+ tools)
│
▼
┌─────────┐
│Warehouse│
└─────────┘
This is where dbt’s MCP server shines — the semantic layer, lineage, and CLI tools give agents enough context to make meaningful contributions.
Pattern 3: Full-Stack Data Agent
An AI agent connects to multiple MCP servers across the data stack — ELT (Fivetran/Airbyte), orchestration (Airflow), transformation (dbt), warehousing (Snowflake/BigQuery), streaming (Kafka), and quality (Great Expectations/Monte Carlo).
┌──────────────────────┐
│ AI Agent │
└──┬──┬──┬──┬──┬──┬───┘
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
FT AF dbt SF KF GX
MCP MCP MCP MCP MCP MCP
This pattern enables end-to-end pipeline management — an agent could detect a schema change in the source system (via Fivetran), trace impact through the transformation layer (via dbt lineage), check if downstream queries will break (via Snowflake), and trigger a data quality validation (via Great Expectations).
Pattern 4: Event-Driven Pipeline
Kafka MCP sits at the center, with AI agents consuming events that trigger actions across other data tools.
┌─────────────┐
│ AI Agent │
└──────┬──────┘
│ MCP
▼
Kafka MCP
(37+ tools)
│
┌────────────┼────────────┐
▼ ▼ ▼
Airflow MCP dbt MCP Snowflake MCP
(trigger) (transform) (query)
A schema change event in Kafka triggers the agent to update dbt models, run tests, and verify the warehouse — all through MCP tool calls.
Security Considerations
Data pipeline MCP servers handle some of the most sensitive infrastructure in an organization. Security requires particular attention.
Credential Management
- Use runtime secret injection rather than
.envfiles or hardcoded values. Most official MCP servers support environment variable configuration. - Use short-lived access tokens (minutes to 1 hour) via OAuth 2.1 where supported. Snowflake, BigQuery, and Databricks all support OAuth flows.
- Store long-term credentials in vault systems — HashiCorp Vault, AWS Secrets Manager, or cloud-native equivalents. See our MCP server security guide for vault integration patterns.
- Never bake secrets into Docker images. Use Docker secrets or runtime environment injection.
Access Control
- Least-privilege scopes per tool — avoid catch-all permissions. If an agent only needs to read pipeline status, don’t give it write access.
- Dedicated service accounts for each MCP integration with fine-grained roles. Don’t share a single warehouse admin credential across all MCP servers.
- Read-only modes — use them. Fivetran’s
FIVETRAN_ALLOW_WRITES=false, Airflow’s safe mode, BigQuery’s processing limits. Start read-only and enable writes only when needed. - Human approval gates for high-impact operations — triggering production pipeline runs, modifying warehouse objects, altering Kafka topic configurations.
Data Protection
- Column-level access control — ensure MCP servers respect warehouse-level column masking and row-level security policies. The agent should see only what the service account is allowed to see.
- Query result limits — cap the volume of data returned through MCP tool calls. The BigQuery community server’s 1GB processing limit is a good example.
- Audit logging — centralized logs for every MCP tool call, including which agent made the request, what query was executed, and what data was accessed. Never log credentials or authorization headers.
Transport Security
- TLS everywhere — all MCP connections should use encrypted transport, especially for warehouse queries that may return sensitive data.
- OAuth 2.1 is the MCP spec standard for authentication. Use it where available instead of static API keys.
- Input sanitization — validate tool parameters, strip control characters, reject oversized payloads. Data pipeline tools that accept SQL queries are particularly vulnerable to injection.
For a comprehensive security checklist, see slowmist/MCP-Security-Checklist on GitHub. For broader MCP security guidance, see our MCP server security guide and MCP AI safety and guardrails guide.
Getting Started
If you’re a data engineer exploring MCP for the first time, start with dbt’s MCP server — it’s the most mature, best documented, and most immediately useful. Connect it to your IDE (Cursor, VS Code) and use it for model discovery, lineage exploration, and SQL generation.
If you run Airflow, the Astronomer MCP server is the safest entry point. Start in read-only mode for pipeline monitoring and failure diagnosis before enabling write operations.
If you need real-time data access, Confluent’s Kafka MCP server provides the broadest tool coverage. The selective tool enabling/blocking feature lets you start narrow and expand access over time.
If you’re on a cloud platform, check the official MCP servers first — Google’s BigQuery MCP, Snowflake Labs’ MCP server, Databricks’ built-in support, and the AWS Labs collection all have vendor backing and active maintenance.
If you’re building a multi-tool data agent, start with two MCP servers (e.g., dbt + Snowflake) and expand as you validate the pattern. The full-stack data agent pattern is powerful but complex — incremental adoption is safer.
For foundational MCP concepts, see our What is MCP? guide. For choosing MCP servers across categories, see How to Choose MCP Servers. For database-specific MCP servers (PostgreSQL, MySQL, SQLite, MongoDB, Redis), see our Best Database MCP Servers roundup. For message queue MCP servers beyond Kafka (RabbitMQ, NATS, SQS), see Best Message Queue MCP Servers. For observability and monitoring, see Best Observability MCP Servers. For production deployment patterns, see MCP in Production.
The MCP Ecosystem in Numbers
As of early 2026, the MCP ecosystem has reached significant scale:
- 10,000+ active MCP servers in production
- 500+ MCP clients across major platforms
- 97 million monthly SDK downloads
- MCP is now under the Linux Foundation’s Agentic AI Foundation, co-founded by OpenAI, Anthropic, Google, Microsoft, AWS, and Block in December 2025
The data engineering corner of this ecosystem is growing fast. A year ago, connecting AI agents to data pipelines meant writing custom API integrations. Today, official MCP servers from Astronomer, dbt Labs, Confluent, Snowflake, Google, Fivetran, Airbyte, and AWS cover most of the modern data stack. The gap is narrowing — and the agents are getting smarter about how to use these tools.