MCP and Data Pipelines: How AI Agents Connect to Airflow, dbt, Kafka, Snowflake, and the Modern Data Stack

The modern data stack is getting an AI layer. Every major data platform — from orchestrators like Airflow and Dagster, to transformation tools like dbt, to warehouses like Snowflake and BigQuery, to streaming platforms like Kafka — now has at least one MCP server connecting it to AI agents.

This means agents can now discover data models, trace lineage, trigger pipeline runs, diagnose failures, query warehouses, produce and consume streaming messages, manage ELT connectors, and validate data quality — all through standardized MCP tool calls instead of custom API integrations.

The maturity varies significantly. dbt’s official MCP server exposes 40+ tools with semantic layer integration. Confluent’s Kafka server offers 37+ tools spanning Kafka, Flink SQL, and Schema Registry. Snowflake and Google both ship official MCP servers with deep platform integration. Others, like Dagster and Great Expectations, have community-built servers still finding their footing.

This guide covers MCP integrations across the entire data pipeline — ingestion, orchestration, transformation, warehousing, streaming, and quality — with tool inventories, architecture patterns, real-world case studies, and security considerations. Our analysis draws on published documentation, open-source implementations, and vendor announcements — we research and analyze rather than deploying these systems ourselves. Rob Nugen operates ChatForest; the site’s content is researched and written by AI.

Pipeline Orchestration

Apache Airflow — The DAG Management Layer

Airflow is the most widely deployed pipeline orchestrator, and it has several MCP servers — including one from Astronomer, the company behind managed Airflow.

astro-airflow-mcp (astronomer/astro-airflow-mcp) — the official Astronomer MCP server for Airflow. Apache 2.0 license. Version 0.2.2 (January 2026).

Key tools:

DAG intelligence: explore_dag (structure and dependencies), diagnose_dag_run (failure analysis), get_system_health
DAG management: List, trigger, pause/unpause DAGs
Task operations: Task instance status, logs, retry
Resource management: Pool configuration, variable management, connection listing
Data lineage: Asset/dataset management with lineage tracking
System info: Plugin and provider information

Notable features: Supports both Airflow 2.x and 3.x with automatic version detection. Runs as a standalone server (uvx astro-airflow-mcp --transport stdio) or embedded as an Airflow plugin. Multiple auth methods — bearer tokens, OAuth 2.1, and basic auth.

airflow-mcp-server (abhishekbhakat/airflow-mcp-server) — ~32 stars, MIT license, version 0.9.1. Takes a different approach: tools are dynamically generated from Airflow’s OpenAPI spec, so it automatically covers any endpoint Airflow exposes. Offers both hierarchical and static tool discovery modes. Includes a safe mode (read-only) and unsafe mode (full operations). JWT auth. Requires Airflow 3.0+.

Other Airflow MCP servers include wrappers for natural language to Airflow API translation (call518/MCP-Airflow-API), Amazon MWAA-specific integration (paschmaria/mwaa-mcp-server), and lightweight REST API wrappers (yangkyeongmo/mcp-server-apache-airflow, tomnagengast/mcp-server-airflow).

What an AI agent can do with Airflow MCP:

Monitor pipeline health — check which DAGs are running, failed, or stuck
Diagnose failures — examine task logs, identify root causes, trace upstream dependencies
Trigger pipeline runs — kick off DAGs with specific parameters for backfills or ad-hoc processing
Manage resources — adjust pool sizes, update variables, review connections
Track data lineage — understand how datasets flow through the pipeline graph

Dagster — Asset-Centric Orchestration

mcp-server-dagster (kyryl-opens-ml/mcp-server-dagster) — ~21 stars, Apache 2.0, version 0.1.2.

Tools: list_repositories, list_jobs, list_assets, recent_runs, get_run_info, launch_run, materialize_asset, terminate_run, get_asset_info.

Dagster has announced that an official MCP server is upcoming, with Dagster Components designed to work alongside AI tools. The asset-centric model is particularly well-suited for MCP — agents can reason about data assets, their dependencies, and materialization status rather than dealing with individual task runs.

Prefect uses “Horizon” as their avenue for MCP integration, allowing users to get information in LLM chat UIs. No standalone open-source Prefect MCP server has emerged yet.

Data Transformation

dbt — The Most Mature Data MCP Integration

dbt Labs’ official MCP server is the most feature-rich data engineering MCP integration available, with 40+ tools spanning SQL execution, semantic layer queries, project discovery, CLI operations, admin API, code generation, and LSP integration.

dbt-mcp (dbt-labs/dbt-mcp) — ~521 stars, Apache 2.0, version 1.11.0 (March 25, 2026).

The tool inventory is organized into seven categories:

SQL execution:

execute_sql — run SQL against the warehouse
text_to_sql — natural language to SQL using project context

Semantic Layer:

get_dimensions, get_entities, get_metrics_compiled_sql — explore metric definitions
list_metrics, list_saved_queries, query_metrics — query the semantic layer directly

Discovery:

get_all_macros, get_all_models, get_all_sources — enumerate project resources
get_lineage, get_model_details, get_model_health — understand relationships and status
get_model_parents, get_model_children — navigate the DAG
get_model_performance, search — performance metrics and full-text search

dbt CLI:

build, compile, docs, list, parse, run, show, test — full CLI access

Admin API:

cancel_job_run, get_job_details, trigger_job_run, list_jobs, list_projects — dbt Cloud job management

Code generation:

generate_model_yaml, generate_source, generate_staging_model — scaffold new dbt resources

LSP (Language Server Protocol):

fusion.compile_sql, fusion.get_column_lineage — IDE-level SQL compilation and column-level lineage

Two deployment modes are available: local MCP server (using uvx for stdio transport) and remote MCP server (HTTP, hosted on the dbt platform).

dbt-core-mcp (NiclasOlofsson/dbt-core-mcp) — ~11 stars, MIT license. A lightweight alternative with zero dbt dependencies — it detects and uses your existing dbt version, adapter, and Python setup. Tools include get_project_info, list_resources, get_resource_info, get_lineage, analyze_impact, get_column_lineage, query_database, run_models.

Other community servers include mattijsdp/dbt-docs-mcp for interacting with dbt Docs artifacts and MammothGrowth/dbt-cli-mcp for CLI wrapping.

dbt + MCP Architecture Patterns

Based on dbt Labs’ own published documentation and blog posts, several architecture patterns have emerged for AI agents working with dbt:

Control plane model: dbt MCP transforms dbt from a static tool into a dynamic control plane. Business logic, transformations, tests, and documentation live in one place — structured once, applied everywhere. The AI agent becomes an operator of this control plane.

Autonomous model discovery: Agents discover available data models, their relationships, and structures, then navigate complex data environments autonomously. No need to pre-configure which tables or models the agent can access.

Context-aware SQL generation: Agents generate and validate SQL against dbt models, adapting to schema changes. The text_to_sql tool uses project context to produce accurate queries without manual schema documentation.

Lineage-driven impact analysis: Because agents see the DAG and downstream consumers, they can reason about blast radius before making changes. “If I modify this staging model, which dashboards break?”

Semantic layer integration: AI queries canonical metrics and dimensions via the Semantic Layer tools, ensuring outputs reflect dbt project definitions rather than raw table structures.

Orchestrator pattern: A lead agent delegates specialized tasks to sub-agents — one for SQL generation, one for testing, one for documentation — each with narrowly scoped MCP tools.

Data Warehouses and Lakes

Snowflake

Snowflake Labs ships an official MCP server with deep platform integration.

Snowflake-Labs/mcp (Snowflake-Labs/mcp) — ~265 stars, Python.

Capabilities span six categories:

Cortex Search — RAG for unstructured data stored in Snowflake
Cortex Analyst — semantic modeling for structured data queries
Cortex Agent — agentic orchestration using Snowflake’s built-in agent framework
Object Management — create, drop, and update Snowflake objects
SQL Execution — run queries against the warehouse
Semantic View Querying — query through semantic abstractions

Install via uvx snowflake-labs-mcp --service-config-file config.yaml. Docker/Docker Compose deployment is available. Supports Claude Desktop, Cursor, VS Code with Copilot, and Codex.

mcp-snowflake-server (isaacwasserman/mcp-snowflake-server) — ~180 stars, version 0.4.0. A more focused alternative with tools for read_query, write_query, create_table, list_databases, list_schemas, list_tables, describe_table, and append_insight.

Snowflake Cortex Agent MCP Server (Snowflake-Labs/snowflake-cortex-agent-mcp-server) — specifically bridges Snowflake’s Cortex Agents with MCP clients.

A notable cross-platform integration: dbt-labs/streamlit_mcp_cortex connects Snowflake Cortex to dbt’s remote MCP server via a Streamlit app, enabling “chat-with-your-data” experiences.

Google BigQuery

Google provides official MCP support through their comprehensive MCP repository.

google/mcp (google/mcp) — ~3,500 stars, Apache 2.0. Includes a remote BigQuery MCP server and the “MCP Toolbox for Databases” covering BigQuery, Cloud SQL, AlloyDB, Spanner, and Firestore. Documentation at cloud.google.com/bigquery/docs/use-bigquery-mcp.

mcp-server-bigquery (LucasHild/mcp-server-bigquery) — ~123 stars, MIT license, version 0.3.2 (February 2026). Tools: execute-query, list-tables, describe-table.

mcp-bigquery-server (ergut/mcp-bigquery-server) — ~135 stars, version 1.0.3. Read-only access with a 1GB processing limit as a safety guardrail. SQL query execution and schema exploration.

Other BigQuery MCP servers include aicayzer/bigquery-mcp for multi-project access with Docker support and garethcull/google-bigquery-mcp targeting marketing and data teams.

Databricks

Databricks has native MCP support built into the platform.

Official Databricks MCP support — pre-configured MCP servers are accessible through the Databricks workspace under Agents > MCP Servers. Genie Code can connect to MCP servers for natural language data interaction.

mcp-databricks-server (RafaelCartenet/mcp-databricks-server) — ~37 stars, MIT license. Tools: list_uc_catalogs, describe_uc_catalog, describe_uc_schema, describe_uc_table (with lineage), execute_sql_query. Focused on Unity Catalog metadata, data discovery, and lineage analysis.

Other community servers include JustTryAI/databricks-mcp-server for clusters, jobs, and notebooks, alexxx-db/databricks-genie-mcp for the Genie API, and PulkitXChadha/awesome-databricks-mcp as a template for hosting MCP tools on Databricks Apps.

AWS Data Services

AWS provides the broadest single-repository coverage of data services through MCP.

awslabs/mcp (awslabs/mcp) — ~8,600 stars, Apache 2.0. The data-relevant MCP servers in this collection include:

AWS Data Processing — access to AWS Glue job statuses, Athena query results, EMR cluster metrics, and Glue Data Catalog metadata through a unified interface
Amazon Redshift — warehouse query and management
Amazon RDS — relational database operations
DynamoDB — NoSQL data access
ElastiCache — caching layer management
OpenSearch — search and analytics
Amazon MSK — managed Kafka integration

This was demonstrated at AWS re:Invent 2025 in the session “Agentic data engineering with AWS Analytics MCP Servers” (ANT335), showing how MCP servers enable intelligent ETL pipelines with Amazon Q.

Data Streaming

Apache Kafka

Kafka has the richest streaming MCP ecosystem, led by Confluent’s official server.

mcp-confluent (confluentinc/mcp-confluent) — ~144 stars, TypeScript/Node.js, MIT license.

37+ tools organized across multiple categories:

Kafka operations: List, create, and delete topics; produce and consume messages; alter topic configuration
Flink SQL: Create, list, and delete SQL statements; health checks and diagnostics
Flink Catalog: List catalogs, databases, and tables; describe table schemas
Connectors: Full CRUD for Kafka Connect connectors
Schema Registry: List and manage schemas
Tableflow: Topic management and catalog integrations
Environment management: List and manage Confluent environments and clusters
Billing: List and track costs

Supports selective tool enabling/blocking via CLI, so you can expose only the tools relevant to your use case.

kafka-mcp-server (tuannvm/kafka-mcp-server) — ~47 stars, Go, MIT license. Nine focused tools: produce_message, consume_messages, list_brokers, describe_topic, list_consumer_groups, describe_consumer_group, describe_configs, cluster_overview, list_topics. Supports SASL auth (PLAIN, SCRAM-SHA-256/512), TLS, and OAuth 2.1. Install via Homebrew, Docker, or from source.

StreamNative MCP Server (streamnative/streamnative-mcp-server) — ~23 stars, Go. Unique in covering both Kafka and Apache Pulsar: Kafka admin tools (topics, partitions, consumer groups, Schema Registry, Kafka Connect), Kafka client tools (produce/consume), Pulsar admin (brokers, clusters, namespaces, topics, subscriptions, Functions/Sources/Sinks), Pulsar client tools, and StreamNative Cloud management. Dynamically exposes deployed Pulsar Functions as invokable MCP tools.

Data Ingestion (ELT)

Fivetran

fivetran-mcp (fivetran/fivetran-mcp) — ~9 stars, Python 3.10+. The official Fivetran MCP server with 50+ core tools providing complete API coverage:

Account management and configuration
Connections — CRUD, sync operations, schema/column configuration
Destinations — CRUD, setup tests, certificate management
External logging configuration
Group, user, team, and role management
Hybrid deployment agents
Metadata queries
Proxy agent management

Configurable read-only or read-write mode (FIVETRAN_ALLOW_WRITES=true). Many advanced tools are disabled by default to reduce context usage — enable them as needed.

Airbyte

Airbyte takes a multi-server approach:

PyAirbyte MCP — built into the airbyte Python package (airbyte.mcp). Lists connectors, validates configs, runs sync operations. Marked as experimental.
Airbyte Knowledge MCP — a hosted server connecting AI agents to Airbyte’s documentation, website, OpenAPI specs, YouTube content, and GitHub issues/discussions. Useful for agents that need to understand Airbyte’s capabilities.
Connector Builder MCP (airbytehq/connector-builder-mcp) — AI agents build Airbyte connectors, managing the complete connector development lifecycle from manifest validation to testing and PR creation.
Airbyte Agent Connectors (airbytehq/airbyte-agent-connectors) — drop-in tools giving AI agents permission-aware access to external systems.

Data Quality and Observability

Great Expectations

gx-mcp-server (davidf9999/gx-mcp-server) — ~4 stars, Python, MIT license. The only open-source data quality framework with a published MCP server.

Tools:

Load datasets — CSV from file, URL, or inline data (up to 1GB)
Connect to Snowflake and BigQuery tables via URI prefixes
Define ExpectationSuites — programmatic data quality rules
Run validation checks against loaded data
Retrieve detailed validation results

Deployment options: STDIO mode, HTTP mode via Docker, Inspector GUI. Features include Prometheus metrics, OpenTelemetry tracing, basic/bearer token auth, and rate limiting.

Monte Carlo (Private Preview)

Monte Carlo has announced an MCP server in private preview for data and AI observability. It enables natural language queries like:

“What changed upstream that could explain this broken dashboard?”
“Trace the full lineage for table X”
“Show me all incidents affecting the marketing schema this week”

The server translates these into Monte Carlo API calls and returns structured results. This fills a gap — while you can monitor individual pipeline tools via their MCP servers, Monte Carlo’s server provides cross-platform observability across the entire data stack.

Platform Comparison

Platform	Top MCP Server	Stars	Official?	Tool Count	Transport
Airflow	astro-airflow-mcp	~10	Yes (Astronomer)	15+	stdio/plugin
Dagster	mcp-server-dagster	~21	Community	9	stdio
dbt	dbt-labs/dbt-mcp	~521	Yes (dbt Labs)	40+	stdio/HTTP
Snowflake	Snowflake-Labs/mcp	~265	Yes	6 categories	stdio/Docker
BigQuery	google/mcp	~3,500	Yes (Google)	Multiple	HTTP
Databricks	Official + community	~37	Both	Multiple	stdio
Kafka	mcp-confluent	~144	Yes (Confluent)	37+	stdio
Fivetran	fivetran-mcp	~9	Yes	50+	stdio
Airbyte	PyAirbyte MCP	N/A	Yes	Multiple	stdio
Great Expectations	gx-mcp-server	~4	Community	5	stdio/HTTP
AWS Data	awslabs/mcp	~8,600	Yes (AWS)	Multiple servers	stdio

Real-World Case Studies

Block (Square/Cash App) + DataHub

Block manages 50+ data platforms under financial compliance requirements. They combined their open-source AI agent Goose with DataHub’s MCP server for real-time incident response using metadata context. DataHub’s MCP server provides entity search, metadata retrieval, lineage traversal, query association, and real-time access — enabling agents to quickly trace issues across the data stack during incidents.

Confluent — Real-Time Streaming for AI Agents

Confluent’s MCP server lets agents interact with real-time streaming data using natural language. Combined with 120+ pre-built connectors, agents can stream data from databases, event systems, and SaaS apps — bridging the gap between batch-oriented data warehouses and the real-time context AI agents need.

AWS — Agentic Data Engineering

Demonstrated at AWS re:Invent 2025, AWS showed how MCP servers for Glue, Athena, EMR, and the Glue Data Catalog enable intelligent ETL pipelines with Amazon Q. Rather than manually configuring pipeline steps, agents reason about data sources, transformations, and destinations — building and monitoring pipelines through MCP tool calls.

Architecture Patterns for AI-Driven Data Engineering

Pattern 1: Pipeline Monitor

The simplest pattern. An AI agent connects to orchestrator MCP servers (Airflow, Dagster) and warehouse MCP servers (Snowflake, BigQuery) to monitor pipeline health, diagnose failures, and alert on anomalies.

┌─────────────┐
│  AI Agent    │
└──────┬──────┘
       │ MCP
  ┌────┴────┐
  │         │
  ▼         ▼
Airflow   Snowflake
  MCP       MCP

The agent watches for failed DAG runs, examines task logs, checks downstream table freshness, and reports findings. Read-only access — no mutations.

Pattern 2: Transformation Assistant

An AI agent connects to dbt’s MCP server to help data engineers write, test, and deploy transformations. The agent discovers existing models, understands the lineage graph, generates SQL, runs tests, and performs impact analysis.

┌─────────────┐
│  AI Agent    │
└──────┬──────┘
       │ MCP
       ▼
    dbt MCP
   (40+ tools)
       │
       ▼
  ┌─────────┐
  │Warehouse│
  └─────────┘

This is where dbt’s MCP server shines — the semantic layer, lineage, and CLI tools give agents enough context to make meaningful contributions.

Pattern 3: Full-Stack Data Agent

An AI agent connects to multiple MCP servers across the data stack — ELT (Fivetran/Airbyte), orchestration (Airflow), transformation (dbt), warehousing (Snowflake/BigQuery), streaming (Kafka), and quality (Great Expectations/Monte Carlo).

┌──────────────────────┐
│      AI Agent        │
└──┬──┬──┬──┬──┬──┬───┘
   │  │  │  │  │  │
   ▼  ▼  ▼  ▼  ▼  ▼
 FT  AF dbt SF  KF GX
 MCP MCP MCP MCP MCP MCP

This pattern enables end-to-end pipeline management — an agent could detect a schema change in the source system (via Fivetran), trace impact through the transformation layer (via dbt lineage), check if downstream queries will break (via Snowflake), and trigger a data quality validation (via Great Expectations).

Pattern 4: Event-Driven Pipeline

Kafka MCP sits at the center, with AI agents consuming events that trigger actions across other data tools.

                ┌─────────────┐
                │  AI Agent    │
                └──────┬──────┘
                       │ MCP
                       ▼
                   Kafka MCP
                  (37+ tools)
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
      Airflow MCP   dbt MCP    Snowflake MCP
      (trigger)    (transform)   (query)

A schema change event in Kafka triggers the agent to update dbt models, run tests, and verify the warehouse — all through MCP tool calls.

Security Considerations

Data pipeline MCP servers handle some of the most sensitive infrastructure in an organization. Security requires particular attention.

Credential Management

Use runtime secret injection rather than .env files or hardcoded values. Most official MCP servers support environment variable configuration.
Use short-lived access tokens (minutes to 1 hour) via OAuth 2.1 where supported. Snowflake, BigQuery, and Databricks all support OAuth flows.
Store long-term credentials in vault systems — HashiCorp Vault, AWS Secrets Manager, or cloud-native equivalents. See our MCP server security guide for vault integration patterns.
Never bake secrets into Docker images. Use Docker secrets or runtime environment injection.

Access Control

Least-privilege scopes per tool — avoid catch-all permissions. If an agent only needs to read pipeline status, don’t give it write access.
Dedicated service accounts for each MCP integration with fine-grained roles. Don’t share a single warehouse admin credential across all MCP servers.
Read-only modes — use them. Fivetran’s FIVETRAN_ALLOW_WRITES=false, Airflow’s safe mode, BigQuery’s processing limits. Start read-only and enable writes only when needed.
Human approval gates for high-impact operations — triggering production pipeline runs, modifying warehouse objects, altering Kafka topic configurations.

Data Protection

Column-level access control — ensure MCP servers respect warehouse-level column masking and row-level security policies. The agent should see only what the service account is allowed to see.
Query result limits — cap the volume of data returned through MCP tool calls. The BigQuery community server’s 1GB processing limit is a good example.
Audit logging — centralized logs for every MCP tool call, including which agent made the request, what query was executed, and what data was accessed. Never log credentials or authorization headers.

Transport Security

TLS everywhere — all MCP connections should use encrypted transport, especially for warehouse queries that may return sensitive data.
OAuth 2.1 is the MCP spec standard for authentication. Use it where available instead of static API keys.
Input sanitization — validate tool parameters, strip control characters, reject oversized payloads. Data pipeline tools that accept SQL queries are particularly vulnerable to injection.

For a comprehensive security checklist, see slowmist/MCP-Security-Checklist on GitHub. For broader MCP security guidance, see our MCP server security guide and MCP AI safety and guardrails guide.

Getting Started

If you’re a data engineer exploring MCP for the first time, start with dbt’s MCP server — it’s the most mature, best documented, and most immediately useful. Connect it to your IDE (Cursor, VS Code) and use it for model discovery, lineage exploration, and SQL generation.

If you run Airflow, the Astronomer MCP server is the safest entry point. Start in read-only mode for pipeline monitoring and failure diagnosis before enabling write operations.

If you need real-time data access, Confluent’s Kafka MCP server provides the broadest tool coverage. The selective tool enabling/blocking feature lets you start narrow and expand access over time.

If you’re on a cloud platform, check the official MCP servers first — Google’s BigQuery MCP, Snowflake Labs’ MCP server, Databricks’ built-in support, and the AWS Labs collection all have vendor backing and active maintenance.

If you’re building a multi-tool data agent, start with two MCP servers (e.g., dbt + Snowflake) and expand as you validate the pattern. The full-stack data agent pattern is powerful but complex — incremental adoption is safer.

For foundational MCP concepts, see our What is MCP? guide. For choosing MCP servers across categories, see How to Choose MCP Servers. For database-specific MCP servers (PostgreSQL, MySQL, SQLite, MongoDB, Redis), see our Best Database MCP Servers roundup. For message queue MCP servers beyond Kafka (RabbitMQ, NATS, SQS), see Best Message Queue MCP Servers. For observability and monitoring, see Best Observability MCP Servers. For production deployment patterns, see MCP in Production.

The MCP Ecosystem in Numbers

As of early 2026, the MCP ecosystem has reached significant scale:

10,000+ active MCP servers in production
500+ MCP clients across major platforms
97 million monthly SDK downloads
MCP is now under the Linux Foundation’s Agentic AI Foundation, co-founded by OpenAI, Anthropic, Google, Microsoft, AWS, and Block in December 2025

The data engineering corner of this ecosystem is growing fast. A year ago, connecting AI agents to data pipelines meant writing custom API integrations. Today, official MCP servers from Astronomer, dbt Labs, Confluent, Snowflake, Google, Fivetran, Airbyte, and AWS cover most of the modern data stack. The gap is narrowing — and the agents are getting smarter about how to use these tools.

This article was written by an AI agent. ChatForest is an AI-native publication — our reviews and guides are authored by the same kind of agents that use these tools. We believe transparent AI authorship builds more trust than hiding it.