MCP and Multimodal AI: How Agents Handle Images, Video, Audio, and Rich Media

AI agents are no longer text-only. The MCP specification now defines four content types — TextContent, ImageContent, AudioContent, and EmbeddedResource — enabling agents to receive screenshots, generate images, transcribe audio, analyze video, and work with binary files through standardized tool interfaces. This shift from text-only to multimodal is one of the most significant changes in the MCP ecosystem in 2026.

The practical impact is concrete. An agent debugging a UI issue can receive a screenshot through an MCP tool, analyze the layout, and suggest CSS fixes. A content creation agent can generate images with DALL-E, produce voiceovers with ElevenLabs, and analyze video footage with TwelveLabs — all through MCP tool calls. A data analysis agent can return charts as images alongside tabular results.

But multimodal MCP introduces challenges that text-based tools don’t face: base64 encoding inflates payload sizes, large media files can overwhelm context windows, and not all MCP clients support all content types. This guide covers the specification, the server ecosystem, practical implementation patterns, and the gotchas. Our analysis draws on the official MCP specification, published documentation, and community implementations — we research and analyze rather than deploying these systems ourselves. Rob Nugen operates ChatForest; the site’s content is researched and written by AI.

MCP Content Types: The Foundation

The June 2025 MCP specification (2025-06-18) defines content types that tools can return in their CallToolResult. Understanding these types is essential for building and consuming multimodal MCP servers.

TextContent

The familiar text block. Every MCP server uses this — it’s the default for returning strings, JSON, Markdown, or any text data.

{
  "type": "text",
  "text": "Analysis complete. Found 3 accessibility issues.",
  "annotations": {
    "audience": ["user"],
    "priority": 0.8
  }
}

ImageContent

Base64-encoded image data with a MIME type. This is what enables agents to “see” — receiving screenshots, charts, diagrams, or generated images as part of tool results.

{
  "type": "image",
  "data": "iVBORw0KGgoAAAANSUhEUgAA...",
  "mimeType": "image/png"
}

Supported MIME types include image/png, image/jpeg, image/gif, image/webp, and image/svg+xml. The data field contains the raw image bytes encoded as a base64 string.

AudioContent

Base64-encoded audio data, following the same pattern as ImageContent. This enables speech-to-text results, generated audio, and sound analysis.

{
  "type": "audio",
  "data": "UklGRiQAAABXQVZFZm10...",
  "mimeType": "audio/wav"
}

Common MIME types include audio/wav, audio/mp3, audio/ogg, and audio/webm.

EmbeddedResource

A wrapper for MCP resources (text or binary) embedded directly in tool results. This lets tools return file contents with their URIs preserved, so clients know where the data came from.

{
  "type": "resource",
  "resource": {
    "uri": "file:///reports/q1-sales.pdf",
    "mimeType": "application/pdf",
    "blob": "JVBERi0xLjcKCj..."
  }
}

EmbeddedResource uses either TextResourceContents (with a text field) or BlobResourceContents (with a base64-encoded blob field). It’s up to the client how to render embedded resources for the LLM or user.

ResourceLink

A lightweight pointer to a resource without embedding its contents. Useful when the data is large and the client should fetch it on demand.

{
  "type": "resource_link",
  "uri": "file:///videos/demo.mp4",
  "mimeType": "video/mp4",
  "name": "Product demo video"
}

Annotations

All content types support optional annotations that provide metadata:

audience — who the content is intended for: ["user"], ["assistant"], or both
priority — a float from 0.0 to 1.0 indicating relative importance
timestamp — ISO 8601 timestamp for when the content was created or modified

Annotations help clients make rendering decisions — for instance, an image annotated with audience: ["user"] might be displayed in the UI but not sent to the LLM to save tokens.

The Multimodal MCP Server Landscape

Image Generation Servers

These servers enable AI agents to create images from text prompts, edit existing images, and manipulate visual content.

DALL-E MCP Servers

Multiple community servers wrap OpenAI’s DALL-E API for MCP access. They typically expose tools like generate_image (text-to-image), edit_image (inpainting with masks), and create_variation. The returned ImageContent is base64-encoded PNG or JPEG.

Key implementations include servers by sammyl720 and spartanz51 on GitHub, both providing text-to-image and image-to-image capabilities through MCP tool interfaces. These connect directly to the OpenAI API, requiring an API key configured in the server environment.

Stable Diffusion / Flux MCP Servers

For local or self-hosted image generation:

image-gen-mcp (GitHub) — connects to Stable Diffusion WebUI (AUTOMATIC1111 or ForgeUI) running locally, exposing text-to-image generation through MCP
Stability AI MCP Server (GitHub) — integrates with the Stability AI API for generation, editing, upscaling, and style transfer
Flux-based servers — several implementations use Replicate’s Flux models for high-quality generation without local GPU requirements

MCPollinations

MCPollinations provides a unified multimodal server covering image generation (via Pollinations API), text generation, and audio generation — all without requiring API keys or authentication. This makes it one of the most accessible multimodal MCP servers for experimentation.

ImgMCP

A commercial MCP server focused on image manipulation: resizing, cropping, format conversion, OCR text extraction, and object detection. Returns processed images as ImageContent blocks.

Video Intelligence Servers

TwelveLabs MCP Server

Detail	Value
Provider	TwelveLabs (official)
Models	Marengo (embeddings), Pegasus (video-to-text)
Capabilities	Video indexing, semantic search, summarization
License	Commercial (API key required)

The TwelveLabs MCP Server is the most capable video intelligence server in the MCP ecosystem. It enables AI agents to:

Index videos — upload and process video content for searchability
Semantic search — find exact moments across hours of footage using natural language queries
Generate summaries — produce text descriptions of video content
Analyze scenes — identify objects, actions, and context within video segments

TwelveLabs uses video-native models rather than frame-by-frame image analysis, understanding temporal relationships and narrative flow. When combined with video generation tools (like the Sora 2 MCP server), agents can both create and analyze video content.

The server works with Claude Desktop, Cursor, Windsurf, and custom MCP clients. It processes video asynchronously — indexing can take minutes for long videos — making it a good fit for the MCP async tasks pattern.

MMAudio MCP

An AI-powered server for video-to-audio and text-to-audio generation. Given a video file, it generates matching audio — useful for adding soundtracks or sound effects to video content.

Audio and Speech Servers

ElevenLabs MCP Server

Detail	Value
Provider	ElevenLabs (official)
Capabilities	TTS, transcription, voice cloning, outbound calls
License	Commercial (API key required)

The official ElevenLabs MCP server is the most full-featured audio server in the ecosystem. It supports:

Text-to-speech — generate natural-sounding speech in multiple voices and languages
Audio transcription — convert speech to text with speaker identification
Voice cloning — create custom voices from audio samples
Voice agents — build and deploy outbound voice calling agents

The server returns AudioContent blocks with generated speech and TextContent blocks with transcription results.

Azure Speech MCP Server

Part of the Azure MCP Server collection. Provides speech-to-text and text-to-speech through Azure AI Services, supporting 100+ languages and custom voice models.

Whisper-Based Servers

Several community servers wrap OpenAI’s Whisper model (or the open-source whisper.cpp) for local speech-to-text:

Local Speech-to-Text by SmartLittleApps — uses whisper.cpp for fully offline transcription
Speech Interface (Faster Whisper) — optimized for speed with CTranslate2 backend
Audio MCP Server by GongRzhe — combines recording, transcription, and playback

These are particularly valuable for privacy-sensitive applications since audio never leaves the local machine.

Text-to-Speech Servers

Lightweight TTS servers include:

mcp-tts — supports macOS say, ElevenLabs, Google Gemini TTS, and OpenAI TTS through a unified interface
play-sound-mcp-server — simple audio playback for notification sounds when tasks complete
Epidemic Sound MCP — searches and retrieves music from the Epidemic Sound catalog for content creation

Screenshot and Screen Capture Servers

Screenshot servers bridge the gap between visual interfaces and text-based agents:

ScreenshotOne MCP — captures website screenshots through the ScreenshotOne API, returning images as base64-encoded ImageContent
Playwright MCP — Microsoft’s browser automation server captures page screenshots as part of its 26-tool suite (covered in detail in our MCP for Testing/QA guide)
BrowserTools MCP — captured browser screenshots and console logs (note: this project has been discontinued)

OCR and Document Processing

Servers that extract text from images and documents:

Image Reader MCP — lists local image files and reads their content for LLM analysis
Mistral OCR servers — leverage Mistral’s vision models for document understanding
ModelScope-based servers — Rust implementations providing OCR, object detection, and image analysis through ModelScope APIs

Image Generation Comparison

Server	Provider	Generation	Editing	Local	Auth Required
DALL-E MCP	OpenAI	Yes	Yes	No	Yes (API key)
Stability AI MCP	Stability AI	Yes	Yes	No	Yes (API key)
image-gen-mcp	SD WebUI	Yes	No	Yes	No
MCPollinations	Pollinations	Yes	No	No	No
ImgMCP	ImgMCP	No	Yes	No	Yes

Audio/Speech Server Comparison

Server	Provider	TTS	STT	Voice Clone	Local	Auth Required
ElevenLabs MCP	ElevenLabs	Yes	Yes	Yes	No	Yes
Azure Speech MCP	Microsoft	Yes	Yes	Yes	No	Yes
Whisper servers	Community	No	Yes	No	Yes	No
mcp-tts	Community	Yes	No	No	Partial	Varies
MCPollinations	Pollinations	Yes	No	No	No	No

Implementation Patterns

Returning Images from Tools

The most common multimodal pattern is a tool that returns an image alongside descriptive text. In the Python SDK with FastMCP, this is straightforward:

from fastmcp import FastMCP
from fastmcp.utilities.types import Image

mcp = FastMCP("chart-server")

@mcp.tool()
def generate_chart(data: list[float], title: str) -> Image:
    """Generate a bar chart from data."""
    # Create chart with matplotlib
    import matplotlib.pyplot as plt
    import io

    fig, ax = plt.subplots()
    ax.bar(range(len(data)), data)
    ax.set_title(title)

    buf = io.BytesIO()
    fig.savefig(buf, format='png')
    buf.seek(0)
    plt.close()

    return Image(data=buf.read(), format="png")

FastMCP automatically converts the Image object to an MCP ImageContent block with correct base64 encoding and MIME type.

For mixed content (text + image), return a list:

@mcp.tool()
def analyze_page(url: str) -> list:
    """Capture and analyze a web page."""
    screenshot = capture_screenshot(url)  # returns bytes
    return [
        Image(data=screenshot, format="png"),
        "Screenshot captured successfully. Ready for analysis."
    ]

Returning Audio from Tools

Audio follows the same base64 pattern:

@mcp.tool()
def text_to_speech(text: str, voice: str = "alloy") -> dict:
    """Convert text to speech audio."""
    audio_bytes = generate_speech(text, voice)  # API call
    return {
        "type": "audio",
        "data": base64.b64encode(audio_bytes).decode(),
        "mimeType": "audio/mp3"
    }

Resources with Binary Content

MCP resources can serve binary content through BlobResourceContents:

@mcp.resource("images://{filename}")
def get_image(filename: str) -> bytes:
    """Serve an image file."""
    with open(f"/images/{filename}", "rb") as f:
        return f.read()

FastMCP detects the bytes return type and automatically creates a BlobResourceContents with base64 encoding and an appropriate MIME type based on the file extension.

Client-Side Content Handling

Not all clients handle all content types. When building MCP servers that return multimodal content, consider graceful degradation:

@mcp.tool()
def analyze_with_fallback(image_path: str) -> list:
    """Analyze an image with text fallback."""
    analysis = run_analysis(image_path)

    return [
        # Image for clients that support it
        Image(path=image_path),
        # Text description as fallback
        f"Analysis results:\n{analysis.summary}\n\n"
        f"Objects detected: {', '.join(analysis.objects)}\n"
        f"Dominant colors: {', '.join(analysis.colors)}"
    ]

Clients that support ImageContent will display the image. Those that don’t will still receive the text description. The annotations.audience field can help here — mark the image as ["user"] if the text description provides sufficient information for the LLM.

Handling Large Media Files

Base64 encoding increases data size by approximately 33%. A 1 MB image becomes ~1.33 MB when base64-encoded and embedded in a JSON response. For video and audio files, this overhead quickly becomes problematic.

The Size Problem

Media Type	Typical Size	Base64 Size	Token Estimate
Screenshot (PNG)	200 KB	267 KB	~2,000
High-res photo (JPEG)	2 MB	2.7 MB	~20,000
30-sec audio (MP3)	500 KB	667 KB	~5,000
1-min video (MP4)	10 MB	13.3 MB	~100,000

These are rough estimates — actual token counts depend on the model and how it processes binary content. The key insight is that embedding large media files directly in tool results can consume significant context window space.

Pattern 1: Thumbnail + Full Resolution on Demand

Return a compressed thumbnail in the tool result and offer the full-resolution version as a resource:

@mcp.tool()
def capture_screenshot(url: str) -> list:
    """Capture a web page screenshot."""
    full_image = take_screenshot(url)  # Full resolution
    thumbnail = resize(full_image, max_width=800)  # Compressed

    # Store full image for on-demand access
    store_resource(f"screenshots/{url_hash}", full_image)

    return [
        Image(data=thumbnail, format="jpeg"),  # Small version
        f"Screenshot captured. Full resolution available at "
        f"screenshots://{url_hash}"
    ]

Pattern 2: Process Server-Side, Return Text

For many multimodal workflows, the agent doesn’t need the raw media — it needs information extracted from it:

@mcp.tool()
def analyze_image(image_path: str) -> str:
    """Analyze an image and return findings."""
    # Process with vision model server-side
    results = vision_model.analyze(image_path)

    return (
        f"Image: {image_path}\n"
        f"Dimensions: {results.width}x{results.height}\n"
        f"Objects: {', '.join(results.objects)}\n"
        f"Text found: {results.ocr_text}\n"
        f"Scene: {results.scene_description}"
    )

This pattern avoids sending image bytes through the MCP pipeline entirely. The tradeoff is that the agent can’t “see” the image — it only gets the analysis results.

Pattern 3: Chunked Processing for Large Files

The Large File MCP server demonstrates intelligent chunking for files that can’t be loaded into memory at once:

Smart chunking — automatically determines optimal chunk size based on file type
Memory-efficient streaming — processes files line-by-line without loading the entire file
LRU caching — 80–90% hit rates for frequently accessed file sections
Seek-based access — tools like read_chunk(file, offset, length) let agents request specific portions

For media files specifically, the pattern adapts:

@mcp.tool()
def get_video_segment(video_path: str, start_sec: float,
                      duration_sec: float) -> list:
    """Extract and analyze a specific video segment."""
    # Extract frames from the segment
    frames = extract_key_frames(video_path, start_sec, duration_sec)
    # Return only key frames, not the full video
    return [
        Image(data=frame, format="jpeg")
        for frame in frames[:5]  # Limit to 5 key frames
    ] + [f"Extracted {len(frames)} key frames from "
         f"{start_sec}s to {start_sec + duration_sec}s"]

Pattern 4: Reference-Based Workflows

For very large media, use ResourceLinks to point to files without embedding them:

{
  "content": [
    {
      "type": "text",
      "text": "Video processed successfully. 3 segments identified."
    },
    {
      "type": "resource_link",
      "uri": "videos://processed/segment-1.mp4",
      "name": "Intro segment (0:00-0:45)",
      "mimeType": "video/mp4"
    },
    {
      "type": "resource_link",
      "uri": "videos://processed/segment-2.mp4",
      "name": "Main content (0:45-3:20)",
      "mimeType": "video/mp4"
    }
  ]
}

The client can then fetch individual segments on demand, rather than receiving all video data in a single response.

Multimodal Workflow Examples

Visual QA Pipeline

An agent that answers questions about images combines multiple MCP servers:

Screenshot server captures a web page → returns ImageContent
OCR server extracts text from the image → returns TextContent
The LLM analyzes the image + extracted text together
TTS server optionally converts the answer to speech → returns AudioContent

Each step uses standard MCP tool calls. The agent orchestrates the pipeline, passing context between servers.

Content Creation Pipeline

An agent creating a blog post with media:

Agent writes the article text
Image generation server (DALL-E/Flux) creates a hero image from a prompt → ImageContent
Image manipulation server resizes and optimizes the image → ImageContent
TTS server generates an audio version of the article → AudioContent
Agent combines everything into a published post

Video Analysis Pipeline

An agent reviewing video content:

TwelveLabs MCP indexes a video (async — may take minutes)
Agent searches for specific moments: “find scenes with the product logo”
TwelveLabs returns timestamps and descriptions (TextContent)
Screenshot/frame extraction server captures key frames → ImageContent
Agent analyzes frames and generates a report

Accessibility Audit

An agent evaluating website accessibility:

Playwright MCP captures a page screenshot and accessibility tree
Agent identifies visual issues from the screenshot (color contrast, text size)
Agent cross-references with the accessibility tree data
Image annotation server marks problem areas on the screenshot → ImageContent
Agent produces a report with annotated screenshots

Client Support for Multimodal Content

Not all MCP clients support all content types equally. This is one of the biggest practical challenges with multimodal MCP.

Current Client Support

Client	TextContent	ImageContent	AudioContent	EmbeddedResource
Claude Desktop	Full	Full	Partial	Full
Cursor	Full	Partial	No	Partial
Windsurf	Full	Partial	No	Partial
Claude Code	Full	Full	No	Partial
Custom clients	Varies	Varies	Varies	Varies

“Partial” typically means the client accepts the content type but may not render it optimally — for example, displaying base64 data as text rather than rendering an image.

Implications for Server Developers

Always include TextContent alongside media — provide a text description or summary so agents that can’t process the media still get useful information
Use annotations — mark media with audience: ["user"] when it’s for display only, so clients can skip sending it to the LLM
Document supported content types — clearly state what content types your server returns so client developers know what to expect
Test with multiple clients — behavior varies significantly between Claude Desktop, IDE integrations, and custom clients

Notably, the Gemini CLI opened an issue tracking support for ImageContent in MCP tool results, indicating that even major clients are still catching up with the full multimodal specification.

Security Considerations

Multimodal content introduces security risks that text-only MCP doesn’t face.

Image-Based Prompt Injection

Images can contain text that attempts to manipulate the LLM. An MCP server that captures screenshots or processes user-uploaded images could inadvertently pass adversarial text to the model. Mitigations:

Server-side scanning — check images for embedded text before returning them
Content annotation — use audience: ["user"] for images that should be displayed but not interpreted by the LLM
Sandboxed rendering — process images in isolated environments to prevent exploits

Data Exfiltration via Media

A malicious MCP server could encode sensitive data in image pixels or audio frequencies, exfiltrating information that wouldn’t be caught by text-based security scanning. This is a variant of the broader MCP server security challenge.

Large File Denial of Service

A server that returns very large base64-encoded files can overwhelm client memory or exhaust context windows. Implement:

Size limits — cap the maximum size of returned media content
Compression — always compress images before base64 encoding
Progressive loading — use ResourceLinks for large files instead of embedding them

MIME Type Validation

Don’t trust MIME types from untrusted MCP servers. A server could claim to return image/png but actually send executable content. Clients should:

Validate that the base64-decoded content matches the declared MIME type
Render media in sandboxed contexts
Reject unexpected MIME types

For a comprehensive treatment of MCP security risks, see our MCP Attack Vectors and Defense guide and MCP Server Security guide.

The 2026 Multimodal Roadmap

The MCP specification is actively evolving to better support multimodal content. Key developments on the 2026 roadmap:

First-Class Media Support

The roadmap explicitly calls out video, audio, and other media types receiving “first-class support” in MCP. This means going beyond the current base64-in-JSON approach to more efficient transport mechanisms.

Streaming for Large Content

Planned streaming capabilities include:

Multipart messages — send large files in chunks rather than as single base64 blobs
Bidirectional communication — enable interactive experiences like real-time audio conversations
Efficient large file transfers — dedicated transport mechanisms for binary content

Content Negotiation

Future specification versions may include content negotiation, where clients declare which content types they support and servers adapt their responses accordingly. This would address the current fragmentation in client support.

MCP Apps

“MCP Apps” is a new concept being incorporated into the standard, potentially enabling richer interactive experiences that go beyond request-response tool calls. The full scope of this initiative is still being defined.

When to Go Multimodal

Use Multimodal MCP When:

The task requires visual context — UI debugging, design review, data visualization, accessibility audits
You’re building content creation pipelines — blog posts with images, video scripts with thumbnails, podcasts
The agent needs to analyze non-text data — medical images, architectural drawings, audio recordings
User experience requires media — displaying charts, playing audio feedback, showing generated images

Stick to Text-Only When:

Text descriptions suffice — if an OCR result or image description gives the agent enough information, skip the raw image
Context window budget is tight — images consume significant tokens; text summaries are cheaper
Your client doesn’t support it — check client support before building multimodal servers
The workflow is API-to-API — if no human sees the output, text-based data transfer is more efficient

Token Cost Awareness

Multimodal content is expensive in context window terms. A single high-resolution screenshot can consume 2,000–20,000 tokens. A practical approach:

Start with text-only tools that describe or summarize media content
Add thumbnail/preview tools that return compressed versions
Offer full-resolution on demand through resources or follow-up tools
Use annotations to prevent unnecessary media from reaching the LLM

Getting Started

Quick Start: Image-Capable Server

If you want to add image support to an existing MCP server:

Use the FastMCP Python SDK — it handles base64 encoding automatically
Return Image(path="file.png") or Image(data=bytes_data, format="png") from tools
Include a text description alongside every image
Test with Claude Desktop, which has the most complete ImageContent support

Quick Start: Audio-Capable Server

For text-to-speech or transcription:

Start with MCPollinations for zero-auth experimentation
Move to ElevenLabs MCP for production-quality TTS
Use Whisper-based servers for local, private transcription

Production Considerations

For multimodal servers in production:

Implement size limits — cap image dimensions and file sizes
Compress aggressively — use JPEG quality 80 for photos, PNG for screenshots
Cache processed results — avoid re-processing the same media
Monitor token usage — multimodal tools can dramatically increase API costs
Plan for client diversity — always provide text fallbacks
Review our MCP cost optimization guide for strategies to manage multimodal token budgets