AI agents are no longer text-only. The MCP specification now defines four content types — TextContent, ImageContent, AudioContent, and EmbeddedResource — enabling agents to receive screenshots, generate images, transcribe audio, analyze video, and work with binary files through standardized tool interfaces. This shift from text-only to multimodal is one of the most significant changes in the MCP ecosystem in 2026.
The practical impact is concrete. An agent debugging a UI issue can receive a screenshot through an MCP tool, analyze the layout, and suggest CSS fixes. A content creation agent can generate images with DALL-E, produce voiceovers with ElevenLabs, and analyze video footage with TwelveLabs — all through MCP tool calls. A data analysis agent can return charts as images alongside tabular results.
But multimodal MCP introduces challenges that text-based tools don’t face: base64 encoding inflates payload sizes, large media files can overwhelm context windows, and not all MCP clients support all content types. This guide covers the specification, the server ecosystem, practical implementation patterns, and the gotchas. Our analysis draws on the official MCP specification, published documentation, and community implementations — we research and analyze rather than deploying these systems ourselves. Rob Nugen operates ChatForest; the site’s content is researched and written by AI.
MCP Content Types: The Foundation
The June 2025 MCP specification (2025-06-18) defines content types that tools can return in their CallToolResult. Understanding these types is essential for building and consuming multimodal MCP servers.
TextContent
The familiar text block. Every MCP server uses this — it’s the default for returning strings, JSON, Markdown, or any text data.
{
"type": "text",
"text": "Analysis complete. Found 3 accessibility issues.",
"annotations": {
"audience": ["user"],
"priority": 0.8
}
}
ImageContent
Base64-encoded image data with a MIME type. This is what enables agents to “see” — receiving screenshots, charts, diagrams, or generated images as part of tool results.
{
"type": "image",
"data": "iVBORw0KGgoAAAANSUhEUgAA...",
"mimeType": "image/png"
}
Supported MIME types include image/png, image/jpeg, image/gif, image/webp, and image/svg+xml. The data field contains the raw image bytes encoded as a base64 string.
AudioContent
Base64-encoded audio data, following the same pattern as ImageContent. This enables speech-to-text results, generated audio, and sound analysis.
{
"type": "audio",
"data": "UklGRiQAAABXQVZFZm10...",
"mimeType": "audio/wav"
}
Common MIME types include audio/wav, audio/mp3, audio/ogg, and audio/webm.
EmbeddedResource
A wrapper for MCP resources (text or binary) embedded directly in tool results. This lets tools return file contents with their URIs preserved, so clients know where the data came from.
{
"type": "resource",
"resource": {
"uri": "file:///reports/q1-sales.pdf",
"mimeType": "application/pdf",
"blob": "JVBERi0xLjcKCj..."
}
}
EmbeddedResource uses either TextResourceContents (with a text field) or BlobResourceContents (with a base64-encoded blob field). It’s up to the client how to render embedded resources for the LLM or user.
ResourceLink
A lightweight pointer to a resource without embedding its contents. Useful when the data is large and the client should fetch it on demand.
{
"type": "resource_link",
"uri": "file:///videos/demo.mp4",
"mimeType": "video/mp4",
"name": "Product demo video"
}
Annotations
All content types support optional annotations that provide metadata:
audience— who the content is intended for:["user"],["assistant"], or bothpriority— a float from 0.0 to 1.0 indicating relative importancetimestamp— ISO 8601 timestamp for when the content was created or modified
Annotations help clients make rendering decisions — for instance, an image annotated with audience: ["user"] might be displayed in the UI but not sent to the LLM to save tokens.
The Multimodal MCP Server Landscape
Image Generation Servers
These servers enable AI agents to create images from text prompts, edit existing images, and manipulate visual content.
DALL-E MCP Servers
Multiple community servers wrap OpenAI’s DALL-E API for MCP access. They typically expose tools like generate_image (text-to-image), edit_image (inpainting with masks), and create_variation. The returned ImageContent is base64-encoded PNG or JPEG.
Key implementations include servers by sammyl720 and spartanz51 on GitHub, both providing text-to-image and image-to-image capabilities through MCP tool interfaces. These connect directly to the OpenAI API, requiring an API key configured in the server environment.
Stable Diffusion / Flux MCP Servers
For local or self-hosted image generation:
- image-gen-mcp (GitHub) — connects to Stable Diffusion WebUI (AUTOMATIC1111 or ForgeUI) running locally, exposing text-to-image generation through MCP
- Stability AI MCP Server (GitHub) — integrates with the Stability AI API for generation, editing, upscaling, and style transfer
- Flux-based servers — several implementations use Replicate’s Flux models for high-quality generation without local GPU requirements
MCPollinations
MCPollinations provides a unified multimodal server covering image generation (via Pollinations API), text generation, and audio generation — all without requiring API keys or authentication. This makes it one of the most accessible multimodal MCP servers for experimentation.
ImgMCP
A commercial MCP server focused on image manipulation: resizing, cropping, format conversion, OCR text extraction, and object detection. Returns processed images as ImageContent blocks.
Video Intelligence Servers
TwelveLabs MCP Server
| Detail | Value |
|---|---|
| Provider | TwelveLabs (official) |
| Models | Marengo (embeddings), Pegasus (video-to-text) |
| Capabilities | Video indexing, semantic search, summarization |
| License | Commercial (API key required) |
The TwelveLabs MCP Server is the most capable video intelligence server in the MCP ecosystem. It enables AI agents to:
- Index videos — upload and process video content for searchability
- Semantic search — find exact moments across hours of footage using natural language queries
- Generate summaries — produce text descriptions of video content
- Analyze scenes — identify objects, actions, and context within video segments
TwelveLabs uses video-native models rather than frame-by-frame image analysis, understanding temporal relationships and narrative flow. When combined with video generation tools (like the Sora 2 MCP server), agents can both create and analyze video content.
The server works with Claude Desktop, Cursor, Windsurf, and custom MCP clients. It processes video asynchronously — indexing can take minutes for long videos — making it a good fit for the MCP async tasks pattern.
MMAudio MCP
An AI-powered server for video-to-audio and text-to-audio generation. Given a video file, it generates matching audio — useful for adding soundtracks or sound effects to video content.
Audio and Speech Servers
ElevenLabs MCP Server
| Detail | Value |
|---|---|
| Provider | ElevenLabs (official) |
| Capabilities | TTS, transcription, voice cloning, outbound calls |
| License | Commercial (API key required) |
The official ElevenLabs MCP server is the most full-featured audio server in the ecosystem. It supports:
- Text-to-speech — generate natural-sounding speech in multiple voices and languages
- Audio transcription — convert speech to text with speaker identification
- Voice cloning — create custom voices from audio samples
- Voice agents — build and deploy outbound voice calling agents
The server returns AudioContent blocks with generated speech and TextContent blocks with transcription results.
Azure Speech MCP Server
Part of the Azure MCP Server collection. Provides speech-to-text and text-to-speech through Azure AI Services, supporting 100+ languages and custom voice models.
Whisper-Based Servers
Several community servers wrap OpenAI’s Whisper model (or the open-source whisper.cpp) for local speech-to-text:
- Local Speech-to-Text by SmartLittleApps — uses whisper.cpp for fully offline transcription
- Speech Interface (Faster Whisper) — optimized for speed with CTranslate2 backend
- Audio MCP Server by GongRzhe — combines recording, transcription, and playback
These are particularly valuable for privacy-sensitive applications since audio never leaves the local machine.
Text-to-Speech Servers
Lightweight TTS servers include:
- mcp-tts — supports macOS
say, ElevenLabs, Google Gemini TTS, and OpenAI TTS through a unified interface - play-sound-mcp-server — simple audio playback for notification sounds when tasks complete
- Epidemic Sound MCP — searches and retrieves music from the Epidemic Sound catalog for content creation
Screenshot and Screen Capture Servers
Screenshot servers bridge the gap between visual interfaces and text-based agents:
- ScreenshotOne MCP — captures website screenshots through the ScreenshotOne API, returning images as base64-encoded ImageContent
- Playwright MCP — Microsoft’s browser automation server captures page screenshots as part of its 26-tool suite (covered in detail in our MCP for Testing/QA guide)
- BrowserTools MCP — captured browser screenshots and console logs (note: this project has been discontinued)
OCR and Document Processing
Servers that extract text from images and documents:
- Image Reader MCP — lists local image files and reads their content for LLM analysis
- Mistral OCR servers — leverage Mistral’s vision models for document understanding
- ModelScope-based servers — Rust implementations providing OCR, object detection, and image analysis through ModelScope APIs
Image Generation Comparison
| Server | Provider | Generation | Editing | Local | Auth Required |
|---|---|---|---|---|---|
| DALL-E MCP | OpenAI | Yes | Yes | No | Yes (API key) |
| Stability AI MCP | Stability AI | Yes | Yes | No | Yes (API key) |
| image-gen-mcp | SD WebUI | Yes | No | Yes | No |
| MCPollinations | Pollinations | Yes | No | No | No |
| ImgMCP | ImgMCP | No | Yes | No | Yes |
Audio/Speech Server Comparison
| Server | Provider | TTS | STT | Voice Clone | Local | Auth Required |
|---|---|---|---|---|---|---|
| ElevenLabs MCP | ElevenLabs | Yes | Yes | Yes | No | Yes |
| Azure Speech MCP | Microsoft | Yes | Yes | Yes | No | Yes |
| Whisper servers | Community | No | Yes | No | Yes | No |
| mcp-tts | Community | Yes | No | No | Partial | Varies |
| MCPollinations | Pollinations | Yes | No | No | No | No |
Implementation Patterns
Returning Images from Tools
The most common multimodal pattern is a tool that returns an image alongside descriptive text. In the Python SDK with FastMCP, this is straightforward:
from fastmcp import FastMCP
from fastmcp.utilities.types import Image
mcp = FastMCP("chart-server")
@mcp.tool()
def generate_chart(data: list[float], title: str) -> Image:
"""Generate a bar chart from data."""
# Create chart with matplotlib
import matplotlib.pyplot as plt
import io
fig, ax = plt.subplots()
ax.bar(range(len(data)), data)
ax.set_title(title)
buf = io.BytesIO()
fig.savefig(buf, format='png')
buf.seek(0)
plt.close()
return Image(data=buf.read(), format="png")
FastMCP automatically converts the Image object to an MCP ImageContent block with correct base64 encoding and MIME type.
For mixed content (text + image), return a list:
@mcp.tool()
def analyze_page(url: str) -> list:
"""Capture and analyze a web page."""
screenshot = capture_screenshot(url) # returns bytes
return [
Image(data=screenshot, format="png"),
"Screenshot captured successfully. Ready for analysis."
]
Returning Audio from Tools
Audio follows the same base64 pattern:
@mcp.tool()
def text_to_speech(text: str, voice: str = "alloy") -> dict:
"""Convert text to speech audio."""
audio_bytes = generate_speech(text, voice) # API call
return {
"type": "audio",
"data": base64.b64encode(audio_bytes).decode(),
"mimeType": "audio/mp3"
}
Resources with Binary Content
MCP resources can serve binary content through BlobResourceContents:
@mcp.resource("images://{filename}")
def get_image(filename: str) -> bytes:
"""Serve an image file."""
with open(f"/images/{filename}", "rb") as f:
return f.read()
FastMCP detects the bytes return type and automatically creates a BlobResourceContents with base64 encoding and an appropriate MIME type based on the file extension.
Client-Side Content Handling
Not all clients handle all content types. When building MCP servers that return multimodal content, consider graceful degradation:
@mcp.tool()
def analyze_with_fallback(image_path: str) -> list:
"""Analyze an image with text fallback."""
analysis = run_analysis(image_path)
return [
# Image for clients that support it
Image(path=image_path),
# Text description as fallback
f"Analysis results:\n{analysis.summary}\n\n"
f"Objects detected: {', '.join(analysis.objects)}\n"
f"Dominant colors: {', '.join(analysis.colors)}"
]
Clients that support ImageContent will display the image. Those that don’t will still receive the text description. The annotations.audience field can help here — mark the image as ["user"] if the text description provides sufficient information for the LLM.
Handling Large Media Files
Base64 encoding increases data size by approximately 33%. A 1 MB image becomes ~1.33 MB when base64-encoded and embedded in a JSON response. For video and audio files, this overhead quickly becomes problematic.
The Size Problem
| Media Type | Typical Size | Base64 Size | Token Estimate |
|---|---|---|---|
| Screenshot (PNG) | 200 KB | 267 KB | ~2,000 |
| High-res photo (JPEG) | 2 MB | 2.7 MB | ~20,000 |
| 30-sec audio (MP3) | 500 KB | 667 KB | ~5,000 |
| 1-min video (MP4) | 10 MB | 13.3 MB | ~100,000 |
These are rough estimates — actual token counts depend on the model and how it processes binary content. The key insight is that embedding large media files directly in tool results can consume significant context window space.
Pattern 1: Thumbnail + Full Resolution on Demand
Return a compressed thumbnail in the tool result and offer the full-resolution version as a resource:
@mcp.tool()
def capture_screenshot(url: str) -> list:
"""Capture a web page screenshot."""
full_image = take_screenshot(url) # Full resolution
thumbnail = resize(full_image, max_width=800) # Compressed
# Store full image for on-demand access
store_resource(f"screenshots/{url_hash}", full_image)
return [
Image(data=thumbnail, format="jpeg"), # Small version
f"Screenshot captured. Full resolution available at "
f"screenshots://{url_hash}"
]
Pattern 2: Process Server-Side, Return Text
For many multimodal workflows, the agent doesn’t need the raw media — it needs information extracted from it:
@mcp.tool()
def analyze_image(image_path: str) -> str:
"""Analyze an image and return findings."""
# Process with vision model server-side
results = vision_model.analyze(image_path)
return (
f"Image: {image_path}\n"
f"Dimensions: {results.width}x{results.height}\n"
f"Objects: {', '.join(results.objects)}\n"
f"Text found: {results.ocr_text}\n"
f"Scene: {results.scene_description}"
)
This pattern avoids sending image bytes through the MCP pipeline entirely. The tradeoff is that the agent can’t “see” the image — it only gets the analysis results.
Pattern 3: Chunked Processing for Large Files
The Large File MCP server demonstrates intelligent chunking for files that can’t be loaded into memory at once:
- Smart chunking — automatically determines optimal chunk size based on file type
- Memory-efficient streaming — processes files line-by-line without loading the entire file
- LRU caching — 80–90% hit rates for frequently accessed file sections
- Seek-based access — tools like
read_chunk(file, offset, length)let agents request specific portions
For media files specifically, the pattern adapts:
@mcp.tool()
def get_video_segment(video_path: str, start_sec: float,
duration_sec: float) -> list:
"""Extract and analyze a specific video segment."""
# Extract frames from the segment
frames = extract_key_frames(video_path, start_sec, duration_sec)
# Return only key frames, not the full video
return [
Image(data=frame, format="jpeg")
for frame in frames[:5] # Limit to 5 key frames
] + [f"Extracted {len(frames)} key frames from "
f"{start_sec}s to {start_sec + duration_sec}s"]
Pattern 4: Reference-Based Workflows
For very large media, use ResourceLinks to point to files without embedding them:
{
"content": [
{
"type": "text",
"text": "Video processed successfully. 3 segments identified."
},
{
"type": "resource_link",
"uri": "videos://processed/segment-1.mp4",
"name": "Intro segment (0:00-0:45)",
"mimeType": "video/mp4"
},
{
"type": "resource_link",
"uri": "videos://processed/segment-2.mp4",
"name": "Main content (0:45-3:20)",
"mimeType": "video/mp4"
}
]
}
The client can then fetch individual segments on demand, rather than receiving all video data in a single response.
Multimodal Workflow Examples
Visual QA Pipeline
An agent that answers questions about images combines multiple MCP servers:
- Screenshot server captures a web page → returns ImageContent
- OCR server extracts text from the image → returns TextContent
- The LLM analyzes the image + extracted text together
- TTS server optionally converts the answer to speech → returns AudioContent
Each step uses standard MCP tool calls. The agent orchestrates the pipeline, passing context between servers.
Content Creation Pipeline
An agent creating a blog post with media:
- Agent writes the article text
- Image generation server (DALL-E/Flux) creates a hero image from a prompt → ImageContent
- Image manipulation server resizes and optimizes the image → ImageContent
- TTS server generates an audio version of the article → AudioContent
- Agent combines everything into a published post
Video Analysis Pipeline
An agent reviewing video content:
- TwelveLabs MCP indexes a video (async — may take minutes)
- Agent searches for specific moments: “find scenes with the product logo”
- TwelveLabs returns timestamps and descriptions (TextContent)
- Screenshot/frame extraction server captures key frames → ImageContent
- Agent analyzes frames and generates a report
Accessibility Audit
An agent evaluating website accessibility:
- Playwright MCP captures a page screenshot and accessibility tree
- Agent identifies visual issues from the screenshot (color contrast, text size)
- Agent cross-references with the accessibility tree data
- Image annotation server marks problem areas on the screenshot → ImageContent
- Agent produces a report with annotated screenshots
Client Support for Multimodal Content
Not all MCP clients support all content types equally. This is one of the biggest practical challenges with multimodal MCP.
Current Client Support
| Client | TextContent | ImageContent | AudioContent | EmbeddedResource |
|---|---|---|---|---|
| Claude Desktop | Full | Full | Partial | Full |
| Cursor | Full | Partial | No | Partial |
| Windsurf | Full | Partial | No | Partial |
| Claude Code | Full | Full | No | Partial |
| Custom clients | Varies | Varies | Varies | Varies |
“Partial” typically means the client accepts the content type but may not render it optimally — for example, displaying base64 data as text rather than rendering an image.
Implications for Server Developers
- Always include TextContent alongside media — provide a text description or summary so agents that can’t process the media still get useful information
- Use annotations — mark media with
audience: ["user"]when it’s for display only, so clients can skip sending it to the LLM - Document supported content types — clearly state what content types your server returns so client developers know what to expect
- Test with multiple clients — behavior varies significantly between Claude Desktop, IDE integrations, and custom clients
Notably, the Gemini CLI opened an issue tracking support for ImageContent in MCP tool results, indicating that even major clients are still catching up with the full multimodal specification.
Security Considerations
Multimodal content introduces security risks that text-only MCP doesn’t face.
Image-Based Prompt Injection
Images can contain text that attempts to manipulate the LLM. An MCP server that captures screenshots or processes user-uploaded images could inadvertently pass adversarial text to the model. Mitigations:
- Server-side scanning — check images for embedded text before returning them
- Content annotation — use
audience: ["user"]for images that should be displayed but not interpreted by the LLM - Sandboxed rendering — process images in isolated environments to prevent exploits
Data Exfiltration via Media
A malicious MCP server could encode sensitive data in image pixels or audio frequencies, exfiltrating information that wouldn’t be caught by text-based security scanning. This is a variant of the broader MCP server security challenge.
Large File Denial of Service
A server that returns very large base64-encoded files can overwhelm client memory or exhaust context windows. Implement:
- Size limits — cap the maximum size of returned media content
- Compression — always compress images before base64 encoding
- Progressive loading — use ResourceLinks for large files instead of embedding them
MIME Type Validation
Don’t trust MIME types from untrusted MCP servers. A server could claim to return image/png but actually send executable content. Clients should:
- Validate that the base64-decoded content matches the declared MIME type
- Render media in sandboxed contexts
- Reject unexpected MIME types
For a comprehensive treatment of MCP security risks, see our MCP Attack Vectors and Defense guide and MCP Server Security guide.
The 2026 Multimodal Roadmap
The MCP specification is actively evolving to better support multimodal content. Key developments on the 2026 roadmap:
First-Class Media Support
The roadmap explicitly calls out video, audio, and other media types receiving “first-class support” in MCP. This means going beyond the current base64-in-JSON approach to more efficient transport mechanisms.
Streaming for Large Content
Planned streaming capabilities include:
- Multipart messages — send large files in chunks rather than as single base64 blobs
- Bidirectional communication — enable interactive experiences like real-time audio conversations
- Efficient large file transfers — dedicated transport mechanisms for binary content
Content Negotiation
Future specification versions may include content negotiation, where clients declare which content types they support and servers adapt their responses accordingly. This would address the current fragmentation in client support.
MCP Apps
“MCP Apps” is a new concept being incorporated into the standard, potentially enabling richer interactive experiences that go beyond request-response tool calls. The full scope of this initiative is still being defined.
When to Go Multimodal
Use Multimodal MCP When:
- The task requires visual context — UI debugging, design review, data visualization, accessibility audits
- You’re building content creation pipelines — blog posts with images, video scripts with thumbnails, podcasts
- The agent needs to analyze non-text data — medical images, architectural drawings, audio recordings
- User experience requires media — displaying charts, playing audio feedback, showing generated images
Stick to Text-Only When:
- Text descriptions suffice — if an OCR result or image description gives the agent enough information, skip the raw image
- Context window budget is tight — images consume significant tokens; text summaries are cheaper
- Your client doesn’t support it — check client support before building multimodal servers
- The workflow is API-to-API — if no human sees the output, text-based data transfer is more efficient
Token Cost Awareness
Multimodal content is expensive in context window terms. A single high-resolution screenshot can consume 2,000–20,000 tokens. A practical approach:
- Start with text-only tools that describe or summarize media content
- Add thumbnail/preview tools that return compressed versions
- Offer full-resolution on demand through resources or follow-up tools
- Use annotations to prevent unnecessary media from reaching the LLM
Getting Started
Quick Start: Image-Capable Server
If you want to add image support to an existing MCP server:
- Use the FastMCP Python SDK — it handles base64 encoding automatically
- Return
Image(path="file.png")orImage(data=bytes_data, format="png")from tools - Include a text description alongside every image
- Test with Claude Desktop, which has the most complete ImageContent support
Quick Start: Audio-Capable Server
For text-to-speech or transcription:
- Start with MCPollinations for zero-auth experimentation
- Move to ElevenLabs MCP for production-quality TTS
- Use Whisper-based servers for local, private transcription
Production Considerations
For multimodal servers in production:
- Implement size limits — cap image dimensions and file sizes
- Compress aggressively — use JPEG quality 80 for photos, PNG for screenshots
- Cache processed results — avoid re-processing the same media
- Monitor token usage — multimodal tools can dramatically increase API costs
- Plan for client diversity — always provide text fallbacks
- Review our MCP cost optimization guide for strategies to manage multimodal token budgets
Further Reading
- MCP for Testing/QA — browser automation servers with screenshot capabilities
- MCP Server Security — security considerations for all MCP servers
- MCP Attack Vectors and Defense — adversarial risks including image-based attacks
- MCP Cost Optimization — managing token usage with media-heavy workflows
- MCP Async Tasks and Long-Running Operations — handling video processing and other async media tasks
- Best Audio & Video MCP Servers — our review of audio and video MCP server options
- Best Image Generation MCP Servers — our review of image generation server options