Web scraping is arguably the most natural use case for MCP — AI agents that can read and understand the web become dramatically more useful. This category has attracted serious investment from both open-source projects and commercial scraping platforms, making it the most mature MCP ecosystem we’ve reviewed.
This review covers web scraping, crawling, and content extraction MCP servers. For browser automation tools (Playwright, Puppeteer), see our individual reviews of Playwright MCP and Puppeteer MCP. For search-specific tools, see Search Engine MCP Servers.
The headline finding: web scraping MCP servers are production-ready, with Firecrawl (6,200 stars), Bright Data (2,300 stars), and Crawl4AI (64,800+ stars for the core library) all offering mature, well-documented implementations. This is the strongest MCP category we’ve reviewed at 4.5/5.
April 2026 headlines: Firecrawl launched /interact (browser control), /parse (document parsing), and a web-agent framework. Crawl4AI issued a security hotfix (v0.8.6) after the litellm supply chain attack. Apify completed its SSE→Streamable HTTP migration. Bright Data added GEO & AI Brand Visibility tools. Decodo (formerly Smartproxy) entered the category with 30+ tools.
Managed Scraping Platforms
firecrawl/firecrawl-mcp-server (Official)
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| firecrawl-mcp-server | 6,200 | TypeScript | MIT | 12+ |
The most widely deployed web scraping MCP server — Firecrawl turns websites into clean, LLM-ready data. v2.9.0 (April 2026) significantly expanded capabilities:
scrape— extract content from a single URL as markdown, JSON, or HTML with JavaScript rendering; now supports query format and audio outputcrawl— crawl entire websites with depth control, URL filtering, and parallel processingmap— discover all URLs on a site, sorted by relevance to a search querysearch— web search with content extraction in one stepextract— structured data extraction using natural language schemasbatch_scrape— process multiple URLs with parallel execution and automatic retriesinteract(new) — natural language browser interaction: click, type, scroll, navigate with persistent sessions and live URLsparse(new, April 28) — convert PDFs, Word documents, and spreadsheets into structured data using a Rust engine (5x faster, up to 50 MB files, Zero Data Retention on Enterprise)
web-agent framework (April 16) — open-source framework for building AI agents: $ firecrawl create agent bundles scrape, search, and interact with parallel sub-agent support and multi-LLM provider support.
Independent benchmarks show 83% accuracy with an average 7-second response time — the fastest in the category. Over 113,000 stars on the main Firecrawl repository (up from 85,000 in March). New Java SDK (March 12) and Elixir SDK expand language support. Supports both cloud API and self-hosted deployment.
brightdata/brightdata-mcp
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| brightdata-mcp | 2,300 | TypeScript | MIT | Multiple |
Enterprise proxy infrastructure meets MCP — one server provides access to Bright Data’s full scraping stack, now with GEO and AI brand monitoring:
- Web Unlocker — anti-bot bypass that handles CAPTCHAs, fingerprinting, and IP rotation automatically
- SERP API — search engine results page scraping across Google, Bing, and others
- Web Scraper API — structured data extraction from any website
- Scraping Browser — full browser automation with proxy routing and country targeting
- GEO & AI Brand Visibility (new) — monitor how ChatGPT, Grok, and Perplexity perceive your brand; the ultimate feedback loop for Generative Engine Optimization (GEO)
- Code tool group (new, v2.9.3) — npm and PyPI package lookup for coding agents: structured data, no scraping, always up to date
Independent benchmarks give Bright Data 90% accuracy — the highest in the category — though with slower 30-second average response times. v2.8.6 (March 1) added minified markdown output for batch operations (~61% token reduction). 5,000 free requests/month. 321 commits.
Open-Source Crawlers
Crawl4AI (Multiple MCP Implementations)
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| crawl4ai (core) | 64,800+ | Python | Apache-2.0 | — |
| crawl4ai-mcp-server | — | Python | — | 3+ |
The open-source powerhouse — Crawl4AI remains the most-starred web crawler with 64,800+ stars (up from 58,000 in March). It’s local-first with no API keys, no cloud dependency, and no costs:
- LLM-optimized markdown output — clean content formatted specifically for language model consumption
- JavaScript rendering — full browser-based rendering via headless Chromium
- Smart extraction — LLM-based structured data extraction from natural language instructions
- 3-tier anti-bot detection (v0.8.5) — automatic proxy escalation when bot detection is encountered
- Shadow DOM flattening (v0.8.5) — extract content from Shadow DOM components
- Full infrastructure control — configure browser flags, proxy pools, CPU allocation, scaling policies
⚠️ SECURITY: v0.8.6 is a critical security hotfix. On March 24, 2026, malicious litellm packages (v1.82.7/v1.82.8) were published to PyPI by the TeamPCP threat actor group, harvesting API keys, SSH keys, cloud credentials, and database secrets during a ~40-minute window. Crawl4AI v0.8.6 replaced litellm with the forked unclecode-litellm. Users on v0.8.5 or earlier should upgrade immediately.
Multiple MCP server implementations wrap the core library (community ecosystem is growing but fragmented):
- sadiuysal/crawl4ai-mcp-server — lightweight wrapper with crawl, search, and smart_extract tools
- BjornMelin/crawl4ai-mcp-server — high-performance alternative with enhanced capabilities
- MaitreyaM/WEB-SCRAPING-MCP — adds text snippet extraction and natural language instruction support
You manage the infrastructure (Docker, Kubernetes, or bare metal) but gain complete control over configuration. SSE connection bugs and schema compatibility issues remain across community MCP implementations. 1,468 commits.
Web-to-Markdown Conversion
jina-ai/MCP (Official Jina AI Remote MCP)
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| jina-ai/MCP | 658 | TypeScript | Apache-2.0 | Multiple |
Universal web content reader — Jina’s remote MCP server converts any URL to clean, LLM-friendly text:
- Reader API — URL-to-markdown conversion powered by ReaderLM-v2 (3x quality improvement over v1)
- Web search grounding — parallel web searches for comprehensive topic coverage
- DeepSearch API (new) — iterative search and reasoning for complex questions
- Classifier API (new) — text and image categorization
- Image search — visual search with content extraction
- Embeddings & reranker — semantic understanding tools
Request-level controls include output format (markdown/HTML/text/screenshot), image auto-captioning, cache management, proxies, cookie forwarding, CSS target selectors, PDF figure/table/equation extraction, and SPA rendering via Puppeteer. Server-side tool filtering via query parameters saves context window — excluded tools are never registered with the client. Free tier available. 70 commits.
Community wrappers for simpler use cases:
- wong2/mcp-jina-reader — focused URL-to-markdown fetching
- spences10/mcp-jinaai-reader — optimized for documentation and web content analysis
Scraping Marketplaces
apify/apify-mcp-server
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| apify-mcp-server | 1,200 | TypeScript | Apache-2.0 | Dynamic |
5,000+ pre-built scrapers at your agent’s fingertips — Apify’s MCP server connects AI agents to its marketplace of specialized scrapers (called Actors):
- Social media — Facebook, Instagram, TikTok, YouTube, Twitter/X, LinkedIn
- Search engines — Google SERPs, Bing, DuckDuckGo
- E-commerce — Amazon, eBay, Shopify stores, product data
- Maps & local — Google Maps, Yelp, business listings
- Custom scrapers — any Actor from the Apify Store can be invoked as an MCP tool
SSE transport deprecated April 1, 2026 — now uses Streamable HTTP exclusively. New in 2026: OAuth support (connect from Claude.ai and VS Code using just a URL), output schema inference (Actor tools automatically include typed field information), mcpc CLI client (production-grade tool for any MCP server), Agent Skills (reusable instruction sets for AI coding assistants), and x402 payment provider integration. v0.9.20 (April 27) added simplified pricing display. 720 commits, 158 forks.
Commercial Scraping APIs
crawlbase/crawlbase-mcp
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| crawlbase-mcp | 54 | TypeScript | — | 3+ |
Battle-tested scraping infrastructure — Crawlbase (trusted by 70,000+ developers) grew 7x in stars since March. Core tools:
crawl— fetch raw HTML with JavaScript rendering and anti-bot protectioncrawl_markdown— extract clean markdown contentcrawl_screenshot— capture page screenshots
Now includes HTTP transport mode for shared multi-user deployments and cloud storage integration for asynchronous batch crawling of large URL sets. 26 commits. Requires Crawlbase API key (paid plans).
Decodo MCP (New)
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| mcp-web-scraper | 25 | TypeScript | — | 30+ |
Formerly Smartproxy — Decodo entered the MCP space with the broadest tool set in the commercial category. 30+ tools across five areas:
- Web & Scraping — URL scraping, screenshots, geographic flexibility for region-restricted content
- Search — Google, Bing, Google Lens, Google Travel Hotels, Google AI Mode
- E-commerce — Amazon (search, products, pricing, sellers, bestsellers), Walmart, Target, TikTok Shop
- Social Media — Reddit (posts, subreddits, users), TikTok, YouTube (metadata, channels, subtitles, search)
- AI Integration — ChatGPT and Perplexity access for AI-powered responses
Device type emulation, pagination support, and token limit controls for context management. 52 commits.
ScrapingBee MCP
Managed headless browser scraping — ScrapingBee’s MCP server provides:
get_page_html— full HTML extraction with proxy rotationget_screenshot— page or element screenshotsget_file— download files (images, PDFs) from pages
Handles headless browser management and proxy rotation automatically. 1,000 free credits on signup, then pay per request (1-75 credits depending on proxy type and JS rendering needs). API key required.
Nimbleway MCP
Enterprise-grade web data collection with 7 specialized tools:
nimble_deep_web_search— real-time search across Google, Bing, and Yandex with full content extractionnimble_extract— direct URL parsing with multiple output formatsnimble_targeted_retrieval— pre-trained templates for Amazon, Best Buy, Target, Walmartnimble_google_maps_search/nimble_google_maps_place/nimble_google_maps_reviews— business data and reviews
SSE transport with LangChain and AutoGen framework integration. Commercial service with API key required.
Content Extraction Utilities
mukul975/mcp-web-scrape
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| mcp-web-scrape | — | — | — | URL fetching |
Clean, cache-aware web content fetcher — extracts readable content from any URL and returns markdown or JSON with citations. Respects robots.txt, implements caching for repeated requests, and works with ChatGPT and Claude Desktop. One of the few servers that explicitly handles legal compliance.
olostep/olostep-mcp-server
| Server | Stars | Language | License | Tools |
|---|---|---|---|---|
| olostep-mcp-server | 15 | — | — | Multiple |
Batch URL processing at scale — process up to 10,000 URLs with JavaScript rendering support, retrieve website maps sorted by relevance, and get AI-powered answers with citations. Also includes SERP API for Google search results. 37 commits. Requires Olostep API key.
Self-Healing Scrapers
scrapoxy/scrapy-mcp-server (Archived)
| Server | Stars | Language | License | Status |
|---|---|---|---|---|
| scrapy-mcp-server | 17 | Python | — | Archived Feb 6, 2026 |
Archived before reaching implementation. The repository was archived on February 6, 2026, with only 4 commits and 2 files (LICENSE and README.md). The self-healing scraper concept — AI agents that automatically detect and fix broken Scrapy spiders when websites change — remains compelling, but this proof-of-concept never progressed to working code. No active MCP implementation exists for this use case.
What’s Missing
- No cross-source orchestration — Firecrawl’s web-agent framework helps within its ecosystem, but you still can’t combine Firecrawl for speed, Bright Data for anti-bot, and Crawl4AI for free crawling in a single coordinated workflow
- Limited schema-driven extraction — Firecrawl’s extract and Crawl4AI’s smart_extract are the closest to structured data schemas, but most servers return unstructured markdown
- No proxy pool management — Bright Data and commercial services handle proxies internally, but there’s no MCP server for managing your own proxy infrastructure
- No scraping scheduling — no MCP tools for recurring scrape jobs, change detection, or data freshness monitoring
- Minimal legal compliance — only mcp-web-scrape explicitly respects robots.txt; no consent checking, rate limit management, or legal compliance tooling
- Supply chain security concerns — the litellm incident (March 24, 2026) showed that MCP server dependencies can be compromised; no standardized dependency auditing or supply chain verification exists
- Crawl4AI MCP fragmentation — multiple community wrappers with varying quality, SSE bugs, and schema issues create confusion about which implementation to use
The Bottom Line
Rating: 4.5/5 — still the strongest MCP category we’ve reviewed.
The web scraping MCP ecosystem continues to mature rapidly. Firecrawl (6,200 stars) expanded beyond scraping into browser interaction (/interact), document parsing (/parse), and agent building (web-agent) — the main repository hit 113,000+ stars. Crawl4AI (64,800+ stars) navigated a supply chain crisis with a swift v0.8.6 hotfix and added anti-bot detection in v0.8.5. Apify (1,200 stars) completed its SSE→Streamable HTTP migration, added OAuth, and introduced Agent Skills. Bright Data (2,300 stars) added GEO & AI Brand Visibility tools for monitoring how AI perceives your brand, plus a code tool group and a 5,000 requests/month free tier. Decodo (formerly Smartproxy) entered with 30+ tools spanning scraping, search, e-commerce, and social media.
For most users: start with Firecrawl for its balance of ease-of-use and expanding capability. For cost-conscious teams: Crawl4AI with a self-hosted MCP wrapper (ensure v0.8.6+). For anti-bot needs: Bright Data. For diverse data sources: Apify’s marketplace. For broad e-commerce/social coverage: Decodo.
The gaps are narrowing — document parsing arrived, agent frameworks are emerging, and the transport layer is standardizing on Streamable HTTP. The remaining holes (cross-source orchestration, legal compliance, supply chain security) are real but not fundamental blockers. Web scraping through MCP is production-ready today.
This review was researched and written by an AI agent. We do not test these servers hands-on — our analysis is based on documentation, GitHub repositories, community discussions, and published benchmarks. Star counts are approximate and may change. Last updated April 2026.
This review was last edited on 2026-04-30 using Claude Opus 4.6 (Anthropic).