Captions AI — Mirage’s Creator Editing App With an Official MCP Server
If you’ve been following AI video generation benchmarks — the ELO leaderboards, the DMD-2 distillation papers, the billion-parameter model announcements — Captions might not have come up. It holds no position on the Artificial Analysis Video Arena leaderboard. It doesn’t claim to generate 1080p scenes from text prompts with native joint audio-video synthesis.
It does something different: it makes the videos you already recorded into something worth posting.
Captions is the flagship product of Mirage (formerly Captions Inc.), a New York-based AI company that has processed more than 3 million creator videos per month and raised $175 million in total capital. The app is available on iOS, Android, and web. It auto-captions, auto-edits, corrects your eye gaze, dubs your voice into 30+ languages, generates a digital clone of you that can deliver scripts on command, and — as of March 2026 — has an official MCP server.
The AI video generation category has been dominated by infrastructure-layer models: Seedance 2.0, HappyHorse-1.0, Kling AI, Veo 3.1. These are models that synthesize video from nothing. Captions operates at the other end of the pipeline. It takes the video you already have — a talking-head clip recorded on your phone — and transforms it into polished short-form content optimized for TikTok, Instagram Reels, and YouTube Shorts.
That framing matters because the competitive question for Captions isn’t “which model has the best ELO score?” It’s “which tool saves creators the most production time?” The answer, for a meaningful segment of the market, appears to be Captions.
This is a research-based review covering Captions’ features, the Mirage foundation model, pricing, platform availability, MCP integration, competitive positioning, and what this tool is — and is not — right for. We do not test AI tools hands-on; we analyze from public sources, company announcements, technical documentation, and user feedback.
The Company: From Captions Inc. to Mirage
Mirage (legal name: Mirage Inc., formerly Captions Inc.) was co-founded by Gaurav Misra (CEO) and Dwight Churchill (CTO), who met while working at Localytics, a mobile analytics startup. They identified the same market gap independently: TikTok and Reels were growing faster than anyone could make content for them, but the production workflow — record, caption, edit, export, schedule — was still mostly manual.
They launched Captions as an iOS app focused on one problem: auto-captioning short-form video. The product expanded steadily, adding AI editing, dubbing, avatar generation, and social scheduling over the following years.
Funding Timeline
- 2021: Seed funding — initial product development
- 2022: Series A — Index Ventures led; product-market fit confirmed
- 2023: Series B — expanded editing feature set
- July 2024: $60M Series C at a $500M valuation, led by Index Ventures; users generating 3M+ videos/month
- September 2025: Company rebrands from Captions Inc. to Mirage, signaling expanded AI research mission
- January 2025: Switch to freemium model to compete directly with CapCut and Meta’s Edits
- March 2026: $75M growth financing from General Catalyst’s Customer Value Fund, bringing total raised to $175M
CEO Gaurav Misra described the rebrand rationale to TechCrunch in September 2025: “The new identity, Mirage, reflects our expanded vision and commitment to redefining the video category, starting with short-form video, through frontier AI research and models.”
General Catalyst’s managing director, who led the March 2026 investment, stated publicly that “Mirage has great product-market fit” and that “Mirage’s business equation is extremely figured out” — language that suggests the company’s unit economics are already favorable, not the typical growth-at-any-cost startup profile.
The Product vs. Company Naming
This creates a notable branding complexity worth understanding before continuing: the company is now called Mirage, but the app is still called Captions. If you search for “Captions AI” you find the app; if you look up the developer company, it’s Mirage. The help documentation lives at both captions.ai/help and help.mirage.app. The Mirage foundation model (discussed below) is a separate product from the editing app, though it feeds features into the app.
This review refers to the app as Captions and the company as Mirage throughout, consistent with official usage.
What Captions Actually Does
Captions is primarily a video editing app for talking-head and short-form content. You record a video — yourself on camera, a screen recording, an interview — and Captions applies AI-powered transformations to make it publication-ready.
The feature set as of May 2026 covers six major capability areas:
1. Auto-Captions
The original feature and still the anchor. Captions transcribes your video automatically and generates synchronized subtitles.
- 100+ languages supported for transcription
- 100+ caption template styles — animated, styled text overlays in formats optimized for different platforms
- Filler word removal — automatically detects and removes “um,” “uh,” “like,” and similar from both captions and audio
- Word-level timestamp sync — captions adjust in real time if you trim or rearrange clips
The auto-caption quality is widely cited in user reviews as best-in-class for short-form video. VEED and CapCut offer competing auto-caption tools, but Captions’ combination of accuracy, styling options, and filler-word removal is a consistent differentiator in user comparisons.
2. AI Eye Contact
AI Eye Contact is a post-processing correction layer that digitally adjusts your gaze direction in footage — making you appear to be looking directly into the camera, even if you were looking at a teleprompter or a different screen.
The mechanism: a computer vision model tracks eye movement and pupils across every frame, then warps the eye region to redirect gaze toward the camera axis. The result is that a creator can read a full script from their phone screen (which is not the camera lens) and still appear to be speaking directly to the audience.
This addresses one of the most common quality signals viewers use to assess creator credibility: eye contact. Creators who can’t maintain camera gaze — because they’re reading notes, checking comments, or using a prompter — project less confidence on screen. AI Eye Contact removes that constraint.
3. AI Twin
The AI Twin feature creates a digital avatar that looks like you, sounds like you, and can deliver any script you provide — without you ever needing to record again after the initial setup.
Setup: Record a short video sample (approximately 2–5 minutes of footage, per the help documentation). Captions builds a personalized avatar model from this footage. After setup, you provide a script, the avatar generates a video of you delivering it.
Use cases Mirage cites:
- Consistent social media posting without daily recording sessions
- Multilingual content from a single recording (the avatar delivers your script in dubbed languages)
- Course content and educational videos at scale
- Ad variations with different scripts from a single creative
The AI Twin sits at the intersection of Captions the editing app and the Mirage foundation model (see below). The underlying generation technology comes from Mirage’s UGC-focused model architecture.
4. AI Edit
AI Edit is the closest Captions comes to a one-click production pipeline. You select a style, and the AI assembles a complete video edit from your raw footage: cuts, B-roll overlays, sound effects, music, and motion effects matched to the chosen aesthetic.
The style library contains 116 styles in total (as of current documentation): 21 Premium styles and 95 Basic styles. Style names reflect specific creator aesthetics — documentary, viral, motivational, product-showcase, talking-head-interview — rather than generic labels.
What AI Edit automates:
- Cut selection — identifying the best takes and trimming dead air
- B-roll overlays — inserting generated or stock B-roll footage based on script content
- Sound effects — adding ambient and Foley sound to support the visual content
- Music — selecting or generating background music matched to the edit’s pacing and mood
- Motion effects — zooms, cuts, transitions calibrated to the chosen style
The premise is that a creator who records a 3-minute unedited talking-head video can select “viral motivational” from the style library and receive a fully edited 60-second Reel in minutes, without manually placing a single cut.
User reviews note that AI Edit works best on talking-head monologue content aligned with existing viral formats. B-roll selection and sound design quality varies by source material.
5. AI Dubbing and Translation
Captions translates and dubs videos into 30+ languages while preserving the speaker’s tone, cadence, and vocal characteristics. This goes beyond subtitling: the dubbed audio is generated to match the original speaker’s voice profile.
- Audio dubbing output is synced to the original lip movements or, when used with AI Twin, regenerates the video with the avatar speaking in the target language
- Captions claims accent preservation across dubbed languages — a technically challenging problem the Mirage foundation model specifically addresses
- Translation quality is supported by the same transcription engine that powers auto-captions
This feature positions Captions for creators with multilingual audiences — especially important for Asian, Latin American, and European creator markets where English-only content has limited reach.
6. AI Media Generation
Beyond editing, Captions offers generative capabilities integrated into the editing workflow:
- B-roll generation — generate custom B-roll footage from text prompts to supplement existing footage
- AI Music — generate original background music matched to video mood and length
- AI Sound Effects — generate contextual sound effects from prompts
- AI Images — generate still images for use as backgrounds, thumbnails, or overlays
These generative features feed the AI Edit pipeline but are also accessible manually. They represent the integration between the Captions editing app and Mirage’s underlying foundation model capabilities.
The Mirage Foundation Model
In March 2025, Mirage published the Mirage foundation model — described as the world’s first video foundation model built specifically for user-generated content (UGC) video.
Standard video generation models (Sora, Veo, Kling, Seedance) are trained primarily on diverse video corpora and optimized for cinematic quality or ELO benchmark performance. The Mirage model is trained with a different objective: generating talking-head UGC content that looks indistinguishable from human-recorded creator video.
The distinction is meaningful. A cinematically excellent AI video model produces footage that looks like a high-production film still — beautiful lighting, perfect framing, hyper-realistic texture. UGC video has a different visual grammar: slightly imperfect framing, natural handheld motion, real-world backgrounds, direct eye contact with the camera, casual clothing. The Mirage foundation model is trained to reproduce this aesthetic.
Key Technical Claims
- Virtual actors with natural facial expressions, micro-reactions, and body language — not just lip-sync
- Generates actors, backgrounds, voices, and scripts from scratch, without stock footage or voice cloning of real people
- Produces natural micro-expressions that correspond to speech content (raised eyebrow when asking a question, narrowed eyes when conveying concern, etc.)
- Preserves accent characteristics in multilingual dubbing output
- Processes upper-body motion — not just the face, but shoulder movement, head tilt, and natural gesture
The Mirage team’s framing: “Half of visual communication comes from facial expressions, micro-reactions, and body language — not just lip movement.” Their claim is that earlier lip-sync models only moved lips to match audio, producing the uncanny-valley effect familiar from early deepfake content. The Mirage model drives the full upper-body performance.
UGC Foundation Model vs. Cinematic Generation Models
This is where Captions/Mirage occupies a genuinely distinct position in the AI video landscape. HappyHorse-1.0 and Seedance 2.0 are optimized to win on Artificial Analysis’s human preference benchmark — which measures cinematic quality, physical realism, motion consistency, and visual aesthetics against a diverse prompt set.
The Mirage foundation model is not optimized for the same benchmark. It is optimized to generate content that looks like it was recorded by a human creator for TikTok or Instagram. These are different distributions. A model trained to generate photorealistic 4K landscape scenes may not generate convincing handheld selfie-style talking-head content — and vice versa.
Whether the Mirage foundation model achieves its objective at a quality level competitive with human-recorded UGC is not independently verified by third-party benchmarks. Mirage’s own claims and user feedback are the primary data sources.
Platform Availability
Captions is available on three platforms:
- iOS (primary platform, most features, longest update history)
- Android (full feature parity with iOS as of 2025)
- Web (browser-based access, useful for desktop creators)
The app is headquartered in New York City and has no geographic access restrictions — unlike Seedance 2.0 (no US access) or HappyHorse-1.0 (primarily fal.ai API, no consumer interface). Captions is available to creators globally.
Freemium Model (Introduced January 2025)
In January 2025, Mirage switched Captions from a paid-first to a freemium model — a direct competitive response to CapCut (ByteDance, free) and Meta’s Edits (free). The free tier includes the core captioning features with limitations on export quality and credit usage.
Pricing
Captions’ paid plans are credit-based, with credits consumed per project creation, export, and premium feature usage.
| Plan | Price | Credits/Month |
|---|---|---|
| Free | $0 | Limited |
| Pro | $9.99/month | 200 |
| Max | $24.99/month | 500 |
| Scale | $69.99/month | 1,400 |
Key notes:
- All paid plans include no watermarks on exports
- Credits do not roll over month to month
- Overage fees apply if credit allowance is exhausted
- Annual pricing is available at a discount
The Credit System: Pros and Cons
The credit model is one of the most frequently discussed aspects of Captions in user feedback. The pros: lighter users on the free or Pro tier get access to powerful features without paying for capacity they don’t use. The cons: creators with variable volume — busy weeks followed by quiet periods — can run out of credits before month end, then face overage charges without rollover for accumulated unused credits.
Comparisons to competitors by pricing:
| Tool | Entry Paid Price | Model |
|---|---|---|
| Captions Pro | $9.99/month | Credits |
| OpusClip Starter | $15/month | Minutes |
| Descript Creator | $16/month | Transcription hours |
| CapCut Pro | Free or $9.99/month | Features |
| VEED Basic | $18/month | Minutes |
Captions’ Pro tier at $9.99 is among the most competitive entry points in the segment. The Scale tier at $69.99 targets agencies and high-volume creators.
MCP Server: Official Integration (March 2026)
This is notable: Captions has an official MCP server, documented at captions.ai/help/api-reference/mcp, launched on March 27, 2026.
For context: among the video AI tools we’ve reviewed — HappyHorse-1.0, Seedance 2.0, Kling AI, Veo 3.1 — none have official MCP servers. Community-built servers exist for some of these tools, but the Captions MCP server is the first official, company-maintained MCP integration in the AI video editing space.
What the Captions MCP Server Does
The Captions MCP server allows AI agents and LLM-powered tools (such as Claude or Cursor) to query the Captions documentation programmatically. Example use case per the official docs:
“Using Claude or Cursor with the Captions MCP server, you can search Captions documentation — for example, you can enter a prompt like ‘What languages are supported for Dubbing in the Captions App?’ and Claude will search the Captions documentation for the answer.”
The setup process requires:
- An active Captions account
- An API key from your Captions API Dashboard
- Running the provided configuration command to register the MCP server locally
Scope of the Current MCP Integration
The current MCP server’s primary function is documentation search — allowing AI agents to retrieve information about Captions’ capabilities and settings. This is narrower than an MCP integration that exposes operational tools (e.g., “submit a video for captioning,” “export a project,” “generate a dub”). Those operational capabilities are accessible via the Captions API separately; the MCP server brings those capabilities into the Model Context Protocol ecosystem.
This positions the Captions MCP server as particularly useful for:
- AI coding assistants helping developers integrate the Captions API
- Agentic workflows where an AI needs to understand Captions’ current feature set before constructing API calls
- Documentation-grounded question answering within an LLM conversation
The launch of an official MCP server — even a documentation-focused one — signals Mirage’s awareness of the emerging agentic AI ecosystem and willingness to prioritize that integration. It’s a meaningful differentiator in a segment where most competitors have no MCP presence.
Social Media Manager Feature
In October 2024, Captions launched an AI-powered social media manager feature for websites — an expansion beyond video editing into content scheduling and publishing.
The feature allows creators to:
- Schedule video publication to multiple social platforms
- Generate content recommendations and captions (post captions, not video captions)
- Embed video players on external websites
This represents Mirage’s ambition to own more of the creator workflow: not just the editing step, but the distribution step. The competitive implication is that Captions begins to compete with tools like Later, Buffer, and Hootsuite — social scheduling tools that don’t currently have native AI video editing capabilities.
As of this writing, the social media manager feature is an add-on within the Captions app rather than a separate product.
Competitive Positioning
Captions competes in the short-form video editing segment, which is distinct from the AI video generation segment covered in most recent AI video coverage. The relevant competitors are different:
CapCut (ByteDance)
The dominant free competitor. CapCut is deeply embedded in TikTok’s creation workflow — TikTok and CapCut share parent company ByteDance, which means CapCut’s editing workflow is optimized for TikTok’s native formats. CapCut’s feature set is broader than Captions (full timeline editing, effects library, templates), and it is free on all tiers.
CapCut’s weaknesses: its ByteDance/TikTok ownership creates regulatory risk in the United States (the same TikTok ban legislative effort affects CapCut’s distribution). It lacks the AI Eye Contact and AI Twin features that Captions built first. Its auto-caption quality is generally rated lower than Captions in head-to-head comparisons.
Descript
Descript’s text-based editing model is genuinely different: edit the transcript, and the video edits itself accordingly. Delete a sentence, and the corresponding footage is removed. This is particularly powerful for interview and long-form content.
Descript is better suited to podcast repurposing, documentary, and interview-format content. Captions is better suited to quick talking-head short-form optimization. Different use cases with limited overlap in practice.
OpusClip
OpusClip is an AI clipping tool: it takes a long video (a podcast, a livestream, a webinar) and automatically identifies the most shareable moments, cutting and reformatting them for vertical short-form platforms. Captions does not specialize in this long-to-short clip extraction use case.
OpusClip has recently added captioning features, moving toward Captions’ territory. Captions doesn’t have a comparable long-video-to-clips pipeline — creating some market overlap as both products expand.
HeyGen
HeyGen is the closest competitor to Captions’ AI Twin feature: HeyGen specializes in AI avatar generation from video samples, allowing users to create spokesperson videos from a digital clone. HeyGen skews toward enterprise marketing teams; Captions’ AI Twin is positioned for individual creators. Pricing and UI differ substantially.
Where Captions Differentiates
Captions’ competitive advantage concentrates in a specific combination: AI Eye Contact + AI Twin + AI Edit + Dubbing in a single mobile-first workflow. No single competitor offers all four capabilities at equivalent quality in one app with freemium pricing. CapCut has most features but weaker AI; HeyGen has better avatars but no editing pipeline; Descript has better transcript editing but no AI Twin.
Known Limitations and User Complaints
Fair coverage requires documenting what doesn’t work well.
Credit System Friction
The most common complaint in user reviews: the credit system creates pricing unpredictability. Heavy users in a given week can burn through monthly credits and face overage charges. Credits don’t roll over. The free tier is limited enough that some users feel the freemium experience is more demo than functional tool.
App Performance Issues
Recurring user reports of:
- Lag during processing for longer videos
- Captions going out of sync after manual edits
- App crashes during export, particularly on older iOS devices
- Slow re-rendering when adjusting caption timing
These are common complaints for any compute-intensive mobile app, but they appear frequently enough in user reviews to be noted as persistent issues rather than isolated incidents.
UI Complexity
“Feature bloat” is a common characterization in reviews from new users. Captions has added features continuously — captioning, editing, AI Twin, dubbing, social scheduling, B-roll generation, music generation — and the interface reflects that accumulation. New users report a steep initial discovery curve before finding the workflow that works for their use case.
AI Edit Quality Variance
AI Edit works best on content that matches the training distribution of the selected style. Talking-head motivational content processed with the “viral motivational” style produces reliably good results. Highly visual or action-based content, interviews with multiple speakers, or content with unusual framing can produce inconsistent or low-quality auto-edits.
AI Twin Limitations
The AI Twin is a generation model, and it carries the limitations of generation: extended monologue scripts, unusual lighting, or high-motion backgrounds can degrade avatar quality. The Mirage foundation model’s UGC focus helps, but generated avatar video remains distinguishable from real footage under careful inspection.
Controversies and Concerns
Data Privacy and Face Data
AI Eye Contact, AI Twin, and the Mirage foundation model all involve processing or learning from personal biometric data — face geometry, eye movement, voice characteristics. Mirage’s privacy policy governs how this data is used, but the collection of face and voice data raises questions that have materialized for comparable products in other markets.
No documented regulatory actions against Captions/Mirage on biometric data grounds as of this writing. The concern is speculative but worth noting for enterprise or regulated-industry users.
Synthetic Identity and Deepfake Risk
AI Twin creates convincing personalized avatar videos from a brief recording. The legitimate uses are clear; the misuse potential is equally clear. Mirage has not published explicit documentation of abuse-prevention measures for AI Twin in the same way some avatar vendors (e.g., HeyGen) have.
CapCut Feature Overlap and TikTok Ecosystem
For creators deeply embedded in TikTok, CapCut is the native editing choice. Captions must continuously differentiate its AI features to justify its cost premium over a free competitor. If CapCut improves its AI Eye Contact equivalent or adds a comparable AI Twin, Captions’ differentiation narrows significantly.
Captions vs. AI Video Generation Models
This review has deliberately avoided comparing Captions directly to HappyHorse-1.0, Seedance 2.0, Kling AI, or Veo 3.1 on generation quality benchmarks. Those comparisons are largely category errors: Captions is an editing tool, not a generation model.
The cleaner framing:
| Dimension | Captions | Generation Models (Seedance, HappyHorse, etc.) |
|---|---|---|
| Primary input | Footage you already recorded | Text prompt or image |
| Primary output | Edited version of your footage | Synthesized video from scratch |
| Target user | Individual creator with a phone | Developer, enterprise, AI studio |
| Benchmark | Not ELO-ranked | Artificial Analysis Video Arena |
| Access | App (iOS/Android/web) | API |
| Pricing entry | $0 freemium | Per-second or per-generation API cost |
There is one overlap point: Captions’ AI Twin and B-roll generation features involve synthesis of video from text/prompts. In those features, Captions is a direct (if specialized) competitor to the generation models. But for the core editing use case, they serve different workflows entirely.
For a creator workflow that starts with “I recorded a video on my phone,” Captions is a more complete solution than any pure generation model. For a workflow that starts with “I want to create a video without ever recording anything,” the generation models are more capable — but Captions’ AI Twin bridges that gap for personalized talking-head content.
MCP Integration Context
The Captions MCP server’s March 2026 launch places it ahead of every AI video generation model in MCP readiness. As of this writing, none of Seedance 2.0, HappyHorse-1.0, Kling AI 3.0, or Veo 3.1 have official MCP servers. Several have community-built integrations, but no official server maintained by the developing company.
For developers building AI-powered content workflows, the Captions MCP server enables LLM-native integration for documentation retrieval. The practical value compounds as more agentic tools adopt MCP as a standard integration layer: a content production agent that understands Captions’ current capabilities without requiring manual documentation review is a meaningful workflow efficiency.
The scope limitation — documentation search rather than operational control — is a constraint worth noting. An MCP server that could programmatically submit videos for captioning, trigger dubs, or export edits would be substantially more powerful for automation use cases. Whether Mirage expands MCP scope to include operational tools is an open question.
Rating: 4 out of 5
What earns the 4:
- Clear market leadership in AI-first short-form creator editing
- AI Eye Contact and AI Twin are genuinely differentiated capabilities with no direct equivalent in competitor apps
- Official MCP server — the only AI video tool with one, as of May 2026
- Freemium accessibility removes the try-before-buy friction
- $175M raised, strong unit economics (per General Catalyst), product-market fit confirmed by scale
- Cross-platform (iOS, Android, web) with no geographic restrictions
- Mirage foundation model represents original AI research investment, not just wrapper tooling
Why it’s 4 and not 5:
- Credit-based pricing creates unpredictability; no rollover is a consistent user complaint
- App stability issues (crashes on export, caption sync drift) are documented and persistent
- AI Edit quality variance — works well for standard formats, inconsistent for unusual content
- MCP server currently limited to documentation search, not operational control
- UGC foundation model quality claims are company-stated without independent benchmark validation
- AI Twin has deepfake misuse potential without documented abuse controls
Who Should Use Captions
Strong fit:
- Individual creators producing regular short-form content for TikTok, Reels, YouTube Shorts
- Multilingual creators who want to reach audiences in languages beyond their native one
- Creators who read from scripts and want to appear camera-engaged without teleprompter setups
- Small teams that need to scale talking-head content production without proportional recording time
- Developers building LLM-powered content workflows who want MCP-integrated documentation access
Weaker fit:
- Long-form content creators (podcast editors, documentary filmmakers) — Descript is better suited
- Users who need a full multi-track timeline editor — CapCut or professional tools offer more control
- Enterprise legal/compliance environments where face biometric data collection requires vendor vetting
- Budget-constrained creators who need free unlimited use — CapCut remains the free standard
- Developers who need operational API control over video processing at scale — the Captions API serves this but the MCP server doesn’t yet
Summary
Captions is a mature, well-funded, product-market-fit-proven AI editing tool that occupies a distinct segment from the headline-grabbing AI video generation models. It solves a different problem — transforming recorded footage into publication-ready short-form content — and it solves that problem better than any direct competitor currently available.
The Mirage rebrand signals an ambition to move upstream into foundation model territory. The UGC foundation model is a genuine research direction, not just a marketing reframe. Whether Mirage can compete with Alibaba, ByteDance, Google, and Kuaishou on model quality at scale is an open question. As a consumer product company that built its way to $175M raised and 3M+ monthly videos by solving a specific creator problem with exceptional focus, however, Mirage has earned the right to make that ambition credible.
The official MCP server — launched three months before this review — is a quiet signal. The AI video generation companies with hundred-million-parameter models and billion-dollar backing don’t have one. Captions does. That reflects either early mover instinct or product culture that pays attention to where the ecosystem is heading. Given the trajectory of the company, probably both.
ChatForest reviews are research-based. We analyze AI tools from public sources, company announcements, technical documentation, and user feedback. We do not test tools hands-on or access paid features to produce reviews.