At a glance: Microsoft MAI model family, released April 2–14, 2026 via Microsoft Foundry. Four models spanning speech recognition, voice synthesis, and image generation — all built by Mustafa Suleyman’s superintelligence team formed just months earlier. MAI-Transcribe-1: 3.88% WER across 25 languages (#1 FLEURS), $0.36/hr. MAI-Voice-1: 60 seconds of audio in under 1 second, $22/1M characters. MAI-Image-2: #3 Arena.ai at launch (Elo 1,326), $5/$33/1M tokens. MAI-Image-2-Efficient: 22% faster, 4x GPU throughput, $5/$19.50/1M — 41% cheaper output. Part of our AI Models & Companies reviews.
Microsoft did not build these models because it ran out of things to do with OpenAI’s API. It built them because it had to.
The September 2025 renegotiation of the original 2019 Microsoft-OpenAI partnership removed a clause that had effectively prohibited Microsoft from building its own broadly capable AI models. That clause was the price of early exclusive hosting rights. Once it was gone, Mustafa Suleyman’s team — formed in November 2025, officially called the Microsoft AI Superintelligence division — had a clear runway. In April 2026, less than five months after formation, they shipped four production models in twelve days.
The question worth asking is not whether the models are impressive given the timeline. They are. The question is whether they are good enough to matter in a market that already has OpenAI, Google, ElevenLabs, and Midjourney in their respective lanes.
On the speech side, the answer is clearly yes. On the image side, the launch positioning was stronger than sustained performance. On the TTS side, the answer is provisional — the English-only limitation is significant, and the preference benchmarks that would prove it out against ElevenLabs are not yet published.
Why Microsoft Built Its Own Models
The OpenAI partnership created a specific dependency problem. Microsoft was paying OpenAI licensing fees for DALL-E to power Bing Image Creator, PowerPoint Designer, and Copilot’s creative features. When MAI-Image-2 replaced DALL-E in those surfaces, Microsoft retained all the margin rather than paying per-call licensing costs at scale. SiliconANGLE called this move “the clearest signal that Redmond is serious about building a self-sufficient AI stack.”
The April 27, 2026 formalization of the partnership restructuring confirmed the direction: OpenAI exclusivity ended, OpenAI can now sell on AWS and other clouds, and Microsoft retained an IP licence through 2032 alongside its ~27% equity stake. The model launches came before the legal formalization — Microsoft signaled its intent with shipped product, not an announcement.
The team itself is deliberately structured as a “lean, talent-dense” group — Suleyman’s words. The Next Web reported that MAI-Transcribe-1 was built by roughly ten people. Ali Farhadi, former AI2 CEO, joined the team in March 2026. The twelve-day gap between the first three models (April 2) and the fourth (April 14) reflects an internal shipping cadence more typical of a startup than a division of a $3 trillion company.
Suleyman’s stated mission for the division is “Humanist Superintelligence (HSI)": AI capabilities that “always work for, in service of, people and humanity more generally.” Whether that framing shapes product decisions or is aspirational brand copy is a question the next several model releases will answer.
Microsoft Foundry
All four MAI models are available through Microsoft Foundry — the January 2026 rebrand of Azure AI Foundry (previously Azure AI Studio). Dropping “Azure” from the name signals that the platform extends beyond Azure to Microsoft 365 and Fabric. Agents are first-class citizens in the platform as of this rebranding.
Access methods:
- Web portal: ai.azure.com
- REST APIs: Images API, Chat Completions API, Responses API
- SDKs: Python, JavaScript/TypeScript, .NET, Java, Rust
- OpenAI-compatible endpoints (same endpoint structure as OpenAI’s API)
- VS Code Extension: Microsoft Foundry Toolkit with curated model playground, no-code agent builder, and one-click deployment
- MAI Playground: microsoft.ai — no waitlist required
Azure regions for image models: East US 2, South Central US, Sweden Central, Poland Central. Speech models: East US and West US via the Azure Speech service.
The Foundry catalog is expansive — 100+ models including GPT-5.5, Claude Opus 4.7, Gemma 4, and the MAI family side by side. For enterprises already running on Azure, the friction to try MAI models is low.
MAI-Transcribe-1 — Speech-to-Text
What It Is
MAI-Transcribe-1 is Microsoft’s first-party speech recognition model: “first-generation speech recognition model, delivering enterprise-grade accuracy across 25 languages at approximately 50% lower GPU cost than leading alternatives.” Released April 2, 2026 (Public Preview).
Supported languages (25 total): English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Japanese, Korean, Mandarin Chinese, Hindi, Arabic, Polish, Turkish, Swedish, Danish, Finnish, Norwegian, Czech, Romanian, Ukrainian, Hungarian, Thai, and Indonesian.
Benchmarks
| Model | FLEURS WER (25 languages avg) |
|---|---|
| MAI-Transcribe-1 | 3.88% |
| GPT-Transcribe | 4.17% |
| ElevenLabs Scribe v2 | 4.32% |
| Gemini 3.1 Flash-Lite | 4.89% |
| Whisper-large-v3 | 7.60% |
The FLEURS benchmark covers 25 languages in standardized audio conditions. MAI-Transcribe-1 leads overall and leads in 11 of the 25 individual language tracks. It outperforms Gemini 3.1 Flash-Lite on 11 of the remaining 14. Arabic (10.1% WER) and Danish (13.2%) are relatively weaker.
These are not self-reported numbers. FLEURS is a public benchmark with consistent conditions across models. The gap over OpenAI and ElevenLabs is real and meaningful for multilingual production deployments.
Pricing
$0.36 per hour of audio. This is roughly 50% of what comparable large-cloud alternatives charge per GPU-hour of comparable compute. For high-volume transcription workloads — call center analytics, media archiving, multilingual customer support — the unit economics are favorable.
How to Access
Via the Azure Speech service. Audio format support: WAV, MP3, FLAC, up to 300 MB per file. The model card (PDF available at microsoft.ai) covers the full technical specification.
Limitations
- Batch processing only — no real-time streaming transcription at launch. Synchronous voice interfaces and live captioning workflows require the existing Azure Speech real-time service in parallel.
- No speaker diarization — who said what requires separate Azure services or post-processing pipelines.
- No context biasing — domain vocabulary injection (e.g., legal terms, product names) is listed as “coming soon.”
- Azure-only — not available on other cloud providers.
Integration Roadmap
Microsoft confirmed planned integration with Copilot Voice mode (phased rollout) and Microsoft Teams. The model is already accessible through the Azure Speech service for developers building new applications.
Assessment: For multilingual enterprise transcription, MAI-Transcribe-1 is the best-priced option at this accuracy tier on major cloud infrastructure. The batch-only constraint is a real limitation for live applications, but for the use cases it covers — media, analytics, archiving, asynchronous processing — it delivers genuine value.
MAI-Voice-1 — Text-to-Speech
What It Is
MAI-Voice-1 is Microsoft’s text-to-speech model: “high-fidelity speech generation model capable of producing 60 seconds of expressive audio in under one second on a single GPU.” Released April 2, 2026 (Public Preview). Already powering Copilot’s Audio Expressions and podcast features.
Performance
The headline spec — 60 seconds of audio in under one second on a single GPU — is a GPU efficiency number, not a latency guarantee in API conditions. The practical implication is that at scale, the model runs cheaply per unit of audio generated. Long-form content (audiobooks, podcast narration) is feasible without the cost curve that has historically made neural TTS expensive at volume.
Microsoft has not published MOS (Mean Opinion Score) scores or head-to-head preference test results against ElevenLabs or OpenAI TTS. ElevenLabs holds 4.14–4.5 MOS ratings on 2026 leaderboards. Without equivalent MAI-Voice-1 MOS data, direct comparison on naturalness is not possible.
What is documented: the model is optimized for long-form content and described as preserving “speaker identity across long-form content” — an important property for audiobook narration where voice consistency over hours matters.
Technical Specs
- Preset voices: 6 named presets — Jasper, June, Grant, Iris, Reed, Joy
- Emotion styles (SSML): 4 — neutral, joy, excitement, empathy
- Voice cloning: Requires a 10-second audio sample; gated approval process applies (consent verification required)
- API voice name format:
en-us-{Name}:MAI-Voice-1 - Language support: English-only at launch; 10+ languages “coming soon”
- Architecture: Not disclosed
Pricing
$22 per 1 million characters. Comparable to mid-tier ElevenLabs plans, lower than OpenAI’s HD TTS pricing for high-volume use. The 700+ voices in the broader Azure Speech ecosystem remain accessible alongside the MAI-Voice-1 quality tier.
Limitations
- English-only at launch — this is the most significant constraint. Multilingual TTS pipelines cannot switch to MAI-Voice-1 yet.
- Limited emotion palette — four styles covers basic tonal variation but not the nuanced expression controls available in ElevenLabs or Hume.
- No published MOS data — naturalness claims are qualitative; independent benchmarking has not yet produced comparable numbers.
- Voice cloning gated — Personal Voice feature requires explicit consent verification and approval, adding friction for rapid deployment.
Assessment: MAI-Voice-1 has a clear use case in Microsoft-centric enterprise environments where TTS cost at scale matters and English-language output is sufficient. For multilingual, highly expressive, or benchmark-validated applications, ElevenLabs remains the reference point until MAI-Voice-1 publishes preference data and expands language support.
MAI-Image-2 — Text-to-Image
What It Is
MAI-Image-2 is Microsoft’s text-to-image generation model, built on a flow-matching diffusion architecture. It debuted on Arena.ai before the April 2 announcement, reaching #3 on the image model family leaderboard. It now powers Bing Image Creator and PowerPoint Designer, replacing the OpenAI DALL-E integration those surfaces previously used.
WPP Global CCO Rob Reilly called it “a genuine game-changer” that “responds to the intricate nuance of creative direction.” That is an enterprise partner testimonial, not an independent benchmark, but it confirms the model went through production validation at scale before the public announcement.
Benchmarks
| Model | Arena.ai Elo (at MAI-Image-2 launch, March 2026) |
|---|---|
| OpenAI DALL-E 3 (reference) | ~1,350+ |
| MAI-Image-2 | ~1,326 (#3) |
| (other models) | — |
| Metric | MAI-Image-2 | Notes |
|---|---|---|
| Arena.ai Elo at debut | ~1,326 (#3) | March 2026 soft debut |
| Elo by mid-April 2026 | ~1,184 | Slipped after GPT-Image-2 raised the field |
| VQA v2 accuracy | 89.2% | Visual question answering |
| TextVQA accuracy | 76.8% | In-image text comprehension |
| Speed vs. prior gen | 2x faster | Internal benchmark |
| Generation time | Under 3 seconds | 1024×1024, production conditions |
The Elo slippage from 1,326 to ~1,184 over the first six weeks matters. By April 21, 2026, ChatGPT Images 2.0 (a reasoning-integrated image model) reached 1,512 Elo — a 242-point gap. MAI-Image-2 no longer holds a top-3 position as of mid-May. Arena.ai Elo is a moving leaderboard, and the current frontier moved fast.
Technical Specs
- Architecture: Flow-matching diffusion model
- Training infrastructure: Azure, thousands of NVIDIA H100 GPUs
- Output resolutions: 1024×1024 (square), 1365×768 (landscape), 768×1365 (portrait); max 1,048,576 pixels per image; minimum 768px per side
- API endpoint:
https://<resource>.services.ai.azure.com/mai/v1/images/generations
Pricing
- Text input: $5 per 1M tokens
- Image output: $33 per 1M tokens
Strengths
The model demonstrates genuine strength in photorealistic commercial imagery — natural lighting, accurate skin tones, and cinematic composition. Text rendering in images improved substantially over the previous Microsoft generation (+115 Elo points on text rendering category). For enterprise marketing content, product visuals, and presentation imagery, it performs in the range of commercial expectations.
Limitations
Real-world testing documented in post-launch reviews reveals a consistent pattern of issues:
- Human anatomy inconsistency — distorted hands and facial features appear on complex multi-subject prompts; a persistent problem for diffusion-based models at this resolution tier
- Content filtering — aggressive filtering rejects some legitimate artistic and creative requests; 30-second cooldown between generations in the playground
- Cultural bias — produces results biased toward Western aesthetics; struggles with non-English prompts and non-Western subjects
- No print-quality resolution — maximum 1024×1024 (square) without external upscaling
- Complex multi-subject prompts — performance degrades when composing multiple distinct characters or objects with precise spatial relationships
- vs. Midjourney: Midjourney v6.1 produces more consistent and detailed results for photorealistic portraits, per independent testing
- vs. DALL-E 3: DALL-E 3 maintains superior accuracy for text rendering per independent review, despite Microsoft’s own text rendering claims
Assessment: MAI-Image-2 is a solid enterprise image generation option — better than nothing and better than older DALL-E versions for Microsoft-centric deployments. At its April 2 debut it was genuinely competitive at #3 on Arena.ai. By May 2026, the image generation frontier has moved, and it no longer holds that position. For commercial imagery pipelines where cost and Azure integration matter, it is a reasonable choice. For creative professionals who need maximum quality, Midjourney remains the reference.
MAI-Image-2-Efficient — The Fast Variant
What It Is
Released twelve days after MAI-Image-2 (April 14, 2026), MAI-Image-2-Efficient is the cost-optimized, high-throughput variant of the flagship. No waitlist. Immediate availability.
SiliconANGLE called the twelve-day turnaround evidence that the MAI team “operated more like a startup shipping iterative products than a traditional corporate research lab.”
Benchmarks
| Metric | MAI-Image-2-Efficient vs. MAI-Image-2 | vs. Competitors |
|---|---|---|
| Speed | 22% faster | 40% better latency vs. Google Gemini 3.1 Flash |
| Throughput per GPU | 4x (normalized by latency + GPU usage) | — |
| Visual characteristics | Sharper, more defined lines | vs. MAI-Image-2’s smoother gradients |
The 40% latency advantage over Gemini 3.1 Flash was measured at p50 latency with optimized batch sizes on NVIDIA H100 at 1024×1024 resolution (April 13, 2026 test conditions).
Pricing
- Text input: $5 per 1M tokens (same as flagship)
- Image output: $19.50 per 1M tokens — 41% cheaper than MAI-Image-2’s $33/1M
Visual Differences
Microsoft describes a deliberate stylistic split:
- MAI-Image-2-Efficient: Sharp, defined lines — better for illustrations, product visuals, high-clarity photoreal images
- MAI-Image-2 (flagship): Smoother, more nuanced contrast — better for photorealistic imagery with soft gradients and atmospheric effects
This is a meaningful distinction. The “Efficient” label is not a downgrade — it is a different tool with a different visual signature optimized for different content types.
Use Cases
- High-volume production — e-commerce product imagery, media asset generation, marketing at scale
- Real-time or conversational experiences — chatbots, interactive design tools, synchronous generation pipelines
- Rapid prototyping — creative iteration where speed matters more than final output quality
Assessment: MAI-Image-2-Efficient is arguably the more interesting of the two image models for developers. The 41% price cut on output tokens is substantial at volume, the throughput advantage is real, and the sharper visual style is appropriate for a large fraction of commercial use cases. For high-volume image generation workloads on Azure, it should be the default starting point.
Summary Benchmark Table
| Model | Type | Release | Pricing | Key Benchmark | Languages | Status |
|---|---|---|---|---|---|---|
| MAI-Transcribe-1 | STT | April 2, 2026 | $0.36/hr | 3.88% WER (#1 FLEURS) | 25 | Public Preview |
| MAI-Voice-1 | TTS | April 2, 2026 | $22/1M chars | 60s audio in <1s / 1 GPU | English only | Public Preview |
| MAI-Image-2 | Text-to-image | April 2, 2026 | $5/$33/1M tokens | #3 Arena.ai at debut | N/A | Public Preview |
| MAI-Image-2-Efficient | Text-to-image | April 14, 2026 | $5/$19.50/1M tokens | 22% faster, 4x GPU throughput | N/A | Public Preview |
Who Should Use These Models
MAI-Transcribe-1 is the clearest recommendation in the family. If your workload is multilingual batch transcription at scale — call analytics, media archiving, customer voice processing — and you are already on Azure, it leads the FLEURS benchmark and undercuts competitors on pricing. The streaming gap is a real constraint for live applications.
MAI-Voice-1 is worth evaluating if you are building English-language TTS for long-form content or enterprise voice interfaces and need Azure-native integration. The language limitation makes it a non-starter for multilingual products today. The lack of published MOS data means you should run your own listening tests before committing.
MAI-Image-2-Efficient is the practical choice for Azure-native image generation at volume. Forty-one percent cheaper output than the flagship, 4x throughput, no waitlist. For e-commerce, marketing, and product imagery workflows, start here.
MAI-Image-2 (flagship) is appropriate when output quality on photorealistic imagery with soft lighting and gradients matters more than throughput cost. The Arena.ai ranking has slipped since launch — evaluate against ChatGPT Images 2.0 and Midjourney v6.1 before committing to the flagship for creative-quality pipelines.
The Bigger Picture
The MAI family is the first tangible output of a bet Microsoft made when it renegotiated the OpenAI contract in September 2025. The bet is that Microsoft can build world-class AI models internally — not just distribute other labs’ models through Azure.
The speech results validate that bet. MAI-Transcribe-1 at 3.88% WER, built by ten people in a few months, leading the FLEURS leaderboard above OpenAI and ElevenLabs, is a meaningful result. It suggests the MAI team can move fast and ship quality.
The image results are more complex. Reaching #3 on Arena.ai at debut was a genuine milestone. Slipping from that position as the field moved in six weeks is a reminder that the image generation market is faster-moving than the speech market. The MAI team’s iterative shipping cadence — flagship on April 2, efficient variant twelve days later — suggests they are aware of this and are building the rhythm to compete.
The TTS story is still being written. English-only at launch is a constraint that limits the market the model can address today. Language expansion and MOS benchmarks will be the next inflection points.
What is not in question is the strategic direction. Microsoft is building its own models, deploying them at scale in its own products, and shipping them to developers through Microsoft Foundry — in parallel with the broader model catalog rather than instead of it. The OpenAI partnership is restructured, not ended. The equity stake and IP licence remain. But Suleyman’s team has made clear that Microsoft intends to be a model producer, not just a model distributor.
Overall rating: 4/5. Exceptional STT debut; provisionally strong TTS; solid image generation with a competitive efficient variant. English-only TTS and batch-only transcription at launch limit the addressable use cases today. The shipping velocity and benchmark quality — given the team’s age and size — points toward continued improvement.
Related Reviews
- Gemini 3.1 Pro Review — Google’s frontier model with 94.3% GPQA Diamond
- Claude Opus 4.7 Deep Dive — Anthropic’s top-tier reasoning model
- Mistral Voxtral TTS Review — open-weights TTS with 68.4% preference win rate over ElevenLabs (coming soon)
- Microsoft Semantic Kernel Framework Review — Microsoft’s agent orchestration framework