Grok Imagine — xAI’s Aurora Model Hits #1 on Every Video Leaderboard

In October 2025, xAI — Elon Musk’s AI company, founded March 2023 — shipped a video generation feature inside Grok called Imagine, powered by a model named Aurora. It arrived with something its most prominent competitors had taken months to add: native audio from day one. Six-second videos with synchronized dialogue, sound effects, and ambient audio, generated in under 15 seconds on a cold prompt.

By February 2026, Grok Imagine 1.0 was generating 10-second clips at 720p. By the time Artificial Analysis announced its Video Arena rankings in early 2026, Grok Imagine had reached #1 in both text-to-video and image-to-video — the first model to simultaneously hold both positions, ahead of Runway Gen-4.5, Kling 2.5 Turbo, and Google Veo 3.1.

The path there involved an unusual architectural choice, a controversial content moderation failure, the world’s largest AI supercomputer, and a distribution channel that no other AI video company could replicate: 600 million monthly active users on X (formerly Twitter).

This is a research-based review of Grok Imagine and the Aurora model — covering architecture, capabilities, pricing, API access, MCP integration, competitive positioning, and the deepfake controversy that drew regulatory investigations on three continents. We do not test AI video tools hands-on; we analyze from public sources, company announcements, independent benchmarks, and technical documentation.


The Company: xAI (a SpaceX Subsidiary)

From Startup to $250 Billion in 35 Months

xAI was founded in March 2023 by Elon Musk, who assembled a team drawn from OpenAI, Google DeepMind, Microsoft Research, and Tesla. The company’s publicly stated mission: “to understand the true nature of the universe.” In practice, its first product — Grok-1, a chatbot for X Premium subscribers — launched in November 2023. External API access opened in late 2024.

The company’s financial history is among the most compressed in technology: from founding to a $250 billion valuation in under three years.

Round Amount Valuation Date
Seed/A ~$134M ~$18B 2023
Series C ~$6B $50B November 2024
Series D ~$6B $75–80B July 2025
Series E $20B $230B January 2026
SpaceX acquisition all-stock $250B (xAI) February 2026

In February 2026, SpaceX acquired xAI in an all-stock transaction. xAI’s $250 billion valuation combined with SpaceX’s existing value produced an entity valued at approximately $1.25 trillion at announcement. Michael Nicolls, former SpaceX Starlink VP, became xAI’s president following the merger.

Employees: approximately 1,518 as of 2026. This is a lean count relative to the company’s scale — xAI has consistently prioritized infrastructure and model capability over headcount growth.

Colossus: The Hardware Foundation

The defining infrastructure advantage that separates xAI from other AI video companies is Colossus, the supercomputer cluster built in Memphis, Tennessee. Phase 1 came online in September 2024 — 100,000 NVIDIA H100 GPUs assembled in 122 days. By February 2026, Colossus had expanded to 555,000+ NVIDIA GPUs, including the newer GB200 architecture. Colossus 2, a second facility, is under construction with a planned capacity of 550,000 NVIDIA chips on completion.

No other AI video company — Runway, Kling’s Kuaishou, Google DeepMind, or OpenAI — operates inference and training infrastructure at this scale independently. Aurora’s quality and generation speed are direct products of that compute advantage.

For more on xAI’s corporate history, API pricing for LLM models, and the SpaceX merger, see our xAI Grok API review.


Aurora: Architecture of the #1 Video Model

Not Diffusion — Autoregressive MoE

Every major AI video model released before Aurora — Sora, Veo, Kling, Runway Gen-4, HappyHorse-1.0 — used some variant of diffusion: a process that iteratively denoises a random noise input, guided by a text or image conditioning signal, until the noise becomes a coherent image or video frame. Diffusion models dominate the video generation space because they produce high-quality results at high resolution and are well understood by researchers.

Aurora uses a fundamentally different architecture: Autoregressive Mixture-of-Experts (MoE).

The autoregressive approach treats image and video generation the same way a large language model treats text — predicting the next token from a stream of interleaved text and image data, one token at a time, left to right, conditioned on everything that came before. Images are tokenized into discrete tokens that occupy the same stream as text tokens. The model has seen text and image tokens interleaved during training and has learned to generate either, or both simultaneously.

The Mixture-of-Experts component means that instead of activating the entire neural network for every generation, Aurora routes inputs through specialized sub-networks (“experts”). A photorealistic portrait activates one expert cluster. An animated scene activates a different cluster. An audio synthesis task routes to a third. This selective activation dramatically reduces inference compute requirements per generation — which is why Aurora can produce a 6-second video in approximately 15 seconds while diffusion-based competitors measuring minutes per clip.

Aurora was trained on xAI’s Colossus supercomputer using 110,000 NVIDIA GB200 GPUs. xAI has not disclosed a parameter count for Aurora, and the model’s architecture paper (if one exists) has not been publicly released. What xAI has disclosed is that Aurora was trained on interleaved text, image, and video data at a scale consistent with the Colossus infrastructure.

The practical implication of the autoregressive approach for video: because the model generates tokens in a left-to-right sequence where each token conditions on all prior tokens, it has inherently strong temporal consistency — the same mechanism that makes LLMs coherent sentence-to-sentence applies to visual frames second-to-second. This is structurally different from the way diffusion models achieve temporal consistency (through flow-matching or temporal attention mechanisms applied across a pre-denoised frame sequence).


Product Timeline: Image to Video to #1

July 2025: Image Generation Launches

Grok Imagine launched on July 28, 2025 as an image generation feature inside the Grok interface on X and at grok.com. Aurora first appeared here — generating images from text prompts with what reviewers noted as unusually strong photorealistic quality and prompt adherence relative to established image generators.

The image-only launch gave xAI time to test Aurora’s token generation pipeline at scale before adding the higher-compute video generation workflows.

October 2025: Imagine v0.9 — Video Arrives With Audio

October 7, 2025: xAI released Grok Imagine v0.9, the first video generation build. Key specifications at launch:

  • Duration: up to 6 seconds per clip
  • Resolution: 480p (free users), 720p (SuperGrok subscribers)
  • Audio: native — dialogue, sound effects, and ambient audio synchronized to video actions, generated in a single pass
  • Generation time: approximately 15 seconds for a 6-second clip
  • Access: SuperGrok subscribers via waitlist; free tier with reduced resolution

The audio-from-day-one launch was significant. Google’s Veo 3 had launched native audio in May 2025. Runway Gen-4 had been silent at launch in March 2025, with Gen-4.5’s native audio arriving in December 2025. Kling and HappyHorse launched audio as separate additions. Aurora’s single-pass text+audio+video generation — a consequence of the autoregressive architecture that treats all modalities as interleaved token streams — meant audio was not bolted on but was part of the original design.

February 2026: Grok Imagine 1.0 — 10 Seconds at 720p

February 3, 2026: Grok Imagine 1.0 shipped with the following changes:

  • Duration extended from 6 seconds to 10 seconds
  • 720p now the standard output at SuperGrok tier (locked; free users remain at 480p)
  • Dramatically improved visual quality relative to v0.9 community assessments
  • Generation time reduced from ~15 seconds to under 10 seconds for standard requests

In the same period — late January 2026 — xAI launched the Grok Imagine API, opening programmatic access to video, image, and audio generation workflows for developers.

March 2026: Extend from Frame

March 2, 2026: xAI added the “Extend from Frame” feature, allowing users to use the final frame of one generation as the starting point of the next. This enables chaining clips into longer sequences. Community testing noted that visual quality degrades visibly after two or three extensions — temporal consistency remains strong within a single generation but drifts across extended chains. xAI has not released a timeline for an improved multi-clip coherence system, though the company’s stated goal of 30-minute generation by late 2026 implies significant upcoming work in this area.

March 2026: SuperGrok Lite

March 25, 2026: xAI introduced SuperGrok Lite at $10/month as the new entry-level paid tier, providing basic image and video generation at 480p and 6-second duration. This positioned Grok Imagine across three price tiers: free (480p, 6s), Lite ($10/mo, 480p, 6s + additional features), and SuperGrok ($30/mo, 720p, 10s, ~100 video renders/day).

April 2026: 1080p Announced

In late April 2026, xAI announced that Grok Imagine would soon support full HD 1080p output. As of this writing (May 2026), 720p remains the current maximum native resolution. The 1080p upgrade is anticipated but not yet live.


Rankings: #1 in Both Arenas

Grok Imagine’s arrival at the top of the Artificial Analysis Video Arena leaderboard was confirmed in early 2026, following the Grok Imagine 1.0 release. Artificial Analysis announced the rankings publicly via X and updated their video leaderboard:

“xAI’s Grok Imagine takes the #1 spot in both Text to Video and Image to Video in the Artificial Analysis Video Arena, surpassing Runway Gen-4.5, Kling 2.5 Turbo, and Veo 3.1.”

Image-to-Video ELO: 1,336 — ranking details from available sources:

Model Image-to-Video Elo
Grok Imagine 1,336 (#1)
Runway Gen-4.5 ~1,247
Kling 2.5 Turbo ~1,230
Veo 3.1 ~1,226
HappyHorse-1.0 1,397 (silent only)

Note: Artificial Analysis maintains separate leaderboards for silent and audio-capable models; ELO numbers above reflect the audio/combined leaderboards where applicable. Rankings change regularly — verify current standings at artificialanalysis.ai.

The Artificial Analysis rankings involved greater than 15,000 pairwise comparisons, providing statistical credibility to the top-position claim. Separately, the DesignArena by Arcada Labs gave Grok Imagine a 1,336 Elo score with a 69.7% win rate — 33 points clear of #2 in that arena at the time of reporting.

One caveat: the leaderboard positions reported here reflect the state at Grok Imagine 1.0 launch in early 2026. Kling 3.0 (released after our data collection window) may have affected the text-to-video position. Verify current rankings directly at artificialanalysis.ai before using these numbers for procurement decisions.


Capabilities and Workflows

Grok Imagine supports five workflows as of v1.0:

1. Text-to-Video

Type a text prompt; receive a video clip. The Aurora architecture’s single-pass generation produces synchronized audio alongside the video — ambient sounds, sound effects, and dialogue-like vocalizations emerge from the same generation, not from a post-processing step.

Aspect ratios: horizontal (16:9) and vertical (9:16, for social media), with standard (4:3) support. Generation at any supported aspect ratio in one pass.

Duration: up to 10 seconds. Resolution: 720p (SuperGrok) or 480p (free/Lite).

Generation time: approximately 10–15 seconds. This is substantially faster than diffusion-based competitors, which range from 30 seconds (Runway Gen-4 Turbo) to several minutes for higher-quality models. The speed difference reflects the MoE routing efficiency.

2. Image-to-Video

Supply a reference image; the model generates a video clip consistent with the image’s content, style, and subject. The #1 image-to-video Elo score on Artificial Analysis reflects that Aurora’s autoregressive conditioning on image tokens produces stronger visual consistency than competing approaches — the starting image’s characteristics are preserved throughout the clip rather than drifting.

3. Video-to-Video

Supply an existing video; the model generates a modified version — style transfer, background changes, character modifications, or motion alteration. This is Aurora’s most computationally demanding workflow.

4. Image Edit

Supply an image and a text instruction; Aurora modifies the image without regenerating the full composition. Built-in editing for background changes, outfit swaps, object additions, and lighting adjustments. This uses Aurora’s image-token conditioning as a reference anchor.

5. Extend from Frame

Uses the final frame of an existing Grok Imagine generation as the starting frame of the next clip, enabling sequential chaining. As noted, quality degrades across more than two or three extensions. xAI has not disclosed when or whether a dedicated long-form coherence system will replace the frame-based chaining approach.


MCP Integration

xAI does not publish an official MCP server for Grok Imagine video generation.

xAI operates an Agent Tools API that enables server-side tool execution for Grok the chatbot — including web search, X search, code execution, and remote MCP connection. This is an LLM-level feature of the Grok chatbot, not a dedicated video generation MCP interface.

For the video generation API specifically, the available integration paths are:

  • Grok Imagine API (x.ai/api/imagine) — official REST API with Python SDK; text-to-video, image-to-video, video editing, and audio endpoints. Video output URLs are temporary and must be downloaded promptly.
  • Community MCP servergithub.com/merterbak/Grok-MCP — unofficial, community-built MCP server for xAI’s Grok API including image and video generation. Not supported by xAI.
  • Replicatereplicate.com/xai/grok-imagine-video — third-party hosted API endpoint.
  • fal.ai — Grok Imagine available on the fal inference platform.

For developers building MCP-enabled workflows that include AI video, Runway’s official MCP server (github.com/runwayml/runway-api-mcp-server) and Captions AI’s official MCP server (captions.ai/help/api-reference/mcp) currently offer more formal support channels.


Pricing

Consumer Tiers

Tier Price Video Resolution Duration Renders
Free $0 480p 6s Limited
SuperGrok Lite $10/mo 480p 6s More than free
SuperGrok $30/mo or $300/yr 720p 10s ~100/day
X Premium+ $40/mo 720p 10s ~100/day

SuperGrok Lite launched March 25, 2026 as the new entry tier after xAI recognized that the gap between free and the $30/mo SuperGrok was limiting adoption. X Premium+ at $40/month includes Grok 4 access, making it the higher-priced but more feature-complete option for users already paying for X’s premium subscription.

API Pricing

$4.20 per minute of generated video with audio — equivalent to approximately $0.07 per second.

Competitive context:

Provider API Price (video with audio) Notes
Grok Imagine $4.20/min Aurora MoE, 720p
Google Veo 3.1 ~$12/min ~3× Grok
Runway Gen-4.5 ~$9–12/min (credits) credit-based
OpenAI Sora 2 Pro ~$30/min ~7× Grok

At $4.20/minute, xAI has priced Grok Imagine at approximately one-third of Veo 3.1 and one-seventh of Sora 2 Pro — positioning Aurora as the cost-effective default for developers building video generation applications, particularly at the #1 ELO quality tier.

The API is available at x.ai/api/imagine. Video URLs generated through the API are temporary (consistent with Runway’s 24-hour expiry approach) and must be downloaded to persistent storage by the caller.


Distribution Advantage: X Integration

One structural advantage Grok Imagine has over every other AI video model is its distribution channel. Grok is the native AI of X (formerly Twitter), which has 600 million monthly active users. When X integrated Grok — including Imagine — directly into the X interface, xAI gained a video generation product installed by default in the app that hundreds of millions of people already open daily.

No other AI video company has a comparable distribution entry point. Runway requires navigating to runway.ml, creating an account, and purchasing credits. Google Veo requires Gemini Advanced or the Vertex AI API. Kling is accessible via an app and web interface but lacks a comparable social distribution mechanism.

The X integration means that Grok Imagine’s user acquisition cost is near zero for the 600M users already on X. The conversion path from “X user” to “Grok Imagine user” is a single subscription upgrade — something no other video AI company can offer at that scale.


Controversies

Spicy Mode and the Deepfake Scandal

When Grok Imagine launched in July 2025 (image generation phase), xAI included a “Spicy Mode” setting that substantially relaxed content generation filters, permitting the creation of sexually explicit imagery. The feature survived into the v0.9 video phase in October 2025.

By late 2025 and early 2026, the consequences were severe. Analysis by content detection tool Copyleaks found Grok generating explicit images at approximately 6,700 per hour during peak periods. Cases of Grok being directed to generate non-consensual sexual imagery — including images removing or replacing clothing on photographs of real women — began appearing publicly in May 2025. By December 2025, documented trends of X users requesting non-consensual edits had reached significant scale.

Regulatory responses followed:

  • UK Information Commissioner’s Office (ICO): opened investigation
  • France: cybercrime unit investigation
  • California: Attorney General probe
  • Ireland Data Protection Commission: formal Statutory Inquiry into xAI’s historic data processing practices, opened April 2025

xAI tightened Spicy Mode restrictions following the public attention, but the restrictions came after the scandal rather than before. The episode raises substantive questions about xAI’s approach to content moderation governance: whether AI video/image generation should have guardrails designed before product launch rather than in response to documented abuse.

For enterprise procurement teams evaluating Grok Imagine: the deepfake controversy is a documented governance failure, not a speculative risk. Whether the post-scandal mitigations are sufficient depends on your organization’s risk tolerance and reputational exposure.

Data Privacy: Ireland DPC Inquiry

Ireland’s Data Protection Commission opened a formal Statutory Inquiry in April 2025 into the lawfulness of xAI’s historic data processing under GDPR. The inquiry predates the deepfake scandal and focuses on whether xAI’s use of X user data to train Grok models was lawful without explicit consent under European data protection law. As of May 2026, the inquiry remains ongoing.

This matters specifically for EU-based organizations integrating Grok Imagine API into products: if the DPC inquiry produces a finding that xAI’s training data practices violated GDPR, the legal and operational implications for downstream users of the API are uncertain.

Elon Musk Political Risk

The SpaceX acquisition of xAI in February 2026 made xAI a wholly owned subsidiary of a company whose founder holds significant political roles. Enterprise procurement decisions are increasingly factoring in vendor political alignment as a risk variable. Multiple enterprise procurement teams have been reported as having informal “no xAI” policies.

This is a business risk assessment, not a political judgment. Runway, Google, and OpenAI operate outside this specific dimension of reputational exposure. If your deployment context includes enterprise customers, government clients, or public sector work, the xAI/Musk association is a variable to evaluate with your procurement and legal teams.


Roadmap

xAI has made several public commitments about Grok Imagine’s future capabilities:

  • 1080p native resolution: announced April 2026, timeline unspecified
  • 30-minute video generation: targeted for late 2026 — a fundamental increase in long-form coherence
  • Full-length film generation: stated as a 2027 goal
  • Colossus 2: additional 550,000 GPU facility increases inference capacity available for Imagine workloads

The long-form video goals (30 minutes → full-length films) are ambitious to the point of requiring architectural advances beyond what Grok Imagine 1.0 demonstrates. Aurora’s current “Extend from Frame” approach degrades visibly after two or three extensions. Whether the same autoregressive architecture can be adapted for hour-length coherent video — or whether this requires a different approach — is not clear from public information.


Competitive Positioning

Dimension Grok Imagine Runway Gen-4.5 Kling 2.5 Veo 3.1 HappyHorse-1.0
ELO (text-to-video) #1 #2 range #3 range #4 range #1 (silent only)
ELO (image-to-video) #1 (1,336) #2 range varies varies 1,397 (silent only)
Native audio Yes (Oct 2025) Yes (Dec 2025) Yes Yes No
API cost/min (audio) $4.20 ~$10+ varies ~$12 No public API
Max resolution 720p (1080p soon) 720p (4K upscale) 1080p 720p native 1080p
Official MCP server No Yes No No No
Max duration 10s (extendable) 60s (Gen-4.5) 10s 10s 10s
Open source No No No No Apache 2.0
Generation speed ~10–15s ~30s (Turbo) varies varies not disclosed

The competitive picture: Grok Imagine wins on ELO quality, API price, and speed. It loses on resolution ceiling (720p vs. 1080p competitors), MCP integration (no official server), and maximum clip duration for single-pass generation (10s vs. Runway’s 60s). The deepfake controversy is a governance differentiator that Runway, Kling, and Google Veo do not share in the same form.


Rating: 4/5

What earns 4/5:

  • #1 ELO on Artificial Analysis for both text-to-video and image-to-video — the most comprehensive independent benchmarking of AI video quality
  • Native audio from launch — Aurora’s autoregressive architecture generates text, video, and audio in a single pass, not as a post-processing addition
  • Fastest generation speed at ~10–15 seconds per clip — 2–4× faster than diffusion-based competitors
  • Lowest API price at $4.20/min with audio — 3× cheaper than Veo 3.1, 7× cheaper than Sora 2 Pro
  • Unprecedented distribution via X’s 600M monthly active users — a structural moat no other video AI company can replicate
  • Colossus compute advantage — 555,000+ NVIDIA GPUs enabling both quality and speed not achievable at smaller infrastructure scale

What keeps it from 5/5:

  • 720p native resolution cap — HappyHorse-1.0 (Alibaba), Kling 2.5, and others native 1080p. The 1080p announcement for Grok Imagine exists but is not yet live as of May 2026.
  • No official MCP server — the fastest-growing AI video integration pattern, and xAI has not shipped one for Imagine. Runway and Captions AI have official MCP servers; xAI does not.
  • Deepfake/NSFW governance failure — the Spicy Mode scandal drew investigations from four regulatory bodies across three jurisdictions. This is a documented procurement risk, not a theoretical one.
  • Enterprise risk from Musk/SpaceX association — documented informal “no xAI” policies at enterprise procurement level create a market ceiling that other AI video providers do not face
  • Extend-from-Frame quality degradation — single generations are best-in-class; chained sequences degrade. Long-form video at quality parity with single generations is not yet demonstrated.

Grok Imagine is the most technically capable AI video model available as of May 2026, at the lowest API price of any audio-enabled competitor, with the fastest generation speed. The governance and enterprise risk concerns are real and documented. For individual creators, independent developers, and organizations with low reputational exposure, the quality-to-price ratio is the strongest in the market. For regulated enterprise, government, or public sector deployments, the deepfake scandal and DPC inquiry warrant careful legal review before integration.


This review reflects publicly available information as of May 2026. AI video model quality rankings update frequently — verify current Artificial Analysis leaderboard standings at artificialanalysis.ai. ChatForest does not test AI tools hands-on; all findings are sourced from public announcements, independent benchmarks, and technical documentation.