At a glance: Alibaba Qwen3.7-Max, released May 19, 2026. Reasoning model. 1M-token context. $2.50/$7.50 per million tokens. Anthropic API protocol compatible. GPQA Diamond: 92.4. HLE: 41.4. SWE-Verified: 80.4. Artificial Analysis Intelligence Index: 56.6 (#1 Chinese model, #5 globally). Part of our AI Models & Companies reviews.
The gap between Chinese frontier AI and Western frontier AI just got harder to measure.
Qwen3.7-Max, announced at the 2026 Alibaba Cloud Summit on May 20, edges Claude Opus-4.6 on GPQA Diamond (92.4 vs 91.3), edges it on Humanity’s Last Exam (41.4 vs 40.0), and is statistically tied on SWE-Verified (80.4 vs 80.8). It beats Opus-4.6 Max on Apex Math Reasoning by 29%. On the Artificial Analysis Intelligence Index — the aggregate benchmark that tracks overall model capability — Qwen3.7-Max scores 56.6, placing fifth globally and becoming the highest-ranked Chinese model, surpassing Gemini 3.5 Flash (55.3) in the process.
It also costs $2.50/$7.50 per million tokens and plugs directly into existing Claude Code deployments without an SDK change.
Whether those benchmark margins hold under independent evaluation, and whether “highest-ranked Chinese model” matters to you, depends on what you’re building. But the number worth sitting with is this: for the first time, a Chinese AI lab is producing a model where the head-to-head benchmark case is genuinely competitive, not just competitive within a regional bracket.
Release Context
Qwen3.7-Max is the flagship in Alibaba’s Qwen3.7 generation. The model first appeared on the LM Arena text leaderboard around May 14, 2026, with API access opening May 19 through Alibaba Cloud and third-party API aggregators. The formal announcement came at the 2026 Alibaba Cloud Summit on May 20.
Qwen generation comparison:
| Model | AA Index | Context | Multimodal | Weights Available |
|---|---|---|---|---|
| Qwen3.6 Max Preview | 51.8 | 256K | No | No (Max tier) |
| Qwen3.7-Max | 56.6 | 1M | No | No (Max tier) |
The context window expansion from 256K to 1M tokens is the most structural hardware-visible change. Gains on the AA Intelligence Index (+4.8 points, a significant single-generation jump) are concentrated in scientific reasoning (+9.2pp on HLE), autonomous execution (+6.9pp on Terminal-Bench Hard), and logical reasoning (+9.7pp on CritPt).
The Qwen3.6 open-weight series (Qwen3.6-27B, Qwen3.6-35B-A3B under Apache 2.0) had established Alibaba as a serious open-weight competitor. Qwen3.7-Max continues the proprietary frontier API-only pattern for the Max tier — the open-weight community is working with the Qwen3.6 series for now.
Architecture
Reasoning Model Design
Qwen3.7-Max is a chain-of-thought reasoning model — it thinks before it answers. Every response goes through a structured internal deliberation phase before producing output, trading latency and token cost for accuracy on hard problems. This is the same architecture category as OpenAI o3 and DeepSeek R2: stronger on complex multi-step tasks, slower and more expensive on tasks where the chain-of-thought step is unnecessary overhead.
Specific parameter counts for the Max tier are not publicly disclosed by Alibaba. The Qwen3.7 generation uses MoE-lineage architecture consistent with the Qwen3.6 series (where open models like Qwen3.6-35B-A3B use Gated Delta Networks combined with sparse Mixture-of-Experts). Assuming the same design principle, a large total parameter pool with a small active-per-token fraction is the most likely configuration.
Context specifications:
- Input context: 1,000,000 tokens
- Maximum output: 65,536 tokens
- Prompt caching: supported
The 65,536-token output limit is worth noting explicitly. For long-running autonomous tasks generating substantial output — code repositories, extended reasoning traces, large documents — you will hit the output ceiling before the input ceiling. This is standard across the reasoning model category.
Anthropic API Protocol Compatibility
The builder-facing story that matters most for existing tooling: Qwen3.7-Max supports the Anthropic API protocol natively. Developers can configure Claude Code, OpenClaw, and any Anthropic SDK-based tool to use Qwen3.7-Max as the backend model without SDK modifications.
This is not a wrapper or compatibility shim — Alibaba has implemented the protocol directly. The implication for teams already invested in Claude Code workflows: Qwen3.7-Max is a viable cost/performance tradeoff option that doesn’t require rebuilding tooling.
Benchmarks
Aggregate: Artificial Analysis Intelligence Index
Qwen3.7-Max scores 56.6 on the Artificial Analysis Intelligence Index (as of May 2026), placing it:
- 5th globally across all models
- 1st among Chinese models (DeepSeek V4-Pro Max trails)
- Ahead of Gemini 3.5 Flash (55.3)
- 4.8 points above predecessor Qwen3.6 Max Preview (51.8)
The frontier tier (Claude Opus-4.6, GPT-5.5) sits at 58-62 on this index. Qwen3.7-Max is in the second tier, above the fast/cheap models and within measurement range of the frontier.
Head-to-Head Against Claude Opus-4.6
| Benchmark | Qwen3.7-Max | Claude Opus-4.6 | Delta |
|---|---|---|---|
| GPQA Diamond | 92.4 | 91.3 | +1.1 |
| HLE (Humanity’s Last Exam) | 41.4 | 40.0 | +1.4 |
| Apex Math Reasoning | 44.5 | 34.5 | +10.0 |
| SWE-Verified | 80.4 | 80.8 | −0.4 (tied) |
The Apex Math Reasoning margin is the most striking: 44.5 vs 34.5, a gap that exceeds the difference between most tier-adjacent models. On scientific and mathematical reasoning specifically, Qwen3.7-Max is ahead of the current Anthropic flagship.
The SWE-Verified tie at 80.4/80.8 (within margin of error) means that for software engineering agentic tasks, Qwen3.7-Max and Opus-4.6 Max are functionally equivalent in benchmark terms.
Agentic: Terminal-Bench 2.0
Terminal-Bench 2.0 score: 69.7 — measuring autonomous long-horizon terminal task completion. Qwen3.7-Max gained 6.9 percentage points on this benchmark from the prior generation, which is consistent with the architectural focus on sustained autonomous execution.
The 35-Hour Chip Optimization Demo
Alibaba’s most striking capability demonstration for Qwen3.7-Max doesn’t involve a leaderboard. It involves a domestic Chinese chip.
The demo: Qwen3.7-Max was given access to an isolated server running a Pingtouge Zhenwu M890 processor (T-Head ZW-M890 PPU) — a hardware architecture the model had never encountered during training. Its task: autonomously optimize Triton operators for this chip.
Over 35 continuous hours, the model:
- Executed 1,158 distinct tool calls
- Performed 432 kernel evaluations
- Diagnosed compilation failures independently
- Restructured operators (Split-K prefix KV-cache, PyTorch pre-allocation replacing cudaMalloc synchronization, single-thread-block multi-query processing)
- Achieved a 10.0x geometric mean speedup on Triton operator performance
The result was working, optimized software for a chip it had never trained on, produced without human intervention.
Why this matters beyond the demo: Alibaba is demonstrating that a frontier-class LLM can autonomously write the optimized software stack for a domestically produced AI accelerator. This directly addresses the software bottleneck in Chinese AI infrastructure — a layer where dependence on NVIDIA’s CUDA ecosystem has been a constraint. If Qwen3.7-Max can produce this for Pingtouge chips, the same capability applies to any domestic Chinese chip entering production.
The caveat to apply: This is Alibaba’s own reported result. The 10x figure is not third-party reproduced. Independent verification will take months. Treat this as a directional signal rather than a verified specification.
Pricing
| Tier | Price per M tokens |
|---|---|
| Input | $2.50 |
| Output | $7.50 |
| Prompt cache read | Reduced (standard caching) |
At $2.50/$7.50, Qwen3.7-Max sits at a price point that is:
- Equal to Cohere Command A+ on input ($2.50)
- 25% cheaper on output than Command A+ ($7.50 vs $10.00)
- Substantially below Claude Opus-4.6 tier pricing
- Above fast-tier models like Gemini 3.5 Flash ($0.075/$0.30) — this is a frontier reasoning model price, not a flash model price
For teams running high-volume agentic workloads where quality matters and Anthropic API tooling is already in place, the cost-to-benchmark-quality ratio is favorable.
Builder Implications
If you’re running Claude Code and want to reduce costs without rebuilding tooling: Qwen3.7-Max’s Anthropic API protocol support means you can swap the backend and benchmark against Opus-4.6 on your actual workloads. SWE-Verified parity suggests the real-world gap may be small for software engineering tasks.
If you’re building agentic workflows with 1M+ token contexts: Qwen3.7-Max is one of the few models in this tier with a genuine 1M-token context window. The 65,536-token output limit is the constraint to design around.
If your work involves mathematical reasoning or scientific analysis: The Apex Math Reasoning gap (+10.0 over Opus-4.6) is large enough to be practically meaningful. Qwen3.7-Max is the strongest publicly available option on that specific benchmark.
If you need open weights for on-premise deployment: Qwen3.7-Max is API-only. The Qwen3.6 series (Apache 2.0) is the open-weight option from Alibaba’s current generation.
If you need multimodal input (vision, audio, documents): Qwen3.7-Max is text-only. Consider Cohere Command A+ for document/spreadsheet vision or Gemini Omni Flash for video/conversational editing.
What It Isn’t
Not open weight at launch. The Qwen3.6 series established Alibaba as a major open-weight contributor under Apache 2.0. Qwen3.7-Max continues the proprietary API-only pattern for the flagship tier. Researchers and enterprises needing deployable weights will wait.
Not multimodal. Qwen3.7-Max processes text in and text out. The Qwen3.7 generation may include vision-capable variants; the Max flagship is text-only.
Not independently verified (for the chip demo). The 35-hour, 10x optimization result is a first-party demonstration. It is directionally significant but should be treated as a benchmark claim pending external reproduction.
Not cheap for high-volume tasks. At $2.50/$7.50, Qwen3.7-Max is priced for quality-sensitive frontier workloads. If you need volume at low cost, flash-tier alternatives are an order of magnitude cheaper.
Verdict
Qwen3.7-Max is the most competitive Chinese AI model released to date, and the first where the benchmark case against Western frontier models is genuinely credible rather than regional.
The combination of near-Opus-4.6 performance on GPQA Diamond and HLE, SWE-Verified statistical parity, Apex Math Reasoning leadership, Anthropic API protocol compatibility, and $2.50/$7.50 pricing makes this the strongest cost/performance option in the reasoning model tier for teams already building on Anthropic tooling.
The gaps are real: text-only, no open weights, and a 35-hour flagship demo that is self-reported. But for software engineering and mathematical reasoning workloads running through Claude Code-compatible infrastructure, Qwen3.7-Max is worth testing against your actual tasks.
Rating: 4/5. Strong benchmarks, strong price, strong builder compatibility. Held back one point by text-only limitation, no open weights, and the self-reported nature of the headline demo.
ChatForest reviews are research-based. We do not test AI models directly on our own hardware. Benchmark figures are sourced from Artificial Analysis, Alibaba’s announcement materials, and independent analyses. Verify specifications before making infrastructure decisions.