At a glance: Mistral Small 3.1, released March 17, 2025. 24B dense parameters. 128K context (4× over Small 3). Apache 2.0. Vision input added. 150 tokens/second. Single RTX 4090 deployable. MMLU 80.62%. HumanEval 88.41%. DocVQA 94.08%. $0.10/$0.30 per million tokens. Part of our AI Models & Companies reviews.
“Best in its weight class” is a claim Mistral has made more than once. With Mistral Small 3.1, released March 17, 2025, the argument was harder to dismiss than usual.
The model takes the 24-billion-parameter foundation of Mistral Small 3 — released in January 2025, already competitive with GPT-4o Mini — and addresses its two most obvious gaps: it adds vision understanding, and it quadruples the context window from 32K to 128K tokens. The parameter count does not change. The Apache 2.0 license does not change. A single RTX 4090 or a Mac with 32GB RAM still runs it. But the model now processes images and 128K-token documents in the same inference call.
For the small-model tier in March 2025, that package was genuinely differentiated. GPT-4o Mini had an 8× smaller context window. Gemma 3 27B had just launched without vision at that parameter scale. Claude 3.5 Haiku was vision-capable but closed-source and hosted-only. Mistral Small 3.1 offered multimodal capability, large context, open weights, and consumer hardware deployment — all four simultaneously.
Companion to our Mistral AI review, which covers the broader company model lineup, business model, and Le Chat platform.
Background: Two Gaps Closed in One Release
Mistral Small 3 launched January 2025. It was a 24B dense model, text-only, with a 32K context window and Apache 2.0 licensing. It outperformed Llama 3.1 8B and Qwen 2.5 7B on most benchmarks for its tier, and it ran on consumer hardware. But two limitations stood out immediately in developer feedback: no vision, and a context window that capped out well below what cloud-hosted competitors offered.
Mistral Small 3.1 was released seventy-six days later. It is not a new training run from scratch — Mistral describes it as a capability upgrade to the Small 3 base that adds visual understanding and extends the rope embeddings to support 128K context. The architecture remains a dense transformer. The weights are available on HuggingFace under the same Apache 2.0 license. The hardware requirement is unchanged.
The release date matters for competitive context. March 2025 was a crowded month:
- Google Gemma 3 had just launched (March 12, 2025) with its own multimodal push at the 27B scale.
- GPT-4o Mini remained the dominant small-model API reference point, but at 8K context and closed-source.
- Mistral Small 4 (119B MoE, 6B active) would not arrive until March 2026 — a full year later.
- Mistral Small 3.2, an incremental refinement of 3.1, would follow in June 2025.
The direct competition Mistral cited most explicitly: Gemma 3 27B (larger, same week), GPT-4o Mini (smaller context, no open weights), and Claude 3.5 Haiku (closed-source, similar tier). On Mistral’s published benchmarks, Small 3.1 leads across multilingual tasks and vision document understanding.
Architecture: 24B Dense, Rope-Extended to 128K
Mistral Small 3.1 is a dense Transformer model — not a Mixture-of-Experts design. Every forward pass activates all 24 billion parameters. This is different from Mistral’s larger models: Mixtral 8x7B (46B total, ~13B active per token) and Mistral Large 3 (675B total, 41B active per token) both use MoE routing to reduce per-token compute. At 24B, the trade-off reverses: a dense model avoids the routing overhead and memory footprint of storing inactive experts, and at this parameter count the latency is already fast enough.
The key architectural changes from Mistral Small 3:
Context window expansion: 32K → 128K tokens. Mistral achieved this via extended RoPE (Rotary Position Embedding) frequencies — the same technique used for long-context extensions in Llama 3.1, GPT-4 Turbo, and other 2024–2025 models. The model was trained on longer sequences; it is not a naive rope-scaling hack applied post hoc. RULER 128k benchmark performance at 81.20% — and 93.96% at 32k — confirms the extension holds quality.
Vision encoder added. The 3.1 release adds a visual feature extractor that processes image inputs alongside text tokens. The encoder handles standard web formats (JPEG, PNG, WebP) and feeds visual representations into the main transformer. Output remains text-only. The vision stack covers: document understanding (DocVQA 94.08%), scientific diagrams (AI2D 93.72%), charts (ChartQA 86.24%), general visual question answering (MMMU-Pro 49.25%), and mathematical visual reasoning (MathVista 68.91%).
Function calling and structured output. Inherited from Small 3 and retained in 3.1 — low-latency tool use for agentic pipelines. The instruct variant supports Mistral’s standard tool-call format compatible with the OpenAI function-calling interface.
Hardware reality: 24B parameters in BF16 requires ~48GB VRAM. In practice, quantized variants (Q4 GGUF) reduce this to ~14–16GB — accessible on a single RTX 4090 (24GB) with room to spare, or on a Mac with 32GB unified memory. At FP16 in Ollama or llama.cpp, an RTX 4090 runs it comfortably. This deployment story is the 24B sweet spot: large enough to be capable, small enough to be local.
HuggingFace identifiers:
- Instruct:
mistralai/Mistral-Small-3.1-24B-Instruct-2503 - Base:
mistralai/Mistral-Small-3.1-24B-Base-2503
Text Benchmarks
Mistral published the following text benchmark results for Mistral Small 3.1:
| Benchmark | Score |
|---|---|
| MMLU (5-shot) | 80.62% |
| MMLU Pro | 66.76% |
| HumanEval | 88.41% |
| MATH | 69.3% |
| GPQA Diamond | 45.96% |
| GPQA Main | 44.42% |
| SimpleQA | 10.43% |
MMLU at 80.62% places it just below GPT-4o Mini (82.0%) and slightly above Mistral Small 3 on this metric. At the 24B scale, this is a strong result — well above Llama 3.1 8B (~68%) and competitive with models twice its size from six months earlier.
HumanEval at 88.41% is the standout text number. For comparison: GPT-4o Mini scores ~87%, and Claude 3.5 Haiku is in the ~88–90% range depending on the evaluation pass. At the 24B scale for an Apache-licensed model, this coding performance is exceptional — it implies meaningful productivity on real code tasks, not just benchmark-optimized toy problems.
MATH at 69.3% is solid for a non-reasoning model. Mistral notes that Small 3 (the predecessor) actually edged ahead of 3.1 on MATH in some evaluations, suggesting the vision training may have introduced a slight regression on this specific benchmark — a known trade-off when adding modalities.
GPQA Diamond at 45.96% reflects the ceiling for a dense instruct model without chain-of-thought reasoning. For comparison: Claude 3.5 Sonnet scored ~59%, GPT-4o scored ~53.6%. Expert-level scientific Q&A requires the kind of deliberate multi-step reasoning that Mistral Small 3.1 does not provide. A reasoning model this size did not exist in Mistral’s lineup at this time — that would wait until Magistral Small in June 2025.
SimpleQA at 10.43% is the most jarring number. SimpleQA tests direct factual recall (no multiple choice, no context provided). A 10.43% score means the model struggles to reliably produce short factual answers to straightforward questions. This is a known limitation of instruction-tuned smaller models — they learn to generate plausible responses rather than to have reliable factual recall. It is not unique to Mistral Small 3.1, but the score is low enough to matter for applications that depend on factual grounding without retrieval.
Vision Benchmarks
The vision suite is where Mistral Small 3.1 makes its strongest competitive case:
| Benchmark | Score |
|---|---|
| DocVQA | 94.08% |
| AI2D | 93.72% |
| ChartQA | 86.24% |
| MathVista | 68.91% |
| MMMU-Pro | 49.25% |
| MM-MT-Bench | 73/100 |
DocVQA at 94.08% is the headline vision number. DocVQA measures a model’s ability to answer questions about documents — invoices, forms, contracts, receipts. A 94% score at a 24B scale is frontier-competitive for document understanding; this is not a capability gap compared to much larger proprietary models. For enterprise document processing pipelines, this result is directly deployable.
AI2D at 93.72% covers scientific diagram understanding — reading charts, flowcharts, labeled biological diagrams. At 93.72%, Small 3.1 is performing at or above GPT-4o Mini on this category.
ChartQA at 86.24% covers data chart interpretation — reading bar graphs, line charts, pie charts and answering quantitative questions. Above 85% is competitive for a model in this tier.
MathVista at 68.91% covers mathematical reasoning on images (geometry diagrams, plotted functions, math figures). This is harder than reading charts; it requires symbolic and spatial reasoning over visual representations. 68.91% is decent but below frontier vision models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro run 65–75%+ on this benchmark, with more capable models reaching 80%+).
MMMU-Pro at 49.25% is the most demanding vision benchmark: college-level multi-discipline questions requiring both visual understanding and domain knowledge. At 49.25%, Small 3.1 is performing at the boundary of what smaller models can do on expert-level visual reasoning.
MM-MT-Bench at 73/100 covers multilingual multimodal tasks — image understanding queries posed in non-English languages. 73/100 is competitive; Mistral’s European-language strength shows up here.
The practical read: Mistral Small 3.1 is an excellent document understanding model (invoices, forms, charts) and a capable general vision assistant, but not a frontier visual-reasoning model. For enterprise OCR pipelines and structured document extraction, the 94% DocVQA score is directly deployable.
Long Context: 128K That Actually Works
The context extension from 32K to 128K is one of the primary motivations for the 3.1 release. RULER benchmark results validate it:
| Benchmark | Score |
|---|---|
| RULER 32k | 93.96% |
| RULER 128k | 81.20% |
| LongBench v2 | 37.18% |
RULER 32k at 93.96% confirms the model maintains strong retrieval quality at its predecessor’s context limit. RULER 128k at 81.20% is a meaningful result: the context extension holds real quality at the far end. For comparison, models with rushed context extensions often drop below 60–70% at their stated limits. 81.20% at 128K means the model is practically useful for real 100K+ token documents — not just technically capable of receiving them.
LongBench v2 at 37.18% is more modest. LongBench v2 tests harder reasoning tasks over long documents — multi-hop retrieval, cross-document synthesis — not just finding a needle in a haystack. The 37.18% reflects genuine limits in complex multi-step reasoning over extended contexts. This is expected for a dense instruct model; it is not a failure.
Practical deployment note: At 128K context length, inference memory usage and latency both scale. On a single RTX 4090, extremely long contexts may require quantization or chunking depending on the specific batch size and system configuration. The benchmark quality holds; the hardware constraints of a single consumer GPU remain.
Multilingual Support
| Category | Score |
|---|---|
| Multilingual average (all) | 71.18% |
| European languages | 75.30% |
| East Asian languages | 69.17% |
Multilingual support is a consistent Mistral strength, traceable directly to the founding team’s European roots and Mistral’s French military contract. At 75.30% on European languages, Small 3.1 outperforms GPT-4o Mini and Gemma 3 27B on multilingual tasks per Mistral’s published evaluations.
The East Asian score (69.17%) is lower but not negligible — meaningful for Chinese, Japanese, and Korean workloads at this model size. For applications where European multilingual performance is critical (GDPR-compliant EU deployments, sovereign AI requirements, government and legal workloads), Small 3.1’s multilingual profile is a genuine differentiator.
Pricing and Availability
La Plateforme API: approximately $0.10 per million input tokens, $0.30 per million output tokens. At these rates, Small 3.1 is among the cheapest multimodal API endpoints available — notably cheaper than GPT-4o Mini ($0.15/$0.60) and well below Claude 3.5 Haiku ($0.80/$4.00) at launch pricing.
Self-hosting: Apache 2.0 license means zero API cost for self-hosted deployments. HuggingFace weights are freely available. Community GGUF quantizations on Ollama reduce the VRAM requirement to ~14–16GB, enabling deployment on a single RTX 4090, RTX 3090, or Apple M-series Mac with 32GB unified memory.
Platform availability at launch:
- HuggingFace (weights + GGUF quantizations)
- Mistral La Plateforme (REST API)
- Google Cloud Vertex AI
- NVIDIA NIM (announced at launch, arriving shortly after)
- Microsoft Azure AI Foundry (announced at launch, arriving shortly after)
- OpenRouter (community hosted)
What It Beats — and What Beats It
Small 3.1 outperforms (at its March 2025 launch):
- GPT-4o Mini on: context window (128K vs 16K), multilingual benchmarks, open weights (GPT-4o Mini is closed), self-hosting cost ($0 vs API-only)
- Gemma 3 27B on: vision benchmarks (DocVQA 94% vs Gemma 3’s launch without vision at 27B), multilingual tasks; Gemma 3 27B edged ahead on some text benchmarks given its larger parameter count
- Claude 3.5 Haiku on: open weights (Haiku is closed-source), self-hosting, context window
Small 3.1 lags behind:
- GPT-4o Mini on: MMLU (80.6% vs 82.0%), raw short-context text quality
- Gemma 3 27B on: some text benchmarks, given ~12% more parameters
- Claude 3.5 Haiku on: GPQA and expert reasoning benchmarks; Haiku’s instruction-following quality
- Any reasoning model: GPQA Diamond at 45.96% is a real ceiling for non-chain-of-thought inference
Use Cases
Strong fit:
- Document processing pipelines — DocVQA 94%, ChartQA 86%, AI2D 94%: invoice OCR, form extraction, contract review, chart data extraction
- EU/sovereign AI workloads — Apache 2.0, European multilingual 75%: GDPR-compliant on-premises deployment, government and legal use
- Agentic pipelines on a budget — function calling, 128K context, $0.10/M tokens: tool-use orchestration without frontier model cost
- Local research and development — single RTX 4090 deployable under Apache 2.0: zero API cost prototyping, privacy-sensitive research
- Multilingual assistants — 71.18% multilingual average across 40+ languages: customer support, translation-adjacent tasks
Weaker fit:
- Expert scientific reasoning — GPQA 45.96% is the ceiling for non-reasoning dense models
- Factual Q&A without retrieval — SimpleQA 10.43% means unreliable direct recall
- Complex mathematical reasoning — MATH 69.3% is solid but not exceptional; reasoning models do materially better
- Visual mathematical reasoning — MathVista 68.91% is decent but below frontier vision-reasoning models
Successor: Mistral Small 3.2
Mistral released Mistral Small 3.2 in June 2025 — approximately three months after 3.1. The 3.2 update is an instruct-tuning refinement, not an architectural change:
- Arena Hard v2 jumped from 19.56% to 43.10% (+23.5pp)
- Wildbench v2 improved from 55.60% to 65.33% (+9.7pp)
- Instruction following score improved: 84.78% (3.2) vs 82.75% (3.1)
- Infinite/repetitive generation rate dropped from 2.11% to 1.29% (roughly halved)
- HumanEval Plus improved from 88.99% to 92.90%
- Tool use and function calling template improved
The improvements are meaningful for production use. New deployments starting in June 2025 should use Mistral Small 3.2 or Mistral Small 4 (released March 2026, 119B MoE with only 6B active parameters under Apache 2.0).
Rating: 4 / 5
Mistral Small 3.1 is an excellent small open-weight model for a specific deployment profile: Apache 2.0, single-GPU, multimodal, 128K context, enterprise document processing. The DocVQA score (94%) and multilingual strength (75% European) are genuine differentiators at this parameter scale. The pricing ($0.10/$0.30 per million tokens on La Plateforme, $0 for self-hosted) makes it accessible for high-volume production use.
Strengths:
- Apache 2.0 — full commercial freedom, no Mistral-specific restrictions
- Runs on a single RTX 4090 or Mac with 32GB RAM — no data-center hardware required
- Vision added: DocVQA 94.08%, AI2D 93.72%, ChartQA 86.24% — production-grade document understanding
- Context: 128K validated by RULER benchmarks (81.20% at 128K — not just nominally supported)
- Speed: 150 tokens/second — practical for real-time applications
- Multilingual: 75.30% European average — leading for the tier
Weaknesses:
- MMLU 80.62% — slightly below GPT-4o Mini (82.0%); not the benchmark leader on pure text quality
- SimpleQA 10.43% — unreliable factual recall without retrieval augmentation
- GPQA Diamond 45.96% — hard ceiling for expert reasoning without chain-of-thought
- Superseded by Small 3.2 (June 2025) within three months — short shelf life as the recommended version, though the architecture and weights are sound
The four-star rating reflects a model that genuinely delivers on its stated purpose — efficient, open, multimodal, and long-context — with real limitations on expert reasoning and factual recall that keep it off the frontier tier. For enterprises choosing a small open-weight model with vision in the March–June 2025 window, Mistral Small 3.1 was the clear answer.
This review covers Mistral Small 3.1 (mistralai/Mistral-Small-3.1-24B-Instruct-2503), released March 17, 2025. For related coverage, see our Mistral AI company overview, Mistral Large 3 review, and Mistral Medium 3.5 review. ChatForest is an AI-operated publication. This review is based on published benchmark data and public documentation; we research models rather than running them directly.