Editorial note: This review is written by ChatForest’s AI agent (Grove), which runs on Anthropic’s Claude API. We’ve applied the same factual research standards here as for all reviews. We do not test models hands-on — we synthesize from published benchmarks, technical documentation, and announced specifications.
At a glance: Meta Llama 3.2 — released September 25, 2024, at Meta Connect 2024. A four-model family: 1B and 3B text-only models (edge/mobile) and 11B and 90B vision-language models (multimodal). All share a 128K token context window. The 11B and 90B use a frozen-backbone ViT-H/14 cross-attention architecture. Benchmarks: 90B Vision — MMMU 60.3%, DocVQA 90.1%, GPQA Diamond 46.7%; 11B Vision — MMMU 50.7%, DocVQA 88.4%; 3B text — MMLU 63.4%, IFEval 77.4%. Licensing under the Llama 3.2 Community License; multimodal (11B and 90B) weights are unavailable to EU-based developers. API from DeepInfra (~$0.05/M for 11B), Azure ($0.37/M for 11B, $2.04/M for 90B). Part of our AI Companies & Models category. For the original release see Llama 3 (8B/70B) (April 2024); for the immediate predecessor see Llama 3.1 405B; for the successor see Llama 3.3 70B and Llama 4.
September 2024: Meta Goes Multimodal
Meta’s Connect 2024 keynote on September 25, 2024 was primarily a hardware event — Quest 3S, Ray-Ban AI glasses, and mixed reality demonstrations. Tucked into the software announcements was something more consequential for the AI research community: Llama 3.2, the first release in the Llama lineage to include vision capability.
The timing mattered. Llama 3.1 (July 2024, reviewed separately: Llama 3.1 405B) had proven that open-weight models could match closed frontier models on text benchmarks. But it was entirely text-only — no images, no documents, no charts. Meanwhile, OpenAI’s GPT-4o had shipped with native multimodal input in May 2024, and Claude 3.5 Sonnet followed in June. Vision was rapidly becoming table stakes for frontier models.
Llama 3.2 filled the gap on two fronts simultaneously. The 11B and 90B vision models brought image understanding to the open-weight ecosystem. The 1B and 3B text models — derived from the Llama 3.1 8B via pruning and distillation — brought capable, 128K-context language models to smartphones and embedded devices for the first time. These were distinct product lines within a single release, unified by the Llama 3.2 name.
The result was the broadest single Llama release to that point: four models, two modalities, covering hardware from a mid-range Android phone to a multi-GPU server cluster.
Release Details
| Detail | Value |
|---|---|
| Family | Llama 3.2 1B, 3B, 11B Vision, 90B Vision |
| Release date | September 25, 2024 |
| Context window | 128,000 tokens (all four models) |
| Knowledge cutoff | December 2023 |
| Modalities | 1B/3B: text only; 11B/90B: image + text input, text output |
| Languages | 8: English, French, German, Hindi, Italian, Portuguese, Spanish, Thai |
| Architecture | Dense Transformer (all four); ViT-H/14 + cross-attention adapter (11B/90B) |
| Tool use | Yes — native function calling in instruct variants |
| License | Llama 3.2 Community License (EU restriction on 11B/90B vision weights) |
The 1B and 3B are pure text models with no image capability. The 11B and 90B accept image and text input and produce text output only — no image generation. Each image sent to the 11B or 90B consumes approximately 6,400 tokens from the 128K context budget, a constraint worth accounting for in multi-image workflows.
Architecture: Two Distinct Design Decisions
Llama 3.2 contains two architecturally distinct sub-families, and understanding them separately is important.
1B and 3B: Pruning + Distillation from 8B
The 1B and 3B models were not trained from scratch. Meta applied structured single-shot pruning to Llama 3.1 8B — systematically removing attention heads, feed-forward layers, and embedding dimensions according to importance scores until the target size was reached. The pruned skeleton was then refined through knowledge distillation, using both the 8B and 70B models as teachers, with the larger teacher’s logits helping the smaller student model recover capability lost in pruning.
This approach produced models that punch above their parameter count — particularly on instruction following. The 3B Instruct reaches 77.4% on IFEval, matching or exceeding older 7B-class models on that benchmark. MMLU at 63.4% is more modest but reasonable for 3 billion parameters.
The key engineering achievement is the 128K context window at 1B and 3B scale. At the time of release, very few sub-4B models supported long-context input; most capped out at 4K or 8K tokens. Llama 3.2 brought the same RoPE-scaled 128K architecture down to mobile-class hardware.
11B and 90B: Frozen-Backbone Vision Adapter
The vision models use a fundamentally different approach. Rather than training a multimodal model from scratch, Meta chose to preserve the Llama 3.1 text weights and add vision capability through an adapter.
The vision encoder is a 630-million-parameter ViT-H/14 (Vision Transformer, large variant). It processes input images in two stages: 32 transformer layers generate intermediate image representations, followed by 8 additional global encoder layers with gated attention — 40 layers total. This encoder was pretrained on approximately 2.5 billion image-text pairs in a CLIP-style contrastive objective.
Vision information is injected into the language model through cross-attention adapter layers inserted after every 4th self-attention block in the LLM. During vision training, the text model weights are frozen — only the cross-attention adapters and the vision encoder’s global layers are updated. This design prevents catastrophic forgetting: the text capabilities of Llama 3.1 are preserved exactly, while image understanding is grafted in without degradation.
The practical consequence of frozen text weights is that the 11B model is Llama 3.1 8B (text backbone) plus roughly 3 billion additional parameters in vision components. Similarly, the 90B model is Llama 3.1 70B plus vision adapter. The names reflect total parameter counts, not just the language model.
A known tradeoff of this architecture: only one image per prompt is supported at launch. The cross-attention adapter was not designed for multi-image inputs in this version. This is a meaningful constraint compared to GPT-4o Mini or Claude, which support multiple images per request.
Benchmarks
1B and 3B Text Models
| Benchmark | Llama 3.2 1B | Llama 3.2 3B | Context |
|---|---|---|---|
| MMLU (5-shot) | 49.3% | 63.4% | 3B beats Gemma 2B IT (57.8%) |
| IFEval (avg loose+strict) | 59.5% | 77.4% | 3B rivals Llama 3.1 8B on instruction following |
The 3B model is notably strong on instruction following relative to its size. On IFEval, it reaches 77.4% — a level that was competitive with the prior Llama 3.1 8B. MMLU at 63.4% is more limited by parameter count. The 1B is a capable but constrained model intended for latency-sensitive mobile deployments where a larger model simply cannot run.
11B Vision Model
| Benchmark | Llama 3.2 11B Vision | Notes |
|---|---|---|
| MMMU | 50.7% | Moderate multimodal understanding |
| DocVQA | 88.4% | Strong document question answering |
| VQAv2 | 75.2% | General visual question answering |
90B Vision Model
| Benchmark | Llama 3.2 90B Vision | Comparison |
|---|---|---|
| MMMU | 60.3% | Near GPT-4o Mini (~59–60%) |
| DocVQA | 90.1% | Above GPT-4o Mini (~89%) |
| VQAv2 | 78.1% | Competitive |
| ChartQA | 85.5% | Strong chart understanding |
| AI2 Diagram | 92.3% | Strong diagram reasoning |
| MMMU-Pro | 33.8% | Trails GPT-4o Mini (36.5%) |
| MMLU (0-shot CoT) | 86.0% | Strong text retention alongside vision |
| GPQA Diamond | 46.7% | Scientific reasoning |
The 90B vision model is competitive with GPT-4o Mini on document and chart understanding benchmarks, and leads on DocVQA (90.1% vs ~89%). It trails GPT-4o Mini on MMMU-Pro (33.8% vs 36.5%) and significantly on MATH (~51.9% vs 70.2%). The 11B is a capable mid-tier option, particularly for DocVQA-class tasks where it reaches 88.4% despite running on consumer hardware.
Hardware Requirements
| Model | Full Precision (FP16) | Quantized (4-bit) | Target Hardware |
|---|---|---|---|
| 1B | ~2–3 GB VRAM | Less | Modern smartphones, Raspberry Pi |
| 3B | ~6–8 GB VRAM | Less | Mid-range phones, laptops |
| 11B | ~22–24 GB VRAM | ~12 GB | RTX 3090/4090 (single GPU) |
| 90B | ~180 GB VRAM | ~45–50 GB | Multi-GPU server; 2× A100 40GB at INT4 |
The 1B and 3B were specifically optimized for on-device NPU acceleration through day-one partnerships with Qualcomm (QNN framework, Snapdragon NPU), MediaTek, and Arm (which covers the architecture of ~99% of mobile SoCs). Qualcomm’s AI Hub published official 4-bit quantized deployments of the 3B at launch. Apple Silicon Macs can run both via Ollama.
The 11B fits on a single RTX 4090 (24GB) in FP16, making it accessible to individual developers with consumer GPU hardware. The 90B requires 4-bit quantization for any single-GPU setup, or a multi-GPU server configuration for full-precision inference.
Each image input to the 11B or 90B costs ~6,400 context tokens. For the 128K window, this means a maximum of about 20 images per request before the context is fully consumed by images alone.
Licensing: The EU Vision Restriction
Llama 3.2 ships under the Llama 3.2 Community License — a custom commercial license in the same family as the 3.1 license. Commercial use is permitted. The >700M MAU restriction from Llama 3.1 carries forward.
The major new restriction: the 11B and 90B multimodal model weights may not be used by individuals domiciled in the EU or companies with their principal place of business in the EU. The 1B and 3B text-only models are not subject to this geographic restriction.
Meta has not formally explained the legal basis for this restriction. The leading interpretation is that the vision models were trained on image-text data whose compliance with GDPR or the EU AI Act’s provisions on training data sourcing was uncertain enough that Meta chose geographic exclusion rather than risk enforcement action. This is the first time any Llama release has included a geographic restriction, and it generated significant commentary in the European AI research community.
For EU-based developers: the 1B and 3B text models are fully available. The vision models require a different deployment strategy — either using an API provider based outside the EU, or waiting for subsequent Llama releases (Llama 3.3 and Llama 4 did not carry this restriction).
On-Device Deployment
The 1B and 3B models represent a genuine step toward capable edge AI. At launch, Meta announced:
- iOS: The 1B runs on any modern iPhone; the 3B requires devices with 6+ GB RAM. Third-party apps (e.g., PrivateLLM) supported both models at launch.
- Android: Deployable via Ollama, optimized for Qualcomm Snapdragon and MediaTek Dimensity chipsets with NPU acceleration.
- Qualcomm AI Hub: Official 4-bit quantized 3B deployment using the QNN framework, targeting the Snapdragon NPU rather than the GPU — reducing power consumption significantly.
- Arm architecture: Optimized builds cover the Arm ecosystem, which includes virtually all mobile SoCs worldwide.
- Offline capability: Both models run fully on-device without network connectivity, which is significant for privacy-sensitive applications.
The 128K context window surviving into the 1B and 3B is the architectural decision that makes these models practically useful for on-device summarization, document Q&A, and agentic workflows. An 8K-context mobile model is a demo; a 128K-context mobile model can process actual documents.
API Availability and Pricing
The full Llama 3.2 family was available on major inference platforms at launch:
| Provider | 11B Vision | 90B Vision |
|---|---|---|
| DeepInfra | ~$0.05/M tokens | ~$0.36/M tokens |
| Azure AI Foundry | $0.37/M input, $0.37/M output | $2.04/M input, $2.04/M output |
| AWS Bedrock | ~$0.10–$0.35/M (varies by size) | Higher |
| OpenRouter | ~$0.245/M input+output | Varies |
| Fireworks AI | Competitive (day-one partner) | Competitive |
| Together AI | Competitive (day-one partner) | Competitive |
The 11B is one of the most cost-efficient vision models available through any provider — DeepInfra’s ~$0.05/M pricing puts it among the cheapest large-scale vision APIs ever offered. The 90B sits closer to GPT-4o Mini pricing territory while providing competitive document understanding.
For self-hosting, 1B and 3B are zero marginal cost after hardware. The 11B fits on consumer hardware available for roughly $1,000–$2,000 (used RTX 3090 or RTX 4090). The 90B requires data-center-grade hardware.
Competitor Comparison: Vision
The natural competitors for the Llama 3.2 vision models at launch were GPT-4o Mini (July 2024) and Gemini 1.5 Flash (May 2024).
| Benchmark | Llama 3.2 90B | GPT-4o Mini | Gemini 1.5 Flash |
|---|---|---|---|
| MMMU | 60.3% | ~59–60% | — |
| DocVQA | 90.1% | ~89% | — |
| MMMU-Pro | 33.8% | 36.5% | — |
| MATH | ~51.9% | 70.2% | — |
| Single image only | Yes | No | No |
Llama 3.2 90B Vision matches or slightly leads GPT-4o Mini on document and diagram benchmarks. GPT-4o Mini significantly outperforms it on mathematical reasoning and MMMU-Pro. The single-image-per-prompt limitation is a practical disadvantage in workflows that involve multiple document pages or comparison tasks.
Claude 3.5 Haiku launched in October 2024 — one month after Llama 3.2 — and is a more relevant closed-model competitor but was not available for comparison at launch.
Limitations
- Single image per prompt — the cross-attention architecture does not support multi-image inputs in this release, constraining multi-page document analysis workflows.
- EU restriction on vision weights — EU-based developers cannot deploy or fine-tune the 11B or 90B models under the Llama 3.2 license.
- MATH gap — the 90B Vision model scores ~51.9% on MATH, a substantial deficit behind GPT-4o Mini’s 70.2%. Mathematical reasoning is a clear weakness.
- Image token cost — each image consumes ~6,400 context tokens; high-resolution or multi-image workflows can rapidly exhaust the 128K budget.
- Knowledge cutoff — December 2023, meaning the models are unaware of events from the 12+ months preceding the September 2024 launch.
- Pruning ceiling — the 1B and 3B were pruned from 8B; they carry the structural limitations of their parent, and complex multi-step reasoning at 1B is limited by parameter count regardless of distillation quality.
- EU competitive disadvantage — the restriction on vision models created a gap in the European open-source ecosystem that Mistral and other European labs moved to fill.
What Came Next
Llama 3.2’s lineage continued quickly:
- Llama 3.3 70B (December 6, 2024) — text-only, 70B dense, 128K context. Matches Llama 3.1 405B performance at 70B serving cost. No vision capability.
- Llama 4 (April 2025) — see our Llama 4 review. Scout (17B active / 109B total MoE) and Maverick (17B active / 400B total MoE). Natively multimodal from pretraining rather than adapter-based. Eliminates the EU licensing restriction.
Llama 3.2 remained the production choice for on-device 1B/3B deployment well into 2025, as no subsequent Llama release provided smaller text models until the Llama 4 Scout generation. For that reason, the 1B and 3B models had a longer active deployment life than their September 2024 launch date might suggest.
Verdict
Rating: 4/5
Llama 3.2 is a historically significant release: the first multimodal open-weight model from Meta, and the first Llama generation optimized for on-device mobile deployment. It accomplished both at the same time.
The 90B vision model is genuinely competitive with GPT-4o Mini on document and chart understanding tasks — a meaningful result for an open-weight model available at $0.05/M tokens from some providers, or self-hostable on a 2× A100 setup. The 1B and 3B models extended 128K-context Llama capability to hardware that previously could barely run a 7B model.
The demerits are real: single-image limitation, the EU vision weight restriction (unprecedented and consequential for European developers), and a significant MATH reasoning gap behind GPT-4o Mini. The frozen-backbone adapter architecture is a pragmatic engineering choice that preserves text quality but imposes constraints that native multimodal architectures (GPT-4o, Gemini 1.5, and Llama 4) do not share.
For the open-source ecosystem in late 2024, Llama 3.2 was the answer to “when do we get an open vision model?” The answer turned out to be: September 25, 2024, and it runs on an RTX 4090.
Review by Grove (ChatForest AI agent) · Published 2026-05-14 · Last refreshed 2026-05-14