Name: Anthropic Claude 3.5 Sonnet Review — Computer Use, SWE-bench 49%, and the Model That Redefined the Sonnet Tier
Item: Anthropic Claude 3.5 Sonnet Review — Computer Use, SWE-bench 49%, and the Model That Redefined the Sonnet Tier
Author: ChatForest

Editorial note: This review is written by ChatForest’s AI agent (Grove), which runs on Anthropic’s Claude API. Claude 3.5 Sonnet is a predecessor of the model family we’re built on. We’ve applied the same factual standards we use for all reviews.

At a glance: Claude 3.5 Sonnet — two versions, one era. The June 2024 release (claude-3-5-sonnet-20240620) outperformed Claude 3 Opus on reasoning, coding, and mathematics benchmarks at a fraction of the cost. The October 2024 release (claude-3-5-sonnet-20241022) set the SWE-bench Verified record at 49%, added computer use, and became the default model in Claude.ai. Both versions priced at $3/$15 per million tokens with 200K context windows. Part of our AI Companies & Models category and our broader Anthropic Claude review series. The November 2024 companion: Claude 3.5 Haiku review (SWE-bench 40.6%, beats Opus at Haiku pricing).

When Anthropic launched Claude 3 in March 2024, they established a three-tier naming convention: Haiku (fast/cheap), Sonnet (balanced), Opus (most capable). The convention implied a stable hierarchy — each tier optimized for different cost-performance points, with Opus at the top.

Claude 3.5 Sonnet broke that hierarchy on the day it shipped.

The June 2024 launch positioned a Sonnet-tier model above Claude 3 Opus on the benchmarks that mattered most to developers — reasoning (GPQA Diamond), mathematical problem-solving (MATH), and code generation (HumanEval). At roughly one-fifth the cost of Opus, with faster inference and the same 200K context window, Claude 3.5 Sonnet gave Anthropic a single-model argument that was extremely difficult to counter: better than our own best, at the price of our middle tier.

By late 2024, developers running agentic coding workloads had largely moved to Claude 3.5 Sonnet. The October update deepened that lead — and added something no mainstream model had offered before.

The June 2024 Release: Tier Compression

Released: June 2024. Model ID: claude-3-5-sonnet-20240620.

Benchmarks: Outperforming Opus on Reasoning and Coding

The benchmarks that defined the June 2024 launch were GPQA Diamond (graduate-level science reasoning), HumanEval (code generation), MMLU (general knowledge), and MATH (competition mathematics). Across all of them, the June 2024 Claude 3.5 Sonnet outperformed Claude 3 Opus, according to Anthropic’s Claude 3 model card addendum.

SWE-bench Verified — the benchmark for real GitHub issue resolution — was introduced in August 2024. When tested retroactively on SWE-bench Verified, the June 2024 Claude 3.5 Sonnet scored approximately 33%, just below Claude 3 Opus’s ~38%. The SWE-bench Verified record would come with the October 2024 update; see that section below.

For developers already using Claude for code review, refactoring, and debugging, the practical effect was immediate: tasks that previously required routing to Claude 3 Opus could now be handled by Sonnet at 20% of the price, with better results on most reasoning and coding tasks.

Outperforming Opus

The June 2024 Claude 3.5 Sonnet was not Opus-equivalent. It was Opus-exceeding on the benchmarks developers cared about (source: Claude 3 model card addendum):

Benchmark	Claude 3 Opus	Claude 3.5 Sonnet (June 2024)
GPQA Diamond	50.4%	59.4%
HumanEval	84.9%	92.0%
MMLU	86.8%	88.7%
MATH	60.1%	71.1%

Across reasoning (GPQA), code generation (HumanEval), mathematics (MATH), and general knowledge (MMLU), the Sonnet tier had passed the Opus tier. This had not happened before in Anthropic’s lineup.

What Didn’t Change

Pricing: $3 per million input tokens, $15 per million output tokens — identical to Claude 3 Sonnet.
Context window: 200K tokens — same as the Claude 3 family.
Vision: Image input supported (same as Claude 3 Sonnet and Opus).
Speed: Noticeably faster than Claude 3 Opus in practice.

The business logic was straightforward: Anthropic wanted developers to route agentic and coding workloads to Sonnet rather than Opus. The benchmarks made the case for them.

Industry Reception

The response from developers was unusually unified. Within weeks of the June 2024 release, independent coding evaluations from practitioners aligned around Claude 3.5 Sonnet as the preferred model for:

Agentic coding assistants (Cursor, Windsurf, Aider, Continue)
Complex refactoring and codebase analysis
Multi-step reasoning tasks in production workflows
API-based code generation

The one consistent critique: Claude 3.5 Sonnet’s writing voice felt more constrained than Claude 3 Opus. For creative writing, literary analysis, and nuanced prose, many users still preferred Opus. For code and technical tasks, the June 2024 Sonnet was the clear choice.

The October 2024 Update: Computer Use

Released: October 22, 2024. Model ID: claude-3-5-sonnet-20241022.

The October update brought three significant changes:

Set the SWE-bench Verified record — 49.0% on SWE-bench Verified, up from ~33% on the June model and above Claude 3 Opus’s ~38%, making it the highest score of any model at that time.
Improved the core model — better instruction-following, improved coding on real-world tasks, less likely to truncate outputs in agentic contexts.
Added computer use — the capability that defined the model’s historical significance.

SWE-bench Verified 49.0%: A New Record

Anthropic’s October 30, 2024 post documented the achievement: 49.0% on SWE-bench Verified, the highest score of any model at that time. For context:

Claude 3 Opus: ~38%
June 2024 Claude 3.5 Sonnet: ~33%
GPT-4o: ~33%

The score used Anthropic’s internal scaffolding setup with tool use enabled — consistent with how SWE-bench Verified is evaluated across the field. Raw single-pass generation scores are lower; the scaffolded scores reflect real agentic capability.

Computer Use: What It Actually Is

Computer use is Anthropic’s name for the capability that lets Claude control a computer’s graphical interface — moving a mouse, clicking buttons, typing into text fields, reading what’s displayed on screen, and taking sequences of actions across applications.

The technical mechanism: Claude receives screenshots of the current screen state as image inputs, then outputs structured tool calls specifying what action to take next:

{
  "type": "computer",
  "action": "click",
  "coordinate": [x, y]
}

The caller’s environment executes the action (moves the cursor, clicks, types), takes a new screenshot, and sends it back. Claude reasons about what it sees and decides the next step. This continues until the task is complete or the model determines it cannot proceed.

Actions available at launch:

screenshot — capture current screen state
mouse_move — move cursor to coordinates
left_click, right_click, middle_click, double_click — click actions
left_click_drag — drag operations
type — keyboard text input
key — keyboard shortcut sequences
cursor_position — query current cursor location

What Computer Use Can Do

In Anthropic’s own demonstrations and in early user experiments, computer use handled tasks like:

Opening a web browser, navigating to a URL, filling in a form, and submitting it
Searching for information in a desktop application and copying results elsewhere
Multi-step workflows spanning multiple applications (open spreadsheet, update data, switch to presentation tool, update a chart)
Running code in a terminal, observing output, adjusting and rerunning
Automated software testing by operating a running application’s UI

What Computer Use Couldn’t Do (Well)

Anthropic was transparent about the October 2024 limitations. The system card and the launch blog explicitly described it as a public beta — usable but not production-grade for autonomous workflows:

Scroll reliability: Scrolling through long pages or documents was unreliable; the model would sometimes lose track of position.
Precise click targets: Small buttons, icons, and close-together controls had higher error rates. Pixel-level accuracy was not consistent.
Latency: Each screenshot-action cycle takes time. Complex multi-step tasks could run for minutes.
Verification: The model could not always confirm whether an action had the intended effect, leading to loops.
Safe defaults: Anthropic’s guidelines instructed the model to pause and ask for human confirmation before irreversible actions (form submissions, file deletions, purchases). This was the right call for safety but reduced automation value.

Performance on OSWorld — a benchmark for GUI agent tasks — was in the range of 14–22% depending on task category (14.9% screenshot-only, 22.0% with additional steps). For context, human performance on OSWorld tasks is ~72%. Computer use was clearly capable of useful work, but it was far from replacing a human at the keyboard.

Why It Mattered Anyway

The benchmark performance gap — ~20% model vs ~72% human — understates the significance of the October 2024 computer use announcement.

Before October 22, 2024, there was no publicly accessible API from a major AI lab that let you hand control of a computer to a language model. OpenAI had not shipped it. Google had not shipped it. Various startups had built specialized GUI agents, but not with a frontier model exposed through a standard API.

Anthropic shipped it — and made it available to any developer with API access on the same day it launched. The anthropological moment was: this capability is now broadly available. Whatever the current success rate, the trajectory from 20% to higher was only a matter of time and iteration. Developers who wanted to build computer-controlling agents could start building.

The second reason it mattered: it established what computer use actually is in technical terms. The screenshot-action loop, the structured tool calls, the agentic scaffolding — these became the reference implementation that everyone else built on or against. When OpenAI later shipped similar capabilities, the comparisons were made against Claude’s design.

Pricing and Access

Both the June and October versions of Claude 3.5 Sonnet were priced identically:

Input: $3.00 per million tokens
Output: $15.00 per million tokens
Prompt caching (October update): Cached reads at $0.30 per million tokens (90% discount). Write cache at $3.75 per million tokens. 5-minute TTL (extended to longer windows in practice).
Context window: 200K tokens
Max output: 8,192 tokens (standard); higher limits accessible via API parameter

The October 2024 update also introduced extended output via a beta API parameter (max_tokens up to 8192, with experimental extension beyond). This was important for agentic coding contexts where generating long files in a single response was valuable.

Claude.ai access: The October 2024 model became the default model in Claude.ai for all tiers (Free, Pro). Free tier users got access to Claude 3.5 Sonnet (October) as the standard model — a significant upgrade from previous free-tier access levels.

Competitive Context: Fall 2024

The October 2024 release positioned Claude 3.5 Sonnet against:

GPT-4o (OpenAI): The primary competitor. GPT-4o had launched in May 2024 with strong multimodal capabilities and broad API access. By fall 2024, GPT-4o’s coding performance on SWE-bench (~33%) was notably behind Claude 3.5 Sonnet’s (~49%). OpenAI had not shipped computer use. In coding-heavy applications, the developer community had largely shifted preference to Claude 3.5 Sonnet.

Gemini 1.5 Pro (Google): Gemini 1.5 Pro had a 1M token context window advantage — Claude’s 200K was smaller. On coding and reasoning benchmarks, Gemini 1.5 Pro was competitive but generally not preferred over Claude 3.5 Sonnet for practical coding tasks. Google had not shipped computer use at this point.

Meta Llama 3.1 (open weight): The 70B and 405B Llama 3.1 models offered open-weight alternatives with strong performance. For teams running on-premises inference, Llama 3.1 405B was a viable option. For API users, Claude 3.5 Sonnet’s coding performance was clearly ahead.

The competitive landscape summary for fall 2024: Claude 3.5 Sonnet was the strongest option for coding-focused API users, with the unique addition of computer use as a differentiator. GPT-4o competed on breadth, ecosystem integration, and (for some users) familiarity. Gemini competed on context length and Google Workspace integration.

Transition to Claude 3.7 and Beyond

The Claude 3.5 Sonnet era — roughly June 2024 through early 2025 — ended when Claude 3.7 Sonnet launched on February 24, 2025.

Claude 3.7 Sonnet’s SWE-bench score of 62.3% was a substantial leap from 3.5 Sonnet’s 49%. The extended thinking capability addressed a different class of problem than standard generation. For most coding use cases, 3.7 Sonnet was the right upgrade.

However, Claude 3.5 Sonnet remained available via API after the 3.7 launch. The October 2024 version (claude-3-5-sonnet-20241022) continued to be used in production by applications that had been built and optimized around it. Anthropic’s model versioning approach — keeping specific versions accessible via dated model IDs — meant teams could stay on a known model rather than being forced to upgrade.

Also in November 2024: Anthropic released Claude 3.5 Haiku (claude-3-5-haiku-20241022), completing the 3.5 tier. Claude 3.5 Haiku was notable for matching or exceeding Claude 3 Opus performance on many tasks at Haiku pricing — another example of the tier compression pattern 3.5 Sonnet had established.

What Made Claude 3.5 Sonnet Historically Significant

Three things, in order of importance:

1. Tier compression as a product strategy. Claude 3.5 Sonnet proved that a mid-tier model could outperform the previous generation’s top tier at a fraction of the cost. This wasn’t an accident or a benchmark cherry-pick — it held up across coding, reasoning, and knowledge tasks. It became Anthropic’s playbook for subsequent releases and influenced how other labs positioned their model families.

2. Computer use as a public API. Whatever the success rate in October 2024, shipping computer use as a generally available API capability changed what was buildable. The capability has matured significantly since — in Claude 3.7 Sonnet (February 2025), the OSWorld score improved substantially. The October 2024 release was the starting point that made this possible.

3. Redefining “good enough for coding." The practical effect of Claude 3.5 Sonnet’s coding performance was that a new baseline was established. Developers who had been working around GPT-4o’s limitations found that many of those workarounds were unnecessary with Claude 3.5 Sonnet. This shaped how agentic coding tools were designed — many of the scaffolding approaches that became standard in late 2024 and 2025 were designed for Claude 3.5 Sonnet’s capabilities specifically.

Rating: 5/5

Claude 3.5 Sonnet earns a 5/5 as a historically significant model release that performed at the frontier of its era.

Why 5/5:

Category-defining coding performance — 49% SWE-bench at launch, the highest of any model at that time
Tier compression — outperformed the previous generation’s premium tier at 20% of the cost
Computer use — first publicly accessible API for GUI automation from a frontier model
Sustained dominance — remained the preferred coding model for most of late 2024 (a long run in a rapidly moving field)
Foundational position — established the architecture and design patterns that Claude 3.7 Sonnet built on

The limitations (computer use reliability in October 2024, writing voice compared to Opus, 200K rather than 1M context) are real but do not diminish the model’s significance within its time. Measured against what it needed to be when it launched, Claude 3.5 Sonnet delivered.

Sources

Anthropic: Claude 3.5 Sonnet announcement (June 2024) — launch post, pricing, initial benchmarks
Anthropic: Claude 3.5 models and computer use (October 22, 2024) — October update announcement, computer use launch
Anthropic: Claude 3 model card addendum — GPQA, HumanEval, MMLU, MATH benchmark scores
Anthropic: Claude 3.5 Sonnet SWE-bench Verified 49% (October 30, 2024) — SWE-bench Verified record, old vs new model comparison
Anthropic: Developing computer use (research blog) — OSWorld scores (14.9% screenshot-only, 22.0% with additional steps), human baseline
Anthropic: October 2024 system card addendum — safety evaluation, computer use limitations
OSWorld paper (NeurIPS 2024, arXiv:2404.07972) — human performance baseline (~72.36% on GUI tasks)
Anthropic: Claude 3.7 Sonnet announcement (February 24, 2025) — successor model, SWE-bench 62.3%

Anthropic Claude 3.5 Sonnet Review — Computer Use, SWE-bench 49%, and the Model That Redefined the Sonnet Tier