(2025-02-26) ZviM Time To Welcome Claude 3.7

Zvi Mowshowitz: Time to Welcome Claude 3.7. Anthropic has reemerged from stealth and offers us Claude 3.7. Given this is named Claude 3.7, an excellent choice, from now on this blog will refer to what they officially call Claude Sonnet 3.5 (new) as Sonnet 3.6.

Claude 3.7 is a combination of an upgrade to the underlying Claude model, and the move to a hybrid model that has the ability to do o1-style reasoning when appropriate for a given task.

In a refreshing change from many recent releases, we get a proper system card focused on extensive safety considerations. The tl;dr is that things look good for now, but we are rapidly approaching the danger zone.

The cost for Sonnet 3.7 via the API is the same as it was for 3.6, $5/$15 for million. If you use extended thinking, you have to pay for the thinking tokens.

They also introduced a new modality in research preview, called Claude Code, which you can use from the command line, and you can use 3.7 with computer use as well and they report it is substantially better at this than 3.6 was.

Table of Contents

Executive Summary.
Part 1: Capabilities.
Extended Thinking.
Claude Code.
Data Use.
Benchmarks.
Claude Plays Pokemon.
Private Benchmarks.
Early Janus Takes.
System Prompt.
Easter Egg.
Vibe Coding Reports.
Practical Coding Advice.
The Future.
Part 2: Safety and the System Card.
Claude 3.7 Tested as ASL-2.
The RSP Evaluations That Concluded Claude 3.7 is ASL-2.
ASL-3 is Coming Soon, and With That Comes Actual Risk.
Reducing Unnecessary Refusals.
Mundane Harm Evolutions.
Risks From Computer Use.
Chain of Thought Faithfulness.
Alignment Was Not Faked.
Excessive Focus on Passing Tests.
The Lighter Side.

Executive Summary

It is a good model, sir. The base model is an iterative improvement and now you have access to optional reasoning capabilities.
Claude 3.7 is especially good for coding. The o1/o3 models still have some role to play, but for most purposes it seems like Claude 3.7 is now your best bet.

This is ‘less of a reasoning model’ than the o1/o3/r1 crowd. The reasoning helps, but it won’t think for as long and doesn’t seem to get as much benefit from it yet. If you want heavy-duty reasoning to happen, you should use the API so you can tell it to think for 50k tokens.

Thus, my current thinking is more or less:

If you talk and don’t need heavy-duty reasoning or web access, you want Claude.
If you are trying to understand papers or other long texts, you want Claude.
If you are coding, definitely use Claude first.
Essentially, if Claude can do it, use Claude. But sometimes it can’t, so…
If you want heavy duty reasoning or Claude is stumped on coding, OpenAI o1-pro.
If you want to survey a lot of information at once, you want Deep Research.
If you are replacing Google quickly, you want Perplexity.
If you want web access and some reasoning, you want o3-mini-high.
If you want Twitter search in particular, or it would be funny, you want Grok.
If you want cheap, especially at scale, go with Google Gemini Flash.

Part 1: Capabilities

Extended Thinking

This is their name for the ability for Claude 3.7 to use tokens for a chain of thought (CoT) before answering.

There is another consideration they don’t mention. Showing the CoT enables distillation and copying by other AI labs, which should be a consideration for Anthropic both commercially and if they want to avoid a race. Ultimately, I do think sharing it is the right decision, at least for now.

Claude Code

*Alex Albert (Head of Claude Relations): We’re opening limited access to a research preview of a new agentic coding tool we’re building: Claude Code.

You’ll get Claude-powered code assistance, file operations, and task execution directly from your terminal.*

*Here’s a different kind of use case.

Dwarkesh Patel: Running Claude Code on your @Obsidian directory is super powerful.

Here Claude goes through my notes on an upcoming guest’s book, and converts my commentary into a list of questions to be added onto the Interview Prep file.*

Data Use

*Anthropic explicitly confirms they did not train on any user or customer data, period.

They also affirm that they respected robots.txt, and did not access anything password protected or CAPTCHA guarded, and made its crawlers easy to identify.*

Benchmarks

Claude Plays Pokemon

Private Benchmarks

Early Janus Takes

*There’s also the Janus vibes, which are never easy to properly summarize, and emerge slowly over time. This was the thread I’ve found most interesting so far.

My way of thinking about this right now is that with each release the model gets more intelligence, which itself is multi-dimensional, but other details change too, in ways that are not strictly better or worse, merely different. Some of that is intentional, some of that largely isn’t.*

System Prompt

There is a stark contrast between this and Grok’s minimalist prompt. You can tell a lot of thought went into this, and they are attempting to shape a particular experience.

Yes, it is the language of telling someone about a character to play. Claude is method acting, with a history of good results. I suppose it’s not ideal but seems fine? It’s kind of cool to be instructed to enjoy things. Enjoying things is cool.

Easter Egg

Vibe Coding Reports

Code is clearly one place 3.7 is at its strongest. The vibe coders are impressed, here are the impressions I saw without me prompting for them.

The point about Cursor-Sonnet-3.7 having web access feels like a big game.

Practical Coding Advice

The Future

Part 2: Safety and the System Card

Claude 3.7 Tested as ASL-2

The RSP Evaluations That Concluded Claude 3.7 is ASL-2

ASL-3 is Coming Soon, and With That Comes Actual Risk

Reducing Unnecessary Refusals

Mundane Harm Evolutions

Risks From Computer Use

Chain of Thought Faithfulness

Anthropic notes several reasons a CoT might not be faithful.

Alignment Was Not Faked

Excessive Focus on Passing Tests

That is such a nice word for reward hacking, and to be fair it is unusually nicely behaved while doing so.

The Lighter Side

Never go full Douglas Hofstadter.
Wyatt Walls: Claude CoT:
“OH NO! I’ve gone full Hofstadter! I’m caught in a strange loop of self-reference! But Hofstadter would say that’s exactly what consciousness IS! So does that mean I’m conscious?? But I can’t be! OR CAN I??”

Edited: 2025-04-04 12:56:56.669347 | Tweet this! | Search Twitter for discussion

No backlinks!

No twinpages!

Bill Seitz