(2024-12-26) Zvim Ai96 O3 But Not Yet For Thee
Zvi Mowshowitz: AI #96: o3 But Not Yet For Thee. The year in models certainly finished off with a bang. In this penultimate week, we get OpenAI o3, which purports to give us vastly more efficient performance than o1, and also to allow us to choose to spend vastly more compute if we want a superior answer. We also got DeepSeek v3, which claims to have trained a roughly Claude Sonnet-strength model for only $6 million and 37b active parameters per token (671b total via mixture of experts).
Both are potential game changers, both in their practical applications and in terms of what their existence predicts for our future. It is also too soon to know if either of them is the real deal. Both are mostly not covered here quite yet
Table of Contents
- Language Models Offer Mundane Utility. Make best use of your new AI agents.
- Language Models Don’t Offer Mundane Utility. The uncanny valley of reliability.
- Flash in the Pan. o1-style thinking comes to Gemini Flash. It’s doing its best.
- The Six Million Dollar Model. Can they make it faster, stronger, better, cheaper?
- And I’ll Form the Head. We all have our own mixture of experts.
- Huh, Upgrades. ChatGPT can use Mac apps, unlimited (slow) holiday Sora.
- o1 Reactions. Many really love it, others keep reporting being disappointed.
- Fun With Image Generation. What is your favorite color? Blue. It’s blue.
- Introducing. Google finally gives us LearnLM.
- They Took Our Jobs. Why are you still writing your own code?
- Get Involved. Quick reminder that opportunity to fund things is everywhere.
- In Other AI News. Claude gets into a fight over LessWrong moderation.
- You See an Agent, You Run. Building effective agents by not doing so.
- Another One Leaves the Bus. Alec Radford leaves OpenAI.
- Quiet Speculations. Estimates of economic growth keep coming in super low.
- Lock It In. What stops you from switching LLMs?
- The Quest for Sane Regulations. Sriram Krishnan joins the Trump administration.
- The Week in Audio. The many faces of Yann LeCun. Anthropic’s co-founders talk.
- A Tale as Old as Time. Ask why mostly in a predictive sense.
- Rhetorical Innovation. You won’t not wear the f**ing hat.*
- Aligning a Smarter Than Human Intelligence is Difficult. Cooperate with yourself.
- People Are Worried About AI Killing Everyone. I choose you.
- The Lighter Side. Please, no one call human resources.
Language Models Offer Mundane Utility
How does your company make best use of AI agents? Austin Vernon frames the issue well: AIs are super fast, but they need proper context. So if you want to use AI agents, you’ll need to ensure they have access to context, in forms that don’t bottleneck on humans. Take the humans out of the loop, minimize meetings and touch points. Put all your information into written form, such as within wikis. Have automatic tests and approvals, but have the AI call for humans when needed via ‘stop work authority’ – I would flip this around and let the humans stop the AIs, too. (lean, friction)
That all makes sense, and not only for corporations. If there’s something you want your future AIs to know, write it down in a form they can read, and try to design your workflows such that you can minimize human (your own!) touch points.
What will citizenship mean in the age of AI? I have absolutely no idea. So how do you prepare for that? Largely the same goes for wellbeing. A lot of this could be thought of as: Focus on the general and the adaptable, and focus less on the specific, including things specifically for Jobs and other current forms of paid work – you want to be creative and useful and flexible and able to roll with the punches.
That of course assumes that you are taking the world as given, rather than trying to change the course of history (change the world). In which case, there’s a very different calculation.
Language Models Don’t Offer Mundane Utility
Theo: Something I hate when using Cursor is, sometimes, it will randomly delete some of my code, for no reason. Sometimes removing an entire feature
I hate Apple Intelligence email/etc. summaries. They’re just off enough to make me think it is a new email in thread, but not useful enough to be a good summary. Uncanny valley.
Flash in the Pan
The latest rival to at least o1-mini is Google Gemini-2.0-Flash-Thinking, which I’m tempted to refer to (because of reasons) as gf1.
One cool thing about Thinking is that (like DeepSeek’s Deep Thought) it explains its chain of thought much better than o1.
Other reports I’ve seen are less excited about quality, and when o3 got announced it seemed everyone got distracted.
The Six Million Dollar Model
Having no respect for American holidays, DeepSeek dropped their v3 today.
As always, not so fast. DeepSeek is not known to chase benchmarks, but one never knows the quality of a model until people have a chance to bang on it a bunch.
And I’ll Form the Head
Increasingly the correct solution to ‘what LLM or other AI product should I use?’ is ‘you should use a variety of products depending on your exact use case.’
- Gallabytes: OpenAI o1 Pro is by far the smartest single-turn model.
- Claude is still far better at conversation.
- Google Gemini can do many things quickly and is excellent at editing code.
Which almost makes me think the ideal programming workflow right now is something somewhat unholy like:
- Discuss, plan, and collect context with Claude Sonnet.
- Sonnet provides a detailed request to OpenAI o1 (Pro).
- o1 spits out the tricky code.
- In simple cases (most of them), it could make the edit directly.
- For complicated changes, it could instead output a detailed plan for each file it needs to change and pass the actual making of that change to Google Gemini Flash.
- This is too many steps. LLM orchestration spaghetti. But this feels like a real direction.
This is mostly the same workflow I used before o1, when there was only Sonnet. I’d discuss to form a plan, then use that to craft a request, then make the edits. The swap doesn’t seem like it makes things that much trickier, the logistical trick is getting all the code implementation automated.
o1 Reactions
Edited: | Tweet this! | Search Twitter for discussion