(2025-09-04) ZviM AI #132 Part1 Improved Ai Detection
Zvi Mowshowitz: AI #132 Part 1: Improved AI Detection. One result of going on vacation was that I wasn’t able to spin events off into focused posts this week, so I’m going to fall back on splitting the weekly instead, plus some reserving a few subtopics for later posts, including AI craziness (the Tim Hua post on this is excellent), some new OpenAI largely policy-related shenanigans, and the continuing craziness of some people who should very much know better confidently saying that we are not going to hit AGI any time soon, plus some odds and ends including dead internet theory.
That still leaves tons of other stuff.
Table of Contents
- Language Models Offer Mundane Utility. How much improvement have we seen?
- Language Models Don’t Offer Mundane Utility. Writing taste remains elusive.
- On Your Marks. Opus 4.1 on METR graph, werewolf, WeirdML, flash fiction.
- Choose Your Fighter. The right way to use the right fighter, and a long tail.
- Fun With Media Generation. Justine Moore’s slate of AI creative tools.
- Deepfaketown and Botpocalypse Soon. Maybe AI detectors work after all?
- Don’t Be Evil. Goonbots are one thing, but at some point you draw the line.
- They Took Our Jobs. A second finding suggests junior hiring is suffering.
- School Daze. What do you need to learn in order to be able to learn [from AIs]?
- The Art of the Jailbreak. Prompt engineering game Gandalf.
- Overcoming Bias. AIs find center-left think tanks superior, AEI reports.
- Get Involved. MATS 9.0, AIGS needs Canadian dollars, Anthropic Futures Form.
- Introducing. Grok Code Fast 1, InstaLILY, Brave Leo AI browser.
- Unprompted Attention. OpenAI offers a realtime prompting guide.
- In Other AI News. Google survives its antitrust case. GOOG +9%.
- Show Me the Money. Anthropic raises $13b at $183b. Meta might need help.
Language Models Offer Mundane Utility
How much have LLMs improved for practical purposes in the last year? Opinions are split but consensus is a little above Somewhat Better.
To me the answer is very clearly Considerably Better, to the point that about half my uses wouldn’t have been worth bothering with a year ago, and to the extent I’m considering coding it is way better.
Language Models Don’t Offer Mundane Utility
This one time the rumors of a model suddenly getting worse were true, there was a nine hour period where Claude Opus quality was accidentally degraded by a rollout of the interface stack. The change has now been rolled back and quality has recovered
Taco Bell is rethinking its use of artificial intelligence (AI) to power drive-through restaurants in the US after comical videos of the tech making mistakes were viewed millions of times.
This seems very obviously a Skill Issue on multiple fronts. The technology can totally handle this, especially given a human can step in at any time if there is an issue.
Getting models to have writing taste remains a struggle, at least by my eyes even when they have relatively good taste they all reliably have terrible taste and even the samples people say are good are not good
The problem is that the people have terrible taste, really no good, very bad taste, as confirmed every time we do a comparison that says GPT-4.5 is preferred over Emily Dickinson and Walt Whitman or what not. Are you actually going to maximize for ‘elite taste’ over the terrible taste of users, and do so sufficiently robustly to overcome all your other forms of feedback? I don’t know that you could, or if you could that you would even want to.
On Your Marks
Claude Opus 4.1 joins the METR graph, 30% beyond Opus 4 and in second place behind GPT-5, although within margin of error
A new math benchmark looks at questions that stump at least one active model. GPT-5 leads with 43%
Choose Your Fighter
If you use Google Gemini for something other than images, a reminder to always use it in AI Studio, never in the Gemini app, if you need high performance. Quality in AI Studio is much higher
If you use GPT-5, of course, only use the router if you need very basic stuff.
Near: gpt5 router gives me results equivalent to a 1995 markov chain bot.
I do have some narrow use cases where I’ve found GPT-5-Auto is the right tool.
Nikunj Korthari: Here's what Cursor assumes: you want to code. Replit? You want to ship. But Claude Code starts somewhere else entirely. It assumes you have a problem.
Yes, the terminal looks technical because it is. But when you only need to explain problems, not understand solutions, everything shifts.
Fun With Media Generation
Justine Moore gives us a presentation on the state of play for AI creative tools. Nothing surprising but details are always good
Deepfaketown and Botpocalypse Soon
How accurate are AI writing detectors? Brian Jabarian and Alex Imas put four to the test. RoBERTA tested as useless, but Pangram, Originality and GPTZero all had low (<2.5% or better across the board, usually <1%) false positive rates on pre-LLM text passages, at settings that also had acceptable false negative rates
Don’t Be Evil
Yishan: People ask me why I invested in [AN AI HOROSCOPE COMPANY]. They’re like “it’s just some slop AI horoscope!”
My reply is “do you have ANY IDEA how many women are into horoscopes and astrology??? And it’ll run on your phone and know you intimately and help you live your life?”
AI is not just male sci-fi tech. Men thought it would be sex robots but it turned out to be AI boyfriends. The AI longhouse is coming for you and none of you are ready.
Tracing Woods: People ask me why I invested in the torment nexus from the classic sci-fi novel “don’t invest in the torment nexus”
my reply is “do you have ANY IDEA how profitable the torment nexus will be?”
the torment nexus is coming for you and none of you are ready
Seriously. Don’t be evil.
They Took Our Jobs
Ethan Mollick: A second paper also finds Generative AI is reducing the number of junior people hired (while not impacting senior roles)
No, this doesn’t show that AI is ‘an existential threat to human labor’ via this sort of job taking. I do think AI poses an existential threat to human labor, but more as a side effect of the way it poses an existential threat to humans, which would also threaten their labor and jobs, and I agree that this result doesn’t tell us much about that
Context switching is a superpower if you can get good at it, which introduces new maximization problems.
Nabeel Qureshi: Watching this guy code at a wework [in Texas]. He types something into the Cursor AI pane, the AI agent starts coding, he switches tabs and plays 1 min bullet chess for 5 mins; checks in with the agent, types a bit more, switches back to the chess, repeats...
Davidad: If you can context-switch to a game or puzzle while your AI agent is processing, then you should try instead context-switching to another AI agent instance where you are working on a different branch or codebase
Not all context switching is created equal. Switching into a chess game is a different move than switching into another coding task
Still, yes, multi-Clauding will always be the dream, if you can pull it off. And if you don’t net gain productivity but do get to do a bunch of other little tasks, that still counts (to me, anyway) as a massive win.
Kevin Frazier: In the not-so-distant future, access to AI-informed healthcare will distinguish good versus bad care. I'll take Dr. AI. Case in point below.
Radiologists are not yet going away, and AIs are not perfect, but AIs are already less imperfect than doctors at a wide range of tasks, in a ‘will kill the patient less often’ type of way. With access to 5-Level models, failure to consult them in any case where you are even a little uncertain is malpractice. Not in a legal sense, not yet, but in a ‘do right by the patient’ sense.
- I commented with a question about this, but got no replies. I'm kinda surprised/dubious as "pure LLMs" having these medical success rates. Am I missing something?
And/or: they talk about it being impossible to point to sources, but don't some LLMs do that already? Even ChatGPT5? Because I can't help but wonder if the "vignettes" mimicked the literature where diseases were discussed/identified. So "prompt engineering" becomes "vignette engineering". Maybe most doctors can do/learn that, but it needs to be shown?
And aren't the radiology-image AIs more CNN than LLM?
I also posted direct to the source tweet.
School Daze
My position has long been:
If you want to use AI to learn, it is the best tool ever invented for learning.
If you want to use AI to not learn, it is the best tool ever invented for that too.
Which means the question is, which will students choose? Are you providing them with reason to want to learn?
Here’s a plausible hypothesis, where to use LLMs to learn you need to establish basic skills first, or else you end up using them to not learn, instead.
Henry Shevlin: High-school teacher friend of mine says there’s a discontinuity between (i) 17-18 year olds who learned basic research/writing before ChatGPT and can use LLMs effectively, vs (ii) 14-16 year olds who now aren’t learning core skills to begin with, and use LLMs as pure crutches.
Natural General Intelligence (obligatory): Kids with “Google” don’t know how to use the library. TV has killed their attention span, nobody reads anymore. Etc.
You definitely need some level of basic skills. If you can’t read and write, and you’re not using LLMs in modes designed explicitly to teach you those basic skills, you’re going to have a problem.
I am still skeptical that this is a real phenomena. We do not yet, to my knowledge, any graphs that show this discontinuity as expressed in skills and test scores, either over time or between cohorts. We should be actively looking and testing for it, and be prepared to respond if it happens, but the response needs to focus on ‘rethink the way schools work’ rather than ‘try in vain to ban LLMs’ which would only backfire.
The Art of the Jailbreak
Overcoming Bias
A study from the American Enterprise Institute found that top LLMs (OpenAI, Google, Anthropic, xAI and DeepSeek) consistently rate think tanks better the closer they are to center-left on the American political spectrum. This is consistent with prior work and comes as no surprise whatsoever. It is a question of magnitude only.
*Sentiment analysis has what seems like a bigger gap than the ultimate ratings.
Note that the gaps reported here center-left versus right, not left versus right, which would be smaller, as there is as much ‘center over extreme’ preference here as there is for left versus right.*
Why it matters
LLM-generated reputations already steer who is cited, invited, and funded. If LLMs systematically boost center-left institutes and depress right-leaning ones, writers, committees, and donors may unknowingly amplify a one-sided view, creating feedback loops that entrench any initial bias.
My model of how funding works for think tanks is that support comes from ideologically aligned sources, and citations are mostly motivated by politics
Some of these are constructive steps, but I have another idea? One could treat this evaluation of lacking morality, research quality and objectivity as pointing to real problems, and work to fix them? Perhaps they are not errors, or only partly the result of bias, especially if you are not highly ranked within your ideological sector.
Get Involved
Introducing
Grok Code Fast 1, available in many places or $0.20/$1.50 on the API. They offer a guide here which seems mostly similar to what you’d do with any other AI coder.
Unprompted Attention
OpenAI released a Realtime Prompting Guide. Carlos Perez looked into some of its suggestions, starting with ‘before any call, speak neutral filler, then call’ to avoid ‘awkward silence during tool calls.’ Um, no, thanks?
In Other AI News
The Time 100 AI 2025 list is out, including Pliny the Liberator. The list has plenty of good picks, it would be very hard to avoid this, but it also has some obvious holes. How can I take such a list seriously if it doesn’t include Demis Hassabis?
Google will not be forced to do anything crazy like divest Chrome or Android, the court rightfully calling it overreach to have even asked. Nor will Google be barred from paying for Chrome to get top placement, so long as users can switch, as the court realized that this mainly devastates those currently getting payments. For their supposed antitrust violations, Google will also be forced to turn over certain tailored search index and user-interaction data, but not ads data, to competitors. I am very happy with the number of times the court replied to requests with ‘that has nothing to do with anything involved in this case, so no.’
Show Me the Money
Anthropic finalizes its raise of $13 billion at a $183 billion post-money valuation. They note they started 2025 at $1 billion in run-rate revenue and passed $5 billion just eight months later, over 10% of which is from Claude Code which grew 10x in three months.
Edited: | Tweet this! | Search Twitter for discussion