(2023-05-11) Yegge Cody Is Cheating

Steve Yegge: Cody is Cheating. I will explain our moat: How exactly Cody is differentiated over all other coding assistants, including Copilot, and why we will always be the leader.

many of our Enterprise customers and prospects are already familiar with Cody and simply want to understand our key “moat” differentiators vs. Copilot. So that will be the main focus of this post

The Rise and Fall of the GPT Empire

Here’s the TL;DR for what happened last week. For a more action-packed account, see my recent Medium post, We’re Gonna Need a Bigger Moat. I’ll just share a recap here. 2023-05-11-YeggeWereGonnaNeedABiggerMoat

First, a leaked memo from a Google researcher, “We have no moat, and neither does OpenAI”, showed that open-source LLM performance is rapidly catching OpenAI/Google, for specific domains and apps

And second, Geoffrey Hinton, the so-called Godfather of Deep Learning, quit to go on a world tour talking about SkyNet.

Feb 24th: While Zuck was busy flying the plane into the mountainside, Meta’s AI team opened-sourced their 65-billion parameter LLaMA large model

Fortunately, they kept the secret model weights under lock and key in a vault deep in Zuckville. Meta’s secret sauce, LLaMA’s proprietary weights: safe and sound

March 8th: LLaMA’s secret weights are, predictably, leaked to Discord two weeks later.

March 28th: LLaMA dependency is removed; OSS is free and clear.

Ever since then has been full batshit insanity, with new OSS advances launching daily

March 19th: A LLaMA variant strain achieves 90% of ChatGPT’s performance. Training cost: $300

even if the premium luxury highest-end expensive boutique mainframe LLMs from Google/OpenAI are able to maintain better overall performance under load... at some point, the OSS model performance still becomes “good enough”.

The real winners here are, conveniently, me, me, and me. Well, really anyone selling Enterprise LLM-backed SaaS, other than the current big players

For me, it feels like every new bit of news is accelerating Cody’s race to become the most powerful dev tool in the world.

the main takeaway from the history lesson above is that apps need their own data moats to be different from the competition.

*what does a good moat look like? Well, my thesis of Cheating is All You Need was that having high-quality structured data sources helps you build a context window.

But Cody’s “cheating” is in fact much more deeply aligned to the AI, in the sense that Sourcegraph’s code graph can be used directly to improve embeddings, fine-tuning, training, and to significantly raise the quality of nearly every phase and component of your LLM-powered workflows.*

At Sourcegraph we are fairly well-known for our code search, but perhaps not as well known for our code graph, and I’m guessing very few of you know about our embeddings. These three custom backends, all created from different techniques for “indexing” your code.

let’s compare Cody to Copilot.

Good: GitHub’s code search has improved by miles and is quite close to Sourcegraph’s search now.

Bad: GitHub’s code graph for their so-called “precise” code intelligence is basically a cheap plastic imitation of our SCIP graph

Meh: Copilot’s embeddings story is a mess right now.

I don’t think Copilot today, in its current form, uses any fancy helper backends like this. It simply uses your last 20 opened files, or some such.

This diagram is my attempt to capture the spectacle of Cody’s totally unfair cheating in all its glory, with this diagram with all three of Cody’s backends in action.

we never did talk about what Embeddings actually are. That wastebasket of 3D arrows.

Cody is easily 10 times better with embeddings turned on.

it magically 🧙🦄🍄takes each element you encoded, a function, paragraph, comment, whatever, and gives it a magical arrow of meaning.

this magical arrow, the embedding vector

it is how you teach the LLM about your code without doing any fine-tuning.

each embedding is 768 mystical numbers that fully describe a hunk of text… at least, well enough to find similar hunks.

If you ask Cody, Where is the SAML auth code in our codebase?, how does it know which files to examine in order to give you a great answer?

embeddings are basically LLM poo. I kicked around a lot of analogies and that one stuck to my shoe, so to speak.

Embeddings are a bit of a side effect. They come from a sort of gland you squeeze, underneath the Transformer. This process is called “sampling its outputs”.

LLM: The vector contains the answers to 768 questions about your hunk.

what good is an embedding that only asks 768 questions? LLM: Well your hunks of text aren’t that big, so I don’t need that many questions.

You: So… it’s a game of 20 questions?

I’ve hacked my way through jungles and swamps, trying to make absolutely certain that Jessie’s 20-questions analogy is 100% technically and mathematically defensible. And as far as I can tell, nobody fuckin’ knows. But it’s a nice mental model.

There’s been a fair amount of research on how to augment LLMs with structured code graphs. GraphCodeBERT is a pretty good example from back in 2021

But there’s been more research done since then

and the key takeaway for today is: You can directly embed the code graph itself.

This property of LLMs – that they work better with structure than simply with pure token streams – is a huge differentiator for Sourcegraph and Cody.


Edited:    |       |    Search Twitter for discussion