(2023-02-09) Chiang Chatgpt Is A Blurry Jpeg Of The Web

Ted Chiang: ChatGPT Is a Blurry JPEG of the Web. Imagine that you’re about to lose your access to the Internet forever. In preparation, you plan to create a compressed copy of all the text on the Web, so that you can store it on a private server. Unfortunately, your private server has only one per cent of the space needed

you write a lossy algorithm that identifies statistical regularities in the text and stores them in a specialized file format

The only catch is that, because the text has been so highly compressed, you can’t look for information by searching for an exact quote; you’ll never get an exact match, because the words aren’t what’s being stored. To solve this problem, you create an interface that accepts queries in the form of questions and responds with answers that convey the gist of what you have on your server.

What I’ve described sounds a lot like ChatGPT, or most any other large-language model (LLM). Think of ChatGPT as a blurry JPEG of all the text on the Web

This analogy to lossy compression is not just a way to understand ChatGPT’s facility at repackaging information found on the Web by using different words. It’s also a way to understand the “hallucinations,” or nonsensical answers to factual questions, to which large-language models such as ChatGPT are all too prone.

Since 2006, an A.I. researcher named Marcus Hutter has offered a cash reward—known as the Prize for Compressing Human Knowledge, or the Hutter Prize.

This isn’t just an exercise in smooshing. Hutter believes that better text compression will be instrumental in the creation of human-level artificial intelligence, in part because the greatest degree of compression can be achieved by understanding the text.

A lot of uses have been proposed for large-language models. Thinking about them as blurry JPEGs offers a way to evaluate what they might or might not be well suited for. Let’s consider a few scenarios.

Can large-language models take the place of traditional search engines? For us to have confidence in them, we would need to know that they haven’t been fed propaganda and conspiracy theories

there’s still the matter of blurriness. There’s a type of blurriness that is acceptable, which is the re-stating of information in different words. Then there’s the blurriness of outright fabrication, which we consider unacceptable when we’re looking for facts.

I’d say that anything that’s good for content mills is not good for people searching for information. The rise of this type of repackaging is what makes it harder for us to find what we’re looking for online right now

a useful criterion for gauging a large-language model’s quality might be the willingness of a company to use the text that it generates as training material for a new model

Can large-language models help humans with the creation of original writing? To answer that, we need to be specific about what we mean by that question

no one can speak for all writers, but let me make the argument that starting with a blurry copy of unoriginal work isn’t a good way to create original work. If you’re a writer, you will write a lot of unoriginal work before you write something original

just how much use is a blurry JPEG, when you still have the original?


Edited:    |       |    Search Twitter for discussion