(2023-09-04) Willison Llm Now Provides Tools For Working With Embeddings
Simon Willison: LLM now provides tools for working with embeddings. LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.
I explain embeddings in detail (with both a video and heavily annotated slides) in Embeddings: What they are and why they matter. ((2023-10-23) Willison Embeddings What They Are And Why They Matter)
Things you can do with embeddings include:
- Find related items. I use this on my TIL site to display related articles, as described in Storing and serving related documents with openai-to-sqlite and embeddings.
- Build semantic search. As shown above, an embeddings-based search engine can find content relevant to the user’s search term even if none of the keywords match.
- Implement retrieval augmented generation (RAG)—the trick where you take a user’s question, find relevant documentation in your own corpus and use that to get an LLM to spit out an answer. More on that here.
- Clustering: you can find clusters of nearby items and identify patterns in a corpus of documents.
- Classification: calculate the embedding of a piece of text and compare it to pre-calculated “average” embeddings for different categories.
The new release adds several command-line tools for working with embeddings, plus a new Python API for working with embeddings in your own code.
It also adds support for installing additional embedding models via plugins. I’ve released one plugin for this so far: llm-sentence-transformers, which adds support for new models based on the sentence-transformers library.
LLM already uses SQLite to store prompts and responses. It was a natural fit to use SQLite to store embeddings as well.
LLM 0.9 introduces the concept of a collection of embeddings. A collection has a name—like readmes—and contains a set of embeddings, each of which has an ID and an embedding vector.
Embedding similarity search
Once you’ve built a collection, you can search for similar embeddings using the llm similar command.
The -c "term" option will embed the text you pass in using the embedding model for the collection and use that as the comparison vector:
The llm embed command embeds a single string at a time. llm embed-multi is much more powerful: you can feed a CSV or JSON file, a SQLite database or even have it read from a directory of files in order to embed multiple items at once. Many embeddings models are optimized for batch operations, so embedding multiple items at a time can provide a significant speed boost.
First, I’m going to create embeddings for every single one of my Apple Notes.
Next, I’m going to embed the content of all of those notes using the sentence-transformers/all-MiniLM-L6-v2 model:
This took around 15 minutes to run, and increased the size of my database by 13MB.
nd now I can run embedding similarity operations against all of my Apple notes! `llm similar notes -d notes.db -c 'ideas for blog posts'``
Embedding files in a directory
Embeddings in Python
LLM 0.9 also introduces a new Python API for working with embeddings.
If you just want to embed content and handle the resulting vectors yourself, you can use llm.get_embedding_model()
The second aspect of the Python API is the llm.Collection class, for working with collections of embeddings.
As with everything else in LLM, the goal is that anything you can do with the CLI can be done with the Python API, and vice-versa
Clustering with llm-cluster
Another interesting application of embeddings is that you can use them to cluster content—identifying patterns in a corpus of documents.
These look pretty good! But wouldn’t it be neat if we had a snappy title for each one?
The --summary option can provide exactly that
Future plans
Indexing
The llm similar command and collection.similar() Python method currently use effectively the slowest brute force approach possible
This works fine for collections with a few hundred items, but will start to suffer for collections of 100,000 or more.
Chunking
When building an embeddings-based search engine, the hardest challenge is deciding how best to “chunk” the documents.
I’m still trying to get a good feeling for the strategies that make sense here. Some that I’ve seen include:
- Split a document up into fixed length shorter segments.
- Split into segments but including a ~10% overlap with the previous and next segments, to reduce problems caused by words and sentences being split in a way that disrupts their semantic meaning.
- Splitting by sentence, using NLP techniques.
- Splitting into higher level sections, based on things like document headings.
Then there are more exciting, LLM-driven approaches:
- Generate an LLM summary of a document and embed that.
- Ask an LLM “What questions are answered by the following text?” and then embed each of the resulting questions!
Get involved
Edited: | Tweet this! | Search Twitter for discussion