ClickHouse for Vector Search!
Yes, ClickHouse can perform vector search. The main advantages of using ClickHouse for vector search compared to using more specialized vector databases include:- Using ClickHouse’s filtering and full-text search capabilities to refine your dataset before performing a search.
- Performing analytics on your datasets.
- Running a
JOINagainst your existing data. - No need to manage yet another database and complicate your infrastructure.
- Create embeddings
Your data (documents, images, or structured data) must be converted to embeddings. We recommend creating embeddings using the OpenAI Embeddings API or using the open-source Python library SentenceTransformers.
You can think of an embedding as a large array of floating-point numbers that represent your data. Check out this guide from OpenAI to learn more about embeddings.
- Store the embeddings
Once you have generated embeddings, you need to store them in ClickHouse. Each embedding should be stored in a separate row and can include metadata for filtering, aggregations, or analytics. Here’s an example of a table that can store images with captions:
- Search for related embeddings
Let’s say you want to search for pictures of dogs in your dataset. You can use a distance function like cosineDistance to take an embedding of a dog image and search for related images:
_file names and caption of the top 10 images most likely to be related to your provided dog image.