Introduction
The LAION 5b dataset contains 5.85 billion image-text embeddings and associated image metadata. The embeddings were generated usingOpen AI CLIP model ViT-L/14. The
dimension of each embedding vector is 768.
This dataset can be used to model design, sizing and performance aspects for a large scale,
real world vector search application. The dataset can be used for both text to image search and
image to image search.
Dataset details
The complete dataset is available as a mixture ofnpy and Parquet files at the-eye.eu
ClickHouse has made available a subset of 100 million vectors in a S3 bucket.
The S3 bucket contains 10 Parquet files, each Parquet file is filled with 10 million rows.
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.
Steps
Create table
Create thelaion_5b_100m table to store the embeddings and their associated attributes:id is just an incrementing integer. The additional attributes can be used in predicates to understand
vector similarity search combined with post-filtering/pre-filtering as explained in the documentationLoad data
To load the dataset from allParquet files, run the following SQL statement:Run a brute-force vector similarity search
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset to the search embedding vector and then ordering the distances to get the nearest neighbours. We can use one of the vectors from the dataset itself as the search vector. For example:Query
Response
Build a vector similarity index
Run the following SQL to define and build a vector similarity index on thevector column of the laion_5b_100m table :M and ef_construction.
You need to carefully select optimal values for these parameters by evaluating index build time and search results quality
corresponding to selected values.Building and saving the index could even take a few hours for the full l00 million dataset, depending on the number of CPU cores available and the storage bandwidth.Perform ANN search
Once the vector similarity index has been built, vector search queries will automatically use the index:Query
Generate embeddings for search query
TheLAION 5b dataset embedding vectors were generated using OpenAI CLIP model ViT-L/14.An example Python script is provided below to demonstrate how to programmatically generate
embedding vectors using the CLIP APIs. The search embedding vector
is then passed as an argument to the cosineDistance() function in the SELECT query.To install the clip package, please refer to the OpenAI GitHub repository.