Data preparation
The embeddings and the metadata are stored in separate files in the raw data. A data preparation step downloads the data, merges the files, converts them to CSV and imports them into ClickHouse. You can use the followingdownload.sh script for that:
process.py is defined as follows:
seq 0 9 | ....
(The python script above is very slow (~2-10 minutes per file), takes a lot of memory (41 GB per file), and the resulting csv files are big (10 GB each), so be careful. If you have enough RAM, increase the -P1 number for more parallelism. If this is still too slow, consider coming up with a better ingestion procedure - maybe converting the .npy files to parquet, then doing all the other processing with clickhouse.)
Create table
To create a table initially without indexes, run:id column is just for illustration and is populated by the script with non-unique values.
Run a brute-force vector similarity search
To run a brute-force approximate vector search, run:target is an array of 512 elements and a client parameter.
A convenient way to obtain such arrays will be presented at the end of the article.
For now, we can run the embedding of a random LEGO set picture as target.
Result
Run an approximate vector similarity search with a vector similarity index
Let’s now define two vector similarity indexes on the table.Creating embeddings with UDFs
One usually wants to create embeddings for new images or new image captions and search for similar image / image caption pairs in the data. We can use UDF to create thetarget vector without leaving the client. It is important to use the same model to create the data and new embeddings for searches. The following scripts utilize the ViT-B/32 model which also underlies the dataset.
Text embeddings
First, store the following Python script in theuser_scripts/ directory of your ClickHouse data path and make it executable (chmod +x encode_text.py).
encode_text.py:
encode_text_function.xml in a location referenced by <user_defined_executable_functions_config>/path/to/*_function.xml</user_defined_executable_functions_config> in your ClickHouse server configuration file.
SET param_target=... and can easily write queries. Alternatively, the encode_text() function can directly be used as a argument to the cosineDistance function :
encode_text() UDF itself could require a few seconds to compute and emit the embedding vector.
Image embeddings
Image embeddings can be created similarly and we provide a Python script that can generate an embedding of an image stored locally as a file.encode_image.py
encode_image_function.xml