Introduction
The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is384.
This dataset can be used to walk through the design, sizing and performance aspects for a large scale,
real world vector search application built on top of user generated, textual data.
Dataset details
The complete dataset with vector embeddings is made available by ClickHouse as a singleParquet file in a S3 bucket
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.
Steps
Create table
Create thehackernews table to store the postings & their embeddings and associated attributes:id is just an incrementing integer. The additional attributes can be used in predicates to understand
vector similarity search combined with post-filtering/pre-filtering as explained in the documentationLoad data
To load the dataset from theParquet file, run the following SQL statement:Build a vector similarity index
Run the following SQL to define and build a vector similarity index on thevector column of the hackernews table:M and ef_construction.
You need to carefully select optimal values for these parameters by evaluating index build time and search results quality
corresponding to selected values.Building and saving the index could even take a few minutes/hour for the full 28.74 million dataset, depending on the number of CPU cores available and the storage bandwidth.Perform ANN search
Once the vector similarity index has been built, vector search queries will automatically use the index:Query
Generate embeddings for search query
Sentence Transformers provide local, easy to use embedding models for capturing the semantic meaning of sentences and paragraphs.The dataset in this HackerNews dataset contains vector emebeddings generated from the all-MiniLM-L6-v2 model.An example Python script is provided below to demonstrate how to programmatically generate embedding vectors usingsentence_transformers1 Python package. The search embedding vector is then passed as an argument to the [cosineDistance()](/core/reference/functions/regular-functions/distance-functions#cosineDistance) function in the SELECT` query.Summarization demo application
The example above demonstrated semantic search and document retrieval using ClickHouse.A very simple but high potential generative AI example application is presented next.The application performs the following steps:- Accepts a topic as input from the user
- Generates an embedding vector for the topic by using the
SentenceTransformerswith modelall-MiniLM-L6-v2 - Retrieves highly relevant posts/comments using vector similarity search on the
hackernewstable - Uses
LangChainand OpenAIgpt-3.5-turboChat API to summarize the content retrieved in step #3. The posts/comments retrieved in step #3 are passed as context to the Chat API and are the key link in Generative AI.
OPENAI_API_KEY. The OpenAI API key can be obtained after registering at https://platform.openai.com.This application demonstrates a Generative AI use-case that is applicable to multiple enterprise domains like :
customer sentiment analysis, technical support automation, mining user conversations, legal documents, medical records,
meeting transcripts, financial statements, etc