In this tutorial, you’ll insert 28 million rows of Hacker News data into a ClickHouse table from both CSV and Parquet formats and run some simple queries to explore the data.
CSV
Sample the data
clickhouse-local allows you to perform fast processing on local files without
having to deploy and configure the ClickHouse server.Before storing any data in ClickHouse, let’s sample the file using clickhouse-local.
From the console run:Query
Response
file operator allows you to read the file from a local disk, specifying only the format CSVWithNames.
Most importantly, the schema is automatically inferred for you from the file contents.
Note also how clickhouse-local is able to read the compressed file, inferring the gzip format from the extension.
The Vertical format is used to more easily see the data for each column.Load the data with schema inference
The simplest and most powerful tool for data loading is theclickhouse-client: a feature-rich native command-line client.
To load data, you can again exploit schema inference, relying on ClickHouse to determine the types of the columns.Run the following command to create a table and insert the data directly from the remote CSV file, accessing the contents via the url function.
The schema is automatically inferred:DESCRIBE TABLE command allows us to understand these assigned types.Query
Response
INSERT INTO, SELECT command.
Together with the url function, data will be streamed directly from the URL:Explore the data
Sample the Hacker News stories and specific columns by running the following query:Query
Response
Define a schema
An obvious immediate optimization is to define a type for each field. In addition to declaring the time field as aDateTime type, we define an appropriate type for each of the fields below after dropping our existing dataset.
In ClickHouse the primary key id for the data is defined via the ORDER BY clause.Selecting appropriate types and choosing which columns to include in the ORDER BY
clause will help to improve query speed and compression.Run the query below to drop the old schema and create the improved schema:Query
clickhouse-client, insert the file using the INFILE clause with an explicit INSERT INTO.Query
Run sample queries
Some sample queries are presented below to give you inspiration for writing your own queries.How pervasive a topic is “ClickHouse” in Hacker News?
The score field provides a metric of popularity for stories, while theid field and ||
concatenation operator can be used to produce a link to the original post.Query
Response
time field
as a DateTime is shown, as using a proper data type allows you to use the toYYYYMM() function:Query
Response
Who are the top commenters on ClickHouse related articles?
Query
Response
Which comments generate the most interest?
Query
Response
Parquet
One of the strengths of ClickHouse is its ability to handle any number of formats. CSV represents a rather ideal use case, and isn’t the most efficient for data exchange. Next, you’ll load the data from a Parquet file which is an efficient column-oriented format. Parquet has minimal types, which ClickHouse needs to respect, and this type information is encoded in the format itself. Type inference on a Parquet file will invariably lead to a slightly different schema than the one for the CSV file.Insert the data
Run the following query to read the same data in Parquet format, again using the url function to read the remote data:Null keys with ParquetAs a condition of the Parquet format, we have to accept that keys might be
NULL,
even though they aren’t in the data.Query
Add a skipping-index to speed up queries
To find out how many comments mention “ClickHouse”, run the following query:Query
Response
system.data_skipping_indices).Run the query again once the index has been created:Query
Response
EXPLAIN clause can be used to understand why the addition of this index
improved the query around 3.4x.Response
Query
Response
Query