Dataset
This dataset by Foursquare is available to download and to use for free under the Apache 2.0 license. It contains over 100 million records of commercial points-of-interest (POI), such as shops, restaurants, parks, playgrounds, and monuments. It also includes additional metadata about those places, such as categories and social media information.Data exploration
For exploring the data we’ll useclickhouse-local, a small command-line tool
that provides the full ClickHouse engine, although you could also use
ClickHouse Cloud, clickhouse-client or even chDB.
Run the following query to select the data from the s3 bucket where the data is stored:
Query
Response
ᴺᵁᴸᴸ, so we can add some additional conditions
to our query to get back more usable data:
Query
DESCRIBE:
Query
Response
Loading the data into ClickHouse
If you’d like to persist the data on disk, you can useclickhouse-server
or ClickHouse Cloud.
To create the table, run the following command:
Query
LowCardinality
data type for several columns which changes the internal representation of the data
types to be dictionary-encoded. Operating with dictionary encoded data significantly
increases the performance of SELECT queries for many applications.
Additionally, two UInt32 MATERIALIZED columns, mercator_x and mercator_y are created
that map the lat/lon coordinates to the Web Mercator projection
for easier segmentation of the map into tiles:
longitude + 180shifts the longitude range from [-180, 180] to [0, 360]- Dividing by 360 normalizes this to a value between 0 and 1
- Multiplying by
0xFFFFFFFF(hex for maximum 32-bit unsigned integer) scales this normalized value to the full range of a 32-bit integer
latitude + 90shifts latitude from [-90, 90] to [0, 180]- Dividing by 360 and multiplying by pi() converts to radians for the trigonometric functions
- The
log(tan(...))part is the core of the Mercator projection formula - multiplying by
0xFFFFFFFFscales to the full 32-bit integer range
MATERIALIZED makes sure that ClickHouse calculates the values for these
columns when we INSERT the data, without having to specify these columns (which aren’t
part of the original data schema) in the `INSERT statement.
The table is ordered by mortonEncode(mercator_x, mercator_y) which produces a
Z-order space-filling curve of mercator_x, mercator_y in order to significantly
improve geospatial query performance. This Z-order curve ordering ensures data is
physically organized by spatial proximity:
minmax indices are also created for faster search: