- Creating a dbt project and setting up the ClickHouse adapter.
- Defining a model.
- Updating a model.
- Creating an incremental model.
- Creating a snapshot model.
- Using materialized views.
Setup
Follow the instructions in the Setup of dbt and the ClickHouse adapter section to prepare your environment. Important: The following is tested under python 3.9.Prepare ClickHouse
dbt excels when modeling highly relational data. For the purposes of example, we provide a small IMDB dataset with the following relational schema. This dataset originates from the relational dataset repository. This is trivial relative to common schemas used with dbt but represents a manageable sample: We use a subset of these tables as shown. Create the following tables:The column
created_at for the table roles, which defaults to a value of now(). We use this later to identify incremental updates to our models - see Incremental Models.s3 function to read the source data from public endpoints to insert data. Run the following commands to populate the tables:
Connecting to ClickHouse
-
Create a dbt project. In this case we name this after our
imdbsource. When prompted, selectclickhouseas the database source. -
cdinto your project folder: - At this point, you will need the text editor of your choice. In the examples below, we use the popular VS Code. Opening the IMDB directory, you should see a collection of yml and sql files:
-
Update your
dbt_project.ymlfile to specify our first model -actor_summaryand set profile toclickhouse_imdb. -
We next need to provide dbt with the connection details for our ClickHouse instance. Add the following to your
~/.dbt/profiles.yml.Note the need to modify the user and password. There are additional available settings documented here. -
From the IMDB directory, execute the
dbt debugcommand to confirm whether dbt is able to connect to ClickHouse.Confirm the response includesConnection test: [OK connection ok]indicating a successful connection.
Creating a simple view materialization
When using the view materialization, a model is rebuilt as a view on each run, via aCREATE VIEW AS statement in ClickHouse. This doesn’t require any additional storage of data but will be slower to query than table materializations.
-
From the
imdbfolder, delete the directorymodels/example: -
Create a new file in the
actorswithin themodelsfolder. Here we create files that each represent an actor model: -
Create the files
schema.ymlandactor_summary.sqlin themodels/actorsfolder.The fileschema.ymldefines our tables. These will subsequently be available for use in macros. Editmodels/actors/schema.ymlto contain this content:Theactors_summary.sqldefines our actual model. Note in the config function we also request the model be materialized as a view in ClickHouse. Our tables are referenced from theschema.ymlfile via the functionsourcee.g.source('imdb', 'movies')refers to themoviestable in theimdbdatabase. Editmodels/actors/actors_summary.sqlto contain this content:Note how we include the columnupdated_atin our final actor_summary. We use this later for incremental materializations. -
From the
imdbdirectory execute the commanddbt run. -
dbt will represent the model as a view in ClickHouse as requested. We can now query this view directly. This view will have been created in the
imdb_dbtdatabase - this is determined by the schema parameter in the file~/.dbt/profiles.ymlunder theclickhouse_imdbprofile.Querying this view, we can replicate the results of our earlier query with a simpler syntax:
Creating a table materialization
In the previous example, our model was materialized as a view. While this might offer sufficient performance for some queries, more complex SELECTs or frequently executed queries may be better materialized as a table. This materialization is useful for models that will be queried by BI tools to ensure users have a faster experience. This effectively causes the query results to be stored as a new table, with the associated storage overheads - effectively, anINSERT TO SELECT is executed. Note that this table will be reconstructed each time i.e., it isn’t incremental. Large result sets may therefore result in long execution times - see dbt Limitations.
-
Modify the file
actors_summary.sqlsuch that thematerializedparameter is set totable. Notice howORDER BYis defined, and notice we use theMergeTreetable engine: -
From the
imdbdirectory execute the commanddbt run. This execution may take a little longer to execute - around 10s on most machines. -
Confirm the creation of the table
imdb_dbt.actor_summary:You should the table with the appropriate data types: -
Confirm the results from this table are consistent with previous responses. Notice an appreciable improvement in the response time now that the model is a table:
Feel free to issue other queries against this model. For example, which actors have the highest ranking movies with more than 5 appearances?
Creating an Incremental Materialization
The previous example created a table to materialize the model. This table will be reconstructed for each dbt execution. This may be infeasible and extremely costly for larger result sets or complex transformations. To address this challenge and reduce the build time, dbt offers Incremental materializations. This allows dbt to insert or update records into a table since the last execution, making it appropriate for event-style data. Under the hood a temporary table is created with all the updated records and then all the untouched records as well as the updated records are inserted into a new target table. This results in similar limitations for large result sets as for the table model. To overcome these limitations for large sets, the adapter supports ‘inserts_only’ mode, where all the updates are inserted into the target table without creating a temporary table (more about it below). To illustrate this example, we will add the actor “Clicky McClickHouse”, who will appear in an incredible 910 movies - ensuring he has appeared in more films than even Mel Blanc.-
First, we modify our model to be of type incremental. This addition requires:
- unique_key - To ensure the adapter can uniquely identify rows, we must provide a unique_key - in this case, the
idfield from our query will suffice. This ensures we will have no row duplicates in our materialized table. For more details on uniqueness constraints, see here. - Incremental filter - We also need to tell dbt how it should identify which rows have changed on an incremental run. This is achieved by providing a delta expression. Typically this involves a timestamp for event data; hence our updated_at timestamp field. This column, which defaults to the value of now() when rows are inserted, allows new roles to be identified. Additionally, we need to identify the alternative case where new actors are added. Using the
{{this}}variable, to denote the existing materialized table, this gives us the expressionwhere id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}}). We embed this inside the{% if is_incremental() %}condition, ensuring it is only used on incremental runs and not when the table is first constructed. For more details on filtering rows for incremental models, see this discussion in the dbt docs.
actor_summary.sqlas follows:Note that our model will only respond to updates and additions to therolesandactorstables. To respond to all tables, users would be encouraged to split this model into multiple sub-models - each with their own incremental criteria. These models can in turn be referenced and connected. For further details on cross-referencing models see here. - unique_key - To ensure the adapter can uniquely identify rows, we must provide a unique_key - in this case, the
-
Execute a
dbt runand confirm the results of the resulting table: -
We will now add data to our model to illustrate an incremental update. Add our actor “Clicky McClickHouse” to the
actorstable: -
Let’s have “Clicky” star in 910 random movies:
-
Confirm he is indeed now the actor with the most appearances by querying the underlying source table and bypassing any dbt models:
-
Execute a
dbt runand confirm our model has been updated and matches the above results:
Internals
We can identify the statements executed to achieve the above incremental update by querying ClickHouse’s query log.- The adapter creates a temporary table
actor_sumary__dbt_tmp. Rows that have changed are streamed into this table. - A new table,
actor_summary_new,is created. The rows from the old table are, in turn, streamed from the old to new, with a check to make sure row ids don’t exist in the temporary table. This effectively handles updates and duplicates. - The results from the temporary table are streamed into the new
actor_summarytable: - Finally, the new table is exchanged atomically with the old version via an
EXCHANGE TABLESstatement. The old and temporary tables are in turn dropped.
Append Strategy (inserts-only mode)
To overcome the limitations of large datasets in incremental models, the adapter uses the dbt configuration parameterincremental_strategy. This can be set to the value append. When set, updated rows are inserted directly into the target table (a.k.a imdb_dbt.actor_summary) and no temporary table is created.
Note: Append only mode requires your data to be immutable or for duplicates to be acceptable. If you want an incremental table model that supports altered rows don’t use this mode!
To illustrate this mode, we will add another new actor and re-execute dbt run with incremental_strategy='append'.
-
Configure append only mode in actor_summary.sql:
-
Let’s add another famous actor - Danny DeBito
-
Let’s star Danny in 920 random movies.
-
Execute a dbt run and confirm that Danny was added to the actor-summary table
imdb_dbt.actor_summary table and there is no table creation involved.
Delete and insert mode (experimental)
Historically ClickHouse has had only limited support for updates and deletes, in the form of asynchronous Mutations. These can be extremely IO-intensive and should generally be avoided. ClickHouse 22.8 introduced lightweight deletes and ClickHouse 25.7 introduced lightweight updates. With the introduction of these features, modifications from single update queries, even when being materialized asynchronously, will occur instantly from the user’s perspective. This mode can be configured for a model via theincremental_strategy parameter i.e.
- The adapter creates a temporary table
actor_sumary__dbt_tmp. Rows that have changed are streamed into this table. - A
DELETEis issued against the currentactor_summarytable. Rows are deleted by id fromactor_sumary__dbt_tmp - The rows from
actor_sumary__dbt_tmpare inserted intoactor_summaryusing anINSERT INTO actor_summary SELECT * FROM actor_sumary__dbt_tmp.
insert_overwrite mode (experimental)
Performs the following steps:- Create a staging (temporary) table with the same structure as the incremental model relation:
CREATE TABLE {staging} AS {target}. - Insert only new records (produced by SELECT) into the staging table.
- Replace only new partitions (present in the staging table) into the target table.
This approach has the following advantages:
- It is faster than the default strategy because it doesn’t copy the entire table.
- It is safer than other strategies because it doesn’t modify the original table until the INSERT operation completes successfully: in case of intermediate failure, the original table isn’t modified.
- It implements “partitions immutability” data engineering best practice. Which simplifies incremental and parallel data processing, rollbacks, etc.
Creating a snapshot
dbt snapshots allow a record to be made of changes to a mutable model over time. This in turn allows point-in-time queries on models, where analysts can “look back in time” at the previous state of a model. This is achieved using type-2 Slowly Changing Dimensions where from and to date columns record when a row was valid. This functionality is supported by the ClickHouse adapter and is demonstrated below. This example assumes you have completed Creating an Incremental Table Model. Make sure your actor_summary.sql doesn’t set inserts_only=True. Your models/actor_summary.sql should look like this:-
Create a file
actor_summaryin the snapshots directory. -
Update the contents of the actor_summary.sql file with the following content:
- The select query defines the results you wish to snapshot over time. The function ref is used to reference our previously created actor_summary model.
- We require a timestamp column to indicate record changes. Our updated_at column (see Creating an Incremental Table Model) can be used here. The parameter strategy indicates our use of a timestamp to denote updates, with the parameter updated_at specifying the column to use. If this isn’t present in your model you can alternatively use the check strategy. This is significantly more inefficient and requires the user to specify a list of columns to compare. dbt compares the current and historical values of these columns, recording any changes (or doing nothing if identical).
-
Run the command
dbt snapshot.
-
Sampling this data you will see how dbt has included the columns dbt_valid_from and dbt_valid_to. The latter has values set to null. Subsequent runs will update this.
-
Make our favorite actor Clicky McClickHouse appear in another 10 films.
-
Re-run the dbt run command from the
imdbdirectory. This will update the incremental model. Once this is complete, run the dbt snapshot to capture the changes. -
If we now query our snapshot, notice we have 2 rows for Clicky McClickHouse. Our previous entry now has a dbt_valid_to value. Our new value is recorded with the same value in the dbt_valid_from column, and a dbt_valid_to value of null. If we did have new rows, these would also be appended to the snapshot.
Using seeds
dbt provides the ability to load data from CSV files. This capability isn’t suited to loading large exports of a database and is more designed for small files typically used for code tables and dictionaries, e.g. mapping country codes to country names. For a simple example, we generate and then upload a list of genre codes using the seed functionality.-
We generate a list of genre codes from our existing dataset. From the dbt directory, use the
clickhouse-clientto create a fileseeds/genre_codes.csv: -
Execute the
dbt seedcommand. This will create a new tablegenre_codesin our databaseimdb_dbt(as defined by our schema configuration) with the rows from our csv file. -
Confirm these have been loaded: