When should refreshable materialized views be used?
ClickHouse incremental materialized views are enormously powerful and typically scale much better than the approach used by refreshable materialized views, especially in cases where an aggregate over a single table needs to be performed. By only computing the aggregation over each block of data as it is inserted and merging the incremental states in the final table, the query only ever executes on a subset of the data. This method scales to potentially petabytes of data and is usually the preferred method. However, there are use cases where this incremental process isn’t required or isn’t applicable. Some problems are either incompatible with an incremental approach or don’t require real-time updates, with a periodic rebuild being more appropriate. For example, you may want to regularly perform a complete re-computation of a view over the full dataset because it uses a complex join, which is incompatible with an incremental approach.
Refreshable materialized views can run batch processes performing tasks such as denormalization. Dependencies can be created between refreshable materialized views such that one view depends on the results of another and only executes once it is complete. This can replace scheduled workflows or simple DAGs such as a dbt job. To find out more about how to set dependencies between refreshable materialized views go to CREATE VIEW, Dependencies section.
How do you refresh a refreshable materialized view?
Refreshable materialized views are refreshed automatically on an interval that’s defined during creation. For example, the following materialized view is refreshed every minute:SYSTEM REFRESH VIEW clause:
When was a refreshable materialized view last refreshed?
To find out when a refreshable materialized view was last refreshed, you can query thesystem.view_refreshes system table, as shown below:
How can I change the refresh rate?
To change the refresh rate of a refreshable materialized view, use theALTER TABLE...MODIFY REFRESH syntax.
Using APPEND to add new rows
The APPEND functionality allows you to add new rows to the end of the table instead of replacing the whole view.
One use of this feature is to capture snapshots of values at a point in time. For example, let’s imagine that we have an events table populated by a stream of messages from Kafka, Redpanda, or another streaming data platform.
4096 values in the uuid column. We can write the following query to find the ones with the highest total count:
uuid every 10 seconds and store it in a new table called events_snapshot. The schema of events_snapshot would look like this:
events_snapshot to get the count over time for a specific uuid:
Examples
Lets now have a look at how to use refreshable materialized views with some example datasets.Stack Overflow
The denormalizing data guide shows various techniques for denormalizing data using a Stack Overflow dataset. We populate data into the following tables:votes, users, badges, posts, and postlinks.
In that guide, we showed how to denormalize the postlinks dataset onto the posts table with the following query:
posts_with_links table, but in a production system, we’d want to run this operation periodically.
Both the posts and postlinks table could potentially be updated. Therefore, rather than attempt to implement this join using incremental materialized views, it may be sufficient to simply schedule this query to run at a set interval, e.g., once every hour, storing the results in a post_with_links table.
This is where a refreshable materialized view helps, and we can create one with the following query:
The syntax here is identical to an incremental materialized view, except we include a
REFRESH clause:IMDb
In the dbt and ClickHouse integration guide we populated an IMDb dataset with the following tables:actors, directors, genres, movie_directors, movies, and roles.
We can then write the following query can be used to compute a summary of each actor, ordered by the most movie appearances.