This is Part 2 of a guide on migrating from PostgreSQL to ClickHouse. Using a practical example, it demonstrates how to efficiently carry out the migration with a real-time replication (CDC) approach. Many of the concepts covered are also applicable to manual bulk data transfers from PostgreSQL to ClickHouse.Most SQL queries from your PostgreSQL setup should run in ClickHouse without modification and will likely execute faster.
Deduplication using CDC
When using real-time replication with CDC, keep in mind that updates and deletes may result in duplicate rows. To manage this, you can use techniques involving Views and Refreshable Materialized Views. Refer to this guide to learn how to migrate your application from PostgreSQL to ClickHouse with minimal friction when migrating using real-time replication with CDC.Optimize queries in ClickHouse
While this is possible to migrate with minimum query rewriting, it is recommended to leverage ClickHouse features to significantly simplify queries and further improve query performance. The examples here covers common query patterns and show how to optimize them with ClickHouse. They use the full Stack Overflow dataset (up to April 2024) on equivalent resources in PostgreSQL and ClickHouse (8 cores, 32GiB RAM).For simplicity, the queries below omit the use of techniques to deduplicate the data.
Counts here will slightly differ as the Postgres data only contains rows which satisfy the referential integrity of the foreign keys. ClickHouse imposes no such constraints and thus has the full dataset e.g. inc. anon users.Users (with more than 10 questions) which receive the most views:
tags receive the most views: