Migrating from self-managed ClickHouse to ClickHouse Cloud using backup commands
Overview
There are two primary methods to migrate data from self-managed ClickHouse (OSS) to ClickHouse Cloud:- Using the
remoteSecure()function in which data is directly pulled/pushed. - Using
BACKUP/RESTOREcommands via cloud object storage
This migration guide focuses on thePrerequisitesBACKUP/RESTOREapproach and offers a practical example of migrating a database or full service in open source ClickHouse to Cloud via an S3 bucket.
- You have Docker installed
- You have an S3 bucket and IAM user
- You’re able to create a new service ClickHouse Cloud service
Cluster requiredThis backup method requires a ClickHouse cluster because tables must be converted from the
MergeTree engine to ReplicatedMergeTree.
If you’re running a single instance, follow the steps in “Migrating between self-managed ClickHouse and ClickHouse Cloud using remoteSecure” instead.OSS preparation
We’ll first spin up a ClickHouse cluster using a Docker Compose configuration from our examples repository. You can ignore spinning up the ClickHouse cluster if you already have one running.- Clone the examples repository to your local machine
- From your terminal,
cdintoexamples/docker-compose-recipes/recipes/cluster_2S_2R - Make sure Docker is running, then start the ClickHouse cluster:
From MergeTree table to ReplicatedMergeTree table
ClickHouse Cloud works withSharedMergeTree.
When restoring a backup, ClickHouse automatically converts tables with ReplicatedMergeTree to SharedMergeTree tables.
It’s likely your tables are already using the ReplicatedMergeTree engine if you’re running a cluster.
If not, you will need to convert any MergeTree tables to ReplicatedMergeTree before backing them up.
For the sake of demonstration of how to convert MergeTree tables to ReplicatedMergeTree, we will begin with a MergeTree table and convert it to ReplicatedMergeTree after wards.
We’re going to follow the first two steps of the New York taxi data guide to create a sample table and load data into it.
Those steps are included below for your convenience.
Run the following commands to create a new database and insert data from an S3 bucket into a new table:
DETACH the table.
ReplicatedMergeTree:
Distributed tables with ReplicatedMergeTree
If your setup uses distributed tables across multiple shards, you’ll need a localReplicatedMergeTree table on each node and a Distributed table as the query entry point.
Run the following command to create the local replicated table on all cluster nodes:
Distributed table on top:
Cloud preparation
You will be restoring your data into a new Cloud service. Follow the steps below to create a new Cloud service.Open Cloud Console
Go to https://console.clickhouse.cloud/Create a new service
Configure and create a service
Choose your desired region and configuration, then clickCreate serviceCreate an access role
Open SQL consoleSet up S3 access
To restore your backup from S3, you’ll need to configure secure access between ClickHouse Cloud and your S3 bucket.- Follow the steps in “Accessing S3 data securely” to create an access role and obtain the role ARN.
- Update the S3 bucket policy you created in “How to create an S3 bucket and IAM role” by adding the role ARN from the previous step.
- IAM user (
docs-s3-user): Allows your self-managed ClickHouse cluster to back up to S3 - ClickHouse Cloud role (
ClickHouseAccess-001): Allows your Cloud service to restore from S3
Taking the backup (on self-managed deployment)
Each shard must be backed up independently. Connect to a node on each shard and run the backup command with a unique destination path per shard. ReplaceBUCKET_URL, KEY_ID and SECRET_KEY with your own AWS credentials.
The guide “How to create an S3 bucket and IAM role”
shows you how to obtain these if you don’t yet have them.
Shard 1:
Single-node deploymentsIf you are not using distributed tables, you can back up the entire database with a single command:
- All user databases and tables
- User accounts and passwords
- Roles and permissions
- Settings profiles
- Row policies
- Quotas
- User-defined functions
TO S3() (for both AWS and GCP) and TO AzureBlobStorage() syntax.
For very large databases, consider using ASYNC to run the backup in the background:
Restore to ClickHouse Cloud
Restore each shard’s backup one at a time into your Cloud service. SetROLE_ARN to the
value obtained from “Accessing S3 data securely”.
Use SETTINGS allow_non_empty_tables=true on the second (and any subsequent) restore so
that shard data is appended to the already-restored tables rather than failing on a conflict:
Shard 1:
non-distributed deploymentsIf you are not using distributed tables, restore the database with a single command:
SharedMergeTree internally, the old distributed table is no longer needed. You can drop it and replace it with a view that preserves the original table name for your queries:
ReplicatedMergeTree tables will be restored as SharedMergeTree: