Usage in Yandex.Metrica and other Yandex services
ClickHouse serves multiple purposes in Yandex.Metrica. Its main task is to build reports in online mode using non-aggregated data. It uses a cluster of 374 servers, which store over 20.3 trillion rows in the database. The volume of compressed data is about 2 PB, without accounting for duplicates and replicas. The volume of uncompressed data (in TSV format) would be approximately 17 PB. ClickHouse also plays a key role in the following processes:- Storing data for Session Replay from Yandex.Metrica.
- Processing intermediate data.
- Building global reports with Analytics.
- Running queries for debugging the Yandex.Metrica engine.
- Analyzing logs from the API and the user interface.
Aggregated and non-aggregated data
There is a widespread opinion that to calculate statistics effectively, you must aggregate data since this reduces the volume of data. However, data aggregation comes with a lot of limitations:- You must have a pre-defined list of required reports.
- The user can’t make custom reports.
- When aggregating over a large number of distinct keys, the data volume is barely reduced, so aggregation is useless.
- For a large number of reports, there are too many aggregation variations (combinatorial explosion).
- When aggregating keys with high cardinality (such as URLs), the volume of data isn’t reduced by much (less than twofold).
- For this reason, the volume of data with aggregation might grow instead of shrink.
- Users don’t view all the reports we generate for them. A large portion of those calculations are useless.
- The logical integrity of the data may be violated for various aggregations.