The Case of the Mysterious ClickHouse Slowdown: Cloudflare's Journey to Restore Billing Aggregations

Introduction

At Cloudflare, ClickHouse is the engine behind our billing pipeline, processing millions of queries daily to calculate usage charges. When this pipeline suddenly slowed after a routine migration, invoices threatened to become unreconcilable—and that’s a multibillion-dollar problem. The usual suspects—I/O bottlenecks, memory pressure, high row scans—came up clean. This article recounts how we uncovered a hidden bottleneck deep within ClickHouse’s internals and the three patches we wrote to fix it.

The Case of the Mysterious ClickHouse Slowdown: Cloudflare's Journey to Restore Billing Aggregations — Source: blog.cloudflare.com

The Scene: A Petabyte-Scale Analytics Platform

ClickHouse, an open-source OLAP database, stores over a hundred petabytes of data across dozens of clusters at Cloudflare. To simplify data onboarding for internal teams, we built Ready-Analytics in early 2022. The concept: instead of designing custom tables, teams stream data into a single massive table. Namespaces distinguish datasets, and each record follows a standard schema (20 float fields, 20 string fields, a timestamp, and an indexID).

Sorting is key to ClickHouse performance. The indexID—a string field—is part of the primary key, allowing each namespace to optimize its data order. The primary key looks like: (namespace, indexID, timestamp). By December 2024, Ready-Analytics had grown to over 2 PiB of data, ingesting millions of rows per second. But it had one critical flaw: its retention policy.

The Problem: One Retention Policy to Rule Them All

Cloudflare used ClickHouse before native Time‑to‑Live (TTL) features existed. Our homegrown retention system relied on partitioning: the Ready-Analytics table was partitioned by day, and a job dropped partitions older than 31 days. This “one‑size‑fits‑all” policy was a major limitation. Some teams needed years of data for compliance; others needed only days. Those teams couldn’t use Ready-Analytics and were forced into a conventional, complex onboarding process.

We needed a system with per‑namespace retention. That migration—introducing dynamic partitioning—triggered the slowdown.

The Slowdown: When the Pipeline Stuttered

After migrating to per‑namespace retention, daily aggregation jobs—responsible for billing, fraud detection, and revenue reporting—ground to a crawl. Jobs that had finished in hours now stretched beyond deadlines. The downstream implications were severe: delayed invoices, manual reconciliation, and risk of revenue leakage.

Our troubleshooting checklist began with the obvious:

I/O – disk throughput looked normal.
Memory – no pressure or swapping.
Rows scanned & parts read – within expected ranges.
Query profiles – no obvious slow operations.

Every metric we typically monitor was healthy. The bottleneck was invisible.

The Discovery: A Hidden Bottleneck in ClickHouse Internals

After weeks of investigation, we traced the issue to an unexpected interaction between the new per‑namespace retention logic and ClickHouse’s internal merge scheduling. In short, the retention changes introduced many small partitions across namespaces, which dramatically increased the number of merge tasks competing for resources. The merge scheduler, designed for uniform partitions, became a bottleneck—consuming CPU and threads that aggregation queries needed.

We had to dig into ClickHouse’s source code. The culprit was a subtle contention point in the merge algorithm, not exposed by any standard monitoring. The discovery led to three targeted patches.

The Fix: Three Patches to Remove the Bottleneck

Patch 1 – Merge Priority Rework: Adjusted the algorithm to prioritize merge tasks for larger partitions, preventing small namespace partitions from flooding the queue.
Patch 2 – Thread Pool Isolation: Separated threads reserved for merge operations from those used for query execution, so background tasks wouldn’t starve aggregation queries.
Patch 3 – Dynamic Concurrency Limits: Introduced a throttle on the number of concurrent merges based on real‑time resource availability, preventing overload during peak ingestion.

These patches, contributed back to the open‑source community, restored the billing pipeline to its former speed—and improved overall cluster stability.

Conclusion

The hidden bottleneck taught us that even mature systems like ClickHouse can have unexpected points of failure under new workloads. The key was persistence beyond “normal” metrics and a willingness to explore internals. Today, Cloudflare’s billing pipeline runs smoothly, and per‑namespace retention is a success. The three patches remain a testament to the power of deep debugging—and open‑source collaboration.

Read more about Ready-Analytics or our patch contributions.