Talk

Reading Faster Than Ceph Can Serve: How We Built S3 Sharding With No Extra Infrastructure

In Russian

Avito's analytical storage is Trino and Vertica on top of Ceph S3 running on HDDs, and at some point we hit a ceiling: we want to read faster than the largest Ceph cluster configuration available to us can deliver.

Scaling Ceph itself any further isn't practical. Instead, we took several Ceph clusters and started spreading every table across all of them at once: we shard not by bucket but by the hash of the object key, so every table is read from all clusters in parallel. The most interesting part is where the sharding logic lives: rather than building a separate proxy layer, we extended — with Lua — the HAProxy sidecars that were already running on every Trino and Vertica compute node. The result is stateless routing that adds neither a new bottleneck nor a single point of failure to the architecture.

I'll cover how we wrote and tested this extension, how we rolled it out to production step by step, and what it gave us: GET latency dropped from a minute to 1–2 seconds, and real read throughput grew from 20 to 60 GB/s. I'll also take a close look at ListObjects — a paginated operation that, with this approach, turns into a fan-out across all shards with a merge of their responses — and at our resharding process, which runs with zero read downtime and almost zero write downtime.

Speakers

Talks