A specialist in messaging systems, distributed systems and formal verification. Previously a member of the RabbitMQ core team, now at Splunk working on Apache BookKeeper, Pulsar and event streaming within Splunk.


Jack Vanlightly

Principal Software Engineer @ Splunk

Ivan Kelly

Principal Software Engineer @ Splunk

Ivan Kelly is a software engineer at Splunk, working in the Messaging as a Service. Ivan has been active in Apache BookKeeper since its very early days as a project in Yahoo! Research Barcelona. He has been a contributor to Pulsar since 2017.

Pushing Pulsar Performance to the Limits

Wed Jun 16, 10:45 AM - 11:20 AM, PT

In this talk we show the work that has been done at Splunk to lower the cost and increase the performance of the storage layer. We run clusters that handle multiple gigabytes of incoming data per second. Lowering the infrastructure cost by making Pulsar use less resources for the same load has made a large financial impact on operations.

Pulsar offers stronger data safety guarantees than Kafka, thanks to Apache BookKeeper’s journal (aka its write-ahead-log). But this safety comes at a price: every entry must be written twice. But what if we could run BookKeeper safely without the journal, we would reduce disk writes by 2x, effectively doubling the write capacity of each bookie and reducing the amount of hardware needed for a Pulsar cluster.

In this talk we do a technical deep dive into the following:
- Explain what the journal is and what safety guarantees it enables
- Explain the trade-off of safety against throughput due to the double-write
- But what if we’re willing to trade-off a tiny bit of safety for double the throughput?
- Explain the dangers of running without the journal
- How to solve this safety problem?
- Introduce new BookKeeper features that make running without the journal safer, up to the same level of safety as Kafka.
- Explain how we used TLA+ to formally verify the correctness of the solution.

We’ll also cover another important cost saving strategy: running with a replication factor of 2 for less critical data:
- Explain how monolithic log implementations like Kafka cannot be safely operated with a replication factor of 2
- Explain how the segment oriented log design of BookKeeper allows Pulsar to run safely with rep factor of 2