2026-03-29 · internals · ~9 min read

Why I rewrote the merge engine in Rust

This is a post about a piece of frostvex that hadn't aged well. It's also a post about how I went from "this is fine" to "this is the bottleneck for everything" in about ten profiling sessions, and what I did about it.

If you're not interested in CRDT-shaped internals, skim the diagrams and skip to the numbers at the end. The TL;DR is: the new engine is 4–6× faster on the merge path, and the worst-case tail latency dropped from 180 ms to 12 ms.

What the merge engine does

When two peers come back into contact, frostvex has to reconcile their manifests. Each peer holds its own version of "what files exist, what their hashes are, when each was last seen." A correct sync needs to produce one merged manifest both sides agree on.

This is a classic CRDT problem, but with two practical wrinkles:

The manifest is read by a lot of things at once — a sync in progress, a stat, a verify, mDNS announcements. Concurrent reads need to not step on each other.
The manifest is mutable while the merge is happening. Writes have to be visible somewhere consistent, but not necessarily fully ordered with respect to reads happening at the same moment.

0.2's solution was a RwLock<Manifest>. The merge code took a write lock, did its thing, dropped the lock. Reads took read locks, blocked until the writer released, served their query. Simple and correct.

Where it broke

It broke when frostvex started getting used on machines with multiple pools and a real number of peers. A backup server with 12 pools and 4 active peers per pool will have ~50 readers and 8 writers contending for that lock. Tokio's scheduling on a 4-core box was happy to interleave them aggressively, and we'd see runs where a single stat took 180 ms — not because the work was slow, but because it spent 175 ms waiting for write locks.

The fix wasn't "make the lock more granular", even though that's where I started. Per-directory locks helped a bit, then introduced their own deadlock class when two merges raced for adjacent subtrees. Per-file locks made the lock table itself a contention point. Classic.

What I replaced it with

The new engine has three pieces:

An immutable layered tree — every directory is a node, every node has the BLAKE3 hash of its sorted children. The tree is logically immutable; mutations produce a new root with shared structure (path-copy CoW).
An atomic root pointer — readers load(Acquire) the current root and operate on whatever they got. Writers stage a new tree, then compare_exchange the root pointer.
A small per-pool merge log — append-only WAL containing the operations that produced each root. Used for crash recovery and for resolving merge conflicts (which are now data, not exceptions).

If you've used im in Rust or persistent in Haskell — same idea. Path-copy on update, structural sharing for everything else.

The tricky part

I expected the tricky part to be the lock-free atomic stuff. It wasn't — arc-swap handles that nicely, and the operations involved are confined to a single struct field. The tricky part was reconciling concurrent writes.

Concretely: peer A pushes a manifest update for /photos/2026/april; peer B simultaneously pushes one for /photos/2026/march. Both writers stage a new root. Both call compare_exchange on the root pointer. Only one wins.

The naive approach is retry — the loser re-derives its update against the new root and tries again. This is correct but pathological under contention: if you have N writers, you do O(N²) work in the worst case. We measured this on a contended pool and watched a 4-writer test take 800 ms when it should have taken 60.

The fix was merge instead of retry: if a CAS fails, the loser doesn't retry — it computes the merge between its proposed root and the new current root, and tries that. CRDTs let you do this because the merge is associative and commutative; the result is the same root regardless of which writer "got there first."

Implementation-wise this is a few hundred lines of Rust on top of im::OrdMap. The scary part — proving correctness — turned out to be tractable because the operations form a join-semilattice; the joins are well-defined and equivalence is easy to test.

What almost worked but didn't

Two things I tried first and abandoned:

Lock-free linked list (Harris-Michael). The idea was to use a flat sorted list of (path, hash) entries with a Harris-Michael lock-free linked list. Worked on the toy test, fell apart at scale because real manifests are deeply nested — the list ended up being 100k entries long, and traversal alone became the bottleneck.

RCU-style read-side. An RCU approach (separate reader epochs, deferred reclamation) gave great read latency but the writer-side bookkeeping became its own complexity tax. For our workload — write-heavy bursts after a peer reconnects — the win wasn't worth it.

Both were technically correct. Both required reasoning about hazard pointers and reclamation lifetimes that I didn't trust myself to maintain across releases. The path-copy + atomic-root approach is boring by comparison, which is exactly what I wanted.

Numbers

From my laptop, on a 50k-file pool with 4 simulated concurrent peers and 1 active local sync:

scenario	0.2.7 (RwLock)	0.3.0 (lock-free)
median `stat`	62 ms	9 ms
p99 `stat`	184 ms	14 ms
median merge	340 ms	110 ms
p99 merge	1.2 s	290 ms
4-writer contention	820 ms	180 ms

The stat numbers are the most important. Almost all CLI flows touch stat; if it's fast, everything feels responsive.

What I'd do differently

Two things.

First, I would not have started with the granular-lock approach. It cost me three weeks of work that I threw away. The signal was that every mitigation introduced a new deadlock or pathological case — when that's happening repeatedly, the abstraction is wrong, not the implementation.

Second, I would have written the merge-instead-of-retry test case earlier. Once I had it as a reproducible bench, the path forward was clear within a day. Lesson for next time: write the worst-case test before you start optimizing.

If you want to read the actual code — it's around src/merge/tree.rs in the source tree, and it's annotated heavily because future-me will not remember any of this.

Thanks to M. and D. who put up with me bouncing CRDT-correctness questions off them at odd hours. Errors mine.