Rethinking Object Storage with Log-Structured Semantics
ZeroFS bypasses external metadata databases, bringing strict POSIX compliance and raw block devices to S3-compatible object storage.
Mounting S3 as a primary filesystem has historically been a exercise in compromise. Standard object storage is fundamentally incompatible with POSIX filesystem semantics. S3 expects immutable objects, offers eventual consistency, and suffers from high latency. POSIX, on the other hand, demands random writes, fine-grained locking, and immediate consistency.
Early attempts to bridge this gap, like FUSE-based mounts, translate POSIX operations directly to S3 API calls. A simple file rename becomes a costly copy-and-delete operation, and parallel writes easily corrupt data. Second-generation tools like JuiceFS improved performance by decoupling metadata from data, storing the directory tree in an external database like Redis or PostgreSQL while dumping the actual file blocks into S3. But this architecture introduces a critical point of failure. If you lose the metadata database, you lose the map to your data, rendering the raw objects in your bucket useless.
ZeroFS takes a different approach. It is a log-structured filesystem designed specifically for S3-compatible storage that keeps its metadata index directly inside the bucket as a Log-Structured Merge (LSM) tree. By running as a single userspace process, it exposes S3 as an NFS server, a 9P server, or a raw Network Block Device (NBD). This design eliminates the external database dependency while enabling strict POSIX compliance and raw block-level access.
The Architecture: LSM-Trees and Immutable Segments
To understand why ZeroFS can handle workloads that typically choke S3 mounts, you have to look at its storage engine. Instead of writing files directly as individual S3 objects, ZeroFS splits file contents into 32 KiB extents. These extents are compressed using zstd or lz4, encrypted with XChaCha20-Poly1305, and packed into immutable segment objects up to 256 MiB in size.
Because the segments are immutable, ZeroFS never rewrites data in place. When a file is modified or appended to, the new data is written to a new segment, and the metadata index is updated. This log-structured design yields several immediate benefits:
- Atomic Writes: Writes are inherently atomic. A write is only committed when its metadata is flushed, eliminating partial write corruptions.
- Consistent Checkpoints: You can capture the state of the filesystem at any point in time. Because older segments are never modified, these checkpoints remain perfectly consistent and can be opened read-only.
- Read Replicas: A single master instance can handle writes while multiple read-only instances serve the same bucket, picking up changes automatically without risking split-brain states.
To reclaim space from deleted or modified files, ZeroFS relies on background compaction. When a file is deleted or truncated, the filesystem issues a TRIM discard. The compaction engine then repacks the remaining live data from fragmented segments into new ones and deletes the old, emptied segments from S3, directly reducing your storage bill.
Protocols: NFS, 9P, and the Power of NBD
ZeroFS does not force you to install a proprietary client on every machine. Instead, it speaks standard network protocols that modern operating systems already understand.
+---------------------------------+
| ZeroFS Engine |
| (LSM-Tree Metadata in Bucket) |
+---------------------------------+
/ | \
/ | \
v v v
[ NFS ] [ 9P ] [ NBD ]
| | |
Any Major OS FUSE Client Raw Block Device
(Native Mount) (No Root) (ZFS, ext4, VMs)
NFS allows any major operating system (macOS, Linux, Windows, and BSDs) to mount the bucket natively with no extra client software. For environments requiring stricter POSIX compliance, the 9P protocol ensures that fsync operations return only after data has reached stable storage. ZeroFS also bundles a custom FUSE client that speaks 9P to the server, allowing non-root users to mount the filesystem and handle automatic reconnections.
However, the most compelling protocol option is NBD. By exposing the S3 bucket as a raw block device, ZeroFS allows you to run native filesystems like ext4 or even OpenZFS directly on top of object storage.
This makes geo-distributed storage pools surprisingly simple. For example, you can create a ZFS mirror that spans three different S3 regions, treating each regional ZeroFS instance as a local disk:
# Attach block devices from different regions
nbd-client 10.0.1.5 10809 /dev/nbd0 -N storage -persist # us-east
nbd-client 10.0.2.5 10809 /dev/nbd1 -N storage -persist # eu-west
nbd-client 10.0.3.5 10809 /dev/nbd2 -N storage -persist # ap-southeast
# Create a mirrored ZFS pool across continents
zpool create global-pool mirror /dev/nbd0 /dev/nbd1 /dev/nbd2
If one region goes offline, the ZFS pool degrades but remains online, serving data from the remaining two regions. When the offline region returns, a standard ZFS scrub repairs the pool.
Performance and the Developer Trade-off
ZeroFS claims impressive performance numbers in its benchmarks. With a warm local cache, random reads (such as running SQLite queries) execute in as little as 1.6 microseconds. Small write latency, such as file appends over NFS, averages 0.83 milliseconds.
But developers must look closely at how these numbers are achieved. A raw S3 round-trip takes anywhere from 50 to 300 milliseconds. To achieve microsecond-level reads, ZeroFS relies heavily on its local memory and disk caches.
This introduces a clear operational trade-off: your performance is only as good as your cache. If your workload exhibits poor data locality or exceeds the size of your local disk cache, you will frequently hit the cold-read wall, dropping performance back down to raw S3 latencies. Furthermore, while the log-structured engine makes writes fast and atomic, it shifts the performance cost to the background. Compaction processes consume S3 API calls and network bandwidth, which can lead to unexpected costs and performance throttling if your write volume is highly volatile.
Production Readiness and Testing Rigor
Many developer-tool startups launch with shaky reliability claims, but ZeroFS has built an unusually rigorous testing pipeline to prove its stability. On every code change, its public CI runs:
- pjdfstest: Over 8,600 tests validating POSIX permissions, ownership, symlinks, and rename behaviors across NFS, 9P, and FUSE.
- xfstests: The standard Linux kernel filesystem test suite used to validate production filesystems like ext4 and XFS.
- ZFS Scrubs: Building a ZFS pool on ZeroFS block devices, extracting the Linux kernel source tree, and running a full scrub to verify zero checksum errors.
- Jepsen local-fs & failover: Model-based checking that injects crashes mid-run to verify that recovered states match the last acknowledged
fsync, and leader/standby failover tests to ensure no acknowledged writes are lost during a network partition.
This level of validation is rare for early-stage storage projects and suggests that the engine is built on a solid, predictable foundation.
For developers looking to consolidate their storage footprint, ZeroFS presents a viable path toward using S3 as a primary storage layer. It eliminates the operational headache of managing a separate metadata database, offers native block-device integration, and backs its claims with a highly disciplined testing suite. Just be prepared to allocate sufficient local SSD space for caching if you expect to match the performance of native local storage.
Sources & further reading
Ji-ho covers the increasingly tangled overlap between cloud architecture and security, drawing on a background as a penetration tester to keep his reporting grounded in real-world attack paths. He never lets a vendor claim go unquestioned and insists that every buzzword come with a proof of concept.
Discussion 0
No comments yet
Be the first to weigh in.