As always the YMMV of caching is access patterns, but the more consistent cachea...

GauntletWizard · 2025-07-27T06:26:51 1753597611

The same for ZFS; there's provisioning to make a "zil" device - ZFS Intent Log, basically the journal. ZFS is a little nicer in that this journal is explicitly disposable - If you lose your ZIL device, you lose any writes since it's horizon, but you don't lose the whole array.

The next step up is building a "metadata" device, which stores the filesystem metadata but not data. This is dangerous in the way the ext4 journal is; lose the metadata, and you lose everything.

Both are massive speedups. When doing big writes, a bunch of spinning rust can't achieve full throughput without a SSD ZIL. My 8+2 array can write nearly two gigabits, but it's abysmal (roughly the speed of a single drive) without a ZIL.

Likewise, a metadata device can make the whole filesystem feel as snappy as SSD, but it's unnecessary if you have enough cache space; ZFS prefers it, so if your metadata fits into your cache SSD, most of it will stay loaded

Szpadel · 2025-07-27T06:57:19 1753599439

I just want to mention that ZIL is just to speed up sync writes, as it ends syscall when data are written to ZIL, but might be still in progress on slower storage.

ZIL is also basically write only storage, therefore sad without very significant over provisioning will die quickly (you only read from ZIL after unclean shutdown)

if you don't really case about latest version of file (risk of loosing recent chances is acceptable) you might set sync=disabled for that dataset and you can have great performance without ZIL

magicalhippo · 2025-07-27T11:33:05 1753615985

Minor nitpick, your post is primarily talking about SLOG, separate intent log.

The pool always has a ZIL, but you can put it on a separate device, or decices, with SLOG[1].

[1]: https://www.truenas.com/docs/references/zilandslog/

JonChesterfield · 2025-07-27T07:26:29 1753601189

There's a configuration option that amounts to putting a directory (or maybe a volume) entirely into the metadata drive.

It's been a long time since I set that up, but the home storage has spinning rust plus a raid 1 of crucial ssd (sata! But ones with a capacitor to hopefully handle writes after power loss), where the directory I care about performance for lives on the ssd subarray. Still presents as one blob of storage. Metadata on the ssd too, probably no ZIL but could be wrong about that. Made ls a lot more reasonable.

Thinking about it that system must be trundling towards expected death, it might be a decade old now.

trinsic2 · 2025-07-27T05:27:23 1753594043

This reminds me of the hybrid drives. When the NVM failed its was a nightmare to deal with. IMHO it's a bad idea from a stability perspective to be caching off drive to Non-volatile memory.

wtallis · 2025-07-27T07:33:53 1753601633

Your last sentence does not follow from the preceding one. Hybrid drives were doomed by having truly tiny caches, making them not particularly fast (you need a lot of flash chips in parallel to get high throughput), prone to cache thrashing, and easy to wear out the NAND flash. These days, even if you try, it's hard to build a caching system that bad. There just aren't SSDs small and slow enough to have such a crippling effect. Even using a single consumer SSD as a cache for a full shelf of hard drives wouldn't be as woefully unbalanced as the SSHDs that tried to get by with only 8GB of NAND.

Dylan16807 · 2025-07-27T14:30:24 1753626624

> However, when you hit a failure that SSD becomes a single root cause of data loss from several disks when losing that SSD (unlike a read cache).

In theory you can massively reduce this risk by keeping a copy of the journal in memory so it only corrupts if you have a disk loss and a power outage within a few seconds of each other. But I don't know if the tools available would let you do that properly.

hinkley · 2025-07-27T16:57:27 1753635447

Twin SSDs and RAID 1.