Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As always the YMMV of caching is access patterns, but the more consistent cacheable pattern has been the ext4 journals for me.

They are tiny and often hit with a huge number of IOPS.

Ext4 supported external journals and moving it to a single SSD for a large number of otherwise slow SMR disks has worked great in the past.

However, when you hit a failure that SSD becomes a single root cause of data loss from several disks when losing that SSD (unlike a read cache).

Where I was working that didn't matter as I was mostly working with HDFS which both likes a JBOD layout of several disks instead of RAID (no battery backed write caches), tolerant to a single node failing completely and having a ton more metadata operations thanks to writing a single large file as many fixed-size files named blk_<something> with a lot of directories containing thousands of files.

SSDs were expensive then, but it's been a decade of getting cheaper from that.



The same for ZFS; there's provisioning to make a "zil" device - ZFS Intent Log, basically the journal. ZFS is a little nicer in that this journal is explicitly disposable - If you lose your ZIL device, you lose any writes since it's horizon, but you don't lose the whole array.

The next step up is building a "metadata" device, which stores the filesystem metadata but not data. This is dangerous in the way the ext4 journal is; lose the metadata, and you lose everything.

Both are massive speedups. When doing big writes, a bunch of spinning rust can't achieve full throughput without a SSD ZIL. My 8+2 array can write nearly two gigabits, but it's abysmal (roughly the speed of a single drive) without a ZIL.

Likewise, a metadata device can make the whole filesystem feel as snappy as SSD, but it's unnecessary if you have enough cache space; ZFS prefers it, so if your metadata fits into your cache SSD, most of it will stay loaded


I just want to mention that ZIL is just to speed up sync writes, as it ends syscall when data are written to ZIL, but might be still in progress on slower storage.

ZIL is also basically write only storage, therefore sad without very significant over provisioning will die quickly (you only read from ZIL after unclean shutdown)

if you don't really case about latest version of file (risk of loosing recent chances is acceptable) you might set sync=disabled for that dataset and you can have great performance without ZIL


Minor nitpick, your post is primarily talking about SLOG, separate intent log.

The pool always has a ZIL, but you can put it on a separate device, or decices, with SLOG[1].

[1]: https://www.truenas.com/docs/references/zilandslog/


There's a configuration option that amounts to putting a directory (or maybe a volume) entirely into the metadata drive.

It's been a long time since I set that up, but the home storage has spinning rust plus a raid 1 of crucial ssd (sata! But ones with a capacitor to hopefully handle writes after power loss), where the directory I care about performance for lives on the ssd subarray. Still presents as one blob of storage. Metadata on the ssd too, probably no ZIL but could be wrong about that. Made ls a lot more reasonable.

Thinking about it that system must be trundling towards expected death, it might be a decade old now.


This reminds me of the hybrid drives. When the NVM failed its was a nightmare to deal with. IMHO it's a bad idea from a stability perspective to be caching off drive to Non-volatile memory.


Your last sentence does not follow from the preceding one. Hybrid drives were doomed by having truly tiny caches, making them not particularly fast (you need a lot of flash chips in parallel to get high throughput), prone to cache thrashing, and easy to wear out the NAND flash. These days, even if you try, it's hard to build a caching system that bad. There just aren't SSDs small and slow enough to have such a crippling effect. Even using a single consumer SSD as a cache for a full shelf of hard drives wouldn't be as woefully unbalanced as the SSHDs that tried to get by with only 8GB of NAND.


> However, when you hit a failure that SSD becomes a single root cause of data loss from several disks when losing that SSD (unlike a read cache).

In theory you can massively reduce this risk by keeping a copy of the journal in memory so it only corrupts if you have a disk loss and a power outage within a few seconds of each other. But I don't know if the tools available would let you do that properly.


Twin SSDs and RAID 1.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: