Linux Kernel: The multi-generational LRU

truth_seeker · on April 19, 2021

This is encouraging.

>> On Chrome OS, our real-world benchmark that browses popular websites in multiple tabs demonstrates 51% less CPU usage from kswapd and 52% (full) less PSI on v5.11

In addition, direct reclaim latency is reduced by 22% at 99th percentile and the number of refaults is reduced 7%. These metrics are important to phones and laptops as they are correlated to user experience.

>> Use cases

On Android, our most advanced simulation that generates memory pressure from realistic user behavior shows 18% fewer low-memory kills, which in turn reduces cold starts by 16%.

On Borg, a similar approach enables us to identify jobs that underutilize their memory and downsize them considerably without compromising any of our service level indicators.

On Chrome OS, our field telemetry reports 96% fewer low-memory tab discards and 59% fewer OOM kills from fully utilized devices and no UX regressions from underutilized devices.

yrral · on April 19, 2021

Facebook also talked about something similar back in 2014 for caching of people's photos.

https://engineering.fb.com/2014/02/27/web/an-analysis-of-fac...

http://www.cs.cornell.edu/~qhuang/papers/sosp_fbanalysis.pdf

> Quadruply-segmented LRU. Four queues are maintained at levels 0 to 3. On a cache miss, the item is inserted at the head of queue 0. On a cache hit, the item is moved to the head of the next higher queue (items in queue 3 move to the head of queue 3). Each queue is allocated 1/4 of the total cache size and items are evicted from the tail of a queue to the head of the next lower queue to maintain the size invariants. Items evicted from queue 0 are evicted from the cache.

qiqitori · on April 19, 2021

> Consider, for example, an application that is reading sequentially through a file. Each page of the file will be put into the page cache as it is read, but the application will never need it again; in this case, recent access is not a sign that the page will be used again soon.

Going off on a tangent here, but I've always felt there should be an easy way to read through files without causing the kernel to cache them into memory. When I grep through the odd multi-GB file I sometimes/often don't want that to be cached. Looking through the flags for open(2) I see O_DIRECT. Wonder if it'd make sense to expose that as an option in grep, or if there's a handy library somebody made that I could preload to anything to get that effect.

JNRowe · on April 19, 2021

nocache¹. It is LD_PRELOAD based so will only function within the normal limitations of that. It also comes with a couple of tools for examining and modifying a file's cache state, which makes it useful to judge its value in your environment.

One of my biggest use cases for it is with soxi, which I use solely for playlist manipulation. I don't alias my source tree search tools, because invariably I'm going to want all that stuff in the cache anyway for builds and such.

¹ https://github.com/Feh/nocache

fulafel · on April 19, 2021

Also comes with the nifty cachestats tool which tells you how much of a file is currently cached. And is packaged by Debian so you can apt-get it.

kieckerjan · on April 19, 2021

No need to make it an option for each text util. We use an O_DIRECT version of the (z)cat command as the start of pipelines with big files.

specialist · on April 19, 2021

Nice tip, thanks.

Related: The experimental async aware (io_uring optimized) Glommio API may eventually supplant POSIX I/O.

https://github.com/DataDog/glommio

"Modern storage is plenty fast. It is the APIs that are bad." https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...

guenthert · on April 19, 2021

I'm not aware of an O_DIRECT version of cat. The (older) GNU cat (GNU coreutils 0.83) I just looked at doesn't seem to support it.

There is however GNU dd which supports direct i/o ('direct' flag) for ages.

tankenmate · on April 19, 2021

except that the fastest (and also parallel) versions of grep tend to use memory mapping. for grep it tends to be more linear (except the parallel case), but for others like sort it might not be the case that linear access to the file is always best.

chmod775 · on April 19, 2021

That caching is pretty much free with only one exception: it might evict other cached pages.

> When I grep through the odd multi-GB file I sometimes/often don't want that to be cached.

What are you hoping to gain here? The only place I could imagine this might be an issue is on a HDD-using (as opposed to SSD) server that serves a lot of files, which you might evict with your grepping, causing a lot of random IO activity later because the server needs to read everything from disk again to serve requests.

But as a "odd" one-off I probably just wouldn't care.

fulafel · on April 19, 2021

It might swap out memory too.

londons_explore · on April 19, 2021

I'd like to see the kernel figure this out automatically. If the last 100,000 files opened by grep were never accessed again, is it a good bet to cache the 100,001'st?

I think the solution is not to come up with clever heuristics, but to use a tiny neural net which predicts "how many hours till this page is accessed again?".

Train the net using a tiny sample of reads and writes.

Obviously the net needs to be really tiny to run on every single page read, but I believe even say a network with 50 weights would outperform today's heuristics.

magicalhippo · on April 19, 2021

In a previous comment[1] I mentioned finding a paper that made a neural net cache predictor with seemingly great success.

A key thing however was that their actual predictor was a simpler model, SVM, designed using insight discovered from the behavior of the neural network one. In particular the source address was a strong indicator for the model.

[1]: https://news.ycombinator.com/item?id=26455515

d33 · on April 19, 2021

My first intuitive thought is "gosh, please, don't". It sounds like a crazy idea to integrate a neural net to everything - it might work in the 99.999% cases, but when something like this fails, good luck debugging it, re-training the network and then verifying if your improvement helped on a tight schedule. And it's also a dangerous architectural paradigm - will we end up with syscalls deciding which side effects happen deciding on whether the neural network things the file is going to be useful in the future?

I don't argument for excess simplicity, but if you can't explain (critical) behavior of your code without referring to a big matrix of NN weights, it's probably a bad idea.

magicalhippo · on April 19, 2021

As I noted in my previous comment I had a similar gut reaction. However reading the paper, for me the most interesting conclusion was this:

Thus, this paper has shown how we can use deep learning in an offline setting to derive insights that lead to an improved set of features with which to make predictions for cache replacement. More broadly, our approach in designing Glider suggests that deep learning can play an important role in systematically exploring features and feature representations that can improve the effectiveness of much simpler models, such as perceptrons and SVMs.

signa11 · on April 19, 2021

does the kernel now support floating point operations ?

magicalhippo · on April 19, 2021

I'm no kernel developer so I don't know. The dataset they used for training the neural net was a recorded dataset, so training could be done offline.

The specific predictor they created was integer based, so could be used in a non-floating point kernel:

We then use an SVM with the k-sparse binary feature. Since integer operations are much cheaper in hardware than floating point operations, we use an Integer SVM (ISVM) with an integer margin and learning rate of 1.

Again this highlights the interesting point of the paper, IMHO.

Unklejoe · on April 19, 2021

I feel like if you already know exactly how you're about to access the data, then it would be more efficient to be able to just tell the kernel rather than making it figure it out on its own. At the minimum, the first accesses would perform better since it wouldn't have to waste time learning.

This is why instructions like dcbz exist on POWER - to tell it to not bother reading in the next cache line because you're about to write to it.

signa11 · on April 19, 2021

please...submit patches if you can, with benchmarks as well, so that the idea can perhaps be evaluated empirically ?

barrkel · on April 19, 2021

A tweak: it may be that application access patterns are inefficient WRT block I/O, so that some caching (e.g. read-ahead) is useful but should be aggressively expired.

veddan · on April 19, 2021

There was some work done on a RWF_UNCACHED flag a while back, but I'm not sure if it went anywhere. It was supposed to use the page cache if the page is already there, but not add it (or at least not keep it around) if it isn't.

g0xA52A2A · on April 19, 2021

Previous discussion https://news.ycombinator.com/item?id=26671848

jonstewart · on April 19, 2021

The description of the current system as being composed of two LRU queues sound a lot like the 2Queues cache replacement algorithm. However, I was under the impression that Linux used CLOCK-Pro?

NovaX · on April 19, 2021

They decided against it, I believe because ClockPro was too complicated (but maybe NIH?). Instead they devised their own algorithm, DClock, borrowing ideas but is also different [1]. They use an ad hoc sizing rule (50%-99%) [2] which can dramatically impact its performance, neither extreme being universally better. When reimplementing their algorithm for analysis, the hit rates skewed towards being okay but not great [3]. You can compare ClockPro [4] and DClock [5] in the simulator.

[1] https://github.com/torvalds/linux/blob/ac3a0c8472969a03c0496...

[2] https://github.com/torvalds/linux/blob/1590a2e1c681b0991bd42...

[3] https://docs.google.com/spreadsheets/d/16wEq5QBzqOtownEtZvZe...

[4] https://github.com/ben-manes/caffeine/blob/master/simulator/...

[5] https://github.com/ben-manes/caffeine/blob/master/simulator/...

quelsolaar · on April 19, 2021

A FIFO cache of this kind can be very bad in some situations. Imagine if you have a cache of 8 entries and the application reads 9 entries in a loop from beginning to end over and over. In this case the FIFO cache algorithm will always eject the next entry just before its needed. Ejecting even a random entry is much better.

Its an incredibly difficult problem to try to figure out what to keep in cache, so I find that the first thing one should allow is for the user to give hints.

It would be great if there was a way for applications to designate some memory allocations as "hot" or "cold" to indicate the usage pattern, or indicate that its reading something linearly, or that its about to need something.

dr_zoidberg · on April 19, 2021

But this algorithm isn't a FIFO: it's a multi-generational LRU cache. That means, LRU+some extra details (evry well explained in the post) to make it perform even better. There are some metrics in the post, and some in the comments here too.

matheusmoreira · on April 19, 2021

This is the point of the madvise system call.

https://www.man7.org/linux/man-pages/man2/madvise.2.html

> MADV_RANDOM

> MADV_SEQUENTIAL

> MADV_WILLNEED

> MADV_DONTNEED

> MADV_COLD

> MADV_PAGEOUT

> MADV_DONTDUMP

> MADV_WIPEONFORK

quelsolaar · on April 19, 2021

Thanks!

NovaX · on April 19, 2021

That assumes the developer can provide correct hints, which unfortunately is rarely true. In my experience it is better to invest in an algorithm that can detect these patterns, learn from the workload, and optimize accordingly. Modern cache policies are robustly near optimal, but developers don't bother to implement them.

k_bx · on April 19, 2021

What I would love to see is an expiration mechanism for files, e.g. you create a file that will get cleaned up in an hour. Would simplify my life on multiple occasions.

pferde · on April 19, 2021

You can with systemd (*cue boos from the audience*), although it involves a bit of additional configuration - what files to clean up, and set up a timer.

See e.g. https://www.putorius.net/systemd-tmpfiles.html#user-specific... (no affiliation with the site, it was just the first one I found that describes the works).

k_bx · on April 19, 2021

Wow, thank you! I didn't know it not only exists but is provided by systemd. I'm already using systemd to deploy my services, so this sounds like a perfect fit!

loa_in_ · on April 19, 2021

That isn't something kernel should concern itself with. It also can be accomplished by three line python script and one line of crontab

k_bx · on April 19, 2021

To me, this is the same concern that tmpfile api is providing. Either remove the temp file/dir api or extend it with useful stuff.

loa_in_ · on April 20, 2021

tmpfile isn't guaranteed by Linux kernel. It's POSIX and POSIX is userspace layer with elements of it (like IPC and IO) implemented in kernel for efficiency resons.

GrumpySloth · on April 19, 2021

It sounds a bit like a generational garbage collector.

_pmf_ · on April 19, 2021

Hasn't cache handling in the kernel long reached the timing complexity of a garbage collector, mooting some points about system programming and GC?

viraptor · on April 19, 2021

As you said, "cache handling" possibly did. That's one of few well defined areas where you're handling GC by definition of the task. System programming is about a lot more than that.