>> On Chrome OS, our real-world benchmark that browses popular websites in multiple tabs demonstrates 51% less CPU usage from kswapd and 52% (full) less PSI on v5.11
In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced 7%. These metrics are important to phones and laptops as they are correlated to user experience.
>> Use cases
On Android, our most advanced simulation that generates memory pressure from realistic user behavior shows 18% fewer low-memory kills, which in turn reduces cold starts by 16%.
On Borg, a similar approach enables us to identify jobs that
underutilize their memory and downsize them considerably without compromising any of our service level indicators.
On Chrome OS, our field telemetry reports 96% fewer low-memory tab discards and 59% fewer OOM kills from fully utilized devices and no UX regressions from underutilized devices.
> Quadruply-segmented LRU. Four queues are maintained at levels 0 to 3. On a cache miss, the item is inserted at the head of queue 0. On a cache hit, the item is moved to the head of the next higher queue (items in queue 3 move to the head of queue 3). Each queue is allocated 1/4 of the total cache size and items are evicted from the tail of a queue to the head of the next lower queue to maintain the size invariants. Items evicted from queue 0 are evicted from the cache.
> Consider, for example, an application that is reading sequentially through a file. Each page of the file will be put into the page cache as it is read, but the application will never need it again; in this case, recent access is not a sign that the page will be used again soon.
Going off on a tangent here, but I've always felt there should be an easy way to read through files without causing the kernel to cache them into memory. When I grep through the odd multi-GB file I sometimes/often don't want that to be cached. Looking through the flags for open(2) I see O_DIRECT. Wonder if it'd make sense to expose that as an option in grep, or if there's a handy library somebody made that I could preload to anything to get that effect.
nocache¹. It is LD_PRELOAD based so will only function within the normal limitations of that. It also comes with a couple of tools for examining and modifying a file's cache state, which makes it useful to judge its value in your environment.
One of my biggest use cases for it is with soxi, which I use solely for playlist manipulation. I don't alias my source tree search tools, because invariably I'm going to want all that stuff in the cache anyway for builds and such.
except that the fastest (and also parallel) versions of grep tend to use memory mapping. for grep it tends to be more linear (except the parallel case), but for others like sort it might not be the case that linear access to the file is always best.
That caching is pretty much free with only one exception: it might evict other cached pages.
> When I grep through the odd multi-GB file I sometimes/often don't want that to be cached.
What are you hoping to gain here? The only place I could imagine this might be an issue is on a HDD-using (as opposed to SSD) server that serves a lot of files, which you might evict with your grepping, causing a lot of random IO activity later because the server needs to read everything from disk again to serve requests.
But as a "odd" one-off I probably just wouldn't care.
I'd like to see the kernel figure this out automatically. If the last 100,000 files opened by grep were never accessed again, is it a good bet to cache the 100,001'st?
I think the solution is not to come up with clever heuristics, but to use a tiny neural net which predicts "how many hours till this page is accessed again?".
Train the net using a tiny sample of reads and writes.
Obviously the net needs to be really tiny to run on every single page read, but I believe even say a network with 50 weights would outperform today's heuristics.
In a previous comment[1] I mentioned finding a paper that made a neural net cache predictor with seemingly great success.
A key thing however was that their actual predictor was a simpler model, SVM, designed using insight discovered from the behavior of the neural network one. In particular the source address was a strong indicator for the model.
My first intuitive thought is "gosh, please, don't". It sounds like a crazy idea to integrate a neural net to everything - it might work in the 99.999% cases, but when something like this fails, good luck debugging it, re-training the network and then verifying if your improvement helped on a tight schedule. And it's also a dangerous architectural paradigm - will we end up with syscalls deciding which side effects happen deciding on whether the neural network things the file is going to be useful in the future?
I don't argument for excess simplicity, but if you can't explain (critical) behavior of your code without referring to a big matrix of NN weights, it's probably a bad idea.
As I noted in my previous comment I had a similar gut reaction. However reading the paper, for me the most interesting conclusion was this:
Thus, this paper has shown how we can use deep learning in an offline setting to derive insights that lead to an improved set of features with which to make
predictions for cache replacement.More broadly, our approach in designing Glider suggests that deep learning can play an important role in systematically exploring features and feature representations that can improve the effectiveness of much simpler models, such as perceptrons and SVMs.
I'm no kernel developer so I don't know. The dataset they used for training the neural net was a recorded dataset, so training could be done offline.
The specific predictor they created was integer based, so could be used in a non-floating point kernel:
We then use an SVM with the k-sparse binary feature. Since integer operations are much cheaper in hardware than floating point operations, we use an Integer SVM (ISVM) with an integer margin and learning rate of 1.
Again this highlights the interesting point of the paper, IMHO.
I feel like if you already know exactly how you're about to access the data, then it would be more efficient to be able to just tell the kernel rather than making it figure it out on its own. At the minimum, the first accesses would perform better since it wouldn't have to waste time learning.
This is why instructions like dcbz exist on POWER - to tell it to not bother reading in the next cache line because you're about to write to it.
A tweak: it may be that application access patterns are inefficient WRT block I/O, so that some caching (e.g. read-ahead) is useful but should be aggressively expired.
There was some work done on a RWF_UNCACHED flag a while back, but I'm not sure if it went anywhere. It was supposed to use the page cache if the page is already there, but not add it (or at least not keep it around) if it isn't.
The description of the current system as being composed of two LRU queues sound a lot like the 2Queues cache replacement algorithm. However, I was under the impression that Linux used CLOCK-Pro?
They decided against it, I believe because ClockPro was too complicated (but maybe NIH?). Instead they devised their own algorithm, DClock, borrowing ideas but is also different [1]. They use an ad hoc sizing rule (50%-99%) [2] which can dramatically impact its performance, neither extreme being universally better. When reimplementing their algorithm for analysis, the hit rates skewed towards being okay but not great [3]. You can compare ClockPro [4] and DClock [5] in the simulator.
A FIFO cache of this kind can be very bad in some situations. Imagine if you have a cache of 8 entries and the application reads 9 entries in a loop from beginning to end over and over. In this case the FIFO cache algorithm will always eject the next entry just before its needed. Ejecting even a random entry is much better.
Its an incredibly difficult problem to try to figure out what to keep in cache, so I find that the first thing one should allow is for the user to give hints.
It would be great if there was a way for applications to designate some memory allocations as "hot" or "cold" to indicate the usage pattern, or indicate that its reading something linearly, or that its about to need something.
But this algorithm isn't a FIFO: it's a multi-generational LRU cache. That means, LRU+some extra details (evry well explained in the post) to make it perform even better. There are some metrics in the post, and some in the comments here too.
That assumes the developer can provide correct hints, which unfortunately is rarely true. In my experience it is better to invest in an algorithm that can detect these patterns, learn from the workload, and optimize accordingly. Modern cache policies are robustly near optimal, but developers don't bother to implement them.
What I would love to see is an expiration mechanism for files, e.g. you create a file that will get cleaned up in an hour. Would simplify my life on multiple occasions.
You can with systemd (*cue boos from the audience*), although it involves a bit of additional configuration - what files to clean up, and set up a timer.
Wow, thank you! I didn't know it not only exists but is provided by systemd. I'm already using systemd to deploy my services, so this sounds like a perfect fit!
tmpfile isn't guaranteed by Linux kernel. It's POSIX and POSIX is userspace layer with elements of it (like IPC and IO) implemented in kernel for efficiency resons.
As you said, "cache handling" possibly did. That's one of few well defined areas where you're handling GC by definition of the task. System programming is about a lot more than that.
>> On Chrome OS, our real-world benchmark that browses popular websites in multiple tabs demonstrates 51% less CPU usage from kswapd and 52% (full) less PSI on v5.11
In addition, direct reclaim latency is reduced by 22% at 99th percentile and the number of refaults is reduced 7%. These metrics are important to phones and laptops as they are correlated to user experience.
>> Use cases
On Android, our most advanced simulation that generates memory pressure from realistic user behavior shows 18% fewer low-memory kills, which in turn reduces cold starts by 16%.
On Borg, a similar approach enables us to identify jobs that underutilize their memory and downsize them considerably without compromising any of our service level indicators.
On Chrome OS, our field telemetry reports 96% fewer low-memory tab discards and 59% fewer OOM kills from fully utilized devices and no UX regressions from underutilized devices.