In my experience shared memory is really hard to implement well and manage:
1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).
2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy".
3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.
4. There's other separate questions around how to manage the shared memory segments you are using (one big ring buffer? a segment per message?), and how to communicate between processes that different segments are in use and that new messages are available for subscribers. Doable, but also not simple.
It's a tough pill to swallow - you're taking on a lot of complexity in exchange for that low latency. If you can do so, it's better to put things in the same process space if you can - you can use smart pointers and a queue and go just as fast, with less complexity. Anything CUDA will want to be single process, anyhow, (ignoring cuda IPC, anyhow). The number of places where you need (a) ultra low latency (b) high bandwidth/message size (c) can't put everything in the same process (d) are using data structures suited to shared memory and finally (e) are okay with taking on a bunch of complexity just isn't that high. (It's totally possible I'm missing a Linux feature that makes things easy, though).
I plan on integrating iceoryx into a message passing framework I'm working on now (users will ask for SHM), but honestly either "shared pointers and a queue" or "TCP/UDS" are usually better fits.
> In my experience shared memory is really hard to implement well and manage:
I second that. It took us quite some time to get the correct architecture. After all, iceoryx2 is the third incarnation of this piece of software, with elfepiff an me working on the last two.
> 1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).
Indeed, we are using fixed size structures with a bucket allocator. We have ideas on how to enable the usage on types which support custom allocators and even with raw pointers but that is just a crazy idea which might not pan out to work.
> 2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy".
>
> 3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.
Indeed, this is a complicated topic and support from the OS would be appreciated. We found a few ways on how to make this feasible, though.
The origins of iceoryx are in automotive and there it is required to split functionality up into multiple processes. When one process goes down, the system can still operate in a degraded mode or just restart the faulty process. With this, one needs an efficient and low-latency solution else the CPU is spending more time on copying data than on doing real work.
Of course there are issues like the producer mutating data after delivery, but here are also solutions for this. It will of course affect the latency but should still be better than using e.g. unix domain sockets.
Fun fact. For iceoryx1 we supported only 4GB memory chunks and some time ago someone came and asked if we could lift this limitation since he wanted to transfer a 92GB large language model via shared memory.
Thanks for sharing here -- yeah these are definitely huge issues that make shared memory hard -- the when-things-go-wrong case is definitely quite hairy.
I wonder if it would work well as a sort of opt-in specialization? Start with TCP/UDS/STDIN/whatever, and then maybe graduate, and if anything goes wrong, report errors via the fallback?
I do agree it's rarely worth it (and same-machine UDS is probably good enough), but with the 10x gain essentially I'm quite surprised.
One thing I've also found that actually performed very well is ipc-channel[0]. I tried it because I wanted to see how something I might actually use would perform, and it was basically 1/10th the perf of shared memory.
The other thing is 10x improvement on basically nothing is quite small. Whatever time it takes for a message to be processed is going to be dominated by actually consuming the message. If you have a great abstraction, cool - use it anyhow, but it's probably not worth developing a shared memory library yourself.
1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).
2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy".
3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.
4. There's other separate questions around how to manage the shared memory segments you are using (one big ring buffer? a segment per message?), and how to communicate between processes that different segments are in use and that new messages are available for subscribers. Doable, but also not simple.
It's a tough pill to swallow - you're taking on a lot of complexity in exchange for that low latency. If you can do so, it's better to put things in the same process space if you can - you can use smart pointers and a queue and go just as fast, with less complexity. Anything CUDA will want to be single process, anyhow, (ignoring cuda IPC, anyhow). The number of places where you need (a) ultra low latency (b) high bandwidth/message size (c) can't put everything in the same process (d) are using data structures suited to shared memory and finally (e) are okay with taking on a bunch of complexity just isn't that high. (It's totally possible I'm missing a Linux feature that makes things easy, though).
I plan on integrating iceoryx into a message passing framework I'm working on now (users will ask for SHM), but honestly either "shared pointers and a queue" or "TCP/UDS" are usually better fits.