I think that threat is generally overblown in these discussions. Yes, container escape is less difficult than VM escape, but it still requires major kernel 0day to do; it is by no means easy to accomplish. Doubly so if you have some decent hygiene and don't run anything as root or anything else dumb.
When was the last time we have heard container escape actually happening?
Just because you haven't heard of it doesn't mean the risk isn't real.
It's probably better to make some kind of risk assessment and decide whether you're willing to accept this risk for your users / business. And what you can do to mitigate this risk. The truth is the risk is always there and gets smaller as you add several isolation mechanisms to make it insignificant.
I think you meant “container escape is not as difficult as VM escape.”
A malicious workload doesn’t need to be root inside the container, the attack surface is the shared linux kernel.
Not allowing root in a container might mitigate a container getting root access outside of a namespace. But if an escape succeeds the attacker could leverage yet another privilege escalation mechanism to go from non-root to root
Better not rely on unprivileged containers to save you. The problem is:
Breaking out of a VM requires a hypervisor vulnerability, which are rare.
Breaking out of a shared-kernel container requires a kernel syscall vulnerability, which are common. The syscall attack surface is huge, and much of it is exploitable even by unprivileged processes.
They both can be highly unescapable. The podman community is smaller but it's more focused on solving technical problems than docker is at this point, which is trying to increase subscription revenue. I have gotten a configuration for running something in isolation that I'm happy with in podman, and while I think I could do exactly the same thing in Docker, it seems simpler in podman to me.
Apologies for repeating myself all over this part of the thread, but the vulnerabilities here are something that Podman and Docker can't really do anything about as long as they're sharing a kernel between containers.
If you're going to make containers hard to escape, you have to host them under a hypervisor that keeps them apart. Firecracker was invented for this. If Docker could be made unescapable on its own, AWS wouldn't need to run their container workloads under Firecracker.
This same, not especially informative content is being linked to again and again in this thread. If container escapes are so common, why has nobody linked to any of them rather than a comment saying "There are lots" from 3 years ago?
Perspective is everything, I guess. You look at that three year old comment and think it's not particularly informative. I look at that comment and see an experienced infosec pro at Fly.io, who runs billions of container workloads and doesn't trust the cgroups+namespaces security boundary enough so goes to the trouble of running Firecracker instead. (There are other reasons they landed there, but the security angle's part of it.)
Anyway if you want some links, here are a few. If you want more, I'm sure you can find 'em.
Some are covered off by good container deployment hygiene and reducing privilege, but from my POV it looks like the container devs are plugging their fingers in a barrel that keeps springing new leaks.
(To be fair, modern Docker's a lot better than it used to be. If you run your container unprivileged and don't give it extra capabilities and don't change syscall filters or MAC policies, you've closed off quite a bit of the attack surface, though far from all of it.)
But keep in mind that shared-kernel containers are only as secure as the kernel, and today's secure kernel syscall can turn insecure tomorrow as the kernel evolves. There are other solutions to that (look into gVisor and ask yourself why Google went to the trouble to make it -- and the answer is not "because Docker's security mechanisms are good enough"), but if you want peace of mind I believe it's better to sidestep the whole issue by using a hypervisor that's smaller and much more auditable than a whole Linux kernel shared across many containers.
I mean docker runs in sudo privileges for the most part, yes I know that docker can run rootless too but podman does it out of the box.
So if your docker container gets vulnerable and it can somehow break through a container, I think that with default sudo docker, you might get sudo privileges whereas in default podman, you would be having it as a user run executable and might need another zero day or smth to have sudo privilege y'know?