Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm the last person to claim any knowledge in this area but isn't this analysis completely missing the possibility of a more accidental fundamental lower-level OS or networking bug that cascades. The whole stack presumably is built on the same underlying frameworks and TCP/UDP/IP protocols so if there's a precarious update or unexposed time or memory contingent bug, that'd surely be quite damaging? I know he speaks of resilience and says that "they’ve seen everything imaginable that could go wrong" but that just seems like hubris, no?

Also with the FB BGP disaster we saw an example of how their resilience/RED-team/etc. learnings failed to highlight how hard recovery from a real-world outage would be irt to something as basic as building access. Plus the hilarious fact that widespread network outages make difficult the kinds of cross-location/timezone communication that would be needed in order to collaborate and apply fixes. FB teams experienced this apparently. The tools they rely on to communicate obviously relied on the assumption that such a network failure could never occur. They were left relying on non-internet comms and non-FB platforms.

To claim that Amazon is simply too experienced to let such things occur seems quite arrogant and naive.



This type of failure could conceivably knock us-east-1 down, but I think it could be pretty easy to recover (precarious update? roll back!). I think Bray is considering total obliteration of the us-east-1.


Yeh fair. I wonder what duration of outage would cause a spiral of the mentioned economic and social effects. I'm sure a fix made within a few days would not set the course of the economy on a different path but it's interesting to consider what duration or degree of an outage would.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: