I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.
Yes, and that's exactly the problem. It's like choosing a microservice architecture for resiliency and building all the services on top of the same database or message queue without underlying redundancy.
afaik they have a tiered service architecture, where tier 1 services are allowed to rely on tier 0 services but not vice-versa, and have a bunch of reliability guarantees on tier 0 services that are higher than tier 1.
It is kinda cool that the worst aws outages are still within a single region and not global.
But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.
I think a lot of its probably technical debt. So much internally still relies on legacy systems in US-East-1, and every time this happens I'm sure theres a discussion internally about decoupling that reliance which then turns into a massive diagram that looks like a family tree dating back a thousand years of all the things that need to be changed to stop it happening.
There's also the issue of sometimes needing actual strong consistency. Things like auth or billing for example where you absolutely can't tolerate eventual consistency or split-brain situations, in which case you need one region to serve as the ultimate source of truth.
Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.
Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.
Banking/transactions is full of split-brains where everyone involved prays for eventual consistency.
If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.
Sounds plausible. It's also a "fat and happy" symptom not to be able to fix deep underlying issues despite an ever growing pile of cash in the company.
Fixing deep underlying issues tends to fare poorly on performance reviews because success is not an easily traceable victory event. It is the prolonged absence of events like this, and it's hard to prove a negative.
Yeah I think there are a number of "hidden" dependencies on different regions, especially us-east-1. It's an artifact of it being AWS' largest region, etc.
us-east-2 does exist; it’s in Ohio. One major issue is a number of services have (had? Not sure if it’s still this way) a control plane in us-east-1, so if it goes down, so does a number of other services, regardless of their location.
> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.
AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!