Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.


You know how people say X startup is ChatGPT wrapper? A significant chunk of AWS services are wrappers of main services (DynamoDB, EC2, S3 and etc).


Yes, and that's exactly the problem. It's like choosing a microservice architecture for resiliency and building all the services on top of the same database or message queue without underlying redundancy.


afaik they have a tiered service architecture, where tier 1 services are allowed to rely on tier 0 services but not vice-versa, and have a bunch of reliability guarantees on tier 0 services that are higher than tier 1.

It is kinda cool that the worst aws outages are still within a single region and not global.


There IS a huge amount of redundancy built into the core services but nothing is perfect.


DNS is always the single point of failure.

But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.


I think a lot of its probably technical debt. So much internally still relies on legacy systems in US-East-1, and every time this happens I'm sure theres a discussion internally about decoupling that reliance which then turns into a massive diagram that looks like a family tree dating back a thousand years of all the things that need to be changed to stop it happening.


There's also the issue of sometimes needing actual strong consistency. Things like auth or billing for example where you absolutely can't tolerate eventual consistency or split-brain situations, in which case you need one region to serve as the ultimate source of truth.


> billing […] can't tolerate eventual consistency

Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.

Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.


Banking/transactions is full of split-brains where everyone involved prays for eventual consistency.

If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.


Interestingly, TigerBeetle manages to have distributed strict consistency over 6 machines.


Banking is full of examples of eventually consistent systems. ACH, credit card transactions, blockchain...


Sounds plausible. It's also a "fat and happy" symptom not to be able to fix deep underlying issues despite an ever growing pile of cash in the company.


Fixing deep underlying issues tends to fare poorly on performance reviews because success is not an easily traceable victory event. It is the prolonged absence of events like this, and it's hard to prove a negative.


Yeah I think there are a number of "hidden" dependencies on different regions, especially us-east-1. It's an artifact of it being AWS' largest region, etc.


why dont they have us east 2, 3, 4 etc. Actually have different cities.


us-east-1 is actually dozens of physical buildings distributed over a massive area. It's not like a single data center somewhere...


us-east-2 does exist; it’s in Ohio. One major issue is a number of services have (had? Not sure if it’s still this way) a control plane in us-east-1, so if it goes down, so does a number of other services, regardless of their location.


you can't possibly know that?

surely you mean:

> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.

AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!


Sure, but none of those other issues are ever documented by AWS as their status page is usually just one big lie.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: