I find it interesting that AWS services appear to be so tightly integrated that ...

tokioyoyo · 2025-10-20T08:05:20 1760947520

You know how people say X startup is ChatGPT wrapper? A significant chunk of AWS services are wrappers of main services (DynamoDB, EC2, S3 and etc).

abujazar · 2025-10-20T08:14:07 1760948047

Yes, and that's exactly the problem. It's like choosing a microservice architecture for resiliency and building all the services on top of the same database or message queue without underlying redundancy.

pm90 · 2025-10-20T11:20:40 1760959240

afaik they have a tiered service architecture, where tier 1 services are allowed to rely on tier 0 services but not vice-versa, and have a bunch of reliability guarantees on tier 0 services that are higher than tier 1.

It is kinda cool that the worst aws outages are still within a single region and not global.

UltraSane · 2025-10-20T14:30:51 1760970651

There IS a huge amount of redundancy built into the core services but nothing is perfect.

Aperocky · 2025-10-21T12:16:21 1761048981

DNS is always the single point of failure.

But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.

esskay · 2025-10-20T08:44:03 1760949843

I think a lot of its probably technical debt. So much internally still relies on legacy systems in US-East-1, and every time this happens I'm sure theres a discussion internally about decoupling that reliance which then turns into a massive diagram that looks like a family tree dating back a thousand years of all the things that need to be changed to stop it happening.

Nextgrid · 2025-10-20T10:02:43 1760954563

There's also the issue of sometimes needing actual strong consistency. Things like auth or billing for example where you absolutely can't tolerate eventual consistency or split-brain situations, in which case you need one region to serve as the ultimate source of truth.

kassner · 2025-10-20T11:11:16 1760958676

> billing […] can't tolerate eventual consistency

Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.

Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.

chmod775 · 2025-10-20T12:07:19 1760962039

Banking/transactions is full of split-brains where everyone involved prays for eventual consistency.

If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.

smj-edison · 2025-10-20T17:52:06 1760982726

Interestingly, TigerBeetle manages to have distributed strict consistency over 6 machines.

mdavidn · 2025-10-20T19:50:20 1760989820

Banking is full of examples of eventually consistent systems. ACH, credit card transactions, blockchain...

abujazar · 2025-10-20T08:47:22 1760950042

Sounds plausible. It's also a "fat and happy" symptom not to be able to fix deep underlying issues despite an ever growing pile of cash in the company.

dv_dt · 2025-10-20T09:01:29 1760950889

Fixing deep underlying issues tends to fare poorly on performance reviews because success is not an easily traceable victory event. It is the prolonged absence of events like this, and it's hard to prove a negative.

bradhe · 2025-10-20T08:19:22 1760948362

Yeah I think there are a number of "hidden" dependencies on different regions, especially us-east-1. It's an artifact of it being AWS' largest region, etc.

hshdhdhehd · 2025-10-20T10:02:24 1760954544

why dont they have us east 2, 3, 4 etc. Actually have different cities.

bradhe · 2025-10-20T13:56:19 1760968579

us-east-1 is actually dozens of physical buildings distributed over a massive area. It's not like a single data center somewhere...

sgarland · 2025-10-20T11:23:18 1760959398

us-east-2 does exist; it’s in Ohio. One major issue is a number of services have (had? Not sure if it’s still this way) a control plane in us-east-1, so if it goes down, so does a number of other services, regardless of their location.

bananapub · 2025-10-20T08:17:31 1760948251

you can't possibly know that?

surely you mean:

> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.

AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!

abujazar · 2025-10-20T08:48:43 1760950123

Sure, but none of those other issues are ever documented by AWS as their status page is usually just one big lie.