I wonder how many premium accounts Anna’s Archive had to use to scrape the whole thing. Surely Spotify has scrape protection and wouldn’t allow a single account to stream (download) millions of separate tracks.
I haven't looked at the code but I would be surprised if the premium account "requirement" is anything more than an if statement that can be commented out.
I’m really curious what their rollout procedure is, because it seems like many of their past outages should have been uncovered if they released these configuration changes to 1% of global traffic first.
They don't appear to have a rollout procedure for some of their globally replicated application state. They had a number of major outages over the past years which all had the same root cause of "a global config change exposed a bug in our code and everything blew up".
I guess it's an organizational consequence of mitigating attacks in real time, where rollout delays can be risky as well. But if you're going to do that, it would appear that the code has to be written much more defensively than what they're doing it right now.
Yea agree.. This is the same discussion point that came up last time they had an incident.
I really don’t buy this requirement to always deploy state changes 100% globally immediately.
Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.
Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.
Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.
For hypothetical conflicting changes (read worst case: unupgraded nodes/services can't interop with upgraded nodes/services), what's best practice for a partial rollout?
Blue/green and temporarily ossify capacity? Regional?
That's ok but doesn't solve issues you notice only on actual prod traffic. While it can be a nice addition to catch issues earlier with minimal user impact, best practice on large scale systems still requires a staged/progressive prod rollout.
If there is a proper rollout procedure that would've caught this, and they bypass it for routine WAF configuration changes, they might as well not have one.
The update they describe should never bring down all services. I agree with other posters that they must lack a rollout strategy yet they sent spam emails mocking the reliability of other clouds
reply