Google uses "Non-Abstract Large System Design (NALSD)" https://sre.google/workbo...

Google uses "Non-Abstract Large System Design (NALSD)" https://sre.google/workbook/non-abstract-design/ for this style of design.

The emphasis on a concrete design with concrete numbers can help identify the main scaling and reliability limitations, and put a cost on these. "Design X costs $A/year for Y scheduled fly.io tasks".

To build such a design relies on knowing fundamentals such as the performance characteristics of CPU/disk/network. "How many disks would it take to serve 50k QPS at 20ms, each time performing 1k of random disk I/O."

Knowing this helps identify where in your stack you want flexibility, and why you'd want it.

"We log 100MB/s spread across 200 machines, which can be done with vanilla mature Elastic search"

"We need to be able to perform container routing at 0.5ms overhead, and updates need to be atomic. eBPF can do this but existing solutions are immature. Since this is also our core competency, let's do this ourselves."