I am interested in distributed systems and database internals (both traditional and new databases) but find that many database resources tend to be either introductory SQL queries or related to tuning.
I personally like to find new distributed systems, and then learn what techniques they use.
For example learning how serf.io ises Vivaldi, how CockroachDB uses raft multi-group, or why FoundationDB has different processes and they each do.
I try to write interesting stuff on distributed systems, but there's a great discord created by eaton phil on software internals that has a lot of great discussions https://twitter.com/eatonphil
There is overhead to a token. If that was the "simple trick", then why don't hash based systems like Cassandra, Scylla, Temporal, etc do that by default?
Only somewhat effective if you start at massive scale tbh, and it still doesnt' solve hot partitions because it can be hot because of a single tenant (e.g. company ID) that can't be split across hash tokens, but can be across ranges
Why can’t hash based partitioning systems just store the full hash with the key for fast rehashing if the number of buckets needs to change, or else recompute the hash?
1. The hash would be an extra column that can be calculated from existing data, wasted storage
2. You effectively have to rewrite the entire database to itself to redistribute, and keeping the DB availabile during this process is _very_ complicated
3. You're putting an extreme load on the DB for a substantial amount of time. This takes away from your DB performance and makes node downtime even more severe
In a distributed DB, you have to remember that the probability your node _doesn't_ have the data you need increases with the size of the cluster, which creates a negative feedback loop for having to rewrite.
I am interested in distributed systems and database internals (both traditional and new databases) but find that many database resources tend to be either introductory SQL queries or related to tuning.