As such, I have a question for you: contrary to your article, I've always been taught that random primary keys are better than sequential ones. The reason for this, I was told, was to avoid "hotspots". I guess it only really applies once sharding comes into play, and perhaps also only if your primary key is your sharding key, but I think that's a pretty common setup.
I'm not really sure how to formulate a concrete question here, I guess I would like to hear your thoughts on any tradeoffs on sequential Vs random keys in sharded setups? Is there a case there random keys are valid, or have I been taught nonsense?
If you're sharding based purely on sequential ID ranges, then yes this is a problem. Its better practice to shard based on a hash of your ID, so sequential id assignments turn into non-sequential shard keys, keeping things evenly distributed.
And since it's only used for speedy lookup we can even use a fast, cheap and non-secure hashing algorithm, so it's really a low-cost operation!
Thanks! This was really one of those aha-moments where I feel kinda stupid to not have thought of it myself!
yes, that's the crux of the problem. when you have a sharded database, typically you want to be able to add (and/or remove) shards easily and non-disruptively.
for example - your database is currently sharded across N nodes, and it's overloaded due to increased traffic, so you want to increase it to N+1 nodes (or N+M nodes, which can add complexity in some cases)
if adding a shard causes a significant increase in load on the database, that's usually a non-starter for a production workload, because at the time you want to do it, the database is already overloaded
you can read about this in the original Dynamo paper [0] from almost 20 years ago - consistent hashing is used to select 3 of the N nodes to host a given key. when node N+1 is added, it joins the cluster in such a way that it will "take over" hosting 1/Nth of the data, from each of the N nodes - meaning that a) the joining process places a relatively small load on each of those N nodes and b) once the node is fully joined, it reduces overall load evenly across all N nodes.
0: https://www.allthingsdistributed.com/2007/10/amazons_dynamo....
It's not as good as just a sequential ID at keeping the fragmentation and data movement down. However, it does ultimately lead to the best write performance for us because the user data ends up likely still appending to an empty page. It allows for more concurrent writes to the same table because they aren't all fighting over that end page.
UUIDv4 is madness.
That's another thing, some say to use uuid7 for sharded DBs, but this is a serious counterexample.