> MySQL (technically InnoDB) and SQL Server have a huge problem with them due to...

Izkata · on June 12, 2024

The postgres query planner has a correlation statistic for how well indexes match the order of the table data on disk, and if it's bad and the query is expected to return enough rows, it'll skip the index entirely. Instead it'll do a sequential scan and sort, to work with the disk cache instead of against it.

Postgres provides a CLUSTER command to reorder the table data on disk to match a given index, resulting in a high correlation and allowing it to do an index scan without the penalty for random disk access. If you're on an SSD or know you have enough memory to have all the data cached you can also disable the random access penalty in the configuration; it exists mainly for spinning disks.

srcreigh · on June 20, 2024

Imo I agree CLUSTER should help, although pg_repack is even better, since it doesn't yknow hold an exclusive lock while it copies the entire table. pg_repack still is very slow compared to consistent BTree rebalancing in MySQL.

The problem isn't fixed with SSD or RAM or the settings in Postgres. Fast random access is great, but if your data isn't localized you are leaving a 100x optimization on the table.

The original comment I replied to explained the data localization problem as if it was a MySQL problem and Postgres doesn't have the issue. That's like saying that Rust allows unsafe and this can cause memory corruption, so let's use C++. It's hard to understate how little sense it makes.

Postgres is basically the only DB which doesn't offer a decent solution for data localization. Every other DB lets you control the layout of rows automatically with minimal resource overhead, postgres tools are extremely bad in comparison.