I'm aware that Rust has something similiar for things like `std::collections::Ha...

GrumpySloth · on Dec 11, 2023

Which is why the Rust compiler itself uses a non-cryptographic hash, which takes just 3 x86 instructions and can work on 8 bytes at a time: <https://github.com/rust-lang/rustc-hash/blob/master/src/lib....>

eesmith · on Dec 11, 2023

Python 3.11 appears to have switched to SipHash 1-3 for strings, from 2-4, following the lead of Rust and Ruby. https://github.com/python/cpython/issues/73596

However, Python does not use it for integers;

  >>> hash(10)
  10
  >>> hash(100)
  100
  >>> hash(2**61-2) == 2**61-2
  True
  >>> hash(2**61-1)
  0

danhau · on Dec 11, 2023

That’s good though, right? Is there a reason for not using an identity hash (is that the right term?) for integers?

ynik · on Dec 11, 2023

That depends on the hash table implementation and the distribution of the integers.

For the commonly used hash tables with prime size that use modulo to turn the hash code into a slot index, an identity hash for integers is usually fine (unless many integers are multiples of the prime size).

But other hash tables use power-of-two size to replace the modulo operation with a faster bit-and operation. Now an identity hash for integers is much more problematic, e.g. if all integers are multiples of 1000, only 1/8th of the table slots can be used.

The latter kind of hash tables would like all bits in the hash value to be well-distributed; and this is typically not true of the underlying integers. So an additional mixing operation needs to be used. Whether that mixing happens in the hash function or in the hash table depends on the implementation (for some, it's even configurable, e.g. is_avalanching marker in ankerl::unordered_dense).

vhcr · on Dec 11, 2023

Don't use user-supplied integers on dicts or sets on Python:

>>> {i for i in range(10000)}

Takes 0.005s

>>> {i * sys.hash_info.modulus for i in range(10000)}

Takes 0.76s