Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm aware that Rust has something similiar for things like `std::collections::HashMap`. By default:

> The default hashing algorithm is currently SipHash 1-3, though this is subject to change at any point in the future. While its performance is very competitive for medium sized keys, other hashing algorithms will outperform it for small keys such as integers as well as large keys such as long strings, though those algorithms will typically not protect against attacks such as HashDoS.

https://doc.rust-lang.org/std/collections/struct.HashMap.htm...

However, Rust also lets you pick or implement your own hash algorithm if you want to optimise for your usecase.



Which is why the Rust compiler itself uses a non-cryptographic hash, which takes just 3 x86 instructions and can work on 8 bytes at a time: <https://github.com/rust-lang/rustc-hash/blob/master/src/lib....>


Python 3.11 appears to have switched to SipHash 1-3 for strings, from 2-4, following the lead of Rust and Ruby. https://github.com/python/cpython/issues/73596

However, Python does not use it for integers;

  >>> hash(10)
  10
  >>> hash(100)
  100
  >>> hash(2**61-2) == 2**61-2
  True
  >>> hash(2**61-1)
  0


That’s good though, right? Is there a reason for not using an identity hash (is that the right term?) for integers?


That depends on the hash table implementation and the distribution of the integers.

For the commonly used hash tables with prime size that use modulo to turn the hash code into a slot index, an identity hash for integers is usually fine (unless many integers are multiples of the prime size).

But other hash tables use power-of-two size to replace the modulo operation with a faster bit-and operation. Now an identity hash for integers is much more problematic, e.g. if all integers are multiples of 1000, only 1/8th of the table slots can be used.

The latter kind of hash tables would like all bits in the hash value to be well-distributed; and this is typically not true of the underlying integers. So an additional mixing operation needs to be used. Whether that mixing happens in the hash function or in the hash table depends on the implementation (for some, it's even configurable, e.g. is_avalanching marker in ankerl::unordered_dense).


Don't use user-supplied integers on dicts or sets on Python:

>>> {i for i in range(10000)}

Takes 0.005s

>>> {i * sys.hash_info.modulus for i in range(10000)}

Takes 0.76s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: