In my experience / tests, its way easier to write a high-performance open-addressing Hash Table than a high-performance chaining one.
That being said, chaining is easier to write.
> Chaining also tends to waste a lot less memory.
How so? A typical 32-bit integer or 32-bit float uses 4-bytes. But a typical pointer is 8-bytes. But there's also the internal malloc/new chunk header to consider.
That means, to store a 4-byte integer into a chained hash table, you need 8-bytes (pointer) + 8-byte (glibc malloc chunk header) + 4 byte integer/float. Or 20 bytes total in practice.
If you have say, 50 integers inside of the hashmap, then you'll use (table-size * 8) + 50-integers * 20 bytes each == 1000+ bytes for the pointers/data alone, plus even more for the table itself. (ex: Table of size 50 would need 50 * 8-byte pointers even when empty, for a total of 1000 bytes + 400 bytes == 1400 bytes or so)
In contrast, a 4-byte open-addressing hash table only takes up 4-bytes. So 50-integers in a hashmap with ~50% load-factor (ie: size the table to be size 128 or so) will be 128 * 4 == 512 bytes.
EDIT: Its because of this huge practical difference in memory used that I'm pretty sure open-addressing is so high performance in comparison. When your data-structures are 1/2 or smaller number of bytes than the competition, its easier to stay in L1 cache or other such size benefits.
It is not necessary for an allocator to have any hidden overhead inside an allocated block. That would be quite unacceptable for small blocks like a 32 byte block holding four pointer-sized values.
In any case, applications can do their own aggregation for small allocations: allocate the nodes in an array and dole it out from that. It can recycle unused nodes itself.
> In any case, applications can do their own aggregation for small allocations: allocate the nodes in an array and dole it out from that. It can recycle unused nodes itself.
Or you could just write the open-addressing HashMap, which is easier than implementing your own custom small allocation malloc()/free()?
That's the thing. To make chaining competitive against open-addressing HashMaps, you've got to bend-over backwards and its suddenly not easy anymore. Though you're right in that different versions of malloc/free could be more efficient. Ex: tcmalloc() has small allocations built in already like you mention.
But even if you erase those 8-bytes from the glibc chunk implementation of malloc, you're _still_ 8-bytes over the open-addressing implementation (you can't get rid of the 8-byte "next" pointer). So all that work and you're still worse off than the other methodology.
I can write aggregating allocation for a C with my eyes closed, in a tiny number of lines. These days I mostly wouldn't. If the target system thinks it's a good idea to add a header to every 32 byte allocation, that's their problem. No worthwhile malloc implementation does that, other than in a debugging configuration, where it adds "red zones".
The calculations in my other post show that it's not as simple as that next pointer being pure overhead compared to open addressing.
> Nodes don't have to be allocated with malloc (that is actually the worst thing you could possibly do).
I mean... the most obvious place to place nodes is... inside the table itself. Also known as Open Addressing.
Or what, are you going to implement a 2nd, custom heap algorithm to manage your nodes? I guess there's "buddy allocators", but buddy-allocators aren't exactly free either. Whatever method you're using to keep track of nodes is going to have some level of inefficiency.
Whatever allocation you're doing to get these nodes (above-and-beyond generic malloc) is time you could have been spent writing open-addressing instead.
> An open-addressing hash table only performs well up to 50-75% bucket usage. Chaining doesn't care and performs the same up to 100%.
Hardly an apples-to-apples comparison. As I pointed out, 50% load-factor on 50x uint32 open-addressing is only ~400 bytes used.
While 100% load factor on 50x uint32 is 1400 bytes used on chaining. (plus additional load-factor in practice due to malloc fragmentation)
Perhaps we need to start talking with actual code? Here's just a simple idea that's in my brain right now.
template <class Data>
struct HashNode{
// Probably should be shared_ptr<HashNode<Data>>, but that's
// now adding even more inefficiencies like a ref_count per element
struct HashNode<Data>* nextPointerChain;
Data d; // Maybe Data* d if you're sharing it with other data-structures?
};
template <class Data, int size>
struct ChainHashTable{
HashNode<Data> theTable[size];
};
template <class Data, int size>
struct LinearProbingHashTable{
Data theTable[size];
vector<bool> occupied; // set to "size", 0 means empty and 1 means occupied
};
nextPointerChain takes up 8 bytes, even if its nullptr. No other data-structure needs to have the nextPointerChain. If we use shared_ptr<HashNode<Data>> as per typical modern C++, there's even more inefficiencies that I forgot about. EDIT: You probably can get away with unique_ptr, now that I think of it.
----------
Pointers aren't free btw. If you're sharing the pointer with many different parts of your program, that's definitely one of those things you'll want to start getting rid of. Not only does a pointer cost 8 bytes, but its also _SIGNIFICANTLY_ slower and cache-unfriendly in practice.
L1 cache works by reading data in-and-around anything you access. If you're only using 8-bytes of a cache line for pointer-indirection (8/64), you're wasting 56-bytes of your fetch.
Linear Probing is extremely fast because it tends to blitz through L1 cache thanks to locality of data. Simple Linear Probing (or Robin-hood augmented probing) is probably the fastest, and simplest, approach for modern CPUs.
You still fail to understand what being part of multiple data structures means.
Shared pointers, separate container to check occupancy, suggesting making a container of pointers, all these things suggest you have no idea how to do hybrid data structures.
The nodes contain the data along with multiple pointers in them to chain them to various data structures they are part of (trees, lists, hash tables etc.). The object in question contains state that is several cache lines long.
You cannot just put the nodes inside your buckets as you don't get to control where the nodes are and cannot invalidate the other data structures. Though I suppose it's a possibility if you know ahead of time how many nodes you're going to need, but then in that case you may be able to go for perfect hashing to begin with (unless you're building some kind of fixed-size cache).
A given object for example can be part of several hash maps, based on the various pieces of state it is useful to index by.
In practice you'd use a pool to allocate all your nodes in blocks. There is no per-node overhead for this allocation strategy.
Note that linearProbingHashTable only uses 8 bytes per pointer. While ChainHashPointer still needs nextPtr, taking up 16 bytes.
The *2 is for 50% load factor on linear table. But we still win on linear if we're at say 75% (simple linear) or 95% load factor (which is doable with Robin Hood)
Agreed - I do mention in the post that open addressing is impractical when using intrusive linked lists, which are common in low level data structures.
One situation where I painfully learned that it doesn't, is when you iterate over one hashtable to fill another. To defend against that one needs to add some per-hashtable randomized state into the hash IV.
Through bad experience I also learned that you need to not just grow due to fillfactor, but also due to disproportional chain length, even with the above defense in place.
> To defend against that one needs to add some per-hashtable randomized state into the hash IV.
You should almost always be doing this anyways, since most hash tables use user-provided data as keys, which opens up a well known avenue for denial-of-service attacks:
I think that's often fixed using a per run random seed, rather than a per table seed. The patch linked in the python bug linked from that article does seem to do that, for example.
> (that is actually the worst thing you could possibly do)
Nonsense, there are many worse things you could do. For example, you could allocate nodes with mmap, at 4KB per node. (Which is a thing I've actually done in the context of cursed bootstrap code where malloc isn't available and having more than a handful of (in that case doubly-linked-list) nodes means the environment is deranged anyway.)
That being said, chaining is easier to write.
> Chaining also tends to waste a lot less memory.
How so? A typical 32-bit integer or 32-bit float uses 4-bytes. But a typical pointer is 8-bytes. But there's also the internal malloc/new chunk header to consider.
That means, to store a 4-byte integer into a chained hash table, you need 8-bytes (pointer) + 8-byte (glibc malloc chunk header) + 4 byte integer/float. Or 20 bytes total in practice.
If you have say, 50 integers inside of the hashmap, then you'll use (table-size * 8) + 50-integers * 20 bytes each == 1000+ bytes for the pointers/data alone, plus even more for the table itself. (ex: Table of size 50 would need 50 * 8-byte pointers even when empty, for a total of 1000 bytes + 400 bytes == 1400 bytes or so)
In contrast, a 4-byte open-addressing hash table only takes up 4-bytes. So 50-integers in a hashmap with ~50% load-factor (ie: size the table to be size 128 or so) will be 128 * 4 == 512 bytes.
EDIT: Its because of this huge practical difference in memory used that I'm pretty sure open-addressing is so high performance in comparison. When your data-structures are 1/2 or smaller number of bytes than the competition, its easier to stay in L1 cache or other such size benefits.