In my experience / tests, its way easier to write a high-performance open-addres...

kazinator · on April 2, 2023

It is not necessary for an allocator to have any hidden overhead inside an allocated block. That would be quite unacceptable for small blocks like a 32 byte block holding four pointer-sized values.

In any case, applications can do their own aggregation for small allocations: allocate the nodes in an array and dole it out from that. It can recycle unused nodes itself.

dragontamer · on April 3, 2023

> In any case, applications can do their own aggregation for small allocations: allocate the nodes in an array and dole it out from that. It can recycle unused nodes itself.

Or you could just write the open-addressing HashMap, which is easier than implementing your own custom small allocation malloc()/free()?

That's the thing. To make chaining competitive against open-addressing HashMaps, you've got to bend-over backwards and its suddenly not easy anymore. Though you're right in that different versions of malloc/free could be more efficient. Ex: tcmalloc() has small allocations built in already like you mention.

But even if you erase those 8-bytes from the glibc chunk implementation of malloc, you're _still_ 8-bytes over the open-addressing implementation (you can't get rid of the 8-byte "next" pointer). So all that work and you're still worse off than the other methodology.

kazinator · on April 3, 2023

I can write aggregating allocation for a C with my eyes closed, in a tiny number of lines. These days I mostly wouldn't. If the target system thinks it's a good idea to add a header to every 32 byte allocation, that's their problem. No worthwhile malloc implementation does that, other than in a debugging configuration, where it adds "red zones".

The calculations in my other post show that it's not as simple as that next pointer being pure overhead compared to open addressing.

https://news.ycombinator.com/item?id=35416520

mgaunard · on April 2, 2023

Nodes don't have to be allocated with malloc (that is actually the worst thing you could possibly do).

An open-addressing hash table only performs well up to 50-75% bucket usage. Chaining doesn't care and performs the same up to 100%.

dragontamer · on April 2, 2023

> Nodes don't have to be allocated with malloc (that is actually the worst thing you could possibly do).

I mean... the most obvious place to place nodes is... inside the table itself. Also known as Open Addressing.

Or what, are you going to implement a 2nd, custom heap algorithm to manage your nodes? I guess there's "buddy allocators", but buddy-allocators aren't exactly free either. Whatever method you're using to keep track of nodes is going to have some level of inefficiency.

Whatever allocation you're doing to get these nodes (above-and-beyond generic malloc) is time you could have been spent writing open-addressing instead.

> An open-addressing hash table only performs well up to 50-75% bucket usage. Chaining doesn't care and performs the same up to 100%.

Hardly an apples-to-apples comparison. As I pointed out, 50% load-factor on 50x uint32 open-addressing is only ~400 bytes used.

While 100% load factor on 50x uint32 is 1400 bytes used on chaining. (plus additional load-factor in practice due to malloc fragmentation)

mgaunard · on April 2, 2023

I already explained. Objects in non-trivial data structures are part of several data structures concurrently.

dragontamer · on April 2, 2023

Perhaps we need to start talking with actual code? Here's just a simple idea that's in my brain right now.

    template <class Data>
    struct HashNode{
        // Probably should be shared_ptr<HashNode<Data>>, but that's 
        // now adding even more inefficiencies like a ref_count per element
        struct HashNode<Data>* nextPointerChain;
        Data d; // Maybe Data* d if you're sharing it with other data-structures?
    };

    template <class Data, int size>
    struct ChainHashTable{
        HashNode<Data> theTable[size]; 
    };

    template <class Data, int size>
    struct LinearProbingHashTable{
        Data theTable[size];
        vector<bool> occupied; // set to "size", 0 means empty and 1 means occupied
    };

nextPointerChain takes up 8 bytes, even if its nullptr. No other data-structure needs to have the nextPointerChain. If we use shared_ptr<HashNode<Data>> as per typical modern C++, there's even more inefficiencies that I forgot about. EDIT: You probably can get away with unique_ptr, now that I think of it.

----------

Pointers aren't free btw. If you're sharing the pointer with many different parts of your program, that's definitely one of those things you'll want to start getting rid of. Not only does a pointer cost 8 bytes, but its also _SIGNIFICANTLY_ slower and cache-unfriendly in practice.

L1 cache works by reading data in-and-around anything you access. If you're only using 8-bytes of a cache line for pointer-indirection (8/64), you're wasting 56-bytes of your fetch.

Linear Probing is extremely fast because it tends to blitz through L1 cache thanks to locality of data. Simple Linear Probing (or Robin-hood augmented probing) is probably the fastest, and simplest, approach for modern CPUs.

mgaunard · on April 2, 2023

You still fail to understand what being part of multiple data structures means.

Shared pointers, separate container to check occupancy, suggesting making a container of pointers, all these things suggest you have no idea how to do hybrid data structures.

The nodes contain the data along with multiple pointers in them to chain them to various data structures they are part of (trees, lists, hash tables etc.). The object in question contains state that is several cache lines long.

You cannot just put the nodes inside your buckets as you don't get to control where the nodes are and cannot invalidate the other data structures. Though I suppose it's a possibility if you know ahead of time how many nodes you're going to need, but then in that case you may be able to go for perfect hashing to begin with (unless you're building some kind of fixed-size cache).

A given object for example can be part of several hash maps, based on the various pieces of state it is useful to index by.

In practice you'd use a pool to allocate all your nodes in blocks. There is no per-node overhead for this allocation strategy.

dragontamer · on April 2, 2023

So you've at best removed just 8 bytes for rather inconvenient programming for...

    LinearProbingHashTable<FooBar*, size*2>
    ChainHashTable<FooBar*, size>

Note that linearProbingHashTable only uses 8 bytes per pointer. While ChainHashPointer still needs nextPtr, taking up 16 bytes.

The *2 is for 50% load factor on linear table. But we still win on linear if we're at say 75% (simple linear) or 95% load factor (which is doable with Robin Hood)

TheNumbat · on April 2, 2023

Agreed - I do mention in the post that open addressing is impractical when using intrusive linked lists, which are common in low level data structures.

kevin_thibedeau · on April 2, 2023

> only performs well up to 50-75% bucket usage.

Robin Hood performs well into the high 90's.

anarazel · on April 2, 2023

One situation where I painfully learned that it doesn't, is when you iterate over one hashtable to fill another. To defend against that one needs to add some per-hashtable randomized state into the hash IV.

Through bad experience I also learned that you need to not just grow due to fillfactor, but also due to disproportional chain length, even with the above defense in place.

10000truths · on April 2, 2023

> To defend against that one needs to add some per-hashtable randomized state into the hash IV.

You should almost always be doing this anyways, since most hash tables use user-provided data as keys, which opens up a well known avenue for denial-of-service attacks:

https://lwn.net/Articles/474912/

anarazel · on April 2, 2023

I think that's often fixed using a per run random seed, rather than a per table seed. The patch linked in the python bug linked from that article does seem to do that, for example.

a1369209993 · on April 2, 2023

> (that is actually the worst thing you could possibly do)

Nonsense, there are many worse things you could do. For example, you could allocate nodes with mmap, at 4KB per node. (Which is a thing I've actually done in the context of cursed bootstrap code where malloc isn't available and having more than a handful of (in that case doubly-linked-list) nodes means the environment is deranged anyway.)

mgaunard · on April 3, 2023

Now that seems plain silly.