So this isn't entirely true; binaries above a certain size have their own recounting, process independent GC going on in the BEAM; I wonder why the VM doesn't use a similar system for highly mobile data structures (copy to a common refcounted "scratchpad" on first send).
binaries never contain references to anything else, and therefore can never be a part of a reference loop. so it's safe to toss them out into a shared space and use reference counting to track them. you don't want to do it with small binaries, as the overhead of shared-allocating/ locking/ incrementing/ locking/ decrementing would probably be worse than just coping the small binary in the first place.
if you put complex reference bearing data out in the void, now whatever memory arena that allocated it may contain data that it references, and now you have to keep track of those references and use all of the shared items as new roots into your data. and if some other memory arena receives it and plucks out a value, do you only then copy it or do you reference back to the original. if you send a message back and form a loop, the gc process for either now has to jump through both, and you've destroyed your cheap gc cycles. if you shunt all of the data into shared memory to avoid multi-gc reference loops, you've now just created a big shared gc that will have all the problems of big shared gcs.
the per-process small gcs with full copying between them mean you can do things like just dropping the arena when the process dies without even checking liveness for anything in it ( you'll need to run any destructors if anything needs cleanup ( like references to shared large binaries needing dereferencing ), but you can keep track of items that need that in a bitmap or something, and avoiding the trace is a win on its own )
Minor nit: shared allocation, counter incrementing, and counter decrementing can all be done lock-free. They'd still need memory fence operations (and retries in case of contention), and the associated performance hits, but not actual locking.
Lwn.net just published an article where a comment
https://lwn.net/Articles/849239/ described how there were no really lock-free data structures on modern CPUs:
A couple of decades of writing concurrent algorithms has taught me that scalability is really defined by the frequency the concurrent algorithm accesses/updates shared data, not whether the software algorithm is considered "lockless" or not.
Nobody is claiming that memory fences are free or that livelock isn't possible with lockfree algorithms. (Except in corner cases, such as carefully constructed RISC-V code where the architectural specification does guarantee progress in short tight ll-sc loops with proper instruction alignment.)
The LWN comment mentions spinlocks to improve worst-case performance of some lockfree algorithms, but that criticism doesn't apply to atomic increments / decrements on multi-issue CPUs. The latency of contended memory operations would completely hide the overhead of the add/subtract and make the atomic add/subtract equivalent to a spinlock. (Also, the LWN comment doesn't apply where the algorithm can be written within guaranteed progress corner-cases of the architecture, where applicable.)
It would seem off heap messages could be a base for this. I've not gone all the way into the weeds on those, but it seems like it could be possible to start with sending a whole message to another process by bumping a refcount and adding a reference to that process's mailbox without needing to make a new copy.
If that works, and is useful, you could extend by bumping the refcount, and adding a new type of subreference to a process's mailbox.
I don't think you'd want to send something referencing multiple off-heap messages or adding some terms and a reference (although maybe, who knows), because that would be a lot of complexity. And you would also want to be sensitive to subreference memory leaks, like can happen with binaries where a binary is large, but only small subreferences stick around, leading to lots of wasted space (this is extra fun when you put a small subbinary into ets/mnesia and it lives forever (well until you restart beam, but that can be a long time).
This is sort of what Pony did with its ORCA based GC. It has some tricky edge cases to optimize in practice but it can be made to work especially well if data is immutable.
The analog for that would be shared literals which are not copied (super important for small map overhead reduction if the constructor initialized a fix set of keys like elixir structs as the key index is shared). The persistent_term module allows any value to be registered as such at runtime and avoids a lot of caveats that came with mochi_global and the like. While they aren’t ref counted it can definitely help cut down on copying in my experience.