Even for the algorithms that look easy on paper (nothing easier than a matrix mu...

Even for the algorithms that look easy on paper (nothing easier than a matrix multiplication, right?), the real issues are memory accesses. A huge amount of work went into making these libraries cache-friendly. Not so much effort was put in making them NUMA-friendly and multi core was often added after the fact, probably not as efficiently as it could have been.

And then, there are many parts of the algorithms deep in things like the various matrix decompositions that are difficult to parallelise because of non-trivial data dependencies. It’s easy to write some code to pivot a matrix; it’s very hard to do it in an efficient and scalable way. And because all the higher-level routines depend on them, so they have a large effect on overall performance.