I would love to be corrected: do you have some specific examples or some underly...

sfpotter · on Jan 25, 2024

Anything involving pivoting for stability isn’t trivial to parallelize. Plain old Gaussian elimination on huge sparse matrices doesn’t scale in parallel very well.

At the other end of the spectrum, getting a matrix-matrix multiply to run fast isn’t easy either. It’s what necessitated the kind of approach the authors of BLIS adopted. On paper it’s easy, but actually getting it to run fast on a computer isn’t.

kergonath · on Jan 25, 2024

Even for the algorithms that look easy on paper (nothing easier than a matrix multiplication, right?), the real issues are memory accesses. A huge amount of work went into making these libraries cache-friendly. Not so much effort was put in making them NUMA-friendly and multi core was often added after the fact, probably not as efficiently as it could have been.

And then, there are many parts of the algorithms deep in things like the various matrix decompositions that are difficult to parallelise because of non-trivial data dependencies. It’s easy to write some code to pivot a matrix; it’s very hard to do it in an efficient and scalable way. And because all the higher-level routines depend on them, so they have a large effect on overall performance.

vatican_banker · on Jan 25, 2024

Embarrassingly parallel means "singe task multiple data" or "multiple task single data". Canonical example: monte carlo simulations.

With that out of the way, basic linear algebra operations that require sophisticated algorithms and are not "embarrassingly parallel": matrix multiplication, matrix inversion, matrix decomposition (SVM, QR, etc). Some of these algos fall into BLAS and others in LAPACK.

gnufx · on Jan 25, 2024

The BLIS parallelism strategy is written up in at least one of the references they provide. (Serial OpenBLAS is typically a bit faster than BLIS, but it has (had?) problems with multi-threading.) See also the distributed equivalents (ScaLAPACK and competitors) for what's may be of more interest in serious HPC.

hansvm · on Jan 27, 2024

Using just basic large dense square matrix multiplication on a modern CPU as an example, not caring about numerical stability, embarrassingly parallel implementations have high bandwidth to compute ratios, bottlenecking your results 10x - 1000x. Embarrassingly parallelizing cache friendlier solutions is prone to stragglers and other scheduling issues. If you're operating on asymptotically faster solutions like Strassen's it's at least a little nontrivial to keep the few largest recursive layers from dominating the computation and not really being much faster than other parallel algorithms.

As soon as you start mixing numerical stability (data-dependant operation reordering) concerns into linear algebra routines -- arguably one of the main points of LAPACK -- parallelization necessarily gets trickier because you either need a good enough heuristic to ignore those reorderings or you also need data-dependant scheduling without too many pathologies.

KolenCh · on Jan 26, 2024

If you try to implement a simple matmul and you’ll understand. You can start with a naive one as a baseline. Then a few standard tricks can be used to speed it up. And then compare that to a standard BLAS call. And you’ll find that even with those tricks it is nowhere close to off-the-shell BLAS libraries.

But from this exercise alone, knowing the tricks you used already, you can see how un-embarrassingly parallel this task is (frankly if it is truly embarrassingly parallel, then `#prama omp for` should bring you close to best possible performance already.)

I don’t think your prof got it wrong, it is just that you misunderstood them.