Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
BLIS: A BLAS-like framework for basic linear algebra routines (github.com/flame)
156 points by ogogmad on Jan 24, 2024 | hide | past | favorite | 45 comments


The real money shot is here: https://github.com/flame/blis/blob/master/docs/Performance.m...

It seems that the selling point is that BLIS does multi-core quite well. I am especially impressed that it does as well as the highly optimized Intel MKL on Intel CPUs.

I do not see the selling point of BLIS-specific APIs, though. The whole point of having an open BLAS API standard is that numerical libraries should be drop-in replaceable, so when a new library (such as BLIS here) comes along, one could just re-link the library and reap the performance gain immediately.

What is interesting is that numerical algebra work, by nature, is mostly embarrassingly parallel, so it should not be too difficult to write multi-core implementations. And yet, BLIS here performs so much better than some other industry-leading implementations on multi-core configurations. So the question is not why BLIS does so well; the question is why some other implementations do so poorly.


> numerical algebra work [...] is mostly embarrassingly parallel

It's the exact opposite, most numerical linear algebra is _not_ embarrassingly parallel and requires quite an effort to code properly.

That is why BLAS/LAPACK is popular and there are few competing implementations.


I would love to be corrected: do you have some specific examples or some underlying principles?

My professional and academic background is in numerical analysis and scientific computing (including BLAS/LAPACK level implementations), but I admit I haven't done deep numerical implementations in years.


Anything involving pivoting for stability isn’t trivial to parallelize. Plain old Gaussian elimination on huge sparse matrices doesn’t scale in parallel very well.

At the other end of the spectrum, getting a matrix-matrix multiply to run fast isn’t easy either. It’s what necessitated the kind of approach the authors of BLIS adopted. On paper it’s easy, but actually getting it to run fast on a computer isn’t.


Even for the algorithms that look easy on paper (nothing easier than a matrix multiplication, right?), the real issues are memory accesses. A huge amount of work went into making these libraries cache-friendly. Not so much effort was put in making them NUMA-friendly and multi core was often added after the fact, probably not as efficiently as it could have been.

And then, there are many parts of the algorithms deep in things like the various matrix decompositions that are difficult to parallelise because of non-trivial data dependencies. It’s easy to write some code to pivot a matrix; it’s very hard to do it in an efficient and scalable way. And because all the higher-level routines depend on them, so they have a large effect on overall performance.


Embarrassingly parallel means "singe task multiple data" or "multiple task single data". Canonical example: monte carlo simulations.

With that out of the way, basic linear algebra operations that require sophisticated algorithms and are not "embarrassingly parallel": matrix multiplication, matrix inversion, matrix decomposition (SVM, QR, etc). Some of these algos fall into BLAS and others in LAPACK.


The BLIS parallelism strategy is written up in at least one of the references they provide. (Serial OpenBLAS is typically a bit faster than BLIS, but it has (had?) problems with multi-threading.) See also the distributed equivalents (ScaLAPACK and competitors) for what's may be of more interest in serious HPC.


Using just basic large dense square matrix multiplication on a modern CPU as an example, not caring about numerical stability, embarrassingly parallel implementations have high bandwidth to compute ratios, bottlenecking your results 10x - 1000x. Embarrassingly parallelizing cache friendlier solutions is prone to stragglers and other scheduling issues. If you're operating on asymptotically faster solutions like Strassen's it's at least a little nontrivial to keep the few largest recursive layers from dominating the computation and not really being much faster than other parallel algorithms.

As soon as you start mixing numerical stability (data-dependant operation reordering) concerns into linear algebra routines -- arguably one of the main points of LAPACK -- parallelization necessarily gets trickier because you either need a good enough heuristic to ignore those reorderings or you also need data-dependant scheduling without too many pathologies.


If you try to implement a simple matmul and you’ll understand. You can start with a naive one as a baseline. Then a few standard tricks can be used to speed it up. And then compare that to a standard BLAS call. And you’ll find that even with those tricks it is nowhere close to off-the-shell BLAS libraries.

But from this exercise alone, knowing the tricks you used already, you can see how un-embarrassingly parallel this task is (frankly if it is truly embarrassingly parallel, then `#prama omp for` should bring you close to best possible performance already.)

I don’t think your prof got it wrong, it is just that you misunderstood them.


BLIS mixed precision interfaces seem quite interesting and might be a good reason to expand on the BLAS api.

If I recall correctly, MKL also has interfaces that allow different array orders (row order or column order).


I can't comment on BLIS-specific interfaces as I do not know, but BLAS standards (including MKL API) covers both different precision levels and matrix representations.


By mixed precision, I mean something like AB+C where A and B are single and C is double. That way you don’t have to make a double precision copy of A and B.


Many of the algorithms in BLAS are not easily parallelized. For example, a QR factorization an inherently sequential algorithm. Optimizing BLAS performance comes mainly from rewriting the sequential algorithm into larger blocks so as to efficiently access memory. As Jim Demmel is fond of saying, floating point optimizations are cheap, memory movement is expensive.


QR decomposition isn’t in BLAS, you’re probably thinking of LAPACK.


Holy cow, my main takeaway is that mkl is complete garbage on amd parts and that isn't on accident. This makes me think I need to go back and relink some programs against blis or openblas for amd systems... Even though we primarily switched cluster hardware to amd it never occurred to us that mkl would run at half the throughput of open implementations.


You can override the detection function and get the faster Intel AVX kernels:

https://danieldk.eu/Posts/2020-08-31-MKL-Zen

Some back history: they used to use (slow) SSE kernels when a non-Intel CPU is detected. Over time, they started adding kernels for Zen, but last time I tested the Zen kernels were still slower than the AVX kernels. So it still pays off override the Intel CPU detection.

It looks like these benchmarks use MKL 2020 update 3. If I recall correctly, this version did not have the Zen sgemm kernels yet. A newer MKL version would perform much better and if you disable Intel detection, MKL would be competitive on Zen.


That sort of support has come and gone in different versions, though I couldn't tell you which. But, really, why bother? There is too much mythology around Intel stuff.


I don't understand why people use MKL on AMD hardware rather than AMD's (current) library, which is basically upstream BLIS for those architectures; their old one from the Operton era wasn't good. OpenBLAS has been largely competitive with MKL even on Intel anyway over many years, and the MKL licence used not to allow the sort of use people made of it. MKL is infinitely slower than OpenBLAS/BLIS averaged over the hardware relevant to HPC, like the POWER system I support.


Numerics code is annoying to maintain, and most of us have code bases that already use mkl. Intels compilers make it easy and quick to use, whereas openblas was always a headache in my experience. Plus amd hardware only became competitive in the past few years, so it hasn't mattered.


I don't follow. Assuming you're talking BLAS/LAPACK (libflame is the LAPACK equivalent using BLIS) I don't see why the implementation matters, and if you're dependent on MKL for some reason you can't run on some of the biggest and best HPC systems amongst others. You can expect BLAS and LAPACK implementations to be ABI-compatible (via a common Fortran 77 ABI), so even choosing at run time is an LD_PRELOAD away. Better, it's an ldconfig, alternatives, or LD_LIBRARY_PATH away with a sensible policy like Debian's. AMD was competitive with Opteron and Bulldozer for us long ago, if not their BLAS (done by NAG?).


> I do not see the selling point of BLIS-specific APIs

BLAS/CBLAS assumes your matrices are column-major/row-major. BLIS uses general stride instead (both columns and rows are parameterized by stride) which makes it more flexible on matrix storage and makes it easier to implement tensor contractions in higher dimensions.

This necessitates a different API.


The batch match-mul API would be very useful for convolutions. It's also used on Nvidia GPUs.

Often you get tensors that are sliced views on one or more dimensions, with BLAS-api you need to allocate them in a contiguous buffer. BLIS does not need that, there is a repacking algorithm to overcome memory bandwidth bounds in matrix multiplication (search the "roofline model"), the slicing/realloc can be fused with repacking with BLIS API.


I don't think these are realistic. It does far worse than MKL in practice on Intel CPUs in my tests, especially if arrays are not huge.


MKL has a JIT similar to libxsmm for small sizes.


That's for very small matrices, but I forget the dimensions. It was only because of libxsmm that MKL got it. (You can use libxsmm to dispatch to OpenBLAS or BLIS for non-tiny cases.) BLIS has "small"-dimension support in many cases; I don't remember what they call it, but you can see changelog items about it.


It's not that difficult to be faster than MKL when you write the benchmarks.

MKL is a general-purpose library, it does compromises to be good for most sizes. If you care about the performance of specific sizes, you can write code to beat it on that very specific benchmark.

Interestingly, BLIS does poorly at small sizes, which is the area where it is easiest to beat MKL at.


MKL is very fast, on Intel. Due to how they do CPU detection.

It is a specialized library and they JIT their kernel.

Usually BLIS is slower.


I think it has different goals than the BLAS API - it seems like they want to be a higher level generator of BLAS-ish APIs.


I took one of the group's courses, Linear Algebra: Foundations to Frontiers, for fun at the beginning of COVID lockdown and loved it. Here's the course that teaches their basic approach to creating BLIS:

https://www.edx.org/learn/computer-programming/the-universit...

Neat stuff. The instructors are great, so I'll try to make room in my schedule for this later in the year.


Looks very interesting, thanks for the pointer.


Discussed just a little in the past:

BLIS: BLAS-Like Library Instantiation Software Framework - https://news.ycombinator.com/item?id=11369541 - March 2016 (2 comments)


A tangential request:

is there any documentation on the architecture and engineering of OpenBLAS kernels? As a non-expert, delving into the assembly code did not give me any insights on how it worked, and I would love it if there was some way to ease people like me into the deep operation of those routines. Particularly, I would love to understand the engineering against CPU specifics (caches, superscalar arch, speculative ex.) and similar low-level insights.

Thanks in advance!


It’s interesting, it does very well in multicore and with complex numbers. The issue is that there’s no way we’ll rewrite all our code littered with calls to BLAS and LAPACK to use a different API. It looks like they have a BLAS compatibility layer though; I hope it’s good.

It even has a nice, friendly licence.


It's probably worth saying that ~10% differences in BLAS performance are probably lost in the noise of run time variance on an HPC system, especially as things don't typically spend most of their time in BLAS. Also, you don't have licence issues with BLIS and OpenBLAS. For instance when I last looked at cp2k, a major materials science copyleft application, it didn't have a licence exception for linking with MKL.


I used this for all my early edge AI work starting in 2016 after benchmarking against it the competition. Surprised that so few people seem to know of it.


For people new to BLIS, I can recommend this talk from the EasyBuild User Meeting (2021): https://www.youtube.com/watch?v=eb3dXivyTzE


Is this approach similar to or inspired by ATLAS? Sounds like it to me at first glance.


They are solving the same problem (tuning a BLAS).

ATLAS tries to automate it away.

For BLIS, instead they isolated the part that requires hand-tuning so you only have to write a couple kernels, then they can use them for the whole library.

BLIS is quite a bit newer and I’d be surprised if ATLAS kept up.


Right, I guess DGEMM (or parts of it) is all you need for surprisingly many operations.


Wonder if anyone has taken “GEMM is all you need” as a paper title yet.


No, and ATLAS is dead. BLIS is a systematic evolution of the "Goto" approach.


New? Isn't this AMD's reference BLAS implementation?


BLIS is a framework to instantiate (among other things) a BLAS. The exciting bit is that they distilled it down to a couple compute kernels. You provide stuff like the inner loops for a matrix multiplication, it turns it into a numerical library.

AMD did use it to create their BLAS library though.

Also, side note, when I’ve heard “reference BLAS,” it has been used in the opposite way. Netlib BLAS is the reference BLAS, it has basically bad performance but it defines the functionality. AMD used BLIS to create a tuned vendor BLAS.


BLIS is a different library from BLAS, with a better API. For example when you pass an array, you can pass independent strides for all axes, and you can give independent transpose/conjugate flags. It does have a BLAS compatibility interface.

The interface isn't perfect though. For example they didn't commit to supporting zero strides. That used to work anyway, but it's broken now for some archs.


We've taken the word 'new' out of the title. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: