The BLIS parallelism strategy is written up in at least one of the references they provide. (Serial OpenBLAS is typically a bit faster than BLIS, but it has (had?) problems with multi-threading.) See also the distributed equivalents (ScaLAPACK and competitors) for what's may be of more interest in serious HPC.