If you try to implement a simple matmul and you’ll understand. You can start wit...

If you try to implement a simple matmul and you’ll understand. You can start with a naive one as a baseline. Then a few standard tricks can be used to speed it up. And then compare that to a standard BLAS call. And you’ll find that even with those tricks it is nowhere close to off-the-shell BLAS libraries.

But from this exercise alone, knowing the tricks you used already, you can see how un-embarrassingly parallel this task is (frankly if it is truly embarrassingly parallel, then `#prama omp for` should bring you close to best possible performance already.)

I don’t think your prof got it wrong, it is just that you misunderstood them.