As I interview people (e.g., engineers) or I get interviewed for a new project or a new job, I can barely stop wondering why we have this unfounded belief that we are the best in what we are doing. See what I mean by reading the entire post.
For the implementation of Matrix Multiplication, and thus the implementation of the BLAS library, I have seen several project and libraries. Two of them are very impressive: ATLAS by Whaley and GotoBLAS by Goto. As a context, BLAS is the building block for vector and matrix operations. Most of scientific computing applications are based upon this library and the operations within it. LAPACK is built on BLAS. For a funny connection, if you use python and you used scipy (scipy is built on top of LAPACK and BLAS). ATLAS or GotoBLAS are high performance implementation of the BLAS and often written in FORTRAN (the inner kernel in assembly).
In turn, the general matrix multiplication (GEMM) is THE matrix operation within BLAS and the scientific computing work horse (GEMM is the work horse for QR-factorization which is the work horse for the solution of linear systems).
What is my point then? There was a time, when I thought I could do a better job than the other guys, probably because I thought I was better. The problem is that I do NOT have the chops to do half of the job that Goto or Whaley did and are doing. My implementation of the general MM has been close to but always behind these two top designers and developers.
As the time passes by and I see other engineers/students attempt to climb the high peak of GEMM performance, I see the allure. The application is simple to explain and present. Any developer can write a working code using a single liner in C. This is like playing football, the beautiful game, it is simple to explain and play, everybody can play. Writing high performance GEMM is like playing football. I can play a few tricks but I cannot even stand close to a professional player like Zinadine Zidan (I mention him because we have the same age).
How come I can claim I can improve the performance of GEMM and also have the fastest MM? Because I understood that there are different algorithms and there is space for different algorithms. Personally, I am better in thinking and writing recursive algorithms. In another post, I will present the code itself and I believe it is beautiful. The simplicity of the code is embarrassing and you may wonder if there was any contribution to start with.
Winograd–Strassen algorithms have useful applications to build on top of these high performance GEMM. Think for one second, when the GEMM reaches 90-95 % of the machine peak performance, 150 GFLOPS and more, even the original author has little space to improve GEMM. Actually, gaining anything further will not be worth it. A new machine will come along and the process to fit the code to the new system will restart.
I believe that any significant improvement of the GEMM by software alone is by the application of new algorithms: Winograd–Strassen recursive algorithms to start. These new algorithms do not substitute ATLAS nor GotoBLAS GEMM implementations. They build on top of them. They need the fastest implementation of the MM in order to achieve the fastest implementation.
I will draw two conclusions for this post: one optimistic and one cynical.
[optimistic] I can have dinner with Whaley or Goto tonight and I could feel comfortable (I do not know how conformable they could be though). I discovered I did not need to compete and actually I can help them to extend their algorithms to performance they could not achieve otherwise. Together we will succeed, divided we will fail.
[cynical] I discovered late in time the old adagio: “when you cannot beat them, either join them or …”. In this case, my ego was the real obstacle because it was Big but not BIG enough.
The morale of this post? When you have a MM kernel that achieves 95% peak performance, do not be afraid to use it :O. You better invest your time to learn about your problem than to show to the world that you can write better code. In fact, very likely you will not. And, if you do write better code, you will be able to improve performance by little, as much as 5%. In contrast, I could improve performance by 27% with no extra effort.