If you have a Matrix Multiply Kernel that achieves 90-95% peak performance

As I interview people (e.g., engineers) or I get interviewed for a new project or a new job, I can barely stop wondering why we have this unfounded belief that we are the best in what we are doing. See what I mean by reading the entire post.

For the implementation of Matrix Multiplication, and thus the implementation of the BLAS library, I have seen several project and libraries. Two of them are very impressive: ATLAS by Whaley and GotoBLAS by Goto.  As a context, BLAS is the building block for vector and matrix operations. Most of scientific computing applications are  based upon this library and the operations within it.  LAPACK is built on BLAS. For a funny connection, if you use python and you used scipy (scipy is built on top of LAPACK and BLAS).  ATLAS or GotoBLAS are high performance implementation of the BLAS and often written in FORTRAN (the inner kernel in assembly).

In turn, the general matrix multiplication (GEMM) is THE matrix operation within BLAS and the scientific computing work horse (GEMM is the work horse for QR-factorization which is the work horse for the solution of linear systems).

What is my point then? There was a time, when I thought I could do a better job than the other guys, probably because I thought I was better. The problem is that I do NOT have the chops to do half of the job that  Goto or Whaley did and are doing. My implementation of the general MM has been close to  but always behind these two top designers and developers.

As the time passes by and I see other engineers/students attempt to climb the high peak of GEMM performance, I see the allure. The application is simple to explain and present. Any developer can write a working code using a single liner in C. This is like playing football, the beautiful game, it is simple to explain and play, everybody can play. Writing high performance GEMM is like playing football. I can play a few tricks but I cannot even stand close to a professional player like  Zinadine Zidan (I mention him because we have the same age).

How come I can claim I can improve the performance of GEMM and also have the fastest MM?  Because I understood that there are different algorithms and there is space for different algorithms.  Personally, I am better in thinking and writing recursive algorithms. In another post, I will present the code itself and I believe it is beautiful. The simplicity of the code is embarrassing and you may wonder if there was any contribution to start with.

Winograd–Strassen algorithms have useful applications to build on top of these high performance GEMM. Think for one second,  when the GEMM reaches 90-95 % of the machine peak performance, 150 GFLOPS and more, even the original author has little space to improve GEMM. Actually, gaining anything further will not be worth it. A new machine will come along and the process to fit the code to the new system  will restart.

I believe that any significant improvement of the GEMM by software alone is by the application of new algorithms: Winograd–Strassen recursive algorithms to start.  These new algorithms do not substitute ATLAS nor GotoBLAS GEMM implementations. They build on top of them. They need the fastest implementation of the MM in order to achieve the fastest implementation.

I will draw two conclusions for this post: one optimistic and one cynical.

[optimistic] I can have dinner with Whaley or Goto tonight and I could  feel comfortable (I do not know how conformable they could be though).  I discovered I did not need to compete and actually I can help them to extend their algorithms to performance they could not achieve otherwise. Together we will succeed, divided we will fail.

[cynical] I discovered late in time the old adagio: “when you cannot beat them, either join them or …”. In this case, my ego was the real obstacle because it was Big but not BIG enough.

The morale of this post? When you have a MM kernel that achieves 95% peak performance, do not be afraid to use it :O. You better invest your time to learn about your problem than to show to the world that you can write better code.  In fact,  very likely you will not. And, if you do write better code, you will be able to improve performance by  little, as much as  5%. In contrast, I could improve performance by 27% with no extra effort.

 

Parallel (Winograd) Fast MM

The following is taken from the abstract of my latest paper to be published in Transaction in Mathematical Software:

We have found  a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen’s and Winograd’s fast matrix multiply or their combination with the
3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen’s until a highly tuned
BLAS matrix multiplication allows performance advantages).

We have found  that modern symmetric multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining. We have three contributions: first, we investigated the performance overview for double and double complex (published) and single and single complex (private)  precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd’s matrix multiplication (achieving up to 20% speed up w.r.t. regular matrix multiplication). About the latter Winograd’s algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we show how that yields up to 10% further performance improvements.

In this work and for one architecture (Nehalem) we achieve  the ideal performance speedup using Winograd/Strassen like algorithms. This means that for one architecture we have the fastest implementation of the Winograd algorithm.