Is this adaptive-Winograd MM really new?

Once upon a time I showed the adaptive Strassen-Winograd algorithm presented in the previous post, Prof. Nicolau (the co-author) asked why this did not come up in the literature before. Why is it new?

Let us breath and let the question sink in ….  I always find these questions really fun(ny) because they are unanswerable. I will not know that ever, the previous authors know that, probably. Unfortunately, researchers do not publish or better they won’t publish it (like a car will not start without a battery) and, often, they want to forget why something did not work out. I will feel  like a moron to publish a paper saying that “Well, I liked this problem, but I am an idiot and I could not solve it, better luck to you all“. It is not good for your career and  for your self esteem, it is just not right.

Ironically, here I give an answer even though I may out of my element. If you try to reproduce the previous algorithm without paying attention at the matrix sizes of the temporary results, very likely you will have an incorrect algorithm. I know this because I tried to take the original algorithm and apply it. I failed miserably. Very likely you would fail at first and I guess this is what happened for previous practitioners. Previous researchers had to try, they may had the same problems but went for the path with least resistance. For a solution they understood.

When in CMU, I could chat with one professor who was involved in the implementation of Winograd (not the balanced one). I could finally ask the unanswerable question. What I discovered is kind of illuminating how we think: his main goal was to apply matrix tensors for the description of fast algorithms (Winograd) and he really did not care about a balanced or not balanced algorithm. In CMU I was working on the implementation of the Fast Fourier Transforms (FFT) and matrix tensors are powerful tools for the description of FFT implementations —there is a tensor in every SPIRAL paper :).

Why is this illuminating?

I cared for a balanced Winograd algorithm because I wanted to use a set of  tools that  would fit a balanced computation: optimality of the computation, simplicity, and exploiting data locality in caches. So I went through a path to show that I could use such tools. The previous researcher probably never consider my problem because they had other tools they wanted to use. They may have never tried to have a balanced computation, thus never failed.


When somebody ask you an answerable question that you cannot know the answer; for example, your co-author, your wife, your boss or your daughter,  you must give an answer, hopefully in the future you will stumble upon the right one.






Adaptive-Balanced Winograd

Once I thought I knew Strassen-Winograd’s algorithm. Well, I did not.  Probably, you do not too. In this post, I am going to discuss a variation of the classic algorithm

Take the Strassen’s algorithm presented in wikipedia and change the notation just slightly:

We partition A, B and C into equally sized block matrices

C = \begin{bmatrix}C_0 & C_1\\ C_2& C_3 \end{bmatrix}, A = \begin{bmatrix}A_0 & A_1\\ A_2& A_3 \end{bmatrix}, \text{ and } B = \begin{bmatrix}B_0 & B_1\\ B_2& B_3 \end{bmatrix}

with  C_i, A_i, B_i \in \mathbb{R} ^{2^n\times 2^n}

Now comes the important part. We define new matrices

P_1 = (A_0+A_3)*(B_0+B_3), \\ P_2 = (A_2+A_3)*B_0,  \\ P_3 = A_0*(B_1-B3), \\ P_4 = A_3*(B_2-B3), \\ P_5 = (A_0+A_1)*B3, \\ P_6=(A_2-A0)*(B_0+B_1), \text{ and  }\\ P_7 = (A_1- A_3)*(B_2+B3)

which are then used to express the submatrices of C in terms of these products Ps We eliminated one matrix multiplication and reduced the number of multiplications to 7 as

C = \begin{bmatrix} C_0=P_1+P_4-P_5+P_7 & C_1 = P_3+P_5 \\ C_2=P2+P4 & C_3= P_1-P_2+P_3+P_6 \end{bmatrix}

The first question that came to my mind  was: Why power of two matrices?

Intuitively, I knew  that if the matrices are power of two  the division process could be done recursively and stop when only when the operands are single-element matrices; that is, scalars. In fact, it is easy to write a recurrence equations:

(1)   \begin{equation*}   T(2^n) = \begin{cases} 7T(2^{n-1})+18O(2^{n-1}) & n>0 \\ 1 & n==1 \end{cases} \end{equation*}

Equation 2 states that a matrix multiplication is reduces into 7 MMs and 18 matrix additions (MAs) recursively. The definition of the matrix sizes simplifies the algorithm and, more importantly, it simplifies the recurrence equation:

(2)   \begin{equation*}   T(2^n) =  O((2^n)^{\log_{2}7}) \end{equation*}

If you try to look up implementations of the algorithm above, you will find solutions for any matrix sizes:

First: The most common implementation is based on a preliminary division of the matrices into quadrants as above, but where the left-top is the largest even quadrant: for example, A \in \mathbb{R}^{m \times n} where m=2k+1 and n=2j+1 are odd, then we can do a first division by setting A_0 \in   \mathbb{R}^{2k \times 2j}.  In practice, we can apply once Strassen’s recursive approach for the product  C_0 = A_0*B_0. The remain of the computation is carried out as a sequence of Matrix-Vector and VEctor-Vector operations

    \[  C_0 += A_1*B_2, \\ C_1= A_0*B_1 +A_1*B3,  \\ C_2 = A_2*B_0 + A_3*B_2, \\ C_3 = A_2*B_1+A_3*B_3. \]

Notice that right most quadrant of the result matrix C is a scalar and requires O(n) operations, the rest takes 3O(nm) operations (like 3 MAs) and  O(n+m).

Second: We can pad with zeros the matrices to the closest power-of-two size. This is academics.

Third: We can pad the matrices once so that A and B are matrices with even sizes, then we can apply Strassen’s once. Recursively, we can pad the matrices as we see fit during the recursion.

The first solution is the most common in the literature and real libraries (e.g., IBM ESSL). The second one is simply academics. The third one is metioned and if I remember correctly has the cute nickname of peeling, but considered inferior to the first one.

Personally, I find the first solution uninspiring and we have found an elegant and more efficient solution, which is closer to the third one. A reviewer suggested the nick name of  dynamic padding and also suggested that this was found already but not published. Probable.

Before I introduce the algorithm let me provide two teasers that provide the basic idea of our solution:

  • Notice: the best way to exploit the speed up by Strassen/Winograd algorithm is by braking the matrices into to balanced and square submatrices. This assures fewer operations and we can avoid the Matrix-Vector operations. A balanced division provides the most efficient algorithm so that to get closer to the performance of the original algorithm.
  • Notice: Any high performance implementation of the Strassen–Winograd algorithm requires 2 or 3 temporary matrices to store matrix additions for As, Bs, and Cs. We do not need to pad matrices as long as we redefine what is a Matrix addition for uneven matrices.

Now that I gave away the two main idea, Let’s go one step at a time and present a better algorithm.

We partition A, B and C into equally sized block matrices

C = \begin{bmatrix}C_0 & C_1\\ C_2& C_3 \end{bmatrix}\in \mathbb{R}^{m\times p}, \\ A = \begin{bmatrix}A_0 & A_1\\ A_2& A_3 \end{bmatrix} \in \mathbb{R}^{m\times n}, \text{ and } \\ B = \begin{bmatrix}B_0 & B_1\\ B_2& B_3 \end{bmatrix} \in \mathbb{R}^{n\times p}

with the large quadrant

C_0 \in \mathbb{R}^{\lceil\frac{m}{2}\rceil\times \lceil\frac{p}{2}\rceil}, A_0  \in \mathbb{R}^{\lceil\frac{m}{2}\rceil\times \lceil\frac{p}{2}\rceil},  B_0  \in \mathbb{R}^{\lceil\frac{n}{2}\rceil\times \lceil\frac{p}{2}\rceil}

and small quadrant C_3 \in \mathbb{R}^{\lfloor\frac{m}{2}\rfloor\times \lfloor\frac{p}{2}\rfloor},A_3  \in \mathbb{R}^{\lfloor\frac{m}{2}\rfloor\times \lfloor\frac{p}{2}\rfloor}, B_3  \in \mathbb{R}^{\lfloor\frac{n}{2}\rfloor\times \lfloor\frac{p}{2}\rfloor}.

The only original thing we are going to do is to define the MA for operations such as S = A_0 +A_3. Everything is like we pad A with an bottom row of  zeros and a left column with zero. In practice, the definition  of matrix addition is simple, it follows real unoptimized code …


// C = A + B
void add_t(DEF(c), DEF(a), DEF(b)) {

  int i,j,x,y;

   /* minimum sizes */
   x = min(a.m,b.m); y = min(a.n,b.n);

   //# pragma omp parallel for
   for (i=0; i<x; i++) {
     /* core of the computation */
     for (j=0;j<y;j++)  E_(,i,j,c.M,c.N) = a.beta*E_(,i,j,a.M,a.N) + b.beta*E_(,i,j,b.M,b.N);

     if (y<a.n)  E_(,i,j,c.M,c.N) =  a.beta*E_(,i,j,a.M,a.N) ; /* A is larger than B */
     else if (y<b.n) E_(,i,j,c.M,c.N) = b.beta*E_(,i,j,b.M,b.N); /* B is larger than A */
   /* last row */
   if (x<a.m)  {/* A is taller than B */
     for (j=0;j<a.n;j++)  E_(,i,j,c.M,c.N)  = a.beta*E_(,i,j,a.M,a.N);
   if (x<b.m)  {/* B is taller than A */
     for (j=0;j<b.n;j++)   E_(,i,j,c.M,c.N) = b.beta*E_(,i,j,b.M,b.N);
   //   c.beta = 1;
  \mathbf{A} = \begin{bmatrix} \mathbf{A}_{1,1} & \mathbf{A}_{1,2} \\ \mathbf{A}_{2,1} & \mathbf{A}_{2,2} \end{bmatrix} \mbox { , } \mathbf{B} = \begin{bmatrix} \mathbf{B}_{1,1} & \mathbf{B}_{1,2} \\ \mathbf{B}_{2,1} & \mathbf{B}_{2,2} \end{bmatrix} \mbox { , } \mathbf{C} = \begin{bmatrix} \mathbf{C}_{1,1} & \mathbf{C}_{1,2} \\ \mathbf{C}_{2,1} & \mathbf{C}_{2,2} \end{bmatrix}


\mathbf{A}_{i,j}, \mathbf{B}_{i,j}, \mathbf{C}_{i,j} \in R^{2^{n-1} \times 2^{n-1}}


\mathbf{C}_{1,1} = \mathbf{A}_{1,1} \mathbf{B}_{1,1} + \mathbf{A}_{1,2} \mathbf{B}_{2,1}
\mathbf{C}_{1,2} = \mathbf{A}_{1,1} \mathbf{B}_{1,2} + \mathbf{A}_{1,2} \mathbf{B}_{2,2}
\mathbf{C}_{2,1} = \mathbf{A}_{2,1} \mathbf{B}_{1,1} + \mathbf{A}_{2,2} \mathbf{B}_{2,1}
\mathbf{C}_{2,2} = \mathbf{A}_{2,1} \mathbf{B}_{1,2} + \mathbf{A}_{2,2} \mathbf{B}_{2,2}

With this construction we have not reduced the number of multiplications. We still need 8 multiplications to calculate the Ci,j matrices, the same number of multiplications we need when using standard matrix multiplication.

Now comes the important part. We define new matrices

\mathbf{M}_{1} := (\mathbf{A}_{1,1} + \mathbf{A}_{2,2}) (\mathbf{B}_{1,1} + \mathbf{B}_{2,2})
\mathbf{M}_{2} := (\mathbf{A}_{2,1} + \mathbf{A}_{2,2}) \mathbf{B}_{1,1}
\mathbf{M}_{3} := \mathbf{A}_{1,1} (\mathbf{B}_{1,2} - \mathbf{B}_{2,2})
\mathbf{M}_{4} := \mathbf{A}_{2,2} (\mathbf{B}_{2,1} - \mathbf{B}_{1,1})
\mathbf{M}_{5} := (\mathbf{A}_{1,1} + \mathbf{A}_{1,2}) \mathbf{B}_{2,2}
\mathbf{M}_{6} := (\mathbf{A}_{2,1} - \mathbf{A}_{1,1}) (\mathbf{B}_{1,1} + \mathbf{B}_{1,2})
\mathbf{M}_{7} := (\mathbf{A}_{1,2} - \mathbf{A}_{2,2}) (\mathbf{B}_{2,1} + \mathbf{B}_{2,2})

which are then used to express the Ci,j in terms of Mk. Because of our definition of the Mk we can eliminate one matrix multiplication and reduce the number of multiplications to 7 (one multiplication for each Mk) and express the Ci,j as

\mathbf{C}_{1,1} = \mathbf{M}_{1} + \mathbf{M}_{4} - \mathbf{M}_{5} + \mathbf{M}_{7}
\mathbf{C}_{1,2} = \mathbf{M}_{3} + \mathbf{M}_{5}
\mathbf{C}_{2,1} = \mathbf{M}_{2} + \mathbf{M}_{4}
\mathbf{C}_{2,2} = \mathbf{M}_{1} - \mathbf{M}_{2} + \mathbf{M}_{3} + \mathbf{M}_{6}

Ok, We defined the way we decompose the matrices. We defined how we add matrices. Now here is our algorithm:

Adaptive Winograd

S = A_2 + A_3 \\ T = B_1 + B_0 \\C_3 = U, C_1 = U \\ \\ U  = A_0 {*_w}B_0 \\C_0  = U \\ \\ C_0 {+=} A_1 {*_w} B_2    \\ \\S = S + A_0\\T = B_3 - T \\U {+=} S {*_w} T \\C_1 {+=}  U \\ \\S = S + A_1   \\C_1 {+=} S {*_w} B_3  \\ \\T = B_2 - T   \\C_2 = A_3 {*_w} T \\ \\ S = A_0 + A_2 \\T = B_3 + B_1 \\ U {+=} S  {*_w} T\\C_3 {+=} U\\C_2 {+=} U


and the code verbatim:

// Winograd's matrix multiply                                                                                                                                                      
// Notation and order taken from                                                                                                                                                   

int wmul(DEF(c), DEF(a), DEF(b)) {
  c.beta =1;
  if (a.m<= LEAF || a.n<= LEAF || b.n<=LEAF) {
      // Go GotoBLAS or ATLAS
      CMC(USE(c), = , USE(a),  mm_leaf_computation , USE(b));
  else {

    Matrix s = {0, S0(a.m,a.n),S0(a.m,a.n)   ,a.trans,a.beta};
    Matrix t = {0, S0(b.m,b.n),S0(b.m,b.n)   ,b.trans,b.beta};
    Matrix p  = {0, S0(c.m,c.n),S0(c.m,c.n)  ,'n',1};                                                                                                                 
    Matrix u2  = {0, S0(c.m,c.n),S0(c.m,c.n) ,'n',1};                                                                                                                 
    Matrix tc0 = Q0(c),tc1 = Q1(c), tc2 =Q2(c),tc3=Q3(c);
    Matrix ta0 = Q0(a),ta1 = Q1(a), ta2 =Q2(a),ta3=Q3(a);
    Matrix tb0 = Q0(b),tb1 = Q1(b), tb2 =Q2(b),tb3=Q3(b);  = (Mat *) CALLOC(s);  = (Mat *) CALLOC(t);  =  (Mat *) CALLOC(p); =  (Mat *) CALLOC(u2);
    /* P1 */
    /* P = A0*B0   */ CMC (RQ0(u2,c), =, ta0, wmul, tb0);
    /* C0  = P     */ copy(tc0    , u2);
    /* U2  = P           copy(u2, p); */

    /* P2 */
    /* P = A1 * B2 */  CMC(p,   =, ta1, wmul,   tb2);
    /* C0  += P    */  CMC(tc0, =, tc0, s_add,  p);

    /* P3 */
    /* S = A2 + A3 */  CMC(RQ2(s,a), =, ta2,      s_add,   ta3);
    /* T = B1 - B0 */  CMC(t,        =, tb1,      s_sub,   tb0);
    /* P = S * T   */  CMC(RQ2(p,c), =, RQ2(s,a), wmul ,   t);
    /* C3  =  P    */  copy(tc3, RQ3(p,c));
    /* C1  =  P    */  copy(tc1, RQ3(p,c));

    /* P4 */
    /* S  = S - A0 */  CMC(s,   =, RQ2(s,a),   s_sub,  ta0);
    /* T  = B3 - T */  CMC(t,   =, tb3,        s_sub,  t);
    /* P = S*T   */    CMC(p,   =, s,          wmul, t);
    /* U3 =U2 +=P */   CMC(u2,  =, u2,         s_add,  p);
    /* C1 +=U2, */     CMC(tc1, =, RQ3(tc1,c), s_add,  RQ1(u2,c));

    /* P6 */
    /* S = A1 - S */   CMC(s,        =, ta1,       s_sub,   s);
    /* P = S * B3 */   CMC(RQ1(p,c), =, RQ1(s,a),  wmul,    tb3);
    /* C1  += P   */   CMC(tc1,      =, tc1,       s_add,   RQ1(p,c));

    /* P7 */
    /* T = B2 - T */   CMC(RQ2(t,b), =, tb2,    s_sub,  RQ2(t,b));
    /* P = A3*T  */    CMC(RQ2(p,c), =, ta3,    wmul,   RQ2(t,b));
    /* C2 = P    */    copy(tc2, RQ2(p,c));

    /* P5 */
    /* S = A0 - A2 */  CMC(s,        =, ta0,       s_sub,  ta2);
    /* T = B3 - B1 */  CMC(RQ1(t,b), =, tb3,       s_sub,  tb1);
    /* P = S*T     */  CMC(RQ1(p,c), =, s,         wmul,   RQ1(t,b));
    /* U3  += P    */  CMC(u2,       =, u2,        s_add,  RQ1(p,c));
    /* C3  += U3   */  CMC(tc3,      =, tc3,       s_add,  RQ3(u2,c));
    /* C2  += U3   */  CMC(tc2,      =, RQ2(u2,c), s_add,  tc2);

  return recursive;



After such a boring, full of notations, post how can we summarize ? What is the punch line ?

Despise the long history of Strassen-Winograd algorithm, we present here the true generalization for every matrix sizes and shape. Yep, the algorithm can be applied to rectangular matrices as well and as it is. (well this is a white lie, for rectangular matrices we need to change one line of the program).


If you have a Matrix Multiply Kernel that achieves 90-95% peak performance

As I interview people (e.g., engineers) or I get interviewed for a new project or a new job, I can barely stop wondering why we have this unfounded belief that we are the best in what we are doing. See what I mean by reading the entire post.

For the implementation of Matrix Multiplication, and thus the implementation of the BLAS library, I have seen several project and libraries. Two of them are very impressive: ATLAS by Whaley and GotoBLAS by Goto.  As a context, BLAS is the building block for vector and matrix operations. Most of scientific computing applications are  based upon this library and the operations within it.  LAPACK is built on BLAS. For a funny connection, if you use python and you used scipy (scipy is built on top of LAPACK and BLAS).  ATLAS or GotoBLAS are high performance implementation of the BLAS and often written in FORTRAN (the inner kernel in assembly).

In turn, the general matrix multiplication (GEMM) is THE matrix operation within BLAS and the scientific computing work horse (GEMM is the work horse for QR-factorization which is the work horse for the solution of linear systems).

What is my point then? There was a time, when I thought I could do a better job than the other guys, probably because I thought I was better. The problem is that I do NOT have the chops to do half of the job that  Goto or Whaley did and are doing. My implementation of the general MM has been close to  but always behind these two top designers and developers.

As the time passes by and I see other engineers/students attempt to climb the high peak of GEMM performance, I see the allure. The application is simple to explain and present. Any developer can write a working code using a single liner in C. This is like playing football, the beautiful game, it is simple to explain and play, everybody can play. Writing high performance GEMM is like playing football. I can play a few tricks but I cannot even stand close to a professional player like  Zinadine Zidan (I mention him because we have the same age).

How come I can claim I can improve the performance of GEMM and also have the fastest MM?  Because I understood that there are different algorithms and there is space for different algorithms.  Personally, I am better in thinking and writing recursive algorithms. In another post, I will present the code itself and I believe it is beautiful. The simplicity of the code is embarrassing and you may wonder if there was any contribution to start with.

Winograd–Strassen algorithms have useful applications to build on top of these high performance GEMM. Think for one second,  when the GEMM reaches 90-95 % of the machine peak performance, 150 GFLOPS and more, even the original author has little space to improve GEMM. Actually, gaining anything further will not be worth it. A new machine will come along and the process to fit the code to the new system  will restart.

I believe that any significant improvement of the GEMM by software alone is by the application of new algorithms: Winograd–Strassen recursive algorithms to start.  These new algorithms do not substitute ATLAS nor GotoBLAS GEMM implementations. They build on top of them. They need the fastest implementation of the MM in order to achieve the fastest implementation.

I will draw two conclusions for this post: one optimistic and one cynical.

[optimistic] I can have dinner with Whaley or Goto tonight and I could  feel comfortable (I do not know how conformable they could be though).  I discovered I did not need to compete and actually I can help them to extend their algorithms to performance they could not achieve otherwise. Together we will succeed, divided we will fail.

[cynical] I discovered late in time the old adagio: “when you cannot beat them, either join them or …”. In this case, my ego was the real obstacle because it was Big but not BIG enough.

The morale of this post? When you have a MM kernel that achieves 95% peak performance, do not be afraid to use it :O. You better invest your time to learn about your problem than to show to the world that you can write better code.  In fact,  very likely you will not. And, if you do write better code, you will be able to improve performance by  little, as much as  5%. In contrast, I could improve performance by 27% with no extra effort.