## Blue duo Ellesmere vs. Red duo Fiji

To compare the performance of the two cards, we compare the performance of each single GPU and a different way to fire up the computation. We take the sequence of four operations. This is a re-edit of the post basically better and more consistent performance plots.

 TIME;
C0 += A*B;
C1 += A*B;
C2 += A*B;
C3 += A*B;
TIME;

The GPUS 0 and 1 are Ellesmere and 2 and 3 are Fiji. We will have performance plots for the firing sequence 0-0-0-0, 1-1-1-1, and 0-1-0-1 for the blue Ellesmere and 2-2-2-2, 3-3-3-3, and 2-3-2-3 for the red Fiji. No data is re-used in this experiment so all performances account for the memory movement but you can imagine that A and B they do not need to go back and forth. We show also a parallel computation 0-1-2-3.

The Code is from the clBLAS generated for the Fiji architecture. No new code was generated for the Ellesmere, this slack  will be addressed in the future hopefully because clBLAS is a “closed” project and I could not “autoGEMM” it.  clBLAS for AMD GPUs is a fundamental stepping stone for so many applications, I hope we are going to see Lazarus. Size matrices are in thousands (Mille or Kilo).

For Single precision, we see that having 16GB per GPU pays off for larger problem sizes but not as much as I thought. In Single precision, we have a small window between 20 and 25K; for single complex, the window is smaller  15-20K and the effect even smaller. For  larger or smaller  problems Fiji is superior. In single precision, a 16Kx16Kx16K problem fit in the 4GB memory and for larger problem we use a Winograd algorithms (we could use a faster one) and we can see that we can cope well. For problems larger than 30K, we need more that 16GB and thus Ellesmere will loose further.

We can see that in single precision the peak performance for the Fiji system is 6TFLOPS and for Ellesmere is 4TFLOPS. There is a practical 20-40% peak performance gap. All considering, This is pretty good.

Let us check the Double precision computation.

Now, The Fiji is always better even when we need software help. The peak performance gap is negligible 400×2 + 300×2 ~ 1.4TFLOPS  and we achieve 1.25TFLOPS,  smaller 15%.

Using the frequency and number of computational units, we know that Fiji has 64 (unit) at 1GHz while Ellesmere has 36 (unit) at 1.23GHz, so on paper the Fiji should be 44% faster. If we considering Watts, the two GPUS should have the same ops/Joule, this means that the maintenance cost is the same.

So if you have the time to wait for the result, the blue ellesmere is a good choice and, for certain problems sizes,  it is even faster (considering that the more memory allows to do more in the card). For single precision is more efficient for a set of problem sizes (same speed less power) but in general we can cope performance-wise with FIji just fine.  Alas I cannot measure the power consumption and I assume the GPUs will consume the peak Watts.  The blue card is quite, is efficient, it is a long card, it has a smaller current withdraw that help your budget. The air cooling solution penalizes one of the GPUs and the card gets quite hotter at the exhaust end especially during double precision computations.

If we plug in an extra PSU and we unleash all GPUs at the same time we can achieve an empirical peak performance of 7-8 TFLOPS in single precision and 350GFLOPS in double-complex precision.    Considering that we should have about 10TFLOPS (6+4) achievable peak performance we are short  changed of 2 TFLOPS. At this time, I will not linger on the subject. Double precision performance is consistent and additive: The double precision plot shows a peak at 1.2 TFLOPS (0.8 Fiji + 0.4 Ellesmere).

This is 2017,  I can match the performance achieved in 2004  using a G4 with the AltiVec velocity engine, by Apple (which Google-search likes consistently better).

I hope I will see clBLAS for the newer version of GPUS (Rocm has only single precision that is does not have a BLAS). I hope to see a way to communicate across the GPU memories and having that instruction available at openCL level. These two wishes could make these dual GPU cards so much more appealing. In the future, I will be considering the application of  other OpenCL implementation of BLAS. Better codes can do miracles.

## The Unbearable Lighteness of Being Wrong

Every one seems to know that Fast MM based on Winograd-Like algorithm cannot be as accurate as the regular MM. A few believe that Winograd-Like algorithm are unstable numerically. A few believe that are stable for any practical purpose.  All have a very strong opinion. No one has a clear understanding.

In this post, I will talk about the contributions by Prof. Richard. P. Brent, who really started the investigation for Fast MM, Prof. Higham, one of the gods in numerical algorithms, and Prof. Bini who really brought forth the generalization analysis and estimate for Strassen like algorithms. To them, I must pay my dues and give credits. If I have any contribution, it is in the empirical measure of the accuracy of fast MM codes for large problem sizes where fast MM are actually applicable and useful in modern architectures.

Let me start with the main statement. The forward error bound for the conventional C=AB  MM with matrices of size nxn can be written as follows:

(1)

Where the term on the left of the inequality represents the component-wise difference  or the forward error between the result matrix and its practical computation. The right side is the bound and it says that the error is a function of the range of matrix A and B multiplied by the size of the matrices and the precision of the architecture u. When the matrices have large values, the error we commit in computing the MM is large and, as the problem gets bigger, the error increases linearly as the problem size.   To achieve the bound in Equation 1 the algorithm has to perform at least n^3  multiplications. Fast MM cannot have the same component-wise (this have been shown by Miller 1975) because they compute fewer multiplications. I am reading and digesting Miller’s original paper.

In practice, each component of the matrix C is the sum of n terms. Each component is independent and it can be computed independently. Optimizations such as tiling breaks the computations and interleaves them, but very often the original order of the additions remains unchanged (just interleaved with other sums).

If we take the matrices A and B with elements in the range [0,1], we can see that each component of the product |A||B| has bound n and we can easily bound the right side by u(n^2).

The main problem for the addition is when two large and similar operands cancel each other. For example,  1.0001*10^5 – 0.9999*10^5 = 20, it may be that due to the finite representation of the mantissa the actual computation return 0 (instead of 20). An error of 20 w.r.t. the operands is small, but w.r.t. to the result is very large. You can see as the value of the matrices comes to play when something goes wrong.  Small matrices will produce small errors (note that we cannot say the same thing about relative errors).

In practice, no component-wise bound is found for Fast MM and instead is presented a norm-wise bound. A component-wise provides an independent bound to each element of the result matrix. A norm-wise bound provides a single bound for all. A norm bound is weaker in this sense.

(2)

The norm of a matrix A is defined as

The first thing you may notice between the two bounds in Equation 1 and 2 is the square factor n^2 factor. This is because ||AB|| <= n ||A||||B||. Again take the matrices with elements in [0,1] and  ||A|| = ||B || = 1.

By reading again the references, I finally understand the contributions by Brent and Bini, and thus I appreciate much better their research. Let see if I can convey here their message and shed any insight about the accuracy of Fast MM and its error:

(3)

We want to estimate E or its norm ||E||.

Brent work out a bound like a real mathematicians (from the eye of an engineer). He starts with an assumption. Assume that

(4)

and the goal is to find f(n) for the specific algorithm: Strassen, Winograd or any of their variations. Brent uses Strassen and here we follow.

Where

Of course, the P_i products are computed using Strassen recursively. If we consider for now the product P_0 =(A_0-A_1)B_3 and we assume that  for matrices of size (n/2)x(n/2) the error bound is true we have:

Note, I do not understand why we have the additive term n/2  in (n/2+f(n/2)), I believe we are using the norm of the products. In exactly the same manner and using the same bound we can state the bound for P_1, P_2, and P_3. For P_4, P_5, and P_6 we have the following bound:

Adding up the terms for the submatrices of C, for example, C_0

We commit an error computing the products and by adding them in the same order (the error of P_0 and P_2 is affecting 3 additions, P_4 is affecting 2). Please, work out the detail of the formula above, it is a useful exercise.

(5)

We have the recursive formula: f(1) =1 and f(n) = (44(n/2) +12f(n/2)), and the final solution is

(6)

If we do not perform a recursion till a problem size of 1, but earlier, let’s say n_0 and we apply k times the recursion:

(7)

—3M (WINOGRAD/STRASSEN) PIPE: our implementation of the 3M algorithm
as presented in Table II, where MM is implemented as STRASSEN PIPE, WINOGRAD
PIPE and thus there is software pipelining between MMs and MAs, and complex
matrices are stored as two real matrices

Brent suggested that each time we apply a recursion of Strassen, we are loosing 2 bits precision (up to three levels are common for my tests and it means about 6 bits and in practice is in between 3 and 4).

Working out with Higham’s bounds and using a few simple manipulation I could find a lower coefficient: 3^k (instead of 4^k).

As I write this post I really appreciate much better Brent’s work (1970), way ahead of his time, and even more the elegant and far reaching work by Bini and Lotti (1980), where they take the Brent Analysis and generalize it for families of Strassen’s like algorithms thus Winograd’s variants.

Interpretation.

Let us start with an observation. We take a matrix product like P_0 and use it in the computation of C_0 and C_1, any error on P_0 will affect both C_0 and C_1 equally; For Strassen algorithm, each P_i is used twice in different computations. We will not show it here, but a Winograd-variant will have a P_2 that is used in all C_i. Per se, it is not a big deal. The error of a single product will spread evenly to the results, increasing the unlucky possibility to add error from another product. The spread in itself is not dangerous.

Here it comes the core of this long post.

I have one number I would like to highlight:

• The factor 12 in Equation 5.

The factor 12 is the result of four terms: 2 + 2 + 4 +4. As a rule of thumb, 2 means that one operand has the addition of two sub matrices such as A_0+A_1 (say). The factor 4 is when both operands have additions. If we perform a MA prior the product we are increasing the error of the computation because of any unlucky cancellation; that is,  the more  MAs before the products, the Larger the error.

I see two main problems:

• The algorithms have about 4 MA post the product computations (2 more than the regular computation). We save operations but we create a longer computation chain that may, and does, increase the error accumulation. Because we are seeking cancellation we increase the injection of unwanted error.
• The pre MA may have a significant effect of the error, because even if it is small, it will be amplified by the MM that follows. Unfortunately, this is something I cannot grasp very well. The MA is among the inputs, no fancy computations, just short set of additions. I cannot see how there could be any catastrophic cancellation.

In Bini and Lotti work, They took these two ideas and provided a generalization.

Enough food for thoughts.

Just to make a little sense about the legend:

—WOPT: Winograd’s algorithm with fewer MAs.

—WIDEAL: Winograd’s algorithm optimized for a pipeline execution (but not pipelined).

—GOTOS: MM implementation as available in GotoBLAS.

—BLAS MM or MM only: MM implementation row-by-column (this is used in the error analysis only).

—GOTOS 3M: 3M algorithm as available in GotoBLAS where matrices are stored as
complex matrices.

—3M (GOTOS/WINOGRAD/STRASSEN/ATLAS): our implementation of the 3M algorithm as presented in Table II, where MM is implemented as STRASSEN, WINOGRAD, GOTOS, or ATLAS and thus complex matrices are stored as two distinct real matrices.

We believe, this is the first attempt to show how in practice all these algorithms work and how they affect the final result and error. We do not justify the blind use of fast algorithms when accuracy is paramount; however, we want to make sure that fast algorithms are not rejected just because of an unfounded fear of their instability.

Once I thought I knew Strassen-Winograd’s algorithm. Well, I did not.  Probably, you do not too. In this post, I am going to discuss a variation of the classic algorithm

Take the Strassen’s algorithm presented in wikipedia and change the notation just slightly:

We partition A, B and C into equally sized block matrices

with

Now comes the important part. We define new matrices

which are then used to express the submatrices of C in terms of these products Ps We eliminated one matrix multiplication and reduced the number of multiplications to 7 as

The first question that came to my mind  was: Why power of two matrices?

Intuitively, I knew  that if the matrices are power of two  the division process could be done recursively and stop when only when the operands are single-element matrices; that is, scalars. In fact, it is easy to write a recurrence equations:

(1)

Equation 2 states that a matrix multiplication is reduces into 7 MMs and 18 matrix additions (MAs) recursively. The definition of the matrix sizes simplifies the algorithm and, more importantly, it simplifies the recurrence equation:

(2)

If you try to look up implementations of the algorithm above, you will find solutions for any matrix sizes:

First: The most common implementation is based on a preliminary division of the matrices into quadrants as above, but where the left-top is the largest even quadrant: for example, where m=2k+1 and n=2j+1 are odd, then we can do a first division by setting   In practice, we can apply once Strassen’s recursive approach for the product  The remain of the computation is carried out as a sequence of Matrix-Vector and VEctor-Vector operations

Notice that right most quadrant of the result matrix C is a scalar and requires O(n) operations, the rest takes 3O(nm) operations (like 3 MAs) and  O(n+m).

Second: We can pad with zeros the matrices to the closest power-of-two size. This is academics.

Third: We can pad the matrices once so that A and B are matrices with even sizes, then we can apply Strassen’s once. Recursively, we can pad the matrices as we see fit during the recursion.

The first solution is the most common in the literature and real libraries (e.g., IBM ESSL). The second one is simply academics. The third one is metioned and if I remember correctly has the cute nickname of peeling, but considered inferior to the first one.

Personally, I find the first solution uninspiring and we have found an elegant and more efficient solution, which is closer to the third one. A reviewer suggested the nick name of  dynamic padding and also suggested that this was found already but not published. Probable.

Before I introduce the algorithm let me provide two teasers that provide the basic idea of our solution:

• Notice: the best way to exploit the speed up by Strassen/Winograd algorithm is by braking the matrices into to balanced and square submatrices. This assures fewer operations and we can avoid the Matrix-Vector operations. A balanced division provides the most efficient algorithm so that to get closer to the performance of the original algorithm.
• Notice: Any high performance implementation of the Strassen–Winograd algorithm requires 2 or 3 temporary matrices to store matrix additions for As, Bs, and Cs. We do not need to pad matrices as long as we redefine what is a Matrix addition for uneven matrices.

Now that I gave away the two main idea, Let’s go one step at a time and present a better algorithm.

We partition A, B and C into equally sized block matrices

The only original thing we are going to do is to define the MA for operations such as Everything is like we pad with an bottom row of  zeros and a left column with zero. In practice, the definition  of matrix addition is simple, it follows real unoptimized code …

// C = A + B

int i,j,x,y;

/* minimum sizes */
x = min(a.m,b.m); y = min(a.n,b.n);

//# pragma omp parallel for
for (i=0; i<x; i++) {
/* core of the computation */
for (j=0;j<y;j++)  E_(c.data,i,j,c.M,c.N) = a.beta*E_(a.data,i,j,a.M,a.N) + b.beta*E_(b.data,i,j,b.M,b.N);

if (y<a.n)  E_(c.data,i,j,c.M,c.N) =  a.beta*E_(a.data,i,j,a.M,a.N) ; /* A is larger than B */
else if (y<b.n) E_(c.data,i,j,c.M,c.N) = b.beta*E_(b.data,i,j,b.M,b.N); /* B is larger than A */
}
/* last row */
if (x<a.m)  {/* A is taller than B */
for (j=0;j<a.n;j++)  E_(c.data,i,j,c.M,c.N)  = a.beta*E_(a.data,i,j,a.M,a.N);
}
if (x<b.m)  {/* B is taller than A */
for (j=0;j<b.n;j++)   E_(c.data,i,j,c.M,c.N) = b.beta*E_(b.data,i,j,b.M,b.N);
}
//   c.beta = 1;
}

$\mathbf{A} = \begin{bmatrix} \mathbf{A}_{1,1} & \mathbf{A}_{1,2} \\ \mathbf{A}_{2,1} & \mathbf{A}_{2,2} \end{bmatrix} \mbox { , } \mathbf{B} = \begin{bmatrix} \mathbf{B}_{1,1} & \mathbf{B}_{1,2} \\ \mathbf{B}_{2,1} & \mathbf{B}_{2,2} \end{bmatrix} \mbox { , } \mathbf{C} = \begin{bmatrix} \mathbf{C}_{1,1} & \mathbf{C}_{1,2} \\ \mathbf{C}_{2,1} & \mathbf{C}_{2,2} \end{bmatrix}$

with

$\mathbf{A}_{i,j}, \mathbf{B}_{i,j}, \mathbf{C}_{i,j} \in R^{2^{n-1} \times 2^{n-1}}$

then

$\mathbf{C}_{1,1} = \mathbf{A}_{1,1} \mathbf{B}_{1,1} + \mathbf{A}_{1,2} \mathbf{B}_{2,1}$
$\mathbf{C}_{1,2} = \mathbf{A}_{1,1} \mathbf{B}_{1,2} + \mathbf{A}_{1,2} \mathbf{B}_{2,2}$
$\mathbf{C}_{2,1} = \mathbf{A}_{2,1} \mathbf{B}_{1,1} + \mathbf{A}_{2,2} \mathbf{B}_{2,1}$
$\mathbf{C}_{2,2} = \mathbf{A}_{2,1} \mathbf{B}_{1,2} + \mathbf{A}_{2,2} \mathbf{B}_{2,2}$

With this construction we have not reduced the number of multiplications. We still need 8 multiplications to calculate the Ci,j matrices, the same number of multiplications we need when using standard matrix multiplication.

Now comes the important part. We define new matrices

$\mathbf{M}_{1} := (\mathbf{A}_{1,1} + \mathbf{A}_{2,2}) (\mathbf{B}_{1,1} + \mathbf{B}_{2,2})$
$\mathbf{M}_{2} := (\mathbf{A}_{2,1} + \mathbf{A}_{2,2}) \mathbf{B}_{1,1}$
$\mathbf{M}_{3} := \mathbf{A}_{1,1} (\mathbf{B}_{1,2} - \mathbf{B}_{2,2})$
$\mathbf{M}_{4} := \mathbf{A}_{2,2} (\mathbf{B}_{2,1} - \mathbf{B}_{1,1})$
$\mathbf{M}_{5} := (\mathbf{A}_{1,1} + \mathbf{A}_{1,2}) \mathbf{B}_{2,2}$
$\mathbf{M}_{6} := (\mathbf{A}_{2,1} - \mathbf{A}_{1,1}) (\mathbf{B}_{1,1} + \mathbf{B}_{1,2})$
$\mathbf{M}_{7} := (\mathbf{A}_{1,2} - \mathbf{A}_{2,2}) (\mathbf{B}_{2,1} + \mathbf{B}_{2,2})$

which are then used to express the Ci,j in terms of Mk. Because of our definition of the Mk we can eliminate one matrix multiplication and reduce the number of multiplications to 7 (one multiplication for each Mk) and express the Ci,j as

$\mathbf{C}_{1,1} = \mathbf{M}_{1} + \mathbf{M}_{4} - \mathbf{M}_{5} + \mathbf{M}_{7}$
$\mathbf{C}_{1,2} = \mathbf{M}_{3} + \mathbf{M}_{5}$
$\mathbf{C}_{2,1} = \mathbf{M}_{2} + \mathbf{M}_{4}$
$\mathbf{C}_{2,2} = \mathbf{M}_{1} - \mathbf{M}_{2} + \mathbf{M}_{3} + \mathbf{M}_{6}$

Ok, We defined the way we decompose the matrices. We defined how we add matrices. Now here is our algorithm:

and the code verbatim:

// Winograd's matrix multiply
// Notation and order taken from
// http://www.cs.duke.edu/~alvy/papers/sc98/index.htm

int wmul(DEF(c), DEF(a), DEF(b)) {
c.beta =1;
if (a.m<= LEAF || a.n<= LEAF || b.n<=LEAF) {
// Go GotoBLAS or ATLAS

      CMC(USE(c), = , USE(a),  mm_leaf_computation , USE(b));
}
else {

Matrix s = {0, S0(a.m,a.n),S0(a.m,a.n)   ,a.trans,a.beta};
Matrix t = {0, S0(b.m,b.n),S0(b.m,b.n)   ,b.trans,b.beta};
Matrix p  = {0, S0(c.m,c.n),S0(c.m,c.n)  ,'n',1};
Matrix u2  = {0, S0(c.m,c.n),S0(c.m,c.n) ,'n',1};
Matrix tc0 = Q0(c),tc1 = Q1(c), tc2 =Q2(c),tc3=Q3(c);
Matrix ta0 = Q0(a),ta1 = Q1(a), ta2 =Q2(a),ta3=Q3(a);
Matrix tb0 = Q0(b),tb1 = Q1(b), tb2 =Q2(b),tb3=Q3(b);

s.data  = (Mat *) CALLOC(s);
t.data  = (Mat *) CALLOC(t);
p.data  =  (Mat *) CALLOC(p);
u2.data =  (Mat *) CALLOC(u2);
/* P1 */
/* P = A0*B0   */ CMC (RQ0(u2,c), =, ta0, wmul, tb0);
/* C0  = P     */ copy(tc0    , u2);
/* U2  = P           copy(u2, p); */

/* P2 */
/* P = A1 * B2 */  CMC(p,   =, ta1, wmul,   tb2);
/* C0  += P    */  CMC(tc0, =, tc0, s_add,  p);

/* P3 */
/* S = A2 + A3 */  CMC(RQ2(s,a), =, ta2,      s_add,   ta3);
/* T = B1 - B0 */  CMC(t,        =, tb1,      s_sub,   tb0);
/* P = S * T   */  CMC(RQ2(p,c), =, RQ2(s,a), wmul ,   t);
/* C3  =  P    */  copy(tc3, RQ3(p,c));
/* C1  =  P    */  copy(tc1, RQ3(p,c));

/* P4 */
/* S  = S - A0 */  CMC(s,   =, RQ2(s,a),   s_sub,  ta0);
/* T  = B3 - T */  CMC(t,   =, tb3,        s_sub,  t);
/* P = S*T   */    CMC(p,   =, s,          wmul, t);
/* U3 =U2 +=P */   CMC(u2,  =, u2,         s_add,  p);
/* C1 +=U2, */     CMC(tc1, =, RQ3(tc1,c), s_add,  RQ1(u2,c));

/* P6 */
/* S = A1 - S */   CMC(s,        =, ta1,       s_sub,   s);
/* P = S * B3 */   CMC(RQ1(p,c), =, RQ1(s,a),  wmul,    tb3);
/* C1  += P   */   CMC(tc1,      =, tc1,       s_add,   RQ1(p,c));

/* P7 */
/* T = B2 - T */   CMC(RQ2(t,b), =, tb2,    s_sub,  RQ2(t,b));
/* P = A3*T  */    CMC(RQ2(p,c), =, ta3,    wmul,   RQ2(t,b));
/* C2 = P    */    copy(tc2, RQ2(p,c));

/* P5 */
/* S = A0 - A2 */  CMC(s,        =, ta0,       s_sub,  ta2);
/* T = B3 - B1 */  CMC(RQ1(t,b), =, tb3,       s_sub,  tb1);
/* P = S*T     */  CMC(RQ1(p,c), =, s,         wmul,   RQ1(t,b));
/* U3  += P    */  CMC(u2,       =, u2,        s_add,  RQ1(p,c));
/* C3  += U3   */  CMC(tc3,      =, tc3,       s_add,  RQ3(u2,c));
/* C2  += U3   */  CMC(tc2,      =, RQ2(u2,c), s_add,  tc2);

    FREE(s.data);
FREE(t.data);
FREE(p.data);
FREE(u2.data);

}
dept--;
return recursive;

}


## Conclusions

After such a boring, full of notations, post how can we summarize ? What is the punch line ?

Despise the long history of Strassen-Winograd algorithm, we present here the true generalization for every matrix sizes and shape. Yep, the algorithm can be applied to rectangular matrices as well and as it is. (well this is a white lie, for rectangular matrices we need to change one line of the program).

Reference

## If you have a Matrix Multiply Kernel that achieves 90-95% peak performance

As I interview people (e.g., engineers) or I get interviewed for a new project or a new job, I can barely stop wondering why we have this unfounded belief that we are the best in what we are doing. See what I mean by reading the entire post.

For the implementation of Matrix Multiplication, and thus the implementation of the BLAS library, I have seen several project and libraries. Two of them are very impressive: ATLAS by Whaley and GotoBLAS by Goto.  As a context, BLAS is the building block for vector and matrix operations. Most of scientific computing applications are  based upon this library and the operations within it.  LAPACK is built on BLAS. For a funny connection, if you use python and you used scipy (scipy is built on top of LAPACK and BLAS).  ATLAS or GotoBLAS are high performance implementation of the BLAS and often written in FORTRAN (the inner kernel in assembly).

In turn, the general matrix multiplication (GEMM) is THE matrix operation within BLAS and the scientific computing work horse (GEMM is the work horse for QR-factorization which is the work horse for the solution of linear systems).

What is my point then? There was a time, when I thought I could do a better job than the other guys, probably because I thought I was better. The problem is that I do NOT have the chops to do half of the job that  Goto or Whaley did and are doing. My implementation of the general MM has been close to  but always behind these two top designers and developers.

As the time passes by and I see other engineers/students attempt to climb the high peak of GEMM performance, I see the allure. The application is simple to explain and present. Any developer can write a working code using a single liner in C. This is like playing football, the beautiful game, it is simple to explain and play, everybody can play. Writing high performance GEMM is like playing football. I can play a few tricks but I cannot even stand close to a professional player like  Zinadine Zidan (I mention him because we have the same age).

How come I can claim I can improve the performance of GEMM and also have the fastest MM?  Because I understood that there are different algorithms and there is space for different algorithms.  Personally, I am better in thinking and writing recursive algorithms. In another post, I will present the code itself and I believe it is beautiful. The simplicity of the code is embarrassing and you may wonder if there was any contribution to start with.

Winograd–Strassen algorithms have useful applications to build on top of these high performance GEMM. Think for one second,  when the GEMM reaches 90-95 % of the machine peak performance, 150 GFLOPS and more, even the original author has little space to improve GEMM. Actually, gaining anything further will not be worth it. A new machine will come along and the process to fit the code to the new system  will restart.

I believe that any significant improvement of the GEMM by software alone is by the application of new algorithms: Winograd–Strassen recursive algorithms to start.  These new algorithms do not substitute ATLAS nor GotoBLAS GEMM implementations. They build on top of them. They need the fastest implementation of the MM in order to achieve the fastest implementation.

I will draw two conclusions for this post: one optimistic and one cynical.

[optimistic] I can have dinner with Whaley or Goto tonight and I could  feel comfortable (I do not know how conformable they could be though).  I discovered I did not need to compete and actually I can help them to extend their algorithms to performance they could not achieve otherwise. Together we will succeed, divided we will fail.

[cynical] I discovered late in time the old adagio: “when you cannot beat them, either join them or …”. In this case, my ego was the real obstacle because it was Big but not BIG enough.

The morale of this post? When you have a MM kernel that achieves 95% peak performance, do not be afraid to use it :O. You better invest your time to learn about your problem than to show to the world that you can write better code.  In fact,  very likely you will not. And, if you do write better code, you will be able to improve performance by  little, as much as  5%. In contrast, I could improve performance by 27% with no extra effort.