No, GPU Matrix Multiplication ain’t tiled.

Is wisdom  a gift of old age? Or patience may be? I do not remember which one. My age is affecting my memory.

All along these random posts, my research, and the presentation of my research  the topic of error analysis is a fastidious itch that everybody around me wants to scratch. Paraphrasing the sentence “See the stone set in your eyes”, when I talk about Fast MM there is  always a cry saying the use of any Fast Algorithms is inconceivable: the error those algorithms introduce is a huge stone that I should lift. The wise see my stone and  want me to scratch its itch.

I would recommend a talk to who will attend  ICS 2013:  Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms.

There are a few interesting ideas that I would like to express (not necessarily aligned with all the authors)  beside the obvious point that technology makes all lazy:

  • When new architectures are introduced, like these beautiful GPUs,  we tend to forgive the quality of the code written for them. For fast publication,  we forget what has been done previously (by ourselves?).
  • These things can awe you with 3TFLOPS performance, which is a slap in the face to  any high-end general purpose processors. For the same reason,  any decent code will do.
  • MM for GPUs do not use blocking. The irony is that who developed code for the GPUs  are the same authors who proposed blocking.

But WTF is blocking/tiling and why MM ain’t tiled?

It is actually easy to summarize Tiling/blocking as a way to break the computation into smaller computations following a divide-and-conquer algorithm. The MM is tiled if it is  decomposed into similar and smaller MM. For example, Fast algorithms are applying blocking naturally (i.e., recursive blocking).

If you pick up the code for MM for NVidia’s or for AMD’s GPUS as today April 2013, the code is not tiled. The authors break the matrix C in square sub blocks of fixed sizes (k_0 x k_0) and they perform  “long” matrix multiplications, in practice they exploit dot products. This is not me being obsessive about the definition of blocking: it has serious consequences on the error the computation commits. The code has long string of dependent additions that very likely introduce and accumulate error. Here is the stone set in your eyes.

The wise man would say: do not worry, young man! The error committed here is not even close to the error committed by the Fast MM Algorithms. After all, the wise is thinking anything is more accurate than a Fast MM, which is unstable. What the matter you do not like 3TFLOPS?

If I apply any Fast Algorithm, I may add another 1TFLOP (25% improvement), but the wise  would say: inconceivable. Because, my Fast Algorithm will introduce error of inconceivable sizes making the result plain useless. I like to use “inconceivable” here. Unfortunately, it seems I cannot say, what the matter you do not like 4TFLOPS for the price of 3?

In the ICS talk, the authors will show that for positive matrices (like probability matrices) there is a Winograd’s variant  that is  both 27% faster and 3-4 bits more accurate. The authors also show an accuracy comparison with Goto’s implementation, which should have been used as reference for numerical error analysis previously (but did not). The fast MM looses one bit to Goto’s implementation. 4TFLOPS and 1 decimal precision back, can anyone tell me what is wrong with that ?

Again, my conclusion is simple: keep an open mind. There are places where Fast MMs should not be used and places where they should, let’s learn together when and where.