## On the Fast MM for GPUs

This month,  I read a paper about a Strassen-like MM for a GPU published in a distributed computing conference.  The paper is academic but it can be read quickly and it provides what it promises: the implementation of both Strassen and Winograd algorithms on a GPU.  I am always pleased to read that the community starts investigating Fast MM and in particular Winograd’s, this means that the HW is reaching the performance limit and an algorithmic view kicks in. Practitioners and more theoretical researchers start using old algorithms in this new arena and claim performance and error: being a new arena, GPUs are nice little computational engines, allows a freedom that rarely is provided.  It is like a field covered with fresh snow and we are the first stepping on it.

The literature liberty to compare the GPUs to a white snow field is a big one. The GPUs are the current computational engines and god-like capable scientist  such as prof. Dongarra and prof. Demmel are providing powerful Matrix Multiplication kernels and quite smart researchers with them for quite some time now. Some kernels are designed to be fast and others to be more accurate and less fast.  I noticed that  because the computational engines is new the algorithm is simplified: thread computation are simple row-by-column computations. It makes sense, these engines allows lots of thread and to make them independent the easy way out is the tiling of matrix C into independent computations. But this is not the core of my train of thoughts.

In the article that is the inspiration for this post, the authors show a 30% performance improvement and 200x error for both Strassen and Winograd algorithms.  The authors know want they want to show and provide a set of experiments to confirm this: they use power of two matrices to show performance and error analysis.

What is wrong with that? Really, nothing. Power of two are useful, the code is simple, the problem size double every time and the number-of-operation complexity increases by 8. Every thing is clean and it is easy. It easy to explain and the message is simple to understand. If you need performance and you are willing to loose 2 digits, the least you can achieve is just 30% speed up on matrices of size 16Kx 16K. Sorry, I forgot to tell you the problem size. Is it  big, uh?

Such a result and such a paper will scare any engineer working on the implementation of MM kernels: you need a huge/large problem size; when you do have a large one, instead of 100 seconds, the computation will take 70 seconds on these machines, and  you will have the result correct to the second position.  This is disarming, right?  I will bet that the final conclusion any reader will have is negative. The reader will not care to use any fast algorithms, simply put it, it will  not be worth it.

I think this is the course of being the first stepping on the fresh snow: if no new snow will cover the path traced, most engineers evaluating their options, especially who will not have time to read other papers or evaluate the performance on their own, they will have to skip Fast MM.

What frustrates me is the too narrow punch line. Despite the classic scientific approach, theory, algorithms, implementation, and experimental results,  I know that the paper presents one side of a coin, and there is the other one just right there, waiting patiently to be told. Furthermore, I know it has been told several times previously.