## Improving the Accuracy of High Performance BLAS Implementations using Adaptive Blocked Algorithms

That is a long title uh? It is actually the title of a paper by Matt Badin, in which I have a small contribution, if you like. Accuracy of a Fast MM is always a problem and I have already wrote a post about it: The lightness of being wrong. I have to say that the previous title is so much fun and this is just too academic. Bottom line: we can and we do improve the accuracy of Fast MMs.

If you read the previous post with lots of notations, formula, and a set of conclusions, then, this post will be easy. If you did not, please go read it, I can wait,  and come back to this post.

As a reminder, Fast MMs perform gathering operations by adding sub-matrices together from the operands A and B, perform recursively matrix multiplication, and then combine the partial results.  Let us call the additions of the operands  as pre-addition and the combination of the results as post-additions.

The numerical error that everybody hates is the so called cancellation: when two large numbers of opposite sign are added up, the result is so small that in finite precision it is reduced to zero. This cancellation error is frightening for a few. How this error comes forth in Fast MM?

Pre-addition: If we add two matrices, the probability that to addenda are of opposite signs and of the same magnitude, is really small but possible. Considering that for large matrices,  we are actually playing a Russian roulette, we are bound to hit the target. Most commonly the error that we commit in a pre-addition is the common addition error. I can see that there can be error but there is little to be done. Pre-addition cancellations are anomalies too difficult to eradicate.

• The depth of the computation is small, we perform one addition. There is no trick  to compensate for the error.
• The problem is that the error will be injected into a computation with a much larger sequence of operations: will be injected into a MM, if we stop recursion right there, or recursively it will increase and amplify even further.

Post-Addition: Here comes the pickle, now that we have performed our MMs we must perform the post-addition and actually in perfect precision we want to have perfect cancellation. We may not have perfect cancellation because of the error injected in the pre-addition (which is small)  and because of the computation of MM itself, which is at most linear in the size of the problem size (N). The post-addition can do very little to correct the error, because we can perform only a few (but on large matrices).

• The depth of the computation is small, we cannot really compensate for the error.

MM Hint: The key of this post is that to improve the accuracy of Fast MM, the only place we can do something about is by improving the accuracy of the Kernel MM. Funny, is it ?

In the paper, Matt does a good job describing the basic idea that the MM kernel should be broken up, blocked, so that to reduce the effect of the summation error. This is actually known in the literature. Just look up any reference about blocked linear algebra application and stability. The novelty is that to achieve the optimal accuracy and optimal performance there is very little work about it. I do not really want to give away the punch line, but a top-down algorithm with what is called a post load addition, will provide both accuracy and speed up.

Matt shows that with the right kernel, we can improve the accuracy of Fast MM.

For example, If you are working with probability matrices, doing  lots of MMs to determine the evolution of a probabilistic system, then  you should work with  Winograd’s algorithm with this adapted kernel.  You will have a faster and more accurate result than using ATLAS, GotoBLAS, MKL or  my Uncle’s GEMM.

FastMM will provide the speed, the modified MM kernel, will provide the accuracy.

For other problems and for Strassen’s Fast MM, this approach will work as well. You will have a faster algorithm but you will not have the most accurate: However,  much closer to what you could have with the most accurate GotoBLAS GEMM.

Take a pick at the paper, it has been published recently.