The Better Accuracy of Fast Matrix Multiply III

In the previous post, I showed that Strassen-Winograd algorithms have a directionality in their errors. This is the reason for their numerical properties and it has been used mainly to express bounds. Today, we use it to improve their numerical properties: after all if you have a constructive proof to prove a bound the least we can do is to use it for the optimization of the algorithm itself.

I admit there are several way to improve the algorithm. Randomization was my first guess. It is simple, it breaks patterns, or we make those pattern not predictable thus not pattern any more. In fact, randomization can be used to spread the error across the result matrix, but I could not make it reducing the intensity of the error.

I came to the idea of using permutations designed tilt the error without actually involve extra computations. These permutations must be tailored to the algorithm but they are not that difficult to grasp or implement. What are these permutations and how they are applied is in the paper. I want to show the effect of those permutation on the maximum error and their locations:


Take the previous post and look at scatter plots (this and the previous). First of all, I reduced by half the maximum error.  I can do that because I spread the error.

Once again, the geo-locality of the error is the cause of the constant but extra error FastMMW algorithms have. On one side, I find comfort in knowing where the error is. On the other side,  this knowledge can be use to break the error pattern and improve the error we commit. I see a nice connection with yet another one application of Werner Heisenberg’s principle.



The Better Accuracy of Fast Matrix Multiply II

What is the goal of analytic combinatorics? In algorithms, we use it to get an estimation of complexity. We use this complexity to choose a priori an algorithm: The one with fewer operations or fewer memory reads. Speed is often related to fewer operations but there are interesting exceptions. In the previous post, I wrote about sorting algorithms, their variety, their complexity and, in turn, their assumed performance. The numerical analysis of an algorithm is that different than it complexity analysis?

I had someone introducing me to the analytic combinatorics of algorithms. This is part of the curriculum of any computer science/engineering background.  However, I am self taught about the numerical analysis. Information theory and thus probability is integral part of CS/EECS.  However, Statistics and probability has been a recent deep dive. So to answer the question above, the CS educational system seems considering numerical analysis a topic for special interests.

Interestingly the error analysis methods do not look like much different from the methods used for the complexity analysis. The idea is to bound the error instead of bounding the number of operations, say. For Matrix Multiplication the result is a matrix, very often this is a two dimensional array.

We can bound the error of each matrix element. Each element is important equally. For the regular matrix multiplication, this is simple because each element of the result matrix has the same computation though on different data. If we assume that the inputs do not matter and only the computation is injecting errors, the regular matrix multiplication has a uniform error. In fact, the bounds are identical for all element of the result matrix.

Fast Matrix multiplication are different: by construction, the algorithm build the components of the result matrix using asymmetric computations. We used to write simplified equations so that to bound the whole matrix. The final result looks and feel as any complexity analysis:

    \[ \|D\| \leq \Big[ \Big(\frac{n}{n_0}\Big)^{\log_2K}(n_0^2 +5n_0)-5n \Big] u \|A\|\|B\| \]

Higham’s Book Chapter 23, Theorem 23.3

In principle, it  is possible to write bounds for each elements even for Fast matrix multiplication; however, in practice, no one actually tried. The equation above is beautiful  in so many ways once you start to know it.

About four years ago, I started looking at how the bounds above were computed.  I was awed by the complexity and elegance. I started understanding it when I saw the original proof by Brent. I saw its power when I saw for the first time the proof by Bini and Lotti.  All of the previous masters went at the problem to reduce the result to a single equation: a summary.

More I saw the equation more I hated it because  the result was hiding the methodology to reach such a concise result. The methodology had to compute a bound for each elements and in practice the proofs teach you how to write better algorithms and thus tighter bounds, better equations.

So to achieve a better algorithm we must somehow summarize the error analysis. HeatMError150x15010000Times[-1,1]Orthogonal

So I decided to show the location of the error graphically. Take 10,000 runs of four algorithms Strassen, SW Winograd variation 1, SWOPT Winograd variation 2 and Goto’s SGEMM. The figure show the location of the maximum errors and the maximum of the maxima. The beautiful equation hides these even more beautiful patterns.

Nonetheless, when I show this picture (or pictures like this) every one understands that the higher error of Fast MM is dues to the limited and focus pattern: “if we could spread the area as in the SGEMM we could reduce the error”.  The picture let you have the right intuition immediately. Once we combine the intuition with the equation proof, you have the main idea.

There are algorithm optimizations that allow to achieve more accurate fast matrix multiplications.