The Better Accuracy of Fast Matrix Multiply II

What is the goal of analytic combinatorics? In algorithms, we use it to get an estimation of complexity. We use this complexity to choose a priori an algorithm: The one with fewer operations or fewer memory reads. Speed is often related to fewer operations but there are interesting exceptions. In the previous post, I wrote about sorting algorithms, their variety, their complexity and, in turn, their assumed performance. The numerical analysis of an algorithm is that different than it complexity analysis?

I had someone introducing me to the analytic combinatorics of algorithms. This is part of the curriculum of any computer science/engineering background.  However, I am self taught about the numerical analysis. Information theory and thus probability is integral part of CS/EECS.  However, Statistics and probability has been a recent deep dive. So to answer the question above, the CS educational system seems considering numerical analysis a topic for special interests.

Interestingly the error analysis methods do not look like much different from the methods used for the complexity analysis. The idea is to bound the error instead of bounding the number of operations, say. For Matrix Multiplication the result is a matrix, very often this is a two dimensional array.

We can bound the error of each matrix element. Each element is important equally. For the regular matrix multiplication, this is simple because each element of the result matrix has the same computation though on different data. If we assume that the inputs do not matter and only the computation is injecting errors, the regular matrix multiplication has a uniform error. In fact, the bounds are identical for all element of the result matrix.

Fast Matrix multiplication are different: by construction, the algorithm build the components of the result matrix using asymmetric computations. We used to write simplified equations so that to bound the whole matrix. The final result looks and feel as any complexity analysis:

    \[ \|D\| \leq \Big[ \Big(\frac{n}{n_0}\Big)^{\log_2K}(n_0^2 +5n_0)-5n \Big] u \|A\|\|B\| \]

Higham’s Book Chapter 23, Theorem 23.3

In principle, it  is possible to write bounds for each elements even for Fast matrix multiplication; however, in practice, no one actually tried. The equation above is beautiful  in so many ways once you start to know it.

About four years ago, I started looking at how the bounds above were computed.  I was awed by the complexity and elegance. I started understanding it when I saw the original proof by Brent. I saw its power when I saw for the first time the proof by Bini and Lotti.  All of the previous masters went at the problem to reduce the result to a single equation: a summary.

More I saw the equation more I hated it because  the result was hiding the methodology to reach such a concise result. The methodology had to compute a bound for each elements and in practice the proofs teach you how to write better algorithms and thus tighter bounds, better equations.

So to achieve a better algorithm we must somehow summarize the error analysis. HeatMError150x15010000Times[-1,1]Orthogonal

So I decided to show the location of the error graphically. Take 10,000 runs of four algorithms Strassen, SW Winograd variation 1, SWOPT Winograd variation 2 and Goto’s SGEMM. The figure show the location of the maximum errors and the maximum of the maxima. The beautiful equation hides these even more beautiful patterns.

Nonetheless, when I show this picture (or pictures like this) every one understands that the higher error of Fast MM is dues to the limited and focus pattern: “if we could spread the area as in the SGEMM we could reduce the error”.  The picture let you have the right intuition immediately. Once we combine the intuition with the equation proof, you have the main idea.

There are algorithm optimizations that allow to achieve more accurate fast matrix multiplications.