## Matrix Multiply, All-Pairs Shortest Path, and Large Dense Graphs

Take a node in a graph and consider it as a user in twitter. A link between two users is the minimum retweet time. The link represents the close connection between users. You may be interested to compute the distance among all users. Why ? To determine clusters, to find the closest nit of friends, to find relations between celebrity, or because WTF we can.

In the era of large graphs it may be interesting to compute distances among nodes. If a weighed edge represents the distance between two nodes, we  could compute the all-pairs shortest path (APSP) so that to know, given any node, what is its distance from any other node in the graph. APSP and MM are computationally equivalent (“the design and analysis of computer algorithms”  Aho, Hopcroft, and Ullman 1974. This is a great book to have anyway, I stole my advisor’s copy) and we have shown that incidentally they are the same. I will clarify in the post.

Recursive algorithms are extremely appealing for the design of matrix computation because of their natural cache locality exploitation and for their intuitive formulation. In this section, we present a recursive D&C algorithm derived from Kleene’s algorithm Figure 1.(a). Notice that Kleene’s algorithm was originally designed to solve the transitive closure (TC) of an adjacency matrix. That is, finding whether or not there is a path connecting two nodes in directed graph. However, Kleene’s algorithm is also a standard algorithm/solution for the APSP. In fact, in a closed semi-ring TC and APSP are the same problem and Kleene’s algorithm is a solution (when the scalar operators ∗ and + are specified as in the following paragraph) and it determines every edge of the (shortest) path directly [2].

A brief description of Kleene’s algorithm follows. We divide the basic problem into two sub-problems, and we solve each subproblem directly using the Floyd-Warshall algorithm, see Figure 1.(a). Then, we perform several MMs to combine the results of the subproblems using a temporary matrix; where the scalar addition of two numbers is actually the minimum of the two numbers –i.e., a + b = min(a, b)– and the scalar multiplication of two numbers is the (regular arithmetic) addition –i.e., a ∗ b = a + b.

Figure 1.(a)                                             Figure 1.(b)

Figure 1.(c)

In this case, the MM is defined in a closed semi-ring and using the properties within, we reorganize the algorithm in Figure 1.(a) and obtain the R-Kleene algorithm in Figure 1.(b). Thus, R-Kleene is a solution for APSP in a closed semi-ring. Please, note that in literature it is suggested to use a power method using MM (log(n) times) but as you can see it is not necessary.

We can say that the ASPS can be formulated as a self-matrix multiplication. This finding is not that interesting per se:  MM is highly parallel while APSP was known as really sequential making it not very compelling for large graphs.

So if you like to compute the shortest path across all nodes in a large graph, relatively dense and usign a highly parallel algorithm: R-Kleene + MM is the algorithm for you hands down.

What is the connection with Fast MM ?

Fast MM are in general NOT applicable so that to speed up the MM. This is because Fast MMs deploy cancellation operations (substractions or addition by opposite). In the APSP problem the + operation is a min() operation that is not invertable. Let me clarify this point: take a+b = c where the operation + is the arithmetic addition, then having c, and having either operand we can get the other: b = c – a or  a = c – b. Unfortunately,  min() does have this basic property (infact this is a semiring dang it). In the literature, there are example show to extend a few problem from the semiring to a ring and making possible to use Fast MM.

If your links are discrete numbers (integers), there are simple way to extend the problem to a ring. I will love to see how fast algorithms will do here, in this case no numerical error to bother either.

## The Unbearable Lighteness of Being Wrong

Every one seems to know that Fast MM based on Winograd-Like algorithm cannot be as accurate as the regular MM. A few believe that Winograd-Like algorithm are unstable numerically. A few believe that are stable for any practical purpose.  All have a very strong opinion. No one has a clear understanding.

In this post, I will talk about the contributions by Prof. Richard. P. Brent, who really started the investigation for Fast MM, Prof. Higham, one of the gods in numerical algorithms, and Prof. Bini who really brought forth the generalization analysis and estimate for Strassen like algorithms. To them, I must pay my dues and give credits. If I have any contribution, it is in the empirical measure of the accuracy of fast MM codes for large problem sizes where fast MM are actually applicable and useful in modern architectures.

Let me start with the main statement. The forward error bound for the conventional C=AB  MM with matrices of size nxn can be written as follows:

(1)

Where the term on the left of the inequality represents the component-wise difference  or the forward error between the result matrix and its practical computation. The right side is the bound and it says that the error is a function of the range of matrix A and B multiplied by the size of the matrices and the precision of the architecture u. When the matrices have large values, the error we commit in computing the MM is large and, as the problem gets bigger, the error increases linearly as the problem size.   To achieve the bound in Equation 1 the algorithm has to perform at least n^3  multiplications. Fast MM cannot have the same component-wise (this have been shown by Miller 1975) because they compute fewer multiplications. I am reading and digesting Miller’s original paper.

In practice, each component of the matrix C is the sum of n terms. Each component is independent and it can be computed independently. Optimizations such as tiling breaks the computations and interleaves them, but very often the original order of the additions remains unchanged (just interleaved with other sums).

If we take the matrices A and B with elements in the range [0,1], we can see that each component of the product |A||B| has bound n and we can easily bound the right side by u(n^2).

The main problem for the addition is when two large and similar operands cancel each other. For example,  1.0001*10^5 – 0.9999*10^5 = 20, it may be that due to the finite representation of the mantissa the actual computation return 0 (instead of 20). An error of 20 w.r.t. the operands is small, but w.r.t. to the result is very large. You can see as the value of the matrices comes to play when something goes wrong.  Small matrices will produce small errors (note that we cannot say the same thing about relative errors).

In practice, no component-wise bound is found for Fast MM and instead is presented a norm-wise bound. A component-wise provides an independent bound to each element of the result matrix. A norm-wise bound provides a single bound for all. A norm bound is weaker in this sense.

(2)

The norm of a matrix A is defined as

The first thing you may notice between the two bounds in Equation 1 and 2 is the square factor n^2 factor. This is because ||AB|| <= n ||A||||B||. Again take the matrices with elements in [0,1] and  ||A|| = ||B || = 1.

By reading again the references, I finally understand the contributions by Brent and Bini, and thus I appreciate much better their research. Let see if I can convey here their message and shed any insight about the accuracy of Fast MM and its error:

(3)

We want to estimate E or its norm ||E||.

Brent work out a bound like a real mathematicians (from the eye of an engineer). He starts with an assumption. Assume that

(4)

and the goal is to find f(n) for the specific algorithm: Strassen, Winograd or any of their variations. Brent uses Strassen and here we follow.

Where

Of course, the P_i products are computed using Strassen recursively. If we consider for now the product P_0 =(A_0-A_1)B_3 and we assume that  for matrices of size (n/2)x(n/2) the error bound is true we have:

Note, I do not understand why we have the additive term n/2  in (n/2+f(n/2)), I believe we are using the norm of the products. In exactly the same manner and using the same bound we can state the bound for P_1, P_2, and P_3. For P_4, P_5, and P_6 we have the following bound:

Adding up the terms for the submatrices of C, for example, C_0

We commit an error computing the products and by adding them in the same order (the error of P_0 and P_2 is affecting 3 additions, P_4 is affecting 2). Please, work out the detail of the formula above, it is a useful exercise.

(5)

We have the recursive formula: f(1) =1 and f(n) = (44(n/2) +12f(n/2)), and the final solution is

(6)

If we do not perform a recursion till a problem size of 1, but earlier, let’s say n_0 and we apply k times the recursion:

(7)

—3M (WINOGRAD/STRASSEN) PIPE: our implementation of the 3M algorithm
as presented in Table II, where MM is implemented as STRASSEN PIPE, WINOGRAD
PIPE and thus there is software pipelining between MMs and MAs, and complex
matrices are stored as two real matrices

Brent suggested that each time we apply a recursion of Strassen, we are loosing 2 bits precision (up to three levels are common for my tests and it means about 6 bits and in practice is in between 3 and 4).

Working out with Higham’s bounds and using a few simple manipulation I could find a lower coefficient: 3^k (instead of 4^k).

As I write this post I really appreciate much better Brent’s work (1970), way ahead of his time, and even more the elegant and far reaching work by Bini and Lotti (1980), where they take the Brent Analysis and generalize it for families of Strassen’s like algorithms thus Winograd’s variants.

Interpretation.

Let us start with an observation. We take a matrix product like P_0 and use it in the computation of C_0 and C_1, any error on P_0 will affect both C_0 and C_1 equally; For Strassen algorithm, each P_i is used twice in different computations. We will not show it here, but a Winograd-variant will have a P_2 that is used in all C_i. Per se, it is not a big deal. The error of a single product will spread evenly to the results, increasing the unlucky possibility to add error from another product. The spread in itself is not dangerous.

Here it comes the core of this long post.

I have one number I would like to highlight:

• The factor 12 in Equation 5.

The factor 12 is the result of four terms: 2 + 2 + 4 +4. As a rule of thumb, 2 means that one operand has the addition of two sub matrices such as A_0+A_1 (say). The factor 4 is when both operands have additions. If we perform a MA prior the product we are increasing the error of the computation because of any unlucky cancellation; that is,  the more  MAs before the products, the Larger the error.

I see two main problems:

• The algorithms have about 4 MA post the product computations (2 more than the regular computation). We save operations but we create a longer computation chain that may, and does, increase the error accumulation. Because we are seeking cancellation we increase the injection of unwanted error.
• The pre MA may have a significant effect of the error, because even if it is small, it will be amplified by the MM that follows. Unfortunately, this is something I cannot grasp very well. The MA is among the inputs, no fancy computations, just short set of additions. I cannot see how there could be any catastrophic cancellation.

In Bini and Lotti work, They took these two ideas and provided a generalization.

Enough food for thoughts.

Just to make a little sense about the legend:

—WOPT: Winograd’s algorithm with fewer MAs.

—WIDEAL: Winograd’s algorithm optimized for a pipeline execution (but not pipelined).

—GOTOS: MM implementation as available in GotoBLAS.

—BLAS MM or MM only: MM implementation row-by-column (this is used in the error analysis only).

—GOTOS 3M: 3M algorithm as available in GotoBLAS where matrices are stored as
complex matrices.

—3M (GOTOS/WINOGRAD/STRASSEN/ATLAS): our implementation of the 3M algorithm as presented in Table II, where MM is implemented as STRASSEN, WINOGRAD, GOTOS, or ATLAS and thus complex matrices are stored as two distinct real matrices.

We believe, this is the first attempt to show how in practice all these algorithms work and how they affect the final result and error. We do not justify the blind use of fast algorithms when accuracy is paramount; however, we want to make sure that fast algorithms are not rejected just because of an unfounded fear of their instability.