AFDS 2012 a comment

I have found the symposium refreshing and I had a lot of fun. My presentation slides and the paper are available here in this site if you like; also, the talk has been recorded and it is available for broadcast on demand at the digital symposium framework.

It was fun because I could see things done instead of just talked about, the experience pavilion. The food was good. The coffee was plenty. The people interesting and sharp. The technical sessions may have less academic appealing than let’s say famous conferences. But you could feel that there is a sincere attempt to create a community for these new machines and  tools to develop software for them.

I personally liked it and I attended just two days. Unfortunately, I could not stay longer.

About Matrix Multiplication, there were other three technical sessions comprised mine:  mostly GPU related talks with incredible performance. My session was the only one that addressed all computational engines and presented a little about Fast Matrix Multiplication. I am happy that I refrained myself in presenting yet another implementation of FastMMW for this architecture. My main goal was to present a case for an algorithmic solution for heterogeneous computing of MM. Presenting Fast MM would be a step too much. Let us learn walking before running.

An attendee was so kind to share his opinion that  Winograd’s should not be worth to pursue because of performance and because of its numerical instability.  First about Performance,   the new GPUs have a single and  in hardware multiply-add operation, cutting by half the computation of the operations making less appealing any fast algorithms. Second, the unstable part was not really well explained but it was firmly embedded into the person mind.

As response for the performance caveat.  In the past, I worked with architectures where multiply-add instructions were  speeded up in hardware by avoiding register file access between the multiply and the addition, in practice by having fewer cycles for the operation  and supposedly improving accuracy. For those Machines, I applied successfully Winograd. Matt Badin has currently an implementation for a NVidia GPU with fused multiply-add where Fast MM are applicable.  I would say that it is architecture dependent but it is not a good reason.

As a response for the instability caveat. If you want to ignore the work I did, just Google the following: “fast matrix multiply is stable” and argue with prof. Demmel and previously with prof. Higham. Please, leave me alone. However, let me ask this question: did anyone measure the error of the GPU implementation of MM?

 

 

 

 

OpenCL + APU + GPU + Fast Matrix Multiply

Next Month, I will be going to Bellevue to present a short work about matrix multiplication and heterogeneous computing at the AMD Fusion developer Symposium. I better prepare my presentation because it will be recorded and made available. I just posted the paper online, search my bio or Google the title above.

I tested several architectures during these last seven years, since I started working on Winograd’s Fast Matrix Multiplication (fastMM). Nowadays,  it is the age of GPUs as computational accelerators and I did not have much hands-on experience.  So to learn something new, last year at this time, I started considering working on APUs. In fact, During a conversation with Matteo Frigo, it came up the topic about processors with cores and a GPU: the so called APU. At least, AMD calls them APU.  The underlying question was whether or not Fast MM could be used for APU. The short answer is yes, but this is not my point for today.

In my previous posts, I talked about the recent work about MM and GPU.  Winograd algorithm can be applied to GPUs and it is only a question of problem size. Also it may be beneficial for error analysis; yes, you read me. But APUs are hybrid and they are not a separate unit with astronomical GFLOPS. You may use the CPUs (e.g., 4) or the GPU. The A8 processor,  I had the luck to test, provides this dichotomy and each part may provide 100 GFLOPS, if you add an equivalent and external  GPU, like I did, you will have a system with about 200 GFLOPS peak performance for Matrix Multiplication.

This may look like a wimpy system —I received this comment already. In fact,  to publish anything today you may have to use a GPU with 1TFLOPS peak performance. Otherwise, it is not supercomputing.  I have a different opinion, which is based on the economy of the system: my built has the same performance of a Cell processor with the same cost of a PS3 Playstation, with the same power consumption, and quite enough that I can write this post by using this built while listening Ravel’s Bolero. Think about a supercomputing centre without the annoying whistle, air conditioning, and electric bill.  The washing machine is making more noise.

From the beginning, one year ago, I built a system (new thing for me), I learn a new framework, I looked into the code written for ATI’s GPUs, and  finally I start looking how I could write Fast MM for this parallel system. Having the GFLOPS/$ of a system built on a Cell, there is not need to brag performance.  An APU will take advantage of the software written for x86_64 architectures, for which I am very familiar, and for the type of programming for GPUs. Differently from a Cell, I do not have to write new code for this and I can reuse the optimal one already written.  I used OpenCL to glue everything together. It required some time due to the learning curve, but if I could do it, any younger developer will be able to, actually faster.

Once I was  ready, my interest and curiosity moved from the development of fast MM (i.e., Winograd’s) to the understanding of the new challenges of such as system. Especially, I was very curious about using all these computational units: CPUs, internal GPU and the external GPU.  I wanted to write codes that exploit the system but they are easy to maintain just in case  I will plug out any of the chips.

From the paper the conclusions read as follows:

4.4. Conclusions
In our system, the APU provides a software solution using only CPUs that can achieve 90GFLOPS (GotoBLAS). If we would like to improve performance by just working on a different and fast algorithm, we can achieve 120 GFLOPS. If we take advantage of both GPUs, we can achieve sustainable performance of about 160 GFLOPS (and a peak performance of 200 GFLOPS). This is a first attempt in putting together a OpenCL solution for the implementation of MM using hybrid parallel
systems. The heterogeneous system presents interesting challenges and, thanks to the OpenCL API, ultimately a flexible and powerful environment

The communications to and from memory are the bottleneck of the system. In theory, that is without communication costs, we could use the CPUs, internal GPU and external GPUs concurrently. Once communication costs are included, the utilisation of the system is less.

Please, take a look at the paper OpenCL+APU+GPU+Fast Matrix Multiply or Google it.

 

 

 

On the comparison of time series (a little detour from fast Matrix Multiply)

This week, I self published a paper about Non-Parametric methods for the comparison of series: for the detection of anomalies  or for the mining of similarities in (time) series.

Yep, I meant self publishing. If the contents of the paper does not belong to a journal nor a conference is it is my fault more than the machine-learning community. The paper is long counting 65 pages, it has too many experiments collecting 22 Figures, as well as too many notations, 40 equations and about 100 references.

This paper has been  a work of passion and love. Since I started working for the Search Engine group within  Yahoo! in  2007, I fell in love with statistics and its applications, in particular how to measure differences in between two processes (stochastic or not). My background is Engineering and communication is a must class; probability in signal processing and communication is fundamental; statistical tools are not so much. But in this new environment, so many engineering processes can be described as stochastic processes  and not necessarily because they are. Tools were and are extremely scarce.

We started by applying FIR and IR filters such moving average and exponential moving average with seasonal trends for single dimensional series. We also used stochastic measures applied to probability distributions functions (PDF) such as Entropy. See these  methods are close to my Digital signal processing roots. At the beginning the challenge was to build software and to use it accordingly to fuzzy requirements. Our costumers were different and the intents very different.

After two years, we had a monitoring tool handling quite a few big systems and it was a pleasure working with the engineers that built the search engine.

I will be for ever grateful to the managers who let me went deeper and I could work on and off on multidimensional series.  In  multidimensional series the methods are quite more challenging and publicly available tools are even more scarce.

Compression methods, conformal prediction, and Kernel methods are a few that really attracted me, because of the people working with me: prof. Kifer suggested Vovk’s work that is conformal prediction; prof. Smola was the king of kernel methods (both working in Y! at that time); and compression because it was called The measure in a SIAM conference. From my previous work, I wanted to investigate again the application of PDF and CDF for multivariate statistics. For all of the above methods,  there is a very rich literature ranging 40 years of research and very talented people came up with brilliant solutions. At the beginning, I wanted to enrich our stochastic library with these methods and if I could with something original from my left brain.

Implementing these methods is actually much harder than reading and understanding the papers. Without the help of the original authors I would have failed to reach any results; furthermore, considering that this process help me to come up with my own way to compare stochastic processes I must and I am very grateful.

My methods imply a classic idea in statistics: I create an order among the elements in the multidimensional space and then apply CDF measures. The tricky part is the order. The PDF and CDF measure are not trivial and we have also a little contribution in here. The methods boil down to the computation of a topological order of the samples and then compute a CDF, which is based on this topological order. The order can be achieved by sorting algorithms, you may be surprised how little is available in sorting multidimensional vectors, or by clustering algorithms, we used Dijkstra’s.

Note: The topological order is basically a breadth-first visit of a graph and, strangely,  I never remember how to do that during interviews. I swear I wrote my own version for directed acyclic graphs for this library, however it seems I cannot keep it into my long term memory.

With so much work done, with so much research done, I am in a strange and unfortunate spot: I am biased on the contribution of the paper and I do not want to compromise my intentions. I could shorten the paper, simplify it in contents and presentation, just present a classic comparison paper and provide a limited version of the code. But I do not want to do that: when I started this work I wanted to find a reference like this paper: an attempt to provide not only ideas but also practical insight into the methods by constructive observations.

At the end of the paper, I provide  the reviews from a journal where I submit the paper and it has been rejected. I do this for  my orphan papers and I do it  not because I am upset or for seeking revenge: I want to present the reviews because these are smart people that gave their time to help me improving the paper and their opinions must be told. I often disagree with reviewers but I really appreciate and respect their work.

The paper is available in my bio http://www.fastmmw.com/wp-content/uploads/2011/04/seriesPaolo.pdf.gz and in arXiv  http://arxiv.org/abs/1205.1880

 

On GPUs and Matrix Multiplication (ICS 2012)

This year, I was very lucky to be a member of the International Conference on Supercomputing program committee.  This is my first time on the other side of the barricades. I have a new appreciation about the organization, selecting, and reviewing of technical papers in a conference. In the future, I will be careful in uttering words such that  “Those idiot reviewers, they do not understand shit ..”. If you wrote a paper, and have being rejected, at a minimum, you thought those lines.

I have been reviewing technical papers for quite some time now. I wrote a few myself. Rejection is a personal affair. If you applied and got rejected,  I am Sorry.

Now that the conference program has been published I can talk about it: look for the GPUs, CPUs, and Linear Algebra session in the conference program. I can talk about it because their intent is similar to mine in understanding how GPUs and  CPUs must work together. By the way, I will present my own work at  AMD Fusion12 Developer Summit (June 11-14 Bellevue WA), I will talk about it as soon as I will be back from the conference.

I am fascinated by how these GPUs are becoming ubiquitous and how just recently the research community is taking notice how to use these SIMD computation engines  in combination with multi-core multiprocessor systems. I have mixed feeling about GPUs and the body of wok about them: these are marvelous computing systems that can outperform any multi-core processor on a set of applications. But let us remember that not all applications are equal and sometimes you will not touch a GPU with a barge pole.

Last year, I spend less than $500 and built my own AMD machine with APU and GPU (16GB 1866 MHz) and I could write Fast MM codes in less than 6 moths. This was my first built, I did not know anything about OpenCL and GPU’s codes, and pretty much I am an idiot: this means that the tools and the examples available to code for these complex system are pretty good. My built configuration has a practical peak performance for my codes of about 200 GFLOPS (only the Cell processor could achieve the same performance for the buck). But if you are willing to spend for the-state-of-the-art GPU you may well build a sustainable 1 TFLOPS machine (see the ICS session).

I finally witness the study of the practical cases when the cost to move data to and from memory is accounted: Often in the literature and in the code available you can measure time within the GPU; that is, the execution time is measured as the start time and end time of the GPU execution when the data are already available in the internal memory. In general, the data are not in the GPU and they are at least in memory. Often the problem size is so big that it may not fit into the internal 2-4GB memory, and we need to break the computation. When there are multiple GPUs or there are GPU and CPU, there may be contention and the computation must consider it.

Once again, communication is the bottle neck.

Take a sneak peak at the conference schedule http://verona.dei.unipd.it/ics2012/program.html and consider to go, Venice is lovely in June.

 

 

 

 

 

 

 

Improving the Accuracy of High Performance BLAS Implementations using Adaptive Blocked Algorithms

That is a long title uh? It is actually the title of a paper by Matt Badin, in which I have a small contribution, if you like. Accuracy of a Fast MM is always a problem and I have already wrote a post about it: The lightness of being wrong. I have to say that the previous title is so much fun and this is just too academic. Bottom line: we can and we do improve the accuracy of Fast MMs.

If you read the previous post with lots of notations, formula, and a set of conclusions, then, this post will be easy. If you did not, please go read it, I can wait,  and come back to this post.

As a reminder, Fast MMs perform gathering operations by adding sub-matrices together from the operands A and B, perform recursively matrix multiplication, and then combine the partial results.  Let us call the additions of the operands  as pre-addition and the combination of the results as post-additions.

The numerical error that everybody hates is the so called cancellation: when two large numbers of opposite sign are added up, the result is so small that in finite precision it is reduced to zero. This cancellation error is frightening for a few. How this error comes forth in Fast MM?

Pre-addition: If we add two matrices, the probability that to addenda are of opposite signs and of the same magnitude, is really small but possible. Considering that for large matrices,  we are actually playing a Russian roulette, we are bound to hit the target. Most commonly the error that we commit in a pre-addition is the common addition error. I can see that there can be error but there is little to be done. Pre-addition cancellations are anomalies too difficult to eradicate.

  • The depth of the computation is small, we perform one addition. There is no trick  to compensate for the error.
  • The problem is that the error will be injected into a computation with a much larger sequence of operations: will be injected into a MM, if we stop recursion right there, or recursively it will increase and amplify even further.

 

Post-Addition: Here comes the pickle, now that we have performed our MMs we must perform the post-addition and actually in perfect precision we want to have perfect cancellation. We may not have perfect cancellation because of the error injected in the pre-addition (which is small)  and because of the computation of MM itself, which is at most linear in the size of the problem size (N). The post-addition can do very little to correct the error, because we can perform only a few (but on large matrices).

  • The depth of the computation is small, we cannot really compensate for the error.

MM Hint: The key of this post is that to improve the accuracy of Fast MM, the only place we can do something about is by improving the accuracy of the Kernel MM. Funny, is it ?

In the paper, Matt does a good job describing the basic idea that the MM kernel should be broken up, blocked, so that to reduce the effect of the summation error. This is actually known in the literature. Just look up any reference about blocked linear algebra application and stability. The novelty is that to achieve the optimal accuracy and optimal performance there is very little work about it. I do not really want to give away the punch line, but a top-down algorithm with what is called a post load addition, will provide both accuracy and speed up.

Matt shows that with the right kernel, we can improve the accuracy of Fast MM.

For example, If you are working with probability matrices, doing  lots of MMs to determine the evolution of a probabilistic system, then  you should work with  Winograd’s algorithm with this adapted kernel.  You will have a faster and more accurate result than using ATLAS, GotoBLAS, MKL or  my Uncle’s GEMM.

FastMM will provide the speed, the modified MM kernel, will provide the accuracy.

For other problems and for Strassen’s Fast MM, this approach will work as well. You will have a faster algorithm but you will not have the most accurate: However,  much closer to what you could have with the most accurate GotoBLAS GEMM.

Take a pick at the paper, it has been published recently.

On the Fast MM for GPUs

This month,  I read a paper about a Strassen-like MM for a GPU published in a distributed computing conference.  The paper is academic but it can be read quickly and it provides what it promises: the implementation of both Strassen and Winograd algorithms on a GPU.  I am always pleased to read that the community starts investigating Fast MM and in particular Winograd’s, this means that the HW is reaching the performance limit and an algorithmic view kicks in. Practitioners and more theoretical researchers start using old algorithms in this new arena and claim performance and error: being a new arena, GPUs are nice little computational engines, allows a freedom that rarely is provided.  It is like a field covered with fresh snow and we are the first stepping on it.

The literature liberty to compare the GPUs to a white snow field is a big one. The GPUs are the current computational engines and god-like capable scientist  such as prof. Dongarra and prof. Demmel are providing powerful Matrix Multiplication kernels and quite smart researchers with them for quite some time now. Some kernels are designed to be fast and others to be more accurate and less fast.  I noticed that  because the computational engines is new the algorithm is simplified: thread computation are simple row-by-column computations. It makes sense, these engines allows lots of thread and to make them independent the easy way out is the tiling of matrix C into independent computations. But this is not the core of my train of thoughts.

In the article that is the inspiration for this post, the authors show a 30% performance improvement and 200x error for both Strassen and Winograd algorithms.  The authors know want they want to show and provide a set of experiments to confirm this: they use power of two matrices to show performance and error analysis.

What is wrong with that? Really, nothing. Power of two are useful, the code is simple, the problem size double every time and the number-of-operation complexity increases by 8. Every thing is clean and it is easy. It easy to explain and the message is simple to understand. If you need performance and you are willing to loose 2 digits, the least you can achieve is just 30% speed up on matrices of size 16Kx 16K. Sorry, I forgot to tell you the problem size. Is it  big, uh?

Such a result and such a paper will scare any engineer working on the implementation of MM kernels: you need a huge/large problem size; when you do have a large one, instead of 100 seconds, the computation will take 70 seconds on these machines, and  you will have the result correct to the second position.  This is disarming, right?  I will bet that the final conclusion any reader will have is negative. The reader will not care to use any fast algorithms, simply put it, it will  not be worth it.

I think this is the course of being the first stepping on the fresh snow: if no new snow will cover the path traced, most engineers evaluating their options, especially who will not have time to read other papers or evaluate the performance on their own, they will have to skip Fast MM.

What frustrates me is the too narrow punch line. Despite the classic scientific approach, theory, algorithms, implementation, and experimental results,  I know that the paper presents one side of a coin, and there is the other one just right there, waiting patiently to be told. Furthermore, I know it has been told several times previously.

 

 

 

 

Matrix Multiply, All-Pairs Shortest Path, and Large Dense Graphs

Take a node in a graph and consider it as a user in twitter. A link between two users is the minimum retweet time. The link represents the close connection between users. You may be interested to compute the distance among all users. Why ? To determine clusters, to find the closest nit of friends, to find relations between celebrity, or because WTF we can.

In the era of large graphs it may be interesting to compute distances among nodes. If a weighed edge represents the distance between two nodes, we  could compute the all-pairs shortest path (APSP) so that to know, given any node, what is its distance from any other node in the graph. APSP and MM are computationally equivalent (“the design and analysis of computer algorithms”  Aho, Hopcroft, and Ullman 1974. This is a great book to have anyway, I stole my advisor’s copy) and we have shown that incidentally they are the same. I will clarify in the post.

Recursive algorithms are extremely appealing for the design of matrix computation because of their natural cache locality exploitation and for their intuitive formulation. In this section, we present a recursive D&C algorithm derived from Kleene’s algorithm Figure 1.(a). Notice that Kleene’s algorithm was originally designed to solve the transitive closure (TC) of an adjacency matrix. That is, finding whether or not there is a path connecting two nodes in directed graph. However, Kleene’s algorithm is also a standard algorithm/solution for the APSP. In fact, in a closed semi-ring TC and APSP are the same problem and Kleene’s algorithm is a solution (when the scalar operators ∗ and + are specified as in the following paragraph) and it determines every edge of the (shortest) path directly [2].

A brief description of Kleene’s algorithm follows. We divide the basic problem into two sub-problems, and we solve each subproblem directly using the Floyd-Warshall algorithm, see Figure 1.(a). Then, we perform several MMs to combine the results of the subproblems using a temporary matrix; where the scalar addition of two numbers is actually the minimum of the two numbers –i.e., a + b = min(a, b)– and the scalar multiplication of two numbers is the (regular arithmetic) addition –i.e., a ∗ b = a + b.

Figure 1.(a)                                             Figure 1.(b)

 

 

 

 

Figure 1.(c)

In this case, the MM is defined in a closed semi-ring and using the properties within, we reorganize the algorithm in Figure 1.(a) and obtain the R-Kleene algorithm in Figure 1.(b). Thus, R-Kleene is a solution for APSP in a closed semi-ring. Please, note that in literature it is suggested to use a power method using MM (log(n) times) but as you can see it is not necessary.

We can say that the ASPS can be formulated as a self-matrix multiplication. This finding is not that interesting per se:  MM is highly parallel while APSP was known as really sequential making it not very compelling for large graphs.

So if you like to compute the shortest path across all nodes in a large graph, relatively dense and usign a highly parallel algorithm: R-Kleene + MM is the algorithm for you hands down.

What is the connection with Fast MM ?

Fast MM are in general NOT applicable so that to speed up the MM. This is because Fast MMs deploy cancellation operations (substractions or addition by opposite). In the APSP problem the + operation is a min() operation that is not invertable. Let me clarify this point: take a+b = c where the operation + is the arithmetic addition, then having c, and having either operand we can get the other: b = c – a or  a = c – b. Unfortunately,  min() does have this basic property (infact this is a semiring dang it). In the literature, there are example show to extend a few problem from the semiring to a ring and making possible to use Fast MM.

If your links are discrete numbers (integers), there are simple way to extend the problem to a ring. I will love to see how fast algorithms will do here, in this case no numerical error to bother either.

For more information, please take a look at our Algorithmica paper.

 

The Unbearable Lighteness of Being Wrong


Every one seems to know that Fast MM based on Winograd-Like algorithm cannot be as accurate as the regular MM. A few believe that Winograd-Like algorithm are unstable numerically. A few believe that are stable for any practical purpose.  All have a very strong opinion. No one has a clear understanding.

In this post, I will talk about the contributions by Prof. Richard. P. Brent, who really started the investigation for Fast MM, Prof. Higham, one of the gods in numerical algorithms, and Prof. Bini who really brought forth the generalization analysis and estimate for Strassen like algorithms. To them, I must pay my dues and give credits. If I have any contribution, it is in the empirical measure of the accuracy of fast MM codes for large problem sizes where fast MM are actually applicable and useful in modern architectures.

Let me start with the main statement. The forward error bound for the conventional C=AB  MM with matrices of size nxn can be written as follows:

(1)   \begin{equation*} |{\bf  C - \dot{C} }| \leq nu{\bf |A||B|} + O(u^2) \end{equation*}

Where the term on the left of the inequality represents the component-wise difference  or the forward error between the result matrix and its practical computation. The right side is the bound and it says that the error is a function of the range of matrix A and B multiplied by the size of the matrices and the precision of the architecture u. When the matrices have large values, the error we commit in computing the MM is large and, as the problem gets bigger, the error increases linearly as the problem size.   To achieve the bound in Equation 1 the algorithm has to perform at least n^3  multiplications. Fast MM cannot have the same component-wise (this have been shown by Miller 1975) because they compute fewer multiplications. I am reading and digesting Miller’s original paper.

In practice, each component of the matrix C is the sum of n terms. Each component is independent and it can be computed independently. Optimizations such as tiling breaks the computations and interleaves them, but very often the original order of the additions remains unchanged (just interleaved with other sums).

If we take the matrices A and B with elements in the range [0,1], we can see that each component of the product |A||B| has bound n and we can easily bound the right side by u(n^2).

The main problem for the addition is when two large and similar operands cancel each other. For example,  1.0001*10^5 – 0.9999*10^5 = 20, it may be that due to the finite representation of the mantissa the actual computation return 0 (instead of 20). An error of 20 w.r.t. the operands is small, but w.r.t. to the result is very large. You can see as the value of the matrices comes to play when something goes wrong.  Small matrices will produce small errors (note that we cannot say the same thing about relative errors).

In practice, no component-wise bound is found for Fast MM and instead is presented a norm-wise bound. A component-wise provides an independent bound to each element of the result matrix. A norm-wise bound provides a single bound for all. A norm bound is weaker in this sense.

(2)   \begin{equation*} \|{\bf C -\dot{C}} \|\leq n^2u\|{\bf A}\|\|{\bf B}\| +O(u^2)  \end{equation*}

The norm of a matrix A is defined as

\|A\| \equiv \max_{i,j}|a_{ij}|.

The first thing you may notice between the two bounds in Equation 1 and 2 is the square factor n^2 factor. This is because ||AB|| <= n ||A||||B||. Again take the matrices with elements in [0,1] and  ||A|| = ||B || = 1.

By reading again the references, I finally understand the contributions by Brent and Bini, and thus I appreciate much better their research. Let see if I can convey here their message and shed any insight about the accuracy of Fast MM and its error:

(3)   \begin{equation*} \dot{\bf C} = {\bf C}+{\bf E} \end{equation*}

We want to estimate E or its norm ||E||.

Brent work out a bound like a real mathematicians (from the eye of an engineer). He starts with an assumption. Assume that

(4)   \begin{equation*} \|{\bf E}\| \leq uf(n)\|{\bf A}\|\|{\bf B}\| \end{equation*}

and the goal is to find f(n) for the specific algorithm: Strassen, Winograd or any of their variations. Brent uses Strassen and here we follow.

{\bf C}_0 = P_0 - P_2-P_4+P_6 \\ {\bf C}_1 = P_3 - P_0 \\ {\bf C}_2 = P_1 + P_2 \\ {\bf C}_3 = -P_1 - P_3-P_4+P_5

Where

P_0 = ({\bf A}_0 - {\bf A}_1){\bf B}_3 \\P_1 = ({\bf A}_2 - {\bf A}_3){\bf B}_0 \\P_2 = {\bf A}_3({\bf B}_0 + {\bf B}_2) \\P_3 = {\bf A}_0 ({\bf B}_0 + {\bf B}_3) \\P_4 = ({\bf A}_0 - {\bf A}_3)({\bf B}_3 - {\bf B}_0) \\P_5 = ({\bf A}_0 - {\bf A}_2)({\bf B}_0 + {\bf B}_1) \\P_6 = ({\bf A}_1 - {\bf A}_3)({\bf B}_2 + {\bf B}_3)

Of course, the P_i products are computed using Strassen recursively. If we consider for now the product P_0 =(A_0-A_1)B_3 and we assume that  for matrices of size (n/2)x(n/2) the error bound is true we have:

\dot{\bf P}_0 = {\bf P}_0+E_0 \\ \|E_0\| \leq u(\frac{n}{2}+f(\frac{n}{2}))(\|A_0\|+\|A_1\|)\|B_3\| \\ \|E_0\| \leq u(\frac{n}{2}+f(\frac{n}{2}))2ab

Note, I do not understand why we have the additive term n/2  in (n/2+f(n/2)), I believe we are using the norm of the products. In exactly the same manner and using the same bound we can state the bound for P_1, P_2, and P_3. For P_4, P_5, and P_6 we have the following bound:

\|E_4\| \leq u(n+f(\frac{n}{2}))2a2b

Adding up the terms for the submatrices of C, for example, C_0

\|\dot{\bf C}_0 - {\bf C}_0\| \leq \\ \|E_0\| +\|E_2\| +\|E_4\|+\|E_6\| + \\ u(3 \|P_0\|+ 3\|P_2\| +2\P_4\|  +\|P_6\|)

We commit an error computing the products and by adding them in the same order (the error of P_0 and P_2 is affecting 3 additions, P_4 is affecting 2). Please, work out the detail of the formula above, it is a useful exercise.

(5)   \begin{equation*} \|\dot{\bf C}_0 - {\bf C}_0\| \leq u(44\frac{n}{2}+12f(\frac{n}{2})ab  \end{equation*}

We have the recursive formula: f(1) =1 and f(n) = (44(n/2) +12f(n/2)), and the final solution is

(6)   \begin{equation*} \|{\bf E} \| \leq u 65n^{\log_2 12}\|{\bf A}\|\|{\bf B}\|  \end{equation*}

If we do not perform a recursion till a problem size of 1, but earlier, let’s say n_0 and we apply k times the recursion:

(7)   \begin{equation*} \|{\bf E }\| \leq u 4^kn^2\|{\bf A}\|\|{\bf B}\| \end{equation*}

—3M (WINOGRAD/STRASSEN) PIPE: our implementation of the 3M algorithm
as presented in Table II, where MM is implemented as STRASSEN PIPE, WINOGRAD
PIPE and thus there is software pipelining between MMs and MAs, and complex
matrices are stored as two real matrices

Brent suggested that each time we apply a recursion of Strassen, we are loosing 2 bits precision (up to three levels are common for my tests and it means about 6 bits and in practice is in between 3 and 4).

Working out with Higham’s bounds and using a few simple manipulation I could find a lower coefficient: 3^k (instead of 4^k).

As I write this post I really appreciate much better Brent’s work (1970), way ahead of his time, and even more the elegant and far reaching work by Bini and Lotti (1980), where they take the Brent Analysis and generalize it for families of Strassen’s like algorithms thus Winograd’s variants.

 

Interpretation.

Let us start with an observation. We take a matrix product like P_0 and use it in the computation of C_0 and C_1, any error on P_0 will affect both C_0 and C_1 equally; For Strassen algorithm, each P_i is used twice in different computations. We will not show it here, but a Winograd-variant will have a P_2 that is used in all C_i. Per se, it is not a big deal. The error of a single product will spread evenly to the results, increasing the unlucky possibility to add error from another product. The spread in itself is not dangerous.

Here it comes the core of this long post.

I have one number I would like to highlight:

  • The factor 12 in Equation 5.

The factor 12 is the result of four terms: 2 + 2 + 4 +4. As a rule of thumb, 2 means that one operand has the addition of two sub matrices such as A_0+A_1 (say). The factor 4 is when both operands have additions. If we perform a MA prior the product we are increasing the error of the computation because of any unlucky cancellation; that is,  the more  MAs before the products, the Larger the error.

I see two main problems:

  • The algorithms have about 4 MA post the product computations (2 more than the regular computation). We save operations but we create a longer computation chain that may, and does, increase the error accumulation. Because we are seeking cancellation we increase the injection of unwanted error.
  • The pre MA may have a significant effect of the error, because even if it is small, it will be amplified by the MM that follows. Unfortunately, this is something I cannot grasp very well. The MA is among the inputs, no fancy computations, just short set of additions. I cannot see how there could be any catastrophic cancellation.

In Bini and Lotti work, They took these two ideas and provided a generalization.

Enough food for thoughts.

 

 

 

Just to make a little sense about the legend:

—WOPT: Winograd’s algorithm with fewer MAs.

—WIDEAL: Winograd’s algorithm optimized for a pipeline execution (but not pipelined).

—GOTOS: MM implementation as available in GotoBLAS.

—BLAS MM or MM only: MM implementation row-by-column (this is used in the error analysis only).

—GOTOS 3M: 3M algorithm as available in GotoBLAS where matrices are stored as
complex matrices.

—3M (GOTOS/WINOGRAD/STRASSEN/ATLAS): our implementation of the 3M algorithm as presented in Table II, where MM is implemented as STRASSEN, WINOGRAD, GOTOS, or ATLAS and thus complex matrices are stored as two distinct real matrices.

We believe, this is the first attempt to show how in practice all these algorithms work and how they affect the final result and error. We do not justify the blind use of fast algorithms when accuracy is paramount; however, we want to make sure that fast algorithms are not rejected just because of an unfounded fear of their instability.

Pratical, Theoretical, or Plain Frustrating

In the previous post, I talked about a new Winograd’s algorithm and its novelty. In this post I will talk about its practical benefits.

I understand that a few performance advantages are really subjective, in the sense that they may not really add up to the level of practical advantages.  However, it is a fact that having a balanced computation is a necessary step to exploit the best speed up for the Winograd algorithm. In other words, if an algorithm is not balanced, it performs more computations and thus will be eventually slower. An educated question is “so what?“, it is really adding any performance? This question is an old one, since the appearance of Strassen’s algorithm researchers wondered and tested when these algorithms should be applied: you may find the same question reformulated for the taste of the audience: is it a theoretical result (good only on paper), a practical result  (it find application in real problems).

Winograd’s algorithm has two clear curses:

  1. Numerical, a few have the misconception that the algorithm is numerically instable, I will have a post in the future.
  2. Practicality, a few believe that the performance advantages are available only for either galactic problem sizes or for all sizes.

I use the  term galactic from Lipton Blog: http://rjlipton.wordpress.com/2010/10/23/galactic-algorithms/#comments

Today I will talk about the latter and probably the easier to discuss: is Strassen–Winograd’ss algorithm any practical? If you have time and look up the papers by who really implemented and tested the implementation of Strassen–Winograd’s algorithms, the results are often contrasting. There are two reasons: in time the hardware is becoming more and more complex and the efficient implementation of MM is trying hard to follow up the architecture evolution. Remember the question boils down to find the smallest problem size for which Fast MM are faster than the regular MM (and for regular I mean the high performance MM).

Let us try to go back in time and see how the change of HW may affect the performance of the classic MM and the Fast MM. In the past, machines had little memory and thus the problems they could compute were small. They had a relatively flat memory and slow. They had relatively a few registers if any, and often the operations of multiplication and addition were done in software without any dedicated HW (floating point units). In this scenario, multiplication were very expensive, while additions dirt cheap.  Fast MM addressed two main concerns: First, if you really wanted you could eliminate multiplications and thus simplifying  the HW computations;  Second, the memory was flat and expensive, if we could  reduce the number of operation, we could reduce the memory access and the time spent reading data. The real draw back for Fast MM was the need for more space (in registers and memory). So if the problem was a fit in memory, Fast MM were appealing and useful. Because of the cost of multiplications, Fast MM found application for extremely small problem sizes. In the literature, you may find that Fast MM were always applicable.

With integration and the event of the co-processor designed to perform complex operations in HW, the cost difference between an addition and multiplication became smaller, much smaller, but there was a small difference any way. However, the memory did not change much. In such a scenario, the reduction of operations was beneficial even for small problems but the difference shrank.  In this new architecture, communication among processor and co-processor start counting towards the total execution time.

When the co-processor became  the functional-unit within the processor (Floating Point Unit), multiplication cost was pretty much the cost of an addition.  In the previous architectures, Fast MM could trade off multiplications with additions directly. The benefits was immediate. Now the trading of the operations is not so appealing.

With the FPU, the bottle neck became the register file, the communication means between memory and the FPU. Fast MM use more space and thus more registers.  But if the registers are not enough and spills to memory happen, we pay dearly for those additions and now those addition cost as much as multiplications.  Fast MM will kick in only when the number of operations offset for the costs, now we need a problem size in the tens.

If you play with an O() notation and count the number of operations such as multiplication and additions and they cost equally, say 1, you will find that the problem size should be at least 12 to break even. In practice, you must have a larger problem because the possible spill costs.

The memory is still basically flat, thus counting operation works just fine and the O() notation is elegant and simple. The possible spills are expensive but if we are careful.

Notice that In this scenario, we start trading off Matrix Multiplications with Matrix Additions. We must start taking advantage that we trade a single Matrix Multipliciation O(N^3) with 15-18 Matrix Additions O(N^2). Of course, as side effect, we perform fewer multiplications as in the previous cases, but we need a problem size for which this is advantageous. Just for fun, for N=10, the cost of a MM is 1000 and the cost of a MA is 100.  N=13 seems a good start for searching the right break even point.

With the integration of caches and memory hierarchies in a processor, the situation has  got tricky. If an application reuses data and these data are stored in caches, in practice we reduce the waiting time to access: we pay dearly for the first access, but then we can amortize the total cost because the following accesses are cheap: 3 cycles.

The MM if implemented carefully can exploit optimal reuse and thus can take advantage of caches extremely well. With caches and deep pipeline, MM can reach peak performance.

Fast MM are penalized by memory hierarchy because MA have no data reuse and their performance is basically bound to the throughput of the memory level where the data are stored. In my experience, caches are rarely big enough so that Fast MM and MM are competitive. In my experience, only when in memory FastMM can start kicking some FLOPS.

Again, Fast MMs trade off one MM for 15-18 MAs. The break even point for modern architectures varies. For one architecture, 4 core 2 processors 3.2 GHz Xeon Based system, Fast MM and MM break even at N=7,000.This system has a very poor memory bandwitdth one bus and a single common memory. Thus, a break even point of 7,000 says a  lot about the architecture more than about the algorithm. For other architectures with the same or better peak performance, the break even point is in the range 2,000 — 3,500.

An interesting question is related the future architectures. How FastMM will face off against the future Multicore and multi-processor systems? What about GPUs?

About GPUs I have no clue because I am not an expert. But if they do not have an memory hierarchy or caches are big enough, I can see application and great potential for FastMM even for moderate problem sizes. About multicore, we will need larger problem sizes. What I like is that with the appearance of larger machines, there is  the tendency of solving larger problems. In such a case, what is practical and what is not will be defined by the application where MM are used.