Pratical, Theoretical, or Plain Frustrating

In the previous post, I talked about a new Winograd’s algorithm and its novelty. In this post I will talk about its practical benefits.

I understand that a few performance advantages are really subjective, in the sense that they may not really add up to the level of practical advantages.  However, it is a fact that having a balanced computation is a necessary step to exploit the best speed up for the Winograd algorithm. In other words, if an algorithm is not balanced, it performs more computations and thus will be eventually slower. An educated question is “so what?“, it is really adding any performance? This question is an old one, since the appearance of Strassen’s algorithm researchers wondered and tested when these algorithms should be applied: you may find the same question reformulated for the taste of the audience: is it a theoretical result (good only on paper), a practical result  (it find application in real problems).

Winograd’s algorithm has two clear curses:

  1. Numerical, a few have the misconception that the algorithm is numerically instable, I will have a post in the future.
  2. Practicality, a few believe that the performance advantages are available only for either galactic problem sizes or for all sizes.

I use the  term galactic from Lipton Blog: http://rjlipton.wordpress.com/2010/10/23/galactic-algorithms/#comments

Today I will talk about the latter and probably the easier to discuss: is Strassen–Winograd’ss algorithm any practical? If you have time and look up the papers by who really implemented and tested the implementation of Strassen–Winograd’s algorithms, the results are often contrasting. There are two reasons: in time the hardware is becoming more and more complex and the efficient implementation of MM is trying hard to follow up the architecture evolution. Remember the question boils down to find the smallest problem size for which Fast MM are faster than the regular MM (and for regular I mean the high performance MM).

Let us try to go back in time and see how the change of HW may affect the performance of the classic MM and the Fast MM. In the past, machines had little memory and thus the problems they could compute were small. They had a relatively flat memory and slow. They had relatively a few registers if any, and often the operations of multiplication and addition were done in software without any dedicated HW (floating point units). In this scenario, multiplication were very expensive, while additions dirt cheap.  Fast MM addressed two main concerns: First, if you really wanted you could eliminate multiplications and thus simplifying  the HW computations;  Second, the memory was flat and expensive, if we could  reduce the number of operation, we could reduce the memory access and the time spent reading data. The real draw back for Fast MM was the need for more space (in registers and memory). So if the problem was a fit in memory, Fast MM were appealing and useful. Because of the cost of multiplications, Fast MM found application for extremely small problem sizes. In the literature, you may find that Fast MM were always applicable.

With integration and the event of the co-processor designed to perform complex operations in HW, the cost difference between an addition and multiplication became smaller, much smaller, but there was a small difference any way. However, the memory did not change much. In such a scenario, the reduction of operations was beneficial even for small problems but the difference shrank.  In this new architecture, communication among processor and co-processor start counting towards the total execution time.

When the co-processor became  the functional-unit within the processor (Floating Point Unit), multiplication cost was pretty much the cost of an addition.  In the previous architectures, Fast MM could trade off multiplications with additions directly. The benefits was immediate. Now the trading of the operations is not so appealing.

With the FPU, the bottle neck became the register file, the communication means between memory and the FPU. Fast MM use more space and thus more registers.  But if the registers are not enough and spills to memory happen, we pay dearly for those additions and now those addition cost as much as multiplications.  Fast MM will kick in only when the number of operations offset for the costs, now we need a problem size in the tens.

If you play with an O() notation and count the number of operations such as multiplication and additions and they cost equally, say 1, you will find that the problem size should be at least 12 to break even. In practice, you must have a larger problem because the possible spill costs.

The memory is still basically flat, thus counting operation works just fine and the O() notation is elegant and simple. The possible spills are expensive but if we are careful.

Notice that In this scenario, we start trading off Matrix Multiplications with Matrix Additions. We must start taking advantage that we trade a single Matrix Multipliciation O(N^3) with 15-18 Matrix Additions O(N^2). Of course, as side effect, we perform fewer multiplications as in the previous cases, but we need a problem size for which this is advantageous. Just for fun, for N=10, the cost of a MM is 1000 and the cost of a MA is 100.  N=13 seems a good start for searching the right break even point.

With the integration of caches and memory hierarchies in a processor, the situation has  got tricky. If an application reuses data and these data are stored in caches, in practice we reduce the waiting time to access: we pay dearly for the first access, but then we can amortize the total cost because the following accesses are cheap: 3 cycles.

The MM if implemented carefully can exploit optimal reuse and thus can take advantage of caches extremely well. With caches and deep pipeline, MM can reach peak performance.

Fast MM are penalized by memory hierarchy because MA have no data reuse and their performance is basically bound to the throughput of the memory level where the data are stored. In my experience, caches are rarely big enough so that Fast MM and MM are competitive. In my experience, only when in memory FastMM can start kicking some FLOPS.

Again, Fast MMs trade off one MM for 15-18 MAs. The break even point for modern architectures varies. For one architecture, 4 core 2 processors 3.2 GHz Xeon Based system, Fast MM and MM break even at N=7,000.This system has a very poor memory bandwitdth one bus and a single common memory. Thus, a break even point of 7,000 says a  lot about the architecture more than about the algorithm. For other architectures with the same or better peak performance, the break even point is in the range 2,000 — 3,500.

An interesting question is related the future architectures. How FastMM will face off against the future Multicore and multi-processor systems? What about GPUs?

About GPUs I have no clue because I am not an expert. But if they do not have an memory hierarchy or caches are big enough, I can see application and great potential for FastMM even for moderate problem sizes. About multicore, we will need larger problem sizes. What I like is that with the appearance of larger machines, there is  the tendency of solving larger problems. In such a case, what is practical and what is not will be defined by the application where MM are used.