Fast Matrix Multiplication using AMD clBLAS

Now we can use clBLAS GEMM subroutines within the FastMMW framework, it is something  that I wanted to do for a long time; finally, I have got a chance.

The clBLAS+FastMMW addition  to the project started because I wanted to play with Deep Learning software on my PC. I installed Nervana’s code and Tensor Flow. Oh Boy! My system is ancient. Any training makes my CPU’s fan going nuts. Of course, I do not have any green  NVIDIA GPUs to off load the work, I do not have a TPU either. I have a A10 APU that needs upgrade. The training and testing does take a night of spinning.

OpenCL is an up and coming option for DL software. But there is only one, SINGA, that  provides support for CUDA and OpenCL please take a look at the complete list Unfortunately, I could make work not a single one. I upgraded my Ubuntu from 14 to 16, installed the new amd-pro driver and I tried. Oh Boy, I tried, and tried.

Clearly, I do not understand it and my installation is a Frankenstein’s of packages. Too messy to figure out what is so wrong that TF tests cannot run on my red GPUs. Furthermore, it is not enough to send off  matrix multiplications operations to a GPU:  the code must be optimized. I am  in that kind of business.  OpenCL interface is a good  way to move data but the original BLAS is about an interface to intensive computing routines, clBLAS is the same. There are two reasons I would like to use OpenCL for DL:

  1. Deep Learning is hot and I would like to learn (i.e., after all, My data science profession pays my bills).
  2. GEMM is a work horse in DL,
    1. FastMMW can be  used for CPUs
    2. OpenCL is one open source to unleash GPUs power, but so far FastMMW did not use in its distribution.
      • FastMMW should be used at kernel level, here we use it at Host level.

The GEMM in clBLAS is a set of operations that are tailored to the GPUs (like OpenBLAS or ATLAS for CPUs). So I went back to my experiments and my setting to figure out what I can do to use clBLAS, like I use BLAS.

I have found that I have two platforms: Clover and AMD Accelerated Parallel Processing. Some DL interfaces may have a way to specify the device (by a device number) using a environment variable. But I could not find a way to specify the platform. I realized that  Deep Learning frameworks  using OpenCL must overcome these trivial issues before they are accessible to the rest of us.

What I can do, it  is to give examples how to wrap clBLAS GEMMs so that they can be used similarly as BLAS is used but with whatever GPU you have. I am not completely happy about what I did but it is a solid step forward. In principle, you could use the code to run any GEMM (but float complex) and build a library using OpenBLAS and clBLAS (my experiments will use only one but the libraries will have all).

Yesterday, I finished to integrate the FastMMW code so that I can run sgemm, dgemm and zgemm. For the  cgemm I will need help.

For example: NxNxN problem double complex on a Fiji GPU

N= 1000 GFLOPS COLD     5.4 GFLOPS HOT   45.4
N= 3000 GFLOPS COLD   59.1 GFLOPS HOT 106.6
N= 5000 GFLOPS COLD   98.8 GFLOPS HOT 115.7
N= 7000 GFLOPS COLD 113.6 GFLOPS HOT 119.3
N= 9000 GFLOPS COLD 118.2 GFLOPS HOT 120.7
N=11000 GFLOPS COLD 49.7  GFLOPS HOT 50.0

Larger problems do not fit the 4GB of memory and thus I could use Winograd algorithm to break the problem in subproblems smaller than 10Kx10K.

N = 13000 GFLOPS COLD 112.1 GFLOPS HOT 112.6
N = 15000 GFLOPS COLD 115.9 GFLOPS HOT 116.3
N = 17000 GFLOPS COLD 114.1 GFLOPS HOT 112.3

I use a Pro Duo (Fiji) card and use only one at anytime. This is in line with the peak performance of the system. We do not do any matrix preparation so to make it easier to break the problem into subproblems.

This problem reduction works better for double complex but the idea will work for single precision (float) where we can reach Tera FLOPS (per GPU)

  1000 GFLOPS COLD 3.00     HOT 181.07
  3000 GFLOPS COLD 42.62   HOT 1116.74
  5000 GFLOPS COLD 267.32 HOT 2180.79
  7000 TFLOPS COLD 0.64     HOT 2.57
  9000 TFLOPS COLD 1.02     HOT 2.88
11000 TFLOPS COLD 1.48     HOT 3.25
13000 TFLOPS COLD 1.91     HOT 3.27
15000 TFLOPS COLD 2.31     HOT 3.40
17000 TFLOPS COLD 2.59     HOT 3.43
20000 TFLOPS COLD 0.25     HOT 0.26

The results above are a fit with respect my previous experiments with dedicated code. In my previous post, I have already shown that Fast algorithms are not really applicable for sizes that fit the GPU memory (4GB is you use oneGPU and 8GB if you use both). The performance plot does not reach a plateau where a change of algorithm is beneficial. In this post,  we suggest that if the problem size is larger enough and you are thinking to find a policy to break down the problem down, well Fast Algorithms are competitive and already use a “memory hierarchy oblivious” solution (fancy name for optimal).

On a personal note, I could sell the 295×2 and the funds will go towards the new system (thread ripper) next month. At that time, I will be able to have better memory communication; the CPUs will be competitive and provide a better support for those Matrix Additions (i.e., necessary for fast matrix multiplications). The month of July will bring more opportunities to work on related projects. I should start a github as well.

Upgrading the system to Ryzen/Naples and Radeon new Pro-duo

I am considering upgrading my hardware and thus start a new set of experiments.
I would like to hear suggestions and I will consider any help to raise the funds; that is money, but I am willing to accept also processors (naples), motherboards, memory and  GPUs.  I will help raising the funds by letting go my Pro-duo and my 295×2 (this  is even better than the Pro-duo in double precision) if necessary.

So if you have ideas or you want to help in practice:




PCI-Extension: thus a SGEMM 4 way ?

I received in the mail a PCI-Cable extension thus I could experiment a different configuration: it was a disaster.

I have a ASUS Crossblade Ranger, thus I could actually test a 3-way crossfire. For this experiment, I use a R9 Turk GPU into slot 0, the 295×2 into slot 1, and in the slot 2 the ProDuo.  In practice, I can see 5 GPUs but there is a “but”.

GPU_0 takes care of the monitors
GPU_1 and GPU2 are Fiji thus the ProDuo
GPU_3 and GPU4 are Hawaii thus the 295×2

(But) In this configuration, the ProDuo get half the bandwidth. Obviously, this is wrong in so many way that any smart person will not even try. Why waste energy ?

I wanted to make the computation configurable in order to take advantage of such a skew system even such a system. The idea is simple, if a GPU is slower, I quantify it and I reduce the problem size it can handle. If a Fiji is used in conjunction with a Hawaii, the Hawaii get 2/3 of the problem and Fiji 1/3. This will be useful in the future, but you can see the problem and the simple solution.

In practice, 4 is possible, 3 is better.



Fiji ProDuo vs Hawaii 295×2: clBLAS preliminary performance.

What can you do with a few TeraFLOPs? I have got  a beautiful (if not a little noisy) Radeon 295×2. I plugged to my second PCI port and I have run a few experiments using clBLAS. This was my last post (below). Of course, the main limitation of my system is not the GPU, it  is every thing else. The 295×2 has a peak performance of 11 TFLOPS; however I  know that the data have to come from somewhere and the results often cannot stay tucked in the GPU. In my tests, I assume the data will come from the main memory and has to come back to it. I show that of the 11TFLOPS, I can achieve 3TFLOPS using single precision.  This is quite something:  considering that the card consume about  500W and I did not buy to play games, I thought I had a little supercomputer that I can keep beside my monitor.

Recently I added the ProDuo card based on the Fiji architecture in my GPUs arsenal. There are a few advantages (without considering the cost) versus the 295×2. The solution and packaging allows a smaller card without external fan (quiter), similar water cooling, and reduce power consumption, and more compute nodes (from 80 to 120). All of these goodies using the same 28nm technology, so this is pure reorganization thanks to the high bandwidth memory setup.  In practice, the ProDuo is clearly a better card and it should replace the 295×2. Right ?

single singlecomplex
Due to my limited PSU (1KW), because my setup limits any card in the first slot, because clBLAS has a new setup, I decided to re-run the experiments using the 295×2 and the ProDuo in the second slot (OpenCL would say that the card will take the place of GPU1 and GPU2). These cards are designed for single precision computations: The ProDuo has a peack performance of 16TFLOPS and the 295×2 has 11 TFLOPS (sorry I tend to repeat my self). The new clBLAS provides better kernels and you can see that The 295×2 achieves 5TFLOPS and ProDuo about 6TFLOPS. Good thing, I spent some time to redesign the experiments and I re-run the tests once again.  Once again, the bottle neck is my system and the way I feed the data, but you can see that having two GPUs in the card will allow a 2x speed up (instead of a single GPU).

A note, the plots above and the ones that follows will have missing points thanks to a few quirks in OpenCL that we are working on. Second, the cards have no modifications, they are coming directly from the box to the bench.

The 295×2 is still an awesome card considering that the difference is 0.9 TFLOPS. On the other side, the ProDuo is 20% faster, 30% more energy efficient. I can actually plug 2 ProDuo in my system without any further change, but I cannot plug the 295×2 and ProDuo together.  But what next it is still more interesting.

double doublecomplex

Yep. In double precision the ProDuo stays behind.  First, I am coming from the general purpose CPUs and their bench-marking,  I expect a factor of two penalty from going from single to double precision. Here, we can see that the factor is about 5. The Hawaii card can reach the 1 TFLOPS threshold marking, which sounds good;  Fiji’s has a 0.5 TFLOPS  upper limit. So the 0.9 TFLOP loss in single precision is a 0.4 TFLOP gain in double precision. Indeed, life is the stage for a variety of compromises.  In this case, I am not really sure if it is an architecture difference or kernels deployment. We will have to investigate, but it will require some help and effort.

But for me the most interesting part comes now, when I bought an extra PSU and now I can feed electricity to both cards on the same board.


In the next post, I will have a PCI extension so that I can put the big cards into the 2nd and  3rd slot,  I Will loose bandwidth but I should get full potential from the Hawaii and Fiji GPUs. Currently the two Hawaii are a little constrained and I can get the performance of a single GPU. With the extension,  I should be able to see  5 GPUS (the Turk on the first slot, 2 Hawaii and two Fiji). The system allows a three-way crossfire.

single precisionsingle complex preciston

Now, We have an heterogeneous system, in practice we have 3 GPUs effectively. The current experiments do not balanced the work load as a function of the GPUs throughput and thus the plots can be better, higher.

We can see that we earned yet another 1TFLOPS in single precision. The good news is that even in my system, the problem size and the computation time has such a ratio that more hardware will provide better performance and I can show it.  Also the introduction of more GPUs shows that the computation time become linear (the communication is the bottle neck). If I can unleash the forth GPUs, likely I will have little  improvement. But for double precision the curves are little different.

double precision  double complex precision

The three GPUS (Hawaii in position 1 and Fiji 2 and 3) provide a scalable solution but it is not always the best. The beauty of these plots is their complexity: considering the problem size and the configuration available, the best solution is not always straightforward.

The Future and motivations: 
At this time, my research is two fold:

First, I am investigating the field of deep learning for application of feature selection in advertising (yeah my real job) and GPUs seems the hardware of choice, so I wanted to have a test bed close by my real bed. These new systems promise and deliver performance unprecedented.

Second, with the coming of age of AutoGemm in clBLAS, we start having a self tuning BLAS for GPUs and an open source at that; this is an opportunity to re-evaluate kernels written using Strassen’s algorithm. In a system like mine, Strassen’s algorithm can be really appreciated only if they are done at kernel level: the computation performance plot it is too steep to take advantage of a divide and conquer (from the CPU) approach.

Toying with R9 295×2 and SGEMM

So I build this red and black little thing


and I set the R9 295×2 into the second PCI slot. Then I installed clAmdBLAS, OpenCL3 and the catalyst driver for Ubuntu 14. Then I modified the BLAS example so that to use one or both graphic chips. These are the performance plots: SGEMM can achieve 3.4 TFLOPS sustainable peak performance. See below a quick summary for all GEMMs.



Single Precision above and double precision below.


In practice, I show when it is better having two GPUs instead of one for the  R9 252×2 for any GEMM. Performance in GigaFLOPS is measured by accounting the execution time for all data movements from and to memory and the number of operations are normalized to 2N^3. Please, notice that complex matrices matrix multiplications require more operations by a factor of 4.


Scoring as Sparse Matrix Multiplication

This is my first time working on sparse matrix multiplication and I must say that the project was quite something and it is worth sharing it. Let us do this in a top-down fashion. Let us start by describing the problem.


Consider we are watching a YouTube’s video about the 2014 Kawasaki ZX14R (I like this one in particular At the beginning of the video it appears a banner ad or a 30 sec pre-runner. We may click it, kill it or let it be. The Ad is tailored to our history, our  preferences,  where we live  or the time of the day; it is also tailored to the campaign that YouTube is paid to deliver.

Before the delivery of the ad, this can be a bid request and as such a third party may bid and win it, taking control of the ad contents that is ultimately delivered. The same awesome ZX14R will attract different viewers that will be matched to different ads thus to different campaigns.

Consider that the bid is simply summarized by a set of xs: for example, the time of the day, the browser type, the IP network. These Xs are called features and, if we are logged in YouTube, they can be quite complex and they can be very personal (age, income, gender, purchase of a ZX14R).

YouTube may have thousands of campaign running concurrently and probably we may fit the audience targeting of about 10 to 20 campaign at any time. Each campaign may compete with several others in bidding for our attention.  We are summarized by xs, the campaign is summarized by betas. Assume we have one campaign, say campaign 1, and one bid request, we can quantify the match by a probability function

    \[ \pi_1 \sim \pi({\bf \beta}_1,{\bf x}^t) \]

In particular, the relation is quite interesting and well known as logistic regression

    \[ \ln(\frac{\pi_1}{1-\pi_1})  = \beta_0^1 +\sum_i \beta_i^1x_i = \beta^1{\bf x}^t \]

This is a dot product where the element in the betas matches exactly the right element in the x. If we are targeted by multiple campaigns, we may represent this a vectors

    \[ [\pi_1,..,\pi_N]^t  \sim [ \beta^1 {\bf x}^t,.. ,\beta^N {\bf x}^t ] = \beta{\bf x}^t \]

Which is a matrix-by-Vector computation.  Now there are thousands of us watching this video or something like this at any time.

    \[  \Pi \sim \beta{\bf X} \]

This is a sparse matrix multiplication. Each row of betas represents a model, which is associated with a campaign. Each model is composed of different dimensions or attributes and each bid, each x, may intersect only partially with the model. Thus the computations is sparse. The result matrix is a probability measure of why the i-th campaign should be matched with the j-th bid. A threshold can be applied to create discrete values eventually and then we can argue about an approach in breaking ties.

So far so good, this is after all matrix multiplication: at least the last computation of this process. Before the MM computation, there must be the computation of the Betas.
This is beyond the scope of this post (and this site). I will share   reference and it is fascinating the theory and the practice of building these models: I loved this statistical and machine learning part  of the project.

The core of the project is the ability to perform the sparse matrix multiplication fast enough that we can make an intelligent matching between campaign and user in the allocated time to submit a bid for the impression or impressions. The decision has to be made in the order of milliseconds per bid to make sure that we can watch our video as soon as possible without long waiting time because the ad does not load.

The latency is important but the throughput is key: YouTube may have thousands of campaigns running and millions of users connected at any time. Of course, they will count the money to the bank, but this is the basic requirement for any Ad Targeting company in the market.

Three facets: the machines, the models, and performance

For this project, we wanted to use the Intel Phi. This is the first generation PCA connected system based on 50 something cores, with 4 threads each for a total of 200 cores. The cores are designed for this architecture from scratch and based on Pentium architecture.  The peak performance is in the 4TFLOPS ball park. We had one card  without fan for about 5K USD (yes you read it right). Because this guy has a 300Watts consumption at full load, we had to stick a fan to force cool air into the card in such a way we could actually use it. We do not show the artisan job.



To develop the code, we built a system based on Westemere CPUs. The picture above shows the 2-processor system with 12 real core plus 12 other multithreaded cores (24). We added as well a AMD GPU (another 4 TFLOPS)  and we wanted to test what system was the most efficient for our production system.

This built has a cost of 5K USD.  We install CentOS ans we had a 9TFLOPS machine literally under my desk in office. The Fan noise was acceptable but I did not keep it always on.

We then purchased the license for the Intel compiler (1K USD) and finally we were ready to code for the Phi.  The one advantage of this configuration is that we can write normal code that we can test on the 2 processor system, scale it up to 24 cores. Then we need to use the same code and scale it to 200 for the phi. There is no specific code for the Intel System, the compiler, or better Matteo Frigo’s legacy, takes the C code and create different executables.

Per se, the code is not that interesting and we can skip it.

The models: the construction of the betas.

All impressions delivered by the company reside(d) in a dedicated DB. I wanted to build as many models as I could. I could work with about 120 campaign and thus I could build as many models. I deployed the machine above and two other smaller ones: in practice I could investigate and build 18 models  at any time (10 + 4 +4). For this process multithreads were detrimental.

This process could take as long as one day.


There are two type of performance: speed and accuracy. I can clearly say that the speed was disappointing.

We collected 26Million bids and used 102 Models. We moved all models and impressions into the Phi and scored them. We did the same on the Westemeres. After all the efforts the two systems had the same throughput. Considering that the impressions would have to move through  the PCA, the latency is not really comparable

The Full story is available in the following paper: ctr.

New references and a full algorithm collection

I am lucky and I am happy to work in this field.  I recently started a conversation with Axel Kemper. He shared his list of references and algorithms and I bluntly put it up on my reference page: This will require to click and to download, but it is so worth it.

As separate files there are two important lists.

List one:  Solution using Bini formulation. Axel provides not the code but the three matrices as Bini et al would write. This is important for me because I can go back and take my little optimization engine and try to write code. I cannot wait to  see optimized code  for the <3,3,3> or <4,4,4>? For the moment being, I am swamped by other things but in principle I have the first instrument to extend the FastMMW library with other algorithms.

The longer list: Code is much longer list with basic code description.

This is the first time I have found such a complete list of algorithms with their creators. The list is public and ready available for any one to take, to use and to write code.

I will love to add these to the library especially the cases in between <2,2,2> and <3,3,3> providing algorithms for rectangular matrices and for the range <2,2,2> and <4,4,4>. The latter is a two recursions of <2,2,2>  and it will require twice as large matrices or problem sizes.  The case <3,3,3> will require about 1.5 large matrices. In practice, this provides a finer granularity where we can torque-fitting the algorithm performance.




Searching for Fast Matrix Multiplication …

… Frameworks are appearing.

Regularly, I search for the keywords Fast Matrix Multiplication or Multiply. I use Google and I use Bing/Yahoo. The two search engines have different algorithms and the results are often different. I search for two separate reasons: First, narcissism, I want to find my site in the first page. Second, to see what is new in the field (similar research tends to use the same terminology).

In Bing and Yahoo, I regularly appear in the first page close to wikipedia. This provides great visibility and their algorithm validates my effort to provide technical and high quality contents. As such, I cannot write a new blog every week and thus this search engine does not really consider me as a classic blog.

If I could publish regularly, my contents bubble up in google. As I write today, I appear in either the first or the second page, which is very good and not common. In general,  papers by Stanford or Berkeley alumni are higher ranked. If you repeat the search the result may well be tailored to your history and this site could be a long shot.

Nonetheless, I like this scenario: one search engine is consistent and it values my effort and the other often provides “temporal” importance, I will explain the negative effect first and in the next paragraph the positive one. The negative effect is simple to explain: Not being in the top in Google searches would suck if I had monetary investments or my professional career depends on this type of work. For example, citations from peers are important in so many ways in the research field especially in academia: for advancement, funding and personal satisfaction (pat on your back kind of thing). Not being in the top is equivalent to not being published in the top tier conferences or journals: it will diminish your contribution to the eyes of new researchers and your research peers. If they cannot find your work it means the majority will not know what is your contribution.

The temporal ranking effect helps me to stay updated in the current research easily: new research will bubble up rather quickly and I can easily being in the midst of smarter people thoughts. I found out that the field attracts interesting work and interest in general, this is exciting.

Smart people are reaching the conclusion that it is good if not necessary to have development framework for the generation of algorithms for Fast Matrix multiplication.

The first is by Alexey Smirnov: he provides a systematic way to create recursive algorithms using matrix notations a la Bini’s style. His work is available at If you ever wanted to create code for these algorithms, you need the matrix description and from them the code. He proposes the solution of larger problem where the saving in matrix multiplications are bigger than what I use to work on.

One of the obstacles the use of  Fast MM algorithms is their poor formulation for rectangular matrices. One of my humble contributions is the presentation of one single algorithm that can be used for any matrix size (my first TOMS paper). Now, I am happy to see that there is a practical interest in tackling this problem.

Recently, I have been contacted to share my adaptive library to tackle exactly rectangular matrices, and thank to Google, I have found that  Benson and Ballard  ( are proposing a new framework combining Smirnov’s idea. My library is used as reference and seems having a little edge when applicable. The good news is that the optimizations presented could be applied to mine in principle.

I am happy to see that the current research is going towards the direction I would go: a practical framework where algorithms are presented as matrices and automation will take these to create codes: for every matrices and for every architectures.







The Better Accuracy of Fast Matrix Multiply III

In the previous post, I showed that Strassen-Winograd algorithms have a directionality in their errors. This is the reason for their numerical properties and it has been used mainly to express bounds. Today, we use it to improve their numerical properties: after all if you have a constructive proof to prove a bound the least we can do is to use it for the optimization of the algorithm itself.

I admit there are several way to improve the algorithm. Randomization was my first guess. It is simple, it breaks patterns, or we make those pattern not predictable thus not pattern any more. In fact, randomization can be used to spread the error across the result matrix, but I could not make it reducing the intensity of the error.

I came to the idea of using permutations designed tilt the error without actually involve extra computations. These permutations must be tailored to the algorithm but they are not that difficult to grasp or implement. What are these permutations and how they are applied is in the paper. I want to show the effect of those permutation on the maximum error and their locations:


Take the previous post and look at scatter plots (this and the previous). First of all, I reduced by half the maximum error.  I can do that because I spread the error.

Once again, the geo-locality of the error is the cause of the constant but extra error FastMMW algorithms have. On one side, I find comfort in knowing where the error is. On the other side,  this knowledge can be use to break the error pattern and improve the error we commit. I see a nice connection with yet another one application of Werner Heisenberg’s principle.



The Better Accuracy of Fast Matrix Multiply II

What is the goal of analytic combinatorics? In algorithms, we use it to get an estimation of complexity. We use this complexity to choose a priori an algorithm: The one with fewer operations or fewer memory reads. Speed is often related to fewer operations but there are interesting exceptions. In the previous post, I wrote about sorting algorithms, their variety, their complexity and, in turn, their assumed performance. The numerical analysis of an algorithm is that different than it complexity analysis?

I had someone introducing me to the analytic combinatorics of algorithms. This is part of the curriculum of any computer science/engineering background.  However, I am self taught about the numerical analysis. Information theory and thus probability is integral part of CS/EECS.  However, Statistics and probability has been a recent deep dive. So to answer the question above, the CS educational system seems considering numerical analysis a topic for special interests.

Interestingly the error analysis methods do not look like much different from the methods used for the complexity analysis. The idea is to bound the error instead of bounding the number of operations, say. For Matrix Multiplication the result is a matrix, very often this is a two dimensional array.

We can bound the error of each matrix element. Each element is important equally. For the regular matrix multiplication, this is simple because each element of the result matrix has the same computation though on different data. If we assume that the inputs do not matter and only the computation is injecting errors, the regular matrix multiplication has a uniform error. In fact, the bounds are identical for all element of the result matrix.

Fast Matrix multiplication are different: by construction, the algorithm build the components of the result matrix using asymmetric computations. We used to write simplified equations so that to bound the whole matrix. The final result looks and feel as any complexity analysis:

    \[ \|D\| \leq \Big[ \Big(\frac{n}{n_0}\Big)^{\log_2K}(n_0^2 +5n_0)-5n \Big] u \|A\|\|B\| \]

Higham’s Book Chapter 23, Theorem 23.3

In principle, it  is possible to write bounds for each elements even for Fast matrix multiplication; however, in practice, no one actually tried. The equation above is beautiful  in so many ways once you start to know it.

About four years ago, I started looking at how the bounds above were computed.  I was awed by the complexity and elegance. I started understanding it when I saw the original proof by Brent. I saw its power when I saw for the first time the proof by Bini and Lotti.  All of the previous masters went at the problem to reduce the result to a single equation: a summary.

More I saw the equation more I hated it because  the result was hiding the methodology to reach such a concise result. The methodology had to compute a bound for each elements and in practice the proofs teach you how to write better algorithms and thus tighter bounds, better equations.

So to achieve a better algorithm we must somehow summarize the error analysis. HeatMError150x15010000Times[-1,1]Orthogonal

So I decided to show the location of the error graphically. Take 10,000 runs of four algorithms Strassen, SW Winograd variation 1, SWOPT Winograd variation 2 and Goto’s SGEMM. The figure show the location of the maximum errors and the maximum of the maxima. The beautiful equation hides these even more beautiful patterns.

Nonetheless, when I show this picture (or pictures like this) every one understands that the higher error of Fast MM is dues to the limited and focus pattern: “if we could spread the area as in the SGEMM we could reduce the error”.  The picture let you have the right intuition immediately. Once we combine the intuition with the equation proof, you have the main idea.

There are algorithm optimizations that allow to achieve more accurate fast matrix multiplications.