Fiji ProDuo vs Hawaii 295×2: clBLAS preliminary performance.

What can you do with a few TeraFLOPs? I have got  a beautiful (if not a little noisy) Radeon 295×2. I plugged to my second PCI port and I have run a few experiments using clBLAS. This was my last post (below). Of course, the main limitation of my system is not the GPU, it  is every thing else. The 295×2 has a peak performance of 11 TFLOPS; however I  know that the data have to come from somewhere and the results often cannot stay tucked in the GPU. In my tests, I assume the data will come from the main memory and has to come back to it. I show that of the 11TFLOPS, I can achieve 3TFLOPS using single precision.  This is quite something:  considering that the card consume about  500W and I did not buy to play games, I thought I had a little supercomputer that I can keep beside my monitor.

Recently I added the ProDuo card based on the Fiji architecture in my GPUs arsenal. There are a few advantages (without considering the cost) versus the 295×2. The solution and packaging allows a smaller card without external fan (quiter), similar water cooling, and reduce power consumption, and more compute nodes (from 80 to 120). All of these goodies using the same 28nm technology, so this is pure reorganization thanks to the high bandwidth memory setup.  In practice, the ProDuo is clearly a better card and it should replace the 295×2. Right ?

single singlecomplex
Due to my limited PSU (1KW), because my setup limits any card in the first slot, because clBLAS has a new setup, I decided to re-run the experiments using the 295×2 and the ProDuo in the second slot (OpenCL would say that the card will take the place of GPU1 and GPU2). These cards are designed for single precision computations: The ProDuo has a peack performance of 16TFLOPS and the 295×2 has 11 TFLOPS (sorry I tend to repeat my self). The new clBLAS provides better kernels and you can see that The 295×2 achieves 5TFLOPS and ProDuo about 6TFLOPS. Good thing, I spent some time to redesign the experiments and I re-run the tests once again.  Once again, the bottle neck is my system and the way I feed the data, but you can see that having two GPUs in the card will allow a 2x speed up (instead of a single GPU).

A note, the plots above and the ones that follows will have missing points thanks to a few quirks in OpenCL that we are working on. Second, the cards have no modifications, they are coming directly from the box to the bench.

The 295×2 is still an awesome card considering that the difference is 0.9 TFLOPS. On the other side, the ProDuo is 20% faster, 30% more energy efficient. I can actually plug 2 ProDuo in my system without any further change, but I cannot plug the 295×2 and ProDuo together.  But what next it is still more interesting.

double doublecomplex

Yep. In double precision the ProDuo stays behind.  First, I am coming from the general purpose CPUs and their bench-marking,  I expect a factor of two penalty from going from single to double precision. Here, we can see that the factor is about 5. The Hawaii card can reach the 1 TFLOPS threshold marking, which sounds good;  Fiji’s has a 0.5 TFLOPS  upper limit. So the 0.9 TFLOP loss in single precision is a 0.4 TFLOP gain in double precision. Indeed, life is the stage for a variety of compromises.  In this case, I am not really sure if it is an architecture difference or kernels deployment. We will have to investigate, but it will require some help and effort.

But for me the most interesting part comes now, when I bought an extra PSU and now I can feed electricity to both cards on the same board.


In the next post, I will have a PCI extension so that I can put the big cards into the 2nd and  3rd slot,  I Will loose bandwidth but I should get full potential from the Hawaii and Fiji GPUs. Currently the two Hawaii are a little constrained and I can get the performance of a single GPU. With the extension,  I should be able to see  5 GPUS (the Turk on the first slot, 2 Hawaii and two Fiji). The system allows a three-way crossfire.

single precisionsingle complex preciston

Now, We have an heterogeneous system, in practice we have 3 GPUs effectively. The current experiments do not balanced the work load as a function of the GPUs throughput and thus the plots can be better, higher.

We can see that we earned yet another 1TFLOPS in single precision. The good news is that even in my system, the problem size and the computation time has such a ratio that more hardware will provide better performance and I can show it.  Also the introduction of more GPUs shows that the computation time become linear (the communication is the bottle neck). If I can unleash the forth GPUs, likely I will have little  improvement. But for double precision the curves are little different.

double precision  double complex precision

The three GPUS (Hawaii in position 1 and Fiji 2 and 3) provide a scalable solution but it is not always the best. The beauty of these plots is their complexity: considering the problem size and the configuration available, the best solution is not always straightforward.

The Future and motivations: 
At this time, my research is two fold:

First, I am investigating the field of deep learning for application of feature selection in advertising (yeah my real job) and GPUs seems the hardware of choice, so I wanted to have a test bed close by my real bed. These new systems promise and deliver performance unprecedented.

Second, with the coming of age of AutoGemm in clBLAS, we start having a self tuning BLAS for GPUs and an open source at that; this is an opportunity to re-evaluate kernels written using Strassen’s algorithm. In a system like mine, Strassen’s algorithm can be really appreciated only if they are done at kernel level: the computation performance plot it is too steep to take advantage of a divide and conquer (from the CPU) approach.

Toying with R9 295×2 and SGEMM

So I build this red and black little thing


and I set the R9 295×2 into the second PCI slot. Then I installed clAmdBLAS, OpenCL3 and the catalyst driver for Ubuntu 14. Then I modified the BLAS example so that to use one or both graphic chips. These are the performance plots: SGEMM can achieve 3.4 TFLOPS sustainable peak performance. See below a quick summary for all GEMMs.



Single Precision above and double precision below.


In practice, I show when it is better having two GPUs instead of one for the  R9 252×2 for any GEMM. Performance in GigaFLOPS is measured by accounting the execution time for all data movements from and to memory and the number of operations are normalized to 2N^3. Please, notice that complex matrices matrix multiplications require more operations by a factor of 4.


Scoring as Sparse Matrix Multiplication

This is my first time working on sparse matrix multiplication and I must say that the project was quite something and it is worth sharing it. Let us do this in a top-down fashion. Let us start by describing the problem.


Consider we are watching a YouTube’s video about the 2014 Kawasaki ZX14R (I like this one in particular At the beginning of the video it appears a banner ad or a 30 sec pre-runner. We may click it, kill it or let it be. The Ad is tailored to our history, our  preferences,  where we live  or the time of the day; it is also tailored to the campaign that YouTube is paid to deliver.

Before the delivery of the ad, this can be a bid request and as such a third party may bid and win it, taking control of the ad contents that is ultimately delivered. The same awesome ZX14R will attract different viewers that will be matched to different ads thus to different campaigns.

Consider that the bid is simply summarized by a set of xs: for example, the time of the day, the browser type, the IP network. These Xs are called features and, if we are logged in YouTube, they can be quite complex and they can be very personal (age, income, gender, purchase of a ZX14R).

YouTube may have thousands of campaign running concurrently and probably we may fit the audience targeting of about 10 to 20 campaign at any time. Each campaign may compete with several others in bidding for our attention.  We are summarized by xs, the campaign is summarized by betas. Assume we have one campaign, say campaign 1, and one bid request, we can quantify the match by a probability function

    \[ \pi_1 \sim \pi({\bf \beta}_1,{\bf x}^t) \]

In particular, the relation is quite interesting and well known as logistic regression

    \[ \ln(\frac{\pi_1}{1-\pi_1})  = \beta_0^1 +\sum_i \beta_i^1x_i = \beta^1{\bf x}^t \]

This is a dot product where the element in the betas matches exactly the right element in the x. If we are targeted by multiple campaigns, we may represent this a vectors

    \[ [\pi_1,..,\pi_N]^t  \sim [ \beta^1 {\bf x}^t,.. ,\beta^N {\bf x}^t ] = \beta{\bf x}^t \]

Which is a matrix-by-Vector computation.  Now there are thousands of us watching this video or something like this at any time.

    \[  \Pi \sim \beta{\bf X} \]

This is a sparse matrix multiplication. Each row of betas represents a model, which is associated with a campaign. Each model is composed of different dimensions or attributes and each bid, each x, may intersect only partially with the model. Thus the computations is sparse. The result matrix is a probability measure of why the i-th campaign should be matched with the j-th bid. A threshold can be applied to create discrete values eventually and then we can argue about an approach in breaking ties.

So far so good, this is after all matrix multiplication: at least the last computation of this process. Before the MM computation, there must be the computation of the Betas.
This is beyond the scope of this post (and this site). I will share   reference and it is fascinating the theory and the practice of building these models: I loved this statistical and machine learning part  of the project.

The core of the project is the ability to perform the sparse matrix multiplication fast enough that we can make an intelligent matching between campaign and user in the allocated time to submit a bid for the impression or impressions. The decision has to be made in the order of milliseconds per bid to make sure that we can watch our video as soon as possible without long waiting time because the ad does not load.

The latency is important but the throughput is key: YouTube may have thousands of campaigns running and millions of users connected at any time. Of course, they will count the money to the bank, but this is the basic requirement for any Ad Targeting company in the market.

Three facets: the machines, the models, and performance

For this project, we wanted to use the Intel Phi. This is the first generation PCA connected system based on 50 something cores, with 4 threads each for a total of 200 cores. The cores are designed for this architecture from scratch and based on Pentium architecture.  The peak performance is in the 4TFLOPS ball park. We had one card  without fan for about 5K USD (yes you read it right). Because this guy has a 300Watts consumption at full load, we had to stick a fan to force cool air into the card in such a way we could actually use it. We do not show the artisan job.



To develop the code, we built a system based on Westemere CPUs. The picture above shows the 2-processor system with 12 real core plus 12 other multithreaded cores (24). We added as well a AMD GPU (another 4 TFLOPS)  and we wanted to test what system was the most efficient for our production system.

This built has a cost of 5K USD.  We install CentOS ans we had a 9TFLOPS machine literally under my desk in office. The Fan noise was acceptable but I did not keep it always on.

We then purchased the license for the Intel compiler (1K USD) and finally we were ready to code for the Phi.  The one advantage of this configuration is that we can write normal code that we can test on the 2 processor system, scale it up to 24 cores. Then we need to use the same code and scale it to 200 for the phi. There is no specific code for the Intel System, the compiler, or better Matteo Frigo’s legacy, takes the C code and create different executables.

Per se, the code is not that interesting and we can skip it.

The models: the construction of the betas.

All impressions delivered by the company reside(d) in a dedicated DB. I wanted to build as many models as I could. I could work with about 120 campaign and thus I could build as many models. I deployed the machine above and two other smaller ones: in practice I could investigate and build 18 models  at any time (10 + 4 +4). For this process multithreads were detrimental.

This process could take as long as one day.


There are two type of performance: speed and accuracy. I can clearly say that the speed was disappointing.

We collected 26Million bids and used 102 Models. We moved all models and impressions into the Phi and scored them. We did the same on the Westemeres. After all the efforts the two systems had the same throughput. Considering that the impressions would have to move through  the PCA, the latency is not really comparable

The Full story is available in the following paper: ctr.

New references and a full algorithm collection

I am lucky and I am happy to work in this field.  I recently started a conversation with Axel Kemper. He shared his list of references and algorithms and I bluntly put it up on my reference page: This will require to click and to download, but it is so worth it.

As separate files there are two important lists.

List one:  Solution using Bini formulation. Axel provides not the code but the three matrices as Bini et al would write. This is important for me because I can go back and take my little optimization engine and try to write code. I cannot wait to  see optimized code  for the <3,3,3> or <4,4,4>? For the moment being, I am swamped by other things but in principle I have the first instrument to extend the FastMMW library with other algorithms.

The longer list: Code is much longer list with basic code description.

This is the first time I have found such a complete list of algorithms with their creators. The list is public and ready available for any one to take, to use and to write code.

I will love to add these to the library especially the cases in between <2,2,2> and <3,3,3> providing algorithms for rectangular matrices and for the range <2,2,2> and <4,4,4>. The latter is a two recursions of <2,2,2>  and it will require twice as large matrices or problem sizes.  The case <3,3,3> will require about 1.5 large matrices. In practice, this provides a finer granularity where we can torque-fitting the algorithm performance.




Searching for Fast Matrix Multiplication …

… Frameworks are appearing.

Regularly, I search for the keywords Fast Matrix Multiplication or Multiply. I use Google and I use Bing/Yahoo. The two search engines have different algorithms and the results are often different. I search for two separate reasons: First, narcissism, I want to find my site in the first page. Second, to see what is new in the field (similar research tends to use the same terminology).

In Bing and Yahoo, I regularly appear in the first page close to wikipedia. This provides great visibility and their algorithm validates my effort to provide technical and high quality contents. As such, I cannot write a new blog every week and thus this search engine does not really consider me as a classic blog.

If I could publish regularly, my contents bubble up in google. As I write today, I appear in either the first or the second page, which is very good and not common. In general,  papers by Stanford or Berkeley alumni are higher ranked. If you repeat the search the result may well be tailored to your history and this site could be a long shot.

Nonetheless, I like this scenario: one search engine is consistent and it values my effort and the other often provides “temporal” importance, I will explain the negative effect first and in the next paragraph the positive one. The negative effect is simple to explain: Not being in the top in Google searches would suck if I had monetary investments or my professional career depends on this type of work. For example, citations from peers are important in so many ways in the research field especially in academia: for advancement, funding and personal satisfaction (pat on your back kind of thing). Not being in the top is equivalent to not being published in the top tier conferences or journals: it will diminish your contribution to the eyes of new researchers and your research peers. If they cannot find your work it means the majority will not know what is your contribution.

The temporal ranking effect helps me to stay updated in the current research easily: new research will bubble up rather quickly and I can easily being in the midst of smarter people thoughts. I found out that the field attracts interesting work and interest in general, this is exciting.

Smart people are reaching the conclusion that it is good if not necessary to have development framework for the generation of algorithms for Fast Matrix multiplication.

The first is by Alexey Smirnov: he provides a systematic way to create recursive algorithms using matrix notations a la Bini’s style. His work is available at If you ever wanted to create code for these algorithms, you need the matrix description and from them the code. He proposes the solution of larger problem where the saving in matrix multiplications are bigger than what I use to work on.

One of the obstacles the use of  Fast MM algorithms is their poor formulation for rectangular matrices. One of my humble contributions is the presentation of one single algorithm that can be used for any matrix size (my first TOMS paper). Now, I am happy to see that there is a practical interest in tackling this problem.

Recently, I have been contacted to share my adaptive library to tackle exactly rectangular matrices, and thank to Google, I have found that  Benson and Ballard  ( are proposing a new framework combining Smirnov’s idea. My library is used as reference and seems having a little edge when applicable. The good news is that the optimizations presented could be applied to mine in principle.

I am happy to see that the current research is going towards the direction I would go: a practical framework where algorithms are presented as matrices and automation will take these to create codes: for every matrices and for every architectures.







The Better Accuracy of Fast Matrix Multiply III

In the previous post, I showed that Strassen-Winograd algorithms have a directionality in their errors. This is the reason for their numerical properties and it has been used mainly to express bounds. Today, we use it to improve their numerical properties: after all if you have a constructive proof to prove a bound the least we can do is to use it for the optimization of the algorithm itself.

I admit there are several way to improve the algorithm. Randomization was my first guess. It is simple, it breaks patterns, or we make those pattern not predictable thus not pattern any more. In fact, randomization can be used to spread the error across the result matrix, but I could not make it reducing the intensity of the error.

I came to the idea of using permutations designed tilt the error without actually involve extra computations. These permutations must be tailored to the algorithm but they are not that difficult to grasp or implement. What are these permutations and how they are applied is in the paper. I want to show the effect of those permutation on the maximum error and their locations:


Take the previous post and look at scatter plots (this and the previous). First of all, I reduced by half the maximum error.  I can do that because I spread the error.

Once again, the geo-locality of the error is the cause of the constant but extra error FastMMW algorithms have. On one side, I find comfort in knowing where the error is. On the other side,  this knowledge can be use to break the error pattern and improve the error we commit. I see a nice connection with yet another one application of Werner Heisenberg’s principle.



The Better Accuracy of Fast Matrix Multiply II

What is the goal of analytic combinatorics? In algorithms, we use it to get an estimation of complexity. We use this complexity to choose a priori an algorithm: The one with fewer operations or fewer memory reads. Speed is often related to fewer operations but there are interesting exceptions. In the previous post, I wrote about sorting algorithms, their variety, their complexity and, in turn, their assumed performance. The numerical analysis of an algorithm is that different than it complexity analysis?

I had someone introducing me to the analytic combinatorics of algorithms. This is part of the curriculum of any computer science/engineering background.  However, I am self taught about the numerical analysis. Information theory and thus probability is integral part of CS/EECS.  However, Statistics and probability has been a recent deep dive. So to answer the question above, the CS educational system seems considering numerical analysis a topic for special interests.

Interestingly the error analysis methods do not look like much different from the methods used for the complexity analysis. The idea is to bound the error instead of bounding the number of operations, say. For Matrix Multiplication the result is a matrix, very often this is a two dimensional array.

We can bound the error of each matrix element. Each element is important equally. For the regular matrix multiplication, this is simple because each element of the result matrix has the same computation though on different data. If we assume that the inputs do not matter and only the computation is injecting errors, the regular matrix multiplication has a uniform error. In fact, the bounds are identical for all element of the result matrix.

Fast Matrix multiplication are different: by construction, the algorithm build the components of the result matrix using asymmetric computations. We used to write simplified equations so that to bound the whole matrix. The final result looks and feel as any complexity analysis:

    \[ \|D\| \leq \Big[ \Big(\frac{n}{n_0}\Big)^{\log_2K}(n_0^2 +5n_0)-5n \Big] u \|A\|\|B\| \]

Higham’s Book Chapter 23, Theorem 23.3

In principle, it  is possible to write bounds for each elements even for Fast matrix multiplication; however, in practice, no one actually tried. The equation above is beautiful  in so many ways once you start to know it.

About four years ago, I started looking at how the bounds above were computed.  I was awed by the complexity and elegance. I started understanding it when I saw the original proof by Brent. I saw its power when I saw for the first time the proof by Bini and Lotti.  All of the previous masters went at the problem to reduce the result to a single equation: a summary.

More I saw the equation more I hated it because  the result was hiding the methodology to reach such a concise result. The methodology had to compute a bound for each elements and in practice the proofs teach you how to write better algorithms and thus tighter bounds, better equations.

So to achieve a better algorithm we must somehow summarize the error analysis. HeatMError150x15010000Times[-1,1]Orthogonal

So I decided to show the location of the error graphically. Take 10,000 runs of four algorithms Strassen, SW Winograd variation 1, SWOPT Winograd variation 2 and Goto’s SGEMM. The figure show the location of the maximum errors and the maximum of the maxima. The beautiful equation hides these even more beautiful patterns.

Nonetheless, when I show this picture (or pictures like this) every one understands that the higher error of Fast MM is dues to the limited and focus pattern: “if we could spread the area as in the SGEMM we could reduce the error”.  The picture let you have the right intuition immediately. Once we combine the intuition with the equation proof, you have the main idea.

There are algorithm optimizations that allow to achieve more accurate fast matrix multiplications.

The Better Accuracy of Strassen-Winograd Matrix Multiply I

On March 1 2014 a paper of mine titled as above will be published in Advances in Linear Algebra and Matrix Theory Vol. 4 Number 1. This will be a open publishing venue. I am new to this kind as publishing experience, and I will provide here support, considerations, and  opinions.  I will post a commentary series to help the distribution and to explain the contents.

The title is to poke fun to the subject. In so far, the numerical analysis of these algorithms has been stale in the sense that has not been used to design more accurate algorithms: the analysis has been used to describe the worst case error for  FastMMW but has not been used neither to design better algorithms nor to select the best algorithm for the input at hand.  This point is embarrassingly important.

Let’s have a thought experiment about the sort algorithm, this will be appealing to CS people and Googler’s interviewers. There are about three dozens sorting algorithms with different best, average, and worst case scenarios.  This seems like we have a lot of algorithms to choose from. For example, the merge sort and the quick sort have the same best and average performance and the Quick sort has the worst case performance of O(n^2). That is, every time we choose the pivot element to split the vector into two and sort recursively, we choose always the smallest or the largest for n consecutive times. Albert King would say that he would not know luck at all if it was not for bad luck  or the vector is constant. Would you reject the quick sort a priori for that?

The community did not: the Quick sort comes  with multiple pivots variations, with hierarchical algorithms to improve the overall performance:  we choose the pivot and the number of recursions a finite number of times for the specific input, etc.  For Sorting the attitude is different, the variety and number of algorithms is exciting, the combination of different algorithms to fit a specific purpose is elegant and it makes a great subject for job interviews (usually not very deep after all you need to know three dozens algorithms and you cannot use Google while interviewing at Google).

FastMMW have thousands of variations without considering the even faster ones. I am not kidding during a personal conversation with smarter people, one counted them in the Ks (thousands). The Bounds are  for the family of algorithms (i.e., all sorting)  not for each implementations, the bounds provide only worst case scenarios (forget about best and average), and there is no consideration about the inputs and their nature.

I understand we are talking about precision and not performance, we must be cautious. However, it is true that if we apply Strassen 2 times recursively, the error increases by a bounded constant. It is like to choose to use the quick sort for two level of recursions and then yield to merge sort.

In this work, The Better Accuracy of Strassen-Winograd Matrix Multiply I show that we can design more accurate algorithm even if we use a finite number of recursions, that the theory started 40 years ago is a great starting point if used in combination with new tools and ideas, especially new ideas.

The format of this series is still in my head and thus … but I will start with a new idea.


Clearness, contents, contributions, and rejections

I have seen different facets about the presentation of new ideas. This is my collection of thoughts about a few hurdles thrown to any naive contributor of original contents in the scientific field.

Point A: I am reading an old booklet: An introduction to Mathematics  by Alfred North Whitehead (1910), you can find for a penny in Amazon but by a few is considered as the top 50 books of scientific literature. There is a paragraph about mathematical clearness I shall quote here:

“…. In the second place it is necessary for research. It makes for clearness of thought, and thence for boldness of thought and for fertility in trying new combinations of  ideas. When the initial statements are vague and slipshod, at every subsequence state of thought common sense has to step in to limit application and to explain meanings. Now in creating thought common sense is a bad master. Its sole criterion for judgement is that the new ideas shall look like the old ones. In other words it can only act by suppressing originality.”

This is the real reason for clarity and it shows the hard consequences. I like  that mathematical clearness is the foundation of creative thinking and not just a notational game: if I do not have to think all the time of the meaning of terms and ideas, my creative side will be able to tackle new ideas. This requires also some flexibility from the reader: if I write u and explain that is a unit of measure, please let it be. Because if the reader (you) is more familiar with the term b instead, it is just a short hand. If my notation has to be exactly like yours to make sense to you, it sounds more like forcing me to conform to your common sense and thus even less likely to convince you about a result that is truly original and  it is not by you.

Point B: I am a reviewer for a journal and recently I rejected a paper. The authors presented two algorithms A and B, they described them in fairly good details and they chose to present experiments only for one, B. Algorithm  A did not have experiments because it was not appealing, it was not appealing because the implementation was poorly parallel w.r.t. B. One processor was doing 1/4 of the work no matter how many processors were used.  In my review, I presented a different algorithm: that particular 1/4 of the work can be highly parallel, the parallel algorithm is elegant and short to describe,  and there is code for it. My concern was on the contents and what was missing: a better A algorithm. Strangely, if the authors would have given me only B, I would never bothered with it.

Sincerely, I did not fully understand the paper. That was not my point for the rejection. What I did unconsciously and consciously is to focus on the contents and see if B was better than A. Clearness should provide that because was the main goal of the paper. The authors may think that I am an idiot, their paper is just fine, and they are writing a rebuttal to me and the editor to reconsider the publication. If you are a small potatoes, being reviewed sucks. If you are a small potatoes and you are reviewing, you may get carry away.

Point C: At the beginning of 2013, I have been working on a project for a few months. In the second half of the year, I tried to collect my thoughts and results into a paper and submit to publication. The next post will be about the contents of this publication.  IF you follow my posts, you may know that I am an experimental person; that is, observations and experiments are the first part of my work. Nowadays, I often communicate these results during the investigation to  researchers who worked on the same subject and especially if my work may shed a different light on their work. Mostly I want to know if I did something obscenely wrong. Other researches go to a great length avoiding sharing or paying their dues (as B.B. King would say). Some times I even understand the reasons for not sharing but I understand better the reasons for it.

For me mathematical clearness is an attempt to organize and write a process that is often scruffy, based on inductions and sometime intuitions. Sometimes, I start during the experiments to guide me in finding the right questions and the right answers. More often the writing is after words when I have good enough answer. The experiments are driven by curiosity: the desire to know and the disappointment of remaining an idiot (i.e., who ignores). Thought, the writing is hard, I like seeing  the white and empty paper becoming an article. An original contribution describing in the best way I could  what I did and share it.  Without considering the publication process, this process may take months. With the publication process, it may take years.

I submitted this work to an old and good journal where I have already published but I knew the process will be tough. I believed and still I believe it was the right course. During the preparation of the work, I thought I had the blessing of my predecessors (or at least who matter to the topic and to me).   In fact, the paper was rejected. The reviews were mostly aiming at the writing part and how the writing cluttered the understanding of the material.  In practice, clarity is the responsibility of the authors.  A paper cannot be published if it is not clear.  I rewrote the paper, wrote a detailed rebuttal, and submitted somewhere else, it will be published shortly. This paper will prove better bounds to the error of FastMMW: the last such a result was published 40 years ago. Yep, creative thinking in action.

Conclusions: With experience and knowledge, I hope to reach the mathematical clearness that will help my successors and convince my contemporaries.






No, GPU Matrix Multiplication ain’t tiled.

Is wisdom  a gift of old age? Or patience may be? I do not remember which one. My age is affecting my memory.

All along these random posts, my research, and the presentation of my research  the topic of error analysis is a fastidious itch that everybody around me wants to scratch. Paraphrasing the sentence “See the stone set in your eyes”, when I talk about Fast MM there is  always a cry saying the use of any Fast Algorithms is inconceivable: the error those algorithms introduce is a huge stone that I should lift. The wise see my stone and  want me to scratch its itch.

I would recommend a talk to who will attend  ICS 2013:  Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms.

There are a few interesting ideas that I would like to express (not necessarily aligned with all the authors)  beside the obvious point that technology makes all lazy:

  • When new architectures are introduced, like these beautiful GPUs,  we tend to forgive the quality of the code written for them. For fast publication,  we forget what has been done previously (by ourselves?).
  • These things can awe you with 3TFLOPS performance, which is a slap in the face to  any high-end general purpose processors. For the same reason,  any decent code will do.
  • MM for GPUs do not use blocking. The irony is that who developed code for the GPUs  are the same authors who proposed blocking.

But WTF is blocking/tiling and why MM ain’t tiled?

It is actually easy to summarize Tiling/blocking as a way to break the computation into smaller computations following a divide-and-conquer algorithm. The MM is tiled if it is  decomposed into similar and smaller MM. For example, Fast algorithms are applying blocking naturally (i.e., recursive blocking).

If you pick up the code for MM for NVidia’s or for AMD’s GPUS as today April 2013, the code is not tiled. The authors break the matrix C in square sub blocks of fixed sizes (k_0 x k_0) and they perform  “long” matrix multiplications, in practice they exploit dot products. This is not me being obsessive about the definition of blocking: it has serious consequences on the error the computation commits. The code has long string of dependent additions that very likely introduce and accumulate error. Here is the stone set in your eyes.

The wise man would say: do not worry, young man! The error committed here is not even close to the error committed by the Fast MM Algorithms. After all, the wise is thinking anything is more accurate than a Fast MM, which is unstable. What the matter you do not like 3TFLOPS?

If I apply any Fast Algorithm, I may add another 1TFLOP (25% improvement), but the wise  would say: inconceivable. Because, my Fast Algorithm will introduce error of inconceivable sizes making the result plain useless. I like to use “inconceivable” here. Unfortunately, it seems I cannot say, what the matter you do not like 4TFLOPS for the price of 3?

In the ICS talk, the authors will show that for positive matrices (like probability matrices) there is a Winograd’s variant  that is  both 27% faster and 3-4 bits more accurate. The authors also show an accuracy comparison with Goto’s implementation, which should have been used as reference for numerical error analysis previously (but did not). The fast MM looses one bit to Goto’s implementation. 4TFLOPS and 1 decimal precision back, can anyone tell me what is wrong with that ?

Again, my conclusion is simple: keep an open mind. There are places where Fast MMs should not be used and places where they should, let’s learn together when and where.