Merry Christmas and Full Ahead 2013

As Benigni used to say “Happy birthday Little Boy Jesus (buon hompleanno bambin Gesu’)“. I would like to remind myself, the ones I love, and those very few  who have  read and will read Fast Matrix Multiply (fastMMW)  blog to enjoy these merry days.

Today please, join  me and  thank, kiss, and hug  who you love; and, remember to repeat this  everyday.



Self Installing Package: Fast Matrix Multiplication

In the last two weeks, I have been working on practical ways to improve the distribution of FastMMW library: to make my code less a research project and more like a software package that anyone can download, build, install and use.

I am almost done with the testing of an installation process, which is  written in Python. I chose Python because I am lazy and it is simple. I just finished up writing a Python package so that any one could use FastMMW and provide a simple example and pictures, see previous post.

I tested the installation process on my A8 machine, which is a 4 core single processor system, and it is running Ubuntu 10.4.   I am running the installation on one of my multiprocessor system with two dual core processor (Opteron), and it is running Fedora 9.  Just to test the installation and the process with a complete different system: Boy it is always needed.

Per se, the installation process is not really interesting. The library comes with an INSTALL.txt file where I present the process step by step. I wanted to have the process done a little automatically. For some architectures with  multi-cores multiprocessors systems, where cores are hardware threads, I recommend a little of imagination and testing: do it manually! It is very instructive you will get a feeling about how the system distribute the computation by using the command “mpstat -P ALL”.

Let me not digress,  back to the installation process.

The idea is to create 4 libraries: one for each basic type such as float, double, complex float and complex double. Each library is built independently of each other and this will allow different optimizations.

Matrix addition optimization

MA is a two  dimensional loop code: the inner loop follow the inner dimension (i.e., column Fortran, Python, R matrices or row C matrices). The computation as spatial locality and it is trivial to parallelize (outer loop). The first step is to measure the performance with different number of threads and different unrolling of the inner loop (exploiting register allocation).

I chose unrolling of U = [4,8,16,32,64], in order. With everything left along but the unrolling, I will expect to increase performance until register spill reduces performance. For example, if I measure a drop in performance between 16 and 32, I will not test 64.

I chose a thread allocation. For example, if I have 4 cores [0,1,2,3] I will test the following configurations: L = {[0], [0,1], [0,1,2] and [0,1,2,3]}.

When I found the best configuration L and U, I will estimate the break even point when Fast algorithms are as fast as Goto’s or Whaley’s. Let us call this estimate B.

Break even evaluation

With a first estimate B, I start a linear search for the problem size where Winograd’s algorithm is as fast as the best GEMM. Once I have found the break even point I can create the library. I create also binary code where the user can test the real performance.

A note of wisdom

The main advantage of breaking the installation per type is the opportunity to have different L, U and B as a function of the basic GEMM installed. In fact, on my A8 machine GotoBLAS has better performance than ATLAS, and somehow the configurations are different, in particular L and U (the number of cores and the unrolling  used for MA) are different for GotoBLAS and ATLAS.

This is no good: it means that the BLAS package installation affects the performance in some subtle ways  that are not really apparent. This is good: because the little voice in side my head always saying “I told you there is something interesting even here”.

Once again, it seems I concluded this post with one further open question.Good research often does feed itself.

If you like to help and test this new framework, please contact me, I will follow up after Oct 1 2012.







Fast Matrix Multiplication Python package

After working on a R package, I wanted to work a little bit longer and see what it must take to write up a Python package. I have to say it was fun and actually easier than I expected and remembered. The process is partially automatic, most of it, and without writing manuals, description, comments and all the good stuff they teach in Software engineering classes is pretty straightforward.

The R and the Python package share the same C interface, but each has its own way. Using numpy array object and the ability to write different types of array (float, double, complex float and complex double) the interface is quite simple and quite nimble. We can pass large matrices as reference without copy.

There are few bugs to take care but I could make it run a simple example and run the usual error comparison as done in the previous post. With positive matrices and using double precision as reference I could compare the error committed:








The first thing I have noticed was the striking difference from the picture presented in the previous post. Here, each algorithm has its own range without intersection, Strassen algorithm has the worst maximum  error (as expected) and the two Winograd algorithms present different errors.  In contrast, In the R exercise,  I have all fast algorithms clustering in the same area and having no great distinction. And using the same random number generator but in the interval [-1,1] this is the plot:

That is very similar to the R exercise.

Having two different environments (actually three: R, Python and C), it will make easier to investigate the nuances of the algorithms’ numerical properties. Just playing with these I have already questions about the code generators and the effect of the difference results. Experiments are integral part of science and good experiments always beg more questions.

I will be clean up the code and make available these modifications and interfaces in the next distribution.



Fast Matrix Multiplication R Package



My work experience taught me one thing: people use a lot of different languages and environments; they thrive where one application can be done in different ways. In previous projects, I wrote fast kernels in C and  then wrote different interfaces: R, Java, Python, and Perl.  The top reason is that the basic functionality did not change, during the years we have changed the environments and the languages.  So to keep up with the new products, to keep a single repository of the kernels,  and finally to keep to a minimum the code maintenance, we spent quite some time learning how to write interfaces. Fortunately, most languages are very friendly to C kernels.

R is one of my preferred languages and interpreter. It is easy to use. It is a functional language. It has “lotsa”  of statistical tools. Friendly environment in creating packages and to create your own C kernels. Today, I can say I accomplished the creation of the first FastMMWR package: making possible to use FastMMW algorithms on top of GotoBLAS2 on my A8 AMD machine.

R and the R-interface is simple to create and to use. However, the standard copy of data  from R to C and vice versa, is a overwhelming: the speed up obtained by FastMMW algorithm is almost obliterated by the cost of preparing the input and the outputs. This is because I wanted to use a R wrapper. It will be more efficient to use the .C() interface directly without copy of the data.

But there is a silver lining to have the complete set of FastMMW algorithms and their R interface: R uses double and double complex matrices and thus the regular matrix multiplication  C <- A %*% B is basically of these two types. We could speed up the process by casting double to single and double-complex to single complex. This down cast can be done using the R copy interface, and then call the appropriate function.

If FastMMWR will not really shine for performance, why bother to have an interface and a package?

Because we can create error analysis tests for FastMMW and create beautiful graph in a matter of minutes:








For example, let us consider the Figure above.  We create 50 examples of matrices and B from a uniform distribution between 0 ans 1. The matrices have size 5000×5000. We create positive matrices in double precision and we compute the product reference in double precision. Then, we cast the input in single precision and performed the same product but in single precision (SGEMM). We ran also Winograd algorithm (SWGEMM), Strassen algorithm (SSGEMM) and our fastest algorithm (SBGEMM) where B stands for Bodrato, Marco (the original creator of the algorithm). All FastMMW algorithms have a single recursion because the break even point for this architecture is 3000×3000.

So our reference is in double precision and the comparables are four single precision algorithms. For this architecture, Winograd’s and Strassen’s algorithms have the same error on an average.  In my experience, this is new and interesting because in previous experiments Winograd’s algorithms had always better maximum error.  On average, the fast algorithms for one recursion has a multiplicative error factor of 1.5 only. The theoretical factor for Strassen is 3 and for Winograd is 4. We loose half of a decimal point.







If instead of creating positive matrices, we create matrices from the normal distribution with average 0 and variance 1, the plot is different from before but overall the error is as expected: that is, Strassen’s has smaller maximum error than Winograd’s algorithm. However, the Strassen’s has a multiplicative factor of 1.5 or less. Winograd’s factor is about 3.

The amazing thing is the speed in writing the test, collecting the results, and plotting the results. This is the code:

compare <- function(n,MM,type="normal") {

  Col =c("red", "blue", "black", "yellow", "green","orange","gray")

  dr = vector("numeric",n)
  dw = vector("numeric",n)
  db = vector("numeric",n)
  ds = vector("numeric",n)
  sr = vector("numeric",n)
  sw = vector("numeric",n)
  sb = vector("numeric",n)

  i = 1
  while (i<=n) {
    as = 1.0
    bs =1.0

    if (type =="normal")  {
       AS = array(rnorm(MM*MM),dim=c(MM,MM))
       BS = array(rnorm(MM*MM),dim=c(MM,MM))
    else {
      AS = array(runif(MM*MM,0,1),dim=c(MM,MM))
      BS = array(runif(MM*MM,0,1),dim=c(MM,MM))

    print (c(dim(AS),dim(BS)))

    CD =AS%*%BS

    CS = s_mm_leaf_computationR(as,AS,bs,BS)
    sr[i] =  max(abs(CD-CS))

    CS = s_wmpipeR(as,AS,bs,BS)
    sw[i] =  max(abs(CD-CS))

    CS = s_BMOWR_PIPER(as,AS,bs,BS)
    sb[i] =  max(abs(CD-CS))

    CDC = s_smpipeR(as,AS,bs,BS)
    dw[i] = max(abs(CD-CDC))  

    i = i +1

  L = list(sr,sw,sb,dw)
  i = 1
  while (i<=length(L)) {
    average = sum(L[[i]])/length(L[[i]])
    var = sqrt(sum((L[[i]]-average)^2)/length(L[[i]]))

    T = c(average+var,average-var,L[[i]])
    L[[i]] = T

    print (c(i,average,var))
    i = i + 1



plotL <- function(L,l,Col,t,f) {

  png(filename=f, height=512, width=728, bg="white")

  i =1
  M = 0
  while (i<length(L)) {
    M =max(c(M,L[[i]]))
    i =i+1    

  X = c(0,0,1:(length(L[[1]])-2))

  i = 1
  plot(X,L[[i]],xlab="sample", ylab="maximum Error", main=t, type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
  i = 2
  while (i<=length(L)) {
    lines(X,L[[i]], type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
    i = i +1   


L1 = compare(50,5000)
L2 = compare(50,5000,"uniform")


The FastMMWR package shines here: if you want to check whether or not you algorithm can afford to loose a little of accuracy by means of a faster algorithm, this package provides the tools to make such a decision. Knowledge is a good thing (sorry Martha, I know it is your trademark catch phrase).

AFDS 2012 a comment

I have found the symposium refreshing and I had a lot of fun. My presentation slides and the paper are available here in this site if you like; also, the talk has been recorded and it is available for broadcast on demand at the digital symposium framework.

It was fun because I could see things done instead of just talked about, the experience pavilion. The food was good. The coffee was plenty. The people interesting and sharp. The technical sessions may have less academic appealing than let’s say famous conferences. But you could feel that there is a sincere attempt to create a community for these new machines and  tools to develop software for them.

I personally liked it and I attended just two days. Unfortunately, I could not stay longer.

About Matrix Multiplication, there were other three technical sessions comprised mine:  mostly GPU related talks with incredible performance. My session was the only one that addressed all computational engines and presented a little about Fast Matrix Multiplication. I am happy that I refrained myself in presenting yet another implementation of FastMMW for this architecture. My main goal was to present a case for an algorithmic solution for heterogeneous computing of MM. Presenting Fast MM would be a step too much. Let us learn walking before running.

An attendee was so kind to share his opinion that  Winograd’s should not be worth to pursue because of performance and because of its numerical instability.  First about Performance,   the new GPUs have a single and  in hardware multiply-add operation, cutting by half the computation of the operations making less appealing any fast algorithms. Second, the unstable part was not really well explained but it was firmly embedded into the person mind.

As response for the performance caveat.  In the past, I worked with architectures where multiply-add instructions were  speeded up in hardware by avoiding register file access between the multiply and the addition, in practice by having fewer cycles for the operation  and supposedly improving accuracy. For those Machines, I applied successfully Winograd. Matt Badin has currently an implementation for a NVidia GPU with fused multiply-add where Fast MM are applicable.  I would say that it is architecture dependent but it is not a good reason.

As a response for the instability caveat. If you want to ignore the work I did, just Google the following: “fast matrix multiply is stable” and argue with prof. Demmel and previously with prof. Higham. Please, leave me alone. However, let me ask this question: did anyone measure the error of the GPU implementation of MM?





OpenCL + APU + GPU + Fast Matrix Multiply

Next Month, I will be going to Bellevue to present a short work about matrix multiplication and heterogeneous computing at the AMD Fusion developer Symposium. I better prepare my presentation because it will be recorded and made available. I just posted the paper online, search my bio or Google the title above.

I tested several architectures during these last seven years, since I started working on Winograd’s Fast Matrix Multiplication (fastMM). Nowadays,  it is the age of GPUs as computational accelerators and I did not have much hands-on experience.  So to learn something new, last year at this time, I started considering working on APUs. In fact, During a conversation with Matteo Frigo, it came up the topic about processors with cores and a GPU: the so called APU. At least, AMD calls them APU.  The underlying question was whether or not Fast MM could be used for APU. The short answer is yes, but this is not my point for today.

In my previous posts, I talked about the recent work about MM and GPU.  Winograd algorithm can be applied to GPUs and it is only a question of problem size. Also it may be beneficial for error analysis; yes, you read me. But APUs are hybrid and they are not a separate unit with astronomical GFLOPS. You may use the CPUs (e.g., 4) or the GPU. The A8 processor,  I had the luck to test, provides this dichotomy and each part may provide 100 GFLOPS, if you add an equivalent and external  GPU, like I did, you will have a system with about 200 GFLOPS peak performance for Matrix Multiplication.

This may look like a wimpy system —I received this comment already. In fact,  to publish anything today you may have to use a GPU with 1TFLOPS peak performance. Otherwise, it is not supercomputing.  I have a different opinion, which is based on the economy of the system: my built has the same performance of a Cell processor with the same cost of a PS3 Playstation, with the same power consumption, and quite enough that I can write this post by using this built while listening Ravel’s Bolero. Think about a supercomputing centre without the annoying whistle, air conditioning, and electric bill.  The washing machine is making more noise.

From the beginning, one year ago, I built a system (new thing for me), I learn a new framework, I looked into the code written for ATI’s GPUs, and  finally I start looking how I could write Fast MM for this parallel system. Having the GFLOPS/$ of a system built on a Cell, there is not need to brag performance.  An APU will take advantage of the software written for x86_64 architectures, for which I am very familiar, and for the type of programming for GPUs. Differently from a Cell, I do not have to write new code for this and I can reuse the optimal one already written.  I used OpenCL to glue everything together. It required some time due to the learning curve, but if I could do it, any younger developer will be able to, actually faster.

Once I was  ready, my interest and curiosity moved from the development of fast MM (i.e., Winograd’s) to the understanding of the new challenges of such as system. Especially, I was very curious about using all these computational units: CPUs, internal GPU and the external GPU.  I wanted to write codes that exploit the system but they are easy to maintain just in case  I will plug out any of the chips.

From the paper the conclusions read as follows:

4.4. Conclusions
In our system, the APU provides a software solution using only CPUs that can achieve 90GFLOPS (GotoBLAS). If we would like to improve performance by just working on a different and fast algorithm, we can achieve 120 GFLOPS. If we take advantage of both GPUs, we can achieve sustainable performance of about 160 GFLOPS (and a peak performance of 200 GFLOPS). This is a first attempt in putting together a OpenCL solution for the implementation of MM using hybrid parallel
systems. The heterogeneous system presents interesting challenges and, thanks to the OpenCL API, ultimately a flexible and powerful environment

The communications to and from memory are the bottleneck of the system. In theory, that is without communication costs, we could use the CPUs, internal GPU and external GPUs concurrently. Once communication costs are included, the utilisation of the system is less.

Please, take a look at the paper OpenCL+APU+GPU+Fast Matrix Multiply or Google it.




On the comparison of time series (a little detour from fast Matrix Multiply)

This week, I self published a paper about Non-Parametric methods for the comparison of series: for the detection of anomalies  or for the mining of similarities in (time) series.

Yep, I meant self publishing. If the contents of the paper does not belong to a journal nor a conference is it is my fault more than the machine-learning community. The paper is long counting 65 pages, it has too many experiments collecting 22 Figures, as well as too many notations, 40 equations and about 100 references.

This paper has been  a work of passion and love. Since I started working for the Search Engine group within  Yahoo! in  2007, I fell in love with statistics and its applications, in particular how to measure differences in between two processes (stochastic or not). My background is Engineering and communication is a must class; probability in signal processing and communication is fundamental; statistical tools are not so much. But in this new environment, so many engineering processes can be described as stochastic processes  and not necessarily because they are. Tools were and are extremely scarce.

We started by applying FIR and IR filters such moving average and exponential moving average with seasonal trends for single dimensional series. We also used stochastic measures applied to probability distributions functions (PDF) such as Entropy. See these  methods are close to my Digital signal processing roots. At the beginning the challenge was to build software and to use it accordingly to fuzzy requirements. Our costumers were different and the intents very different.

After two years, we had a monitoring tool handling quite a few big systems and it was a pleasure working with the engineers that built the search engine.

I will be for ever grateful to the managers who let me went deeper and I could work on and off on multidimensional series.  In  multidimensional series the methods are quite more challenging and publicly available tools are even more scarce.

Compression methods, conformal prediction, and Kernel methods are a few that really attracted me, because of the people working with me: prof. Kifer suggested Vovk’s work that is conformal prediction; prof. Smola was the king of kernel methods (both working in Y! at that time); and compression because it was called The measure in a SIAM conference. From my previous work, I wanted to investigate again the application of PDF and CDF for multivariate statistics. For all of the above methods,  there is a very rich literature ranging 40 years of research and very talented people came up with brilliant solutions. At the beginning, I wanted to enrich our stochastic library with these methods and if I could with something original from my left brain.

Implementing these methods is actually much harder than reading and understanding the papers. Without the help of the original authors I would have failed to reach any results; furthermore, considering that this process help me to come up with my own way to compare stochastic processes I must and I am very grateful.

My methods imply a classic idea in statistics: I create an order among the elements in the multidimensional space and then apply CDF measures. The tricky part is the order. The PDF and CDF measure are not trivial and we have also a little contribution in here. The methods boil down to the computation of a topological order of the samples and then compute a CDF, which is based on this topological order. The order can be achieved by sorting algorithms, you may be surprised how little is available in sorting multidimensional vectors, or by clustering algorithms, we used Dijkstra’s.

Note: The topological order is basically a breadth-first visit of a graph and, strangely,  I never remember how to do that during interviews. I swear I wrote my own version for directed acyclic graphs for this library, however it seems I cannot keep it into my long term memory.

With so much work done, with so much research done, I am in a strange and unfortunate spot: I am biased on the contribution of the paper and I do not want to compromise my intentions. I could shorten the paper, simplify it in contents and presentation, just present a classic comparison paper and provide a limited version of the code. But I do not want to do that: when I started this work I wanted to find a reference like this paper: an attempt to provide not only ideas but also practical insight into the methods by constructive observations.

At the end of the paper, I provide  the reviews from a journal where I submit the paper and it has been rejected. I do this for  my orphan papers and I do it  not because I am upset or for seeking revenge: I want to present the reviews because these are smart people that gave their time to help me improving the paper and their opinions must be told. I often disagree with reviewers but I really appreciate and respect their work.

The paper is available in my bio and in arXiv


On GPUs and Matrix Multiplication (ICS 2012)

This year, I was very lucky to be a member of the International Conference on Supercomputing program committee.  This is my first time on the other side of the barricades. I have a new appreciation about the organization, selecting, and reviewing of technical papers in a conference. In the future, I will be careful in uttering words such that  “Those idiot reviewers, they do not understand shit ..”. If you wrote a paper, and have being rejected, at a minimum, you thought those lines.

I have been reviewing technical papers for quite some time now. I wrote a few myself. Rejection is a personal affair. If you applied and got rejected,  I am Sorry.

Now that the conference program has been published I can talk about it: look for the GPUs, CPUs, and Linear Algebra session in the conference program. I can talk about it because their intent is similar to mine in understanding how GPUs and  CPUs must work together. By the way, I will present my own work at  AMD Fusion12 Developer Summit (June 11-14 Bellevue WA), I will talk about it as soon as I will be back from the conference.

I am fascinated by how these GPUs are becoming ubiquitous and how just recently the research community is taking notice how to use these SIMD computation engines  in combination with multi-core multiprocessor systems. I have mixed feeling about GPUs and the body of wok about them: these are marvelous computing systems that can outperform any multi-core processor on a set of applications. But let us remember that not all applications are equal and sometimes you will not touch a GPU with a barge pole.

Last year, I spend less than $500 and built my own AMD machine with APU and GPU (16GB 1866 MHz) and I could write Fast MM codes in less than 6 moths. This was my first built, I did not know anything about OpenCL and GPU’s codes, and pretty much I am an idiot: this means that the tools and the examples available to code for these complex system are pretty good. My built configuration has a practical peak performance for my codes of about 200 GFLOPS (only the Cell processor could achieve the same performance for the buck). But if you are willing to spend for the-state-of-the-art GPU you may well build a sustainable 1 TFLOPS machine (see the ICS session).

I finally witness the study of the practical cases when the cost to move data to and from memory is accounted: Often in the literature and in the code available you can measure time within the GPU; that is, the execution time is measured as the start time and end time of the GPU execution when the data are already available in the internal memory. In general, the data are not in the GPU and they are at least in memory. Often the problem size is so big that it may not fit into the internal 2-4GB memory, and we need to break the computation. When there are multiple GPUs or there are GPU and CPU, there may be contention and the computation must consider it.

Once again, communication is the bottle neck.

Take a sneak peak at the conference schedule and consider to go, Venice is lovely in June.








Improving the Accuracy of High Performance BLAS Implementations using Adaptive Blocked Algorithms

That is a long title uh? It is actually the title of a paper by Matt Badin, in which I have a small contribution, if you like. Accuracy of a Fast MM is always a problem and I have already wrote a post about it: The lightness of being wrong. I have to say that the previous title is so much fun and this is just too academic. Bottom line: we can and we do improve the accuracy of Fast MMs.

If you read the previous post with lots of notations, formula, and a set of conclusions, then, this post will be easy. If you did not, please go read it, I can wait,  and come back to this post.

As a reminder, Fast MMs perform gathering operations by adding sub-matrices together from the operands A and B, perform recursively matrix multiplication, and then combine the partial results.  Let us call the additions of the operands  as pre-addition and the combination of the results as post-additions.

The numerical error that everybody hates is the so called cancellation: when two large numbers of opposite sign are added up, the result is so small that in finite precision it is reduced to zero. This cancellation error is frightening for a few. How this error comes forth in Fast MM?

Pre-addition: If we add two matrices, the probability that to addenda are of opposite signs and of the same magnitude, is really small but possible. Considering that for large matrices,  we are actually playing a Russian roulette, we are bound to hit the target. Most commonly the error that we commit in a pre-addition is the common addition error. I can see that there can be error but there is little to be done. Pre-addition cancellations are anomalies too difficult to eradicate.

  • The depth of the computation is small, we perform one addition. There is no trick  to compensate for the error.
  • The problem is that the error will be injected into a computation with a much larger sequence of operations: will be injected into a MM, if we stop recursion right there, or recursively it will increase and amplify even further.


Post-Addition: Here comes the pickle, now that we have performed our MMs we must perform the post-addition and actually in perfect precision we want to have perfect cancellation. We may not have perfect cancellation because of the error injected in the pre-addition (which is small)  and because of the computation of MM itself, which is at most linear in the size of the problem size (N). The post-addition can do very little to correct the error, because we can perform only a few (but on large matrices).

  • The depth of the computation is small, we cannot really compensate for the error.

MM Hint: The key of this post is that to improve the accuracy of Fast MM, the only place we can do something about is by improving the accuracy of the Kernel MM. Funny, is it ?

In the paper, Matt does a good job describing the basic idea that the MM kernel should be broken up, blocked, so that to reduce the effect of the summation error. This is actually known in the literature. Just look up any reference about blocked linear algebra application and stability. The novelty is that to achieve the optimal accuracy and optimal performance there is very little work about it. I do not really want to give away the punch line, but a top-down algorithm with what is called a post load addition, will provide both accuracy and speed up.

Matt shows that with the right kernel, we can improve the accuracy of Fast MM.

For example, If you are working with probability matrices, doing  lots of MMs to determine the evolution of a probabilistic system, then  you should work with  Winograd’s algorithm with this adapted kernel.  You will have a faster and more accurate result than using ATLAS, GotoBLAS, MKL or  my Uncle’s GEMM.

FastMM will provide the speed, the modified MM kernel, will provide the accuracy.

For other problems and for Strassen’s Fast MM, this approach will work as well. You will have a faster algorithm but you will not have the most accurate: However,  much closer to what you could have with the most accurate GotoBLAS GEMM.

Take a pick at the paper, it has been published recently.