On GPUs and Matrix Multiplication (ICS 2012)

This year, I was very lucky to be a member of the International Conference on Supercomputing program committee.  This is my first time on the other side of the barricades. I have a new appreciation about the organization, selecting, and reviewing of technical papers in a conference. In the future, I will be careful in uttering words such that  “Those idiot reviewers, they do not understand shit ..”. If you wrote a paper, and have being rejected, at a minimum, you thought those lines.

I have been reviewing technical papers for quite some time now. I wrote a few myself. Rejection is a personal affair. If you applied and got rejected,  I am Sorry.

Now that the conference program has been published I can talk about it: look for the GPUs, CPUs, and Linear Algebra session in the conference program. I can talk about it because their intent is similar to mine in understanding how GPUs and  CPUs must work together. By the way, I will present my own work at  AMD Fusion12 Developer Summit (June 11-14 Bellevue WA), I will talk about it as soon as I will be back from the conference.

I am fascinated by how these GPUs are becoming ubiquitous and how just recently the research community is taking notice how to use these SIMD computation engines  in combination with multi-core multiprocessor systems. I have mixed feeling about GPUs and the body of wok about them: these are marvelous computing systems that can outperform any multi-core processor on a set of applications. But let us remember that not all applications are equal and sometimes you will not touch a GPU with a barge pole.

Last year, I spend less than $500 and built my own AMD machine with APU and GPU (16GB 1866 MHz) and I could write Fast MM codes in less than 6 moths. This was my first built, I did not know anything about OpenCL and GPU’s codes, and pretty much I am an idiot: this means that the tools and the examples available to code for these complex system are pretty good. My built configuration has a practical peak performance for my codes of about 200 GFLOPS (only the Cell processor could achieve the same performance for the buck). But if you are willing to spend for the-state-of-the-art GPU you may well build a sustainable 1 TFLOPS machine (see the ICS session).

I finally witness the study of the practical cases when the cost to move data to and from memory is accounted: Often in the literature and in the code available you can measure time within the GPU; that is, the execution time is measured as the start time and end time of the GPU execution when the data are already available in the internal memory. In general, the data are not in the GPU and they are at least in memory. Often the problem size is so big that it may not fit into the internal 2-4GB memory, and we need to break the computation. When there are multiple GPUs or there are GPU and CPU, there may be contention and the computation must consider it.

Once again, communication is the bottle neck.

Take a sneak peak at the conference schedule http://verona.dei.unipd.it/ics2012/program.html and consider to go, Venice is lovely in June.