OpenCL + APU + GPU + Fast Matrix Multiply

Next Month, I will be going to Bellevue to present a short work about matrix multiplication and heterogeneous computing at the AMD Fusion developer Symposium. I better prepare my presentation because it will be recorded and made available. I just posted the paper online, search my bio or Google the title above.

I tested several architectures during these last seven years, since I started working on Winograd’s Fast Matrix Multiplication (fastMM). Nowadays,  it is the age of GPUs as computational accelerators and I did not have much hands-on experience.  So to learn something new, last year at this time, I started considering working on APUs. In fact, During a conversation with Matteo Frigo, it came up the topic about processors with cores and a GPU: the so called APU. At least, AMD calls them APU.  The underlying question was whether or not Fast MM could be used for APU. The short answer is yes, but this is not my point for today.

In my previous posts, I talked about the recent work about MM and GPU.  Winograd algorithm can be applied to GPUs and it is only a question of problem size. Also it may be beneficial for error analysis; yes, you read me. But APUs are hybrid and they are not a separate unit with astronomical GFLOPS. You may use the CPUs (e.g., 4) or the GPU. The A8 processor,  I had the luck to test, provides this dichotomy and each part may provide 100 GFLOPS, if you add an equivalent and external  GPU, like I did, you will have a system with about 200 GFLOPS peak performance for Matrix Multiplication.

This may look like a wimpy system —I received this comment already. In fact,  to publish anything today you may have to use a GPU with 1TFLOPS peak performance. Otherwise, it is not supercomputing.  I have a different opinion, which is based on the economy of the system: my built has the same performance of a Cell processor with the same cost of a PS3 Playstation, with the same power consumption, and quite enough that I can write this post by using this built while listening Ravel’s Bolero. Think about a supercomputing centre without the annoying whistle, air conditioning, and electric bill.  The washing machine is making more noise.

From the beginning, one year ago, I built a system (new thing for me), I learn a new framework, I looked into the code written for ATI’s GPUs, and  finally I start looking how I could write Fast MM for this parallel system. Having the GFLOPS/$ of a system built on a Cell, there is not need to brag performance.  An APU will take advantage of the software written for x86_64 architectures, for which I am very familiar, and for the type of programming for GPUs. Differently from a Cell, I do not have to write new code for this and I can reuse the optimal one already written.  I used OpenCL to glue everything together. It required some time due to the learning curve, but if I could do it, any younger developer will be able to, actually faster.

Once I was  ready, my interest and curiosity moved from the development of fast MM (i.e., Winograd’s) to the understanding of the new challenges of such as system. Especially, I was very curious about using all these computational units: CPUs, internal GPU and the external GPU.  I wanted to write codes that exploit the system but they are easy to maintain just in case  I will plug out any of the chips.

From the paper the conclusions read as follows:

4.4. Conclusions
In our system, the APU provides a software solution using only CPUs that can achieve 90GFLOPS (GotoBLAS). If we would like to improve performance by just working on a different and fast algorithm, we can achieve 120 GFLOPS. If we take advantage of both GPUs, we can achieve sustainable performance of about 160 GFLOPS (and a peak performance of 200 GFLOPS). This is a first attempt in putting together a OpenCL solution for the implementation of MM using hybrid parallel
systems. The heterogeneous system presents interesting challenges and, thanks to the OpenCL API, ultimately a flexible and powerful environment

The communications to and from memory are the bottleneck of the system. In theory, that is without communication costs, we could use the CPUs, internal GPU and external GPUs concurrently. Once communication costs are included, the utilisation of the system is less.

Please, take a look at the paper OpenCL+APU+GPU+Fast Matrix Multiply or Google it.

 

 

 

On the comparison of time series (a little detour from fast Matrix Multiply)

This week, I self published a paper about Non-Parametric methods for the comparison of series: for the detection of anomalies  or for the mining of similarities in (time) series.

Yep, I meant self publishing. If the contents of the paper does not belong to a journal nor a conference is it is my fault more than the machine-learning community. The paper is long counting 65 pages, it has too many experiments collecting 22 Figures, as well as too many notations, 40 equations and about 100 references.

This paper has been  a work of passion and love. Since I started working for the Search Engine group within  Yahoo! in  2007, I fell in love with statistics and its applications, in particular how to measure differences in between two processes (stochastic or not). My background is Engineering and communication is a must class; probability in signal processing and communication is fundamental; statistical tools are not so much. But in this new environment, so many engineering processes can be described as stochastic processes  and not necessarily because they are. Tools were and are extremely scarce.

We started by applying FIR and IR filters such moving average and exponential moving average with seasonal trends for single dimensional series. We also used stochastic measures applied to probability distributions functions (PDF) such as Entropy. See these  methods are close to my Digital signal processing roots. At the beginning the challenge was to build software and to use it accordingly to fuzzy requirements. Our costumers were different and the intents very different.

After two years, we had a monitoring tool handling quite a few big systems and it was a pleasure working with the engineers that built the search engine.

I will be for ever grateful to the managers who let me went deeper and I could work on and off on multidimensional series.  In  multidimensional series the methods are quite more challenging and publicly available tools are even more scarce.

Compression methods, conformal prediction, and Kernel methods are a few that really attracted me, because of the people working with me: prof. Kifer suggested Vovk’s work that is conformal prediction; prof. Smola was the king of kernel methods (both working in Y! at that time); and compression because it was called The measure in a SIAM conference. From my previous work, I wanted to investigate again the application of PDF and CDF for multivariate statistics. For all of the above methods,  there is a very rich literature ranging 40 years of research and very talented people came up with brilliant solutions. At the beginning, I wanted to enrich our stochastic library with these methods and if I could with something original from my left brain.

Implementing these methods is actually much harder than reading and understanding the papers. Without the help of the original authors I would have failed to reach any results; furthermore, considering that this process help me to come up with my own way to compare stochastic processes I must and I am very grateful.

My methods imply a classic idea in statistics: I create an order among the elements in the multidimensional space and then apply CDF measures. The tricky part is the order. The PDF and CDF measure are not trivial and we have also a little contribution in here. The methods boil down to the computation of a topological order of the samples and then compute a CDF, which is based on this topological order. The order can be achieved by sorting algorithms, you may be surprised how little is available in sorting multidimensional vectors, or by clustering algorithms, we used Dijkstra’s.

Note: The topological order is basically a breadth-first visit of a graph and, strangely,  I never remember how to do that during interviews. I swear I wrote my own version for directed acyclic graphs for this library, however it seems I cannot keep it into my long term memory.

With so much work done, with so much research done, I am in a strange and unfortunate spot: I am biased on the contribution of the paper and I do not want to compromise my intentions. I could shorten the paper, simplify it in contents and presentation, just present a classic comparison paper and provide a limited version of the code. But I do not want to do that: when I started this work I wanted to find a reference like this paper: an attempt to provide not only ideas but also practical insight into the methods by constructive observations.

At the end of the paper, I provide  the reviews from a journal where I submit the paper and it has been rejected. I do this for  my orphan papers and I do it  not because I am upset or for seeking revenge: I want to present the reviews because these are smart people that gave their time to help me improving the paper and their opinions must be told. I often disagree with reviewers but I really appreciate and respect their work.

The paper is available in my bio http://www.fastmmw.com/wp-content/uploads/2011/04/seriesPaolo.pdf.gz and in arXiv  http://arxiv.org/abs/1205.1880