Next Month, I will be going to Bellevue to present a short work about matrix multiplication and heterogeneous computing at the AMD Fusion developer Symposium. I better prepare my presentation because it will be recorded and made available. I just posted the paper online, search my bio or Google the title above.
I tested several architectures during these last seven years, since I started working on Winograd’s Fast Matrix Multiplication (fastMM). Nowadays, it is the age of GPUs as computational accelerators and I did not have much hands-on experience. So to learn something new, last year at this time, I started considering working on APUs. In fact, During a conversation with Matteo Frigo, it came up the topic about processors with cores and a GPU: the so called APU. At least, AMD calls them APU. The underlying question was whether or not Fast MM could be used for APU. The short answer is yes, but this is not my point for today.
In my previous posts, I talked about the recent work about MM and GPU. Winograd algorithm can be applied to GPUs and it is only a question of problem size. Also it may be beneficial for error analysis; yes, you read me. But APUs are hybrid and they are not a separate unit with astronomical GFLOPS. You may use the CPUs (e.g., 4) or the GPU. The A8 processor, I had the luck to test, provides this dichotomy and each part may provide 100 GFLOPS, if you add an equivalent and external GPU, like I did, you will have a system with about 200 GFLOPS peak performance for Matrix Multiplication.
This may look like a wimpy system —I received this comment already. In fact, to publish anything today you may have to use a GPU with 1TFLOPS peak performance. Otherwise, it is not supercomputing. I have a different opinion, which is based on the economy of the system: my built has the same performance of a Cell processor with the same cost of a PS3 Playstation, with the same power consumption, and quite enough that I can write this post by using this built while listening Ravel’s Bolero. Think about a supercomputing centre without the annoying whistle, air conditioning, and electric bill. The washing machine is making more noise.
From the beginning, one year ago, I built a system (new thing for me), I learn a new framework, I looked into the code written for ATI’s GPUs, and finally I start looking how I could write Fast MM for this parallel system. Having the GFLOPS/$ of a system built on a Cell, there is not need to brag performance. An APU will take advantage of the software written for x86_64 architectures, for which I am very familiar, and for the type of programming for GPUs. Differently from a Cell, I do not have to write new code for this and I can reuse the optimal one already written. I used OpenCL to glue everything together. It required some time due to the learning curve, but if I could do it, any younger developer will be able to, actually faster.
Once I was ready, my interest and curiosity moved from the development of fast MM (i.e., Winograd’s) to the understanding of the new challenges of such as system. Especially, I was very curious about using all these computational units: CPUs, internal GPU and the external GPU. I wanted to write codes that exploit the system but they are easy to maintain just in case I will plug out any of the chips.
From the paper the conclusions read as follows:
4.4. Conclusions
In our system, the APU provides a software solution using only CPUs that can achieve 90GFLOPS (GotoBLAS). If we would like to improve performance by just working on a different and fast algorithm, we can achieve 120 GFLOPS. If we take advantage of both GPUs, we can achieve sustainable performance of about 160 GFLOPS (and a peak performance of 200 GFLOPS). This is a first attempt in putting together a OpenCL solution for the implementation of MM using hybrid parallel
systems. The heterogeneous system presents interesting challenges and, thanks to the OpenCL API, ultimately a flexible and powerful environment
The communications to and from memory are the bottleneck of the system. In theory, that is without communication costs, we could use the CPUs, internal GPU and external GPUs concurrently. Once communication costs are included, the utilisation of the system is less.
Please, take a look at the paper OpenCL+APU+GPU+Fast Matrix Multiply or Google it.
























where m=2k+1 and n=2j+1 are odd, then we can do a first division by setting
In practice, we can apply once Strassen’s recursive approach for the product
The remain of the computation is carried out as a sequence of Matrix-Vector and VEctor-Vector operations![Rendered by QuickLaTeX.com \[ C_0 += A_1*B_2, \\ C_1= A_0*B_1 +A_1*B3, \\ C_2 = A_2*B_0 + A_3*B_2, \\ C_3 = A_2*B_1+A_3*B_3. \]](http://www.fastmmw.com/wp-content/ql-cache/quicklatex.com-f3218c30a5cf389688a0737271c91a95_l3.png)


.
Everything is like we pad
with an bottom row of zeros and a left column with zero. In practice, the definition of matrix addition is simple, it follows real unoptimized code …
















