Fiji ProDuo vs Hawaii 295×2: clBLAS preliminary performance.

What can you do with a few TeraFLOPs? I have got  a beautiful (if not a little noisy) Radeon 295×2. I plugged to my second PCI port and I have run a few experiments using clBLAS. This was my last post (below). Of course, the main limitation of my system is not the GPU, it  is every thing else. The 295×2 has a peak performance of 11 TFLOPS; however I  know that the data have to come from somewhere and the results often cannot stay tucked in the GPU. In my tests, I assume the data will come from the main memory and has to come back to it. I show that of the 11TFLOPS, I can achieve 3TFLOPS using single precision.  This is quite something:  considering that the card consume about  500W and I did not buy to play games, I thought I had a little supercomputer that I can keep beside my monitor.

Recently I added the ProDuo card based on the Fiji architecture in my GPUs arsenal. There are a few advantages (without considering the cost) versus the 295×2. The solution and packaging allows a smaller card without external fan (quiter), similar water cooling, and reduce power consumption, and more compute nodes (from 80 to 120). All of these goodies using the same 28nm technology, so this is pure reorganization thanks to the high bandwidth memory setup.  In practice, the ProDuo is clearly a better card and it should replace the 295×2. Right ?

single singlecomplex
Due to my limited PSU (1KW), because my setup limits any card in the first slot, because clBLAS has a new setup, I decided to re-run the experiments using the 295×2 and the ProDuo in the second slot (OpenCL would say that the card will take the place of GPU1 and GPU2). These cards are designed for single precision computations: The ProDuo has a peack performance of 16TFLOPS and the 295×2 has 11 TFLOPS (sorry I tend to repeat my self). The new clBLAS provides better kernels and you can see that The 295×2 achieves 5TFLOPS and ProDuo about 6TFLOPS. Good thing, I spent some time to redesign the experiments and I re-run the tests once again.  Once again, the bottle neck is my system and the way I feed the data, but you can see that having two GPUs in the card will allow a 2x speed up (instead of a single GPU).

A note, the plots above and the ones that follows will have missing points thanks to a few quirks in OpenCL that we are working on. Second, the cards have no modifications, they are coming directly from the box to the bench.

The 295×2 is still an awesome card considering that the difference is 0.9 TFLOPS. On the other side, the ProDuo is 20% faster, 30% more energy efficient. I can actually plug 2 ProDuo in my system without any further change, but I cannot plug the 295×2 and ProDuo together.  But what next it is still more interesting.

double doublecomplex

Yep. In double precision the ProDuo stays behind.  First, I am coming from the general purpose CPUs and their bench-marking,  I expect a factor of two penalty from going from single to double precision. Here, we can see that the factor is about 5. The Hawaii card can reach the 1 TFLOPS threshold marking, which sounds good;  Fiji’s has a 0.5 TFLOPS  upper limit. So the 0.9 TFLOP loss in single precision is a 0.4 TFLOP gain in double precision. Indeed, life is the stage for a variety of compromises.  In this case, I am not really sure if it is an architecture difference or kernels deployment. We will have to investigate, but it will require some help and effort.

But for me the most interesting part comes now, when I bought an extra PSU and now I can feed electricity to both cards on the same board.

IMG_20160530_170231168

In the next post, I will have a PCI extension so that I can put the big cards into the 2nd and  3rd slot,  I Will loose bandwidth but I should get full potential from the Hawaii and Fiji GPUs. Currently the two Hawaii are a little constrained and I can get the performance of a single GPU. With the extension,  I should be able to see  5 GPUS (the Turk on the first slot, 2 Hawaii and two Fiji). The system allows a three-way crossfire.

single precisionsingle complex preciston

Now, We have an heterogeneous system, in practice we have 3 GPUs effectively. The current experiments do not balanced the work load as a function of the GPUs throughput and thus the plots can be better, higher.

We can see that we earned yet another 1TFLOPS in single precision. The good news is that even in my system, the problem size and the computation time has such a ratio that more hardware will provide better performance and I can show it.  Also the introduction of more GPUs shows that the computation time become linear (the communication is the bottle neck). If I can unleash the forth GPUs, likely I will have little  improvement. But for double precision the curves are little different.

double precision  double complex precision

The three GPUS (Hawaii in position 1 and Fiji 2 and 3) provide a scalable solution but it is not always the best. The beauty of these plots is their complexity: considering the problem size and the configuration available, the best solution is not always straightforward.

The Future and motivations: 
At this time, my research is two fold:

First, I am investigating the field of deep learning for application of feature selection in advertising (yeah my real job) and GPUs seems the hardware of choice, so I wanted to have a test bed close by my real bed. These new systems promise and deliver performance unprecedented.

Second, with the coming of age of AutoGemm in clBLAS, we start having a self tuning BLAS for GPUs and an open source at that; this is an opportunity to re-evaluate kernels written using Strassen’s algorithm. In a system like mine, Strassen’s algorithm can be really appreciated only if they are done at kernel level: the computation performance plot it is too steep to take advantage of a divide and conquer (from the CPU) approach.