## The 8x8x8 Comparison with 4 GPUs (and 2 Fiji GPUs and 1 Fiji)

The last in the family: the 8x8x8 algorithm with 343 products (so we have 2 M7,3 M23,4 M49 ,5 M99, and 8 M343).

The algorithms are not fully correct: The Matrix additions are sound only if the problem is divisible by the base=2,3,4,5, and 8 and basically we are computing O(N/base) more operations, which is negligible w.r.t. to O(N^3). Having all these parallel products, I just plugged all I have got: 4 GPUS.

All the algorithms M7, M23, M49, M99 and M343 are one level of recursion. That is, we want to apply the M algorithm once. However, Fiji has smaller memory and we apply a hybrid approach: if the leaf computation reaches a Ellesmere we call clBLAS, the problem just fits; otherwise, we apply a recursive Winograd 7 algorithm with leaf size  16000 (single), 12000 (Double and Single Complex) and 9000 (double complex). In the previous posts, I showed that Winograd (RM7) on a Fiji for problems larger than above is faster than or comparable than the clBLAS on Elsemere.

This is an example of experiments where the strategy is really hybrid. This comes with the disclosure that the performance makes sense if the problem size does not fit a GPU, otherwise clBLAS in GPU is by far superior. So the goal is to show the performance comparison for the M algorithms and figure out, in practice, when one is superior.  A reminder GFLOPS ->  operations/time and the operations are always computed as 2N^3 (complex operations involve 2x more operations but we still count them as 2N^3).

M343 does not have the opportunity, M99 gives its best however these algorithms never lead the pack.  M49 is the M7 applied twice: DComplex have M7 and M49 cross performances a little earlier than 18K (the leaf is 9K), which is the ideal and theoretical.

These experiments show that 2x2x2 M7, 3x3x3 M23 and 4x4x4  M49 cover a continuous performance space where each has a contribution.

With only Ellesmere:

With fewer resources and thus less parallelism, M99 get the bronze.  I will append the plot for only Fiji to see if the efficiency of the GPU makes any difference. Will I provide plots for only a single GPUs, Most likely not.

With Ellesmeres, we have slower kernels but larger size done in the GPUs. With Fijis, we have faster kernels. What is the effect overall.

M99 does not lead the pack but has  a smooth and predictable delivery for single precision. In the Single complex performance plot we can see clearly the second level of recursion in M7 at 24K. The knee is evident in the following as well.

We conclude this post by providing the data in csv format and the comparison of the M algorithms by type (Single, Double, SComplex, and DComplex) and number of GPUs. The idea is to show you how the different parallelism and GPUs affect the performance of each algorithm. The plots above should have given already a comparison among algorithm with fixed type and parallelism.

picture_TR_all.zip

In this link, you can find all performance and the plots by type and algorithm. Enjoy.It is a little messy because the naming is not completely clear. However, most of the performance plots are strangely consistent despite the asymmetric nature of my “poor scientist” rig.

What is left to me is to fix the matrix addition and then I may have a numerical comparison, wrap this up, write a paper, and publish it.

## The 5x5x5 with 99 Products a comparison

This will be fast: Winograd with 7 product (2x2x2 W7W), with 23 (3x3x3 M23) and 99 (5x5x5 M99) .. I should add 4x4x4 M49.

## A close up Winograd M7 vs M23

I know there is an asymptotic measure of performance and Axel Kemper would suggest that Winograd with 7 Multiplications is better than the algorithm with M23. Considering that my M23 still is under investigation for correctness and performance verification, I still want to measure their normalized performance.

I considered the following experiment: we run M7 (the optimized Winograd W7 I used since 2005) and the new M23 but we use a single GPU Ellesmere with 32GB. So M23 cannot exploit the parallelism and requires the same temporary space than W7  (one more temporary matrix for C). What else is different ? If I take a problem 36x36x36 then M23 will split into 23 MM of sizes 12x12x12; W7 will split the problem into 7 MM of size  18x18x18. We perform only one recursion step and the leaf computation fits the 32GB.

Performance is related to the saving, fewer multiplications, and the peak performance of the leaf computation. This is the classic problem of algorithm strategy.

For single precision, I cannot say anything. For single complex, it seems that W7 is better and I would suggest that because its leaves are faster.

For Double and Double complex, the story is reversed I would choose M23.

Alas, there is no single winner. It would make such a nice result if there was a single winner.  The common used theory describes the complexity problem in a simple form. This theory would say that W7 should be better: the reality is richer. We must be willing to enrich the O() notation with more details and we must be open to experiment.

Even if this is a partial investigation and the results must be confirmed, I think that for a library having different algorithms, different strategies, and collect different point of view is important.

Likely, the same experiment with the other GPU (Fiji) or the future ones, it will provide different results. My point is that having a place and the attitude to experiment even algorithms that are not optimal is important.

## My first 23-Product Fast Matrix Multiplication (3x3x3)

Today, we present my first practical performance evaluation of 3x3x3 Fast algorithm by Jinsoo and Oh. In 2015, Axel Kemper shared his algorithm references: the papers list, the representation of the algorithms in pseudo code,  and by Bini’s matrix format. The latter describes any algorithm using three matrices and this led to a general-purpose interpreter for the execution of  fast algorithms. It took only a few years.

In the past, I wanted to take the code and to optimize it before executing it. A 3x3x3 algorithm has 23 products and quite a few matrix additions. The size of the code and its complexity asked for a compiler approach. I took the pseudo code and tried to optimize the schedule so that to improve the product/addition ratio and reduce the number of matrix temporaries.  I thought it was  time to have a general approach for it. The idea is still the right thing  but I could not manage any satisfactory results and it does not work still. What I can do now is to create an interpreter that actually create a schedule and run these algorithms. Today, I will show execution time comparisons.

With the arrival of GPUs and basically the idea of  GEMM accelerators, The poor scientist problem is the ability to use different accelerators to cope with limited budget and old technology with newer.  In the previous post, I showed a practical problem where we want to execute four matrix multiplications using 2 or 4 GPUs as helpers. Today, we have a practical approach to compute the 3x3x3 algorithm using 23 products.

The interpreter and the execution of the algorithm is not fully tested but I am confident about the general direction and the performances are consistent and interesting to share.

Let’s start with the algorithm:    s3x3x3_23.JinsooOh20131108a.bini I have other versions of the same algorithm and I think I will be able to run also others if they are balanced (such as 2x2x2, 4x4x4 and 5x5x5).

We prepare for real matrices instead of integer matrices,  we will run solutions with scaling and other trickery. The idea is simple, take the first row of the matrices Gamma, Alpha, and Beta. Product 1 will be added to three sub matrices of C (reading gamma row) and it is the product of one sub matrix of A (read alpha) and one sub matrix B (beta).  The algorithm exposed 23 independent product computations. If you consider that we divided the matrices into 3×3 parts, the regular matrix multiplication will require 27 products and thus we save  4.

The classic Strassen-Winograd algorithms describe the matrices by 2×2 with 7 product (instead of 8). If you like to  do it twice, the matrices are 4×4 and the number of products are 47 (instead of 64). I have been interested in the 3x3x3 algorithm because it is an intermediate algorithm in between 2x2x2 and 4x4x4. Even though, saving 4 product does not make it better asymptotically, it allows more granularity. If I can use 3x3x3 and not 4x4x4 because the leaf computation is too small to tickle the architecture, I have a better solution than to use  only the 2x2x2.

Ok, let us get started. I use M23 to specify the algorithm that use 23 products. WINOGRAD-OPT is my classic Wingorad algorithm. I have 4 GPUs: 0 and 1 stands for Ellesmere from a Polaris Pro Duo and 2 and 3 are Fiji from a Fiji Pro Duo.

Thus 0-M23 is the M23 algorithm that uses a Ellesmere 0 as single accelerator. 0-1-2-3-M23 uses all 4 GPUs. The M23 is not a recursive algorithm in the sense that when possible will call clBLAS GEMM. There are a few caveats but I will skip over. The WINOGRAD-OPT for problems smaller than 16K uses the clBLAS GEMM directly and you can see that for single precision being in the GPU has the best performance. Of course, this solution penalize the Ellesmere that could solve up 30K/12K problems in memory.

M23 scales up but not very nicely. It is true that we could solve larger problems with 0-1-2-3-M23 and other advantages will show up as the problem scales up. For small problems, we are better off using directly the clBLAS GEMM. Off memory/GPU computation, you can see that Fiji will sholder most of the computation. This is not a final comparison. The largest problem is 27Kx27Kx27K thus every problem has size just 9Kx9Kx9K complexity.   Really larger problems should be used for the M23 where 45x45x45 should provide the peak performances.

But there will be a lot of work before closing this door. I would like you to pay attention to the 2-WINOGRAD and 2-M23 in Single precision. For the problem size 21K, both algorithms perform a single recursion step and the M23 is faster. For the Single complex, WINOGRAD is faster (but M23 acceleration seems catching up). I would expect that for practical purpose 2x2x2 should have a window with better performance and then 3x3x3 should top it.

Double precision plots are so much prettier. The contribution of each GPU is clear. 0-1-2-3-M23 is alway better (size>5000K). The performance plots show that 2x2x2 algorithm has different slope and there is a break even point where 3x3x3 is faster (because it perform fewer operations).  We can reach a 1TFLOP in double precision and we did use the Ripper for matrix additions. If we could add it during the product computations, we could squeeze yet another 200-400GFLOPS. For a comparison, I could reach 800GFLOPS in double precision on a two Epyc processors on a single Board (8000\$+).

Independently, the setup I have can be considered for all purposes the poor man scientific rig. I know it is still expensive, on paper I have about 23+TFLOPS in single precision and 250 (Ellesmere)+350 (Fiji)+120 TR  ~ 800W consumptions, 10TFLOPS of practical accounting for perfect scaling and data movement, 8TFLOPS if you really test it with representative tests  and really really 4TFLOPS. The real work horse is a 2 year old GPU (Fiji) running code written for Fiji (underutilized ellesmere). This should show that

• we have software solutions,
• that real applications need a lot of data movements,
• magnificent peak performance are achieved only when a lot of conditions are met,
• double precision GEMM provide the cleanest environment and
• the cleanest scalability.

## Playing with Pro Duos

In this post,  we shall present a ideas how to use multiple GPUs for the solution of  large matrix multiplication with constraints.

I upgraded my system as it follows: I add more memory for a total of 64GB and I added the Polaris Pro Duo. To avoid naming confusion, I have a 4 GPUs system with two Fiji GPUs  (Fiji Pro Duo) and two Ellesmere GPUs (Polaris Pro Duo). Remember, I sold my 295×2 Hawaii to fund the Ripper.  The two cards are quite different,  the performances are different, and their memory requirements are different (Their full comparison is trick and it will be a different post).

For now, the main differences are the memory  and power consumption: Ellesmere has 16GB and Fiji 4GB (i.e., in single precision a Fiji can solve locally a problem of size 17Kx17Kx17K  and Ellesmere ~ 34K ….  larger).  I am using a single PSU 1KW because  I will fire up one GPU at a time. It is like creating a peak power constraint: on paper Ellesmeres 250W  and  Fijis 350W for a total of  600W … I should still be able to sustain the computation … but it is not the point (I have another 1KW PSU if I wanted).  On paper, the overall system composed by 4 GPUs and ThreadRipper I should have a 10+TFLOPS  (6 Fiji + 4 Ellesmere + 1 Ripper) of practical performance.

Now take a problem size N (i.e., 20000) and we want to solve this simple matrix computations:

``` TIME;
C0 += A*B;
C1 += A*B;
C2 += A*B;
C3 += A*B;
TIME;```

Assume we want to run the sequence in order but we have the opportunity to annotate the source, what GPU will run each operation,  and the algorithm for each operation. Using my latest code I can write the operation above as

``` Matrix c[6];    // specify the matrices
MatrixComputation m[4] = {
gpuGEMMS,gpuGEMMS,gpuGEMMS,gpuGEMMS
};  // specify the algorithm

printf("schedule 0 1 2 3 \n");
scanf("%d %d %d %d", &c[2].gpu,&c[3].gpu, &c[4].gpu,&c[5].gpu);

TIMING_ITER(
{
CMC(c[2], =, c[0], m[0], c[1]);
CMC(c[3], =, c[0], m[1], c[1]);
CMC(c[4], =, c[0], m[2], c[1]);
CMC(c[5], =, c[0], m[3], c[1]);
},

Now that I have being using matrix operation for other computational processors besides CPUs, the clBLAS interface requires an identifier. To keep the matrix computation seamless to a BLAS computation the information about the computation unit has to be “hidden” some where and I chose to introduce an entry “gpu” in the matrix definition. If it is used and it specifies a valid GPU, we will schedule the operation to the wanted GPU. Otherwise, we will choose a default solution.

In the code example above,   I can choose to fire the operation on GPU 0, 1,2, and finally 3. Once I chose the schedule, I may choose also the algorithm: for example if the problem size does not fit the GPU memory, I can choose a Winograd variant.

``` {
int i;
int sizes[4] = { 16580608, 16580608, 3956940, 4148964}; //KB memory space E and F

for (i=0;i<4;i++){
int j = i+2;
printf("GPU %d MEM %dMB size %dMB \n", c[j].gpu, sizes[c[j].gpu], (sizeof(float)*3*c[j].M*c[j].N)/(1024));
if (sizeof(float)*3*c[j].M*c[j].N/(1024) > sizes[c[j].gpu]) {
printf("\t change algorithm GPU %d \n", c[j].gpu);
m[i] = s_wm;   // Winograd algorithm with LEAF 16,000
}
}
}```

For example, if N=40,000 and run the code above with schedule 0, 1, 1, 0 (Ellesmere Only)

```N of GPUS 4
schedule 0 1 2 3
GPU 0 MEM 16580608MB size 18750000MB
change algorithm GPU 0
GPU 1 MEM 16580608MB size 18750000MB
change algorithm GPU 1
GPU 1 MEM 16580608MB size 18750000MB
change algorithm GPU 1
GPU 0 MEM 16580608MB size 18750000MB
change algorithm GPU 0
----------> get time 3.672235e+02 sec<------
average 367.223521
Time Cold 3.672235e+02
MUL OPS 5.120000e+14
INT OPS 5.120000e+14 GFLOPS COLD 1.394246e+03
----------> get time 7.363572e+02 sec<------
average 368.178595 times 2
Time HOT 3.681786e+02
MUL OPS 5.120000e+14
INT OPS 5.120000e+14 GFLOPS HOT 1.390629e+03```

The execution ran in parallel while writing the post. As you can see I decided to use only Ellesmere GPUs (16GB) and I had to change algorithm strategy to make sure I could use the limited memory setting. The computation used one GPU at a time. To improve performance I could use a different schedule 0, 2, 1, and 3. The winograd will require more memory resources and data copies but Fiji  is  faster than Ellesmere, we will keep the power cap and improve the overall performance by up to 20%.

```schedule 0 2 1 3
GPU 0 MEM 16580608MB size 18750000MB
change algorithm GPU 0
GPU 2 MEM 3956940MB size 18750000MB
change algorithm GPU 2
GPU 1 MEM 16580608MB size 18750000MB
change algorithm GPU 1
GPU 3 MEM 4148964MB size 18750000MB
change algorithm GPU 3

----------> get time 3.451676e+02 sec<------
average 345.167597
Time Cold 3.451676e+02
MUL OPS 5.120000e+14
INT OPS 5.120000e+14 GFLOPS COLD 1.483337e+03
----------> get time 6.880446e+02 sec<------
average 344.022295 times 2
Time HOT 3.440223e+02
MUL OPS 5.120000e+14
INT OPS 5.120000e+14 GFLOPS HOT 1.488276e+03```

As you can see, we can decide a schedule and computation allocation without change the basic formulation of the algorithm. We can just set the parameter of the result matrix and we are done. This  choice  makes the code portable because the  resources allocation can be done at run time without further modifications. We can go a step further and we can imagine to change strategies and thus algorithm as a function of the destination GPU.

The next and natural step is to optimize the schedule, the resources, and the algorithm strategy as function of the run time. For this example, Peak Power  is the main constraint. If speed is the goal, the computation is 4 independent operations and they can be  executed in parallel on all 4 GPUs. A single thread can be associated to the matrix operation and a sequential and parallel strategy can be applied.

My personal goal is to create a code generator for the FastMMW that I do not have implementation yet and use framework like this to run them using all the resources at my disposal.  Considering that no 3×3 Fast MM is available for consumption or for any testing, this will be a great first step to have a running computation.

## Playing with the ripper 1950x

Using Winograd algorithms, I can achieve 1TFLOP equivalent performance in single precision  out of a Treadripper 1950x. This is pretty good.

In the last two weeks, I have been working building my personal rig based on the Thread Ripper 1950x processor. As before, I do not use a case nor a development platform.

The motherboard is the Zenith Extreme X399, 4x8GB (32GB) Trident Z RGB G.Skill, and water cooling Corsair H115i. The setup is out of the box; that is, no over clocking. The power supply is a 1000W P2 EV3A.  In this post, I will present the performance for Matrix multiplication without GPUs but it is in my schedule to redo the performance comparison because the memory and PCI communication are now faster.

I installed OpenBLAS and ATLAS once again for this architecture. I did not touch the configuration of BLAS libraries. OpenBLAS will use the first 16 cores and ATLAS will use all. Currently, I am not able to drive the core utilization of the BLAS libraries and I take them as granted. I am kind of able to work on core assignment for the Epyc processor and for the ATLAS system but that will be another post and another story.

Then, I probed the system for the break even point for different algorithms and precision.

Clearly, ATLAS GEMM is half the performance of OpenBLAS. We can see that 10K  and 15K is the break even point for single precision computation.

For the Ripper, the performance plots are consistent. Winograd algorithms can be applied for problem sizes larger that 10Kx10K. These are large problems. There is no denying the break even points are getting larger but this is the first system that we can see a core speed up and a memory speed up.   Also this is my first time when I can buy a processor that can reach 1TFLOP and keep it on the side of my desk. These are performance achieved by GPUs and FPGAs.

To conclude, I re-run the GPU (0) to check if  the DDR4 expensive memory or anything else will help the PCI connection and thus the GPU GEMM performance. I cannot achieve the same performance of the previous and cheaper system .. about 10% loss of peak performance (Single precision)

Fiji GPU loves Single precision computation but it is dragging its feet for double precision (295×2 was much better in this respect). Clearly Fiji humiliates The-Ripper in Single Precision but they are equal for double precision.

Then I try to check if I can saturate the system running independently three single precision GEMMat the same time

```OpenBLAS OPS 1.166400e+13 GFLOPS HOT 8.291898e+02
GPU0          OPS 1.166400e+13 GFLOPS HOT 2.977494e+03
GPU1          OPS 1.166400e+13 GFLOPS HOT 2.972505e+03```

Overall the full performance is good, scalable, and the The Ripper holds its grounds. In Double precision, if we organize our work accordingly the DGEMM can achieve 1.3 TFLOPs, which is yet another first for me.

Let me add to this post the performance of ATLAS when we force to use 16 Cores (instead of 32).

First the anomaly: ATLAS using 16 cores in double complex precision has worse performance. Otherwise for double, single complex, and single precision having fewer threads is better .

A note before I forget: there are cases where ATLAS and OpenBLAS decreases performance as the problem increases. This is really uncommon and this is the first time I see such a drop. The reasons ? I am guessing core throttling because of heating dissipation issues. In practice, when the motherboard shows the processor temperature or when I added to the core the lmsensor the performance decreases.

## Fast Matrix Multiplication using AMD clBLAS

Now we can use clBLAS GEMM subroutines within the FastMMW framework, it is something  that I wanted to do for a long time; finally, I have got a chance.

The clBLAS+FastMMW addition  to the project started because I wanted to play with Deep Learning software on my PC. I installed Nervana’s code and Tensor Flow. Oh Boy! My system is ancient. Any training makes my CPU’s fan going nuts. Of course, I do not have any green  NVIDIA GPUs to off load the work, I do not have a TPU either. I have a A10 APU that needs upgrade. The training and testing does take a night of spinning.

OpenCL is an up and coming option for DL software. But there is only one, SINGA, that  provides support for CUDA and OpenCL please take a look at the complete list  https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software. Unfortunately, I could make work not a single one. I upgraded my Ubuntu from 14 to 16, installed the new amd-pro driver and I tried. Oh Boy, I tried, and tried.

Clearly, I do not understand it and my installation is a Frankenstein’s of packages. Too messy to figure out what is so wrong that TF tests cannot run on my red GPUs. Furthermore, it is not enough to send off  matrix multiplications operations to a GPU:  the code must be optimized. I am  in that kind of business.  OpenCL interface is a good  way to move data but the original BLAS is about an interface to intensive computing routines, clBLAS is the same. There are two reasons I would like to use OpenCL for DL:

1. Deep Learning is hot and I would like to learn (i.e., after all, My data science profession pays my bills).
2. GEMM is a work horse in DL,
1. FastMMW can be  used for CPUs
2. OpenCL is one open source to unleash GPUs power, but so far FastMMW did not use in its distribution.
• FastMMW should be used at kernel level, here we use it at Host level.

The GEMM in clBLAS is a set of operations that are tailored to the GPUs (like OpenBLAS or ATLAS for CPUs). So I went back to my experiments and my setting to figure out what I can do to use clBLAS, like I use BLAS.

I have found that I have two platforms: Clover and AMD Accelerated Parallel Processing. Some DL interfaces may have a way to specify the device (by a device number) using a environment variable. But I could not find a way to specify the platform. I realized that  Deep Learning frameworks  using OpenCL must overcome these trivial issues before they are accessible to the rest of us.

What I can do, it  is to give examples how to wrap clBLAS GEMMs so that they can be used similarly as BLAS is used but with whatever GPU you have. I am not completely happy about what I did but it is a solid step forward. In principle, you could use the code to run any GEMM (but float complex) and build a library using OpenBLAS and clBLAS (my experiments will use only one but the libraries will have all).

Yesterday, I finished to integrate the FastMMW code so that I can run sgemm, dgemm and zgemm. For the  cgemm I will need help.

For example: NxNxN problem double complex on a Fiji GPU

```N= 1000 GFLOPS COLD     5.4 GFLOPS HOT   45.4
N= 3000 GFLOPS COLD   59.1 GFLOPS HOT 106.6
N= 5000 GFLOPS COLD   98.8 GFLOPS HOT 115.7
N= 7000 GFLOPS COLD 113.6 GFLOPS HOT 119.3
N= 9000 GFLOPS COLD 118.2 GFLOPS HOT 120.7
N=11000 GFLOPS COLD 49.7  GFLOPS HOT 50.0```

Larger problems do not fit the 4GB of memory and thus I could use Winograd algorithm to break the problem in subproblems smaller than 10Kx10K.

```N = 13000 GFLOPS COLD 112.1 GFLOPS HOT 112.6
N = 15000 GFLOPS COLD 115.9 GFLOPS HOT 116.3
N = 17000 GFLOPS COLD 114.1 GFLOPS HOT 112.3```

I use a Pro Duo (Fiji) card and use only one at anytime. This is in line with the peak performance of the system. We do not do any matrix preparation so to make it easier to break the problem into subproblems.

This problem reduction works better for double complex but the idea will work for single precision (float) where we can reach Tera FLOPS (per GPU)

```  1000 GFLOPS COLD 3.00     HOT 181.07
3000 GFLOPS COLD 42.62   HOT 1116.74
5000 GFLOPS COLD 267.32 HOT 2180.79
7000 TFLOPS COLD 0.64     HOT 2.57
9000 TFLOPS COLD 1.02     HOT 2.88
11000 TFLOPS COLD 1.48     HOT 3.25
13000 TFLOPS COLD 1.91     HOT 3.27
15000 TFLOPS COLD 2.31     HOT 3.40
17000 TFLOPS COLD 2.59     HOT 3.43
20000 TFLOPS COLD 0.25     HOT 0.26```

The results above are a fit with respect my previous experiments with dedicated code. In my previous post, I have already shown that Fast algorithms are not really applicable for sizes that fit the GPU memory (4GB is you use oneGPU and 8GB if you use both). The performance plot does not reach a plateau where a change of algorithm is beneficial. In this post,  we suggest that if the problem size is larger enough and you are thinking to find a policy to break down the problem down, well Fast Algorithms are competitive and already use a “memory hierarchy oblivious” solution (fancy name for optimal).

On a personal note, I could sell the 295×2 and the funds will go towards the new system (thread ripper) next month. At that time, I will be able to have better memory communication; the CPUs will be competitive and provide a better support for those Matrix Additions (i.e., necessary for fast matrix multiplications). The month of July will bring more opportunities to work on related projects. I should start a github as well.

I am considering upgrading my hardware and thus start a new set of experiments.
I would like to hear suggestions and I will consider any help to raise the funds; that is money, but I am willing to accept also processors (naples), motherboards, memory and  GPUs.  I will help raising the funds by letting go my Pro-duo and my 295×2 (this  is even better than the Pro-duo in double precision) if necessary.

So if you have ideas or you want to help in practice: paolo@fastmmw.com

## PCI-Extension: thus a SGEMM 4 way ?

I received in the mail a PCI-Cable extension thus I could experiment a different configuration: it was a disaster.

I have a ASUS Crossblade Ranger, thus I could actually test a 3-way crossfire. For this experiment, I use a R9 Turk GPU into slot 0, the 295×2 into slot 1, and in the slot 2 the ProDuo.  In practice, I can see 5 GPUs but there is a “but”.

GPU_0 takes care of the monitors
GPU_1 and GPU2 are Fiji thus the ProDuo
GPU_3 and GPU4 are Hawaii thus the 295×2

(But) In this configuration, the ProDuo get half the bandwidth. Obviously, this is wrong in so many way that any smart person will not even try. Why waste energy ?

I wanted to make the computation configurable in order to take advantage of such a skew system even such a system. The idea is simple, if a GPU is slower, I quantify it and I reduce the problem size it can handle. If a Fiji is used in conjunction with a Hawaii, the Hawaii get 2/3 of the problem and Fiji 1/3. This will be useful in the future, but you can see the problem and the simple solution.

In practice, 4 is possible, 3 is better.

## Fiji ProDuo vs Hawaii 295×2: clBLAS preliminary performance.

What can you do with a few TeraFLOPs? I have got  a beautiful (if not a little noisy) Radeon 295×2. I plugged to my second PCI port and I have run a few experiments using clBLAS. This was my last post (below). Of course, the main limitation of my system is not the GPU, it  is every thing else. The 295×2 has a peak performance of 11 TFLOPS; however I  know that the data have to come from somewhere and the results often cannot stay tucked in the GPU. In my tests, I assume the data will come from the main memory and has to come back to it. I show that of the 11TFLOPS, I can achieve 3TFLOPS using single precision.  This is quite something:  considering that the card consume about  500W and I did not buy to play games, I thought I had a little supercomputer that I can keep beside my monitor.

Recently I added the ProDuo card based on the Fiji architecture in my GPUs arsenal. There are a few advantages (without considering the cost) versus the 295×2. The solution and packaging allows a smaller card without external fan (quiter), similar water cooling, and reduce power consumption, and more compute nodes (from 80 to 120). All of these goodies using the same 28nm technology, so this is pure reorganization thanks to the high bandwidth memory setup.  In practice, the ProDuo is clearly a better card and it should replace the 295×2. Right ?

Due to my limited PSU (1KW), because my setup limits any card in the first slot, because clBLAS has a new setup, I decided to re-run the experiments using the 295×2 and the ProDuo in the second slot (OpenCL would say that the card will take the place of GPU1 and GPU2). These cards are designed for single precision computations: The ProDuo has a peack performance of 16TFLOPS and the 295×2 has 11 TFLOPS (sorry I tend to repeat my self). The new clBLAS provides better kernels and you can see that The 295×2 achieves 5TFLOPS and ProDuo about 6TFLOPS. Good thing, I spent some time to redesign the experiments and I re-run the tests once again.  Once again, the bottle neck is my system and the way I feed the data, but you can see that having two GPUs in the card will allow a 2x speed up (instead of a single GPU).

A note, the plots above and the ones that follows will have missing points thanks to a few quirks in OpenCL that we are working on. Second, the cards have no modifications, they are coming directly from the box to the bench.

The 295×2 is still an awesome card considering that the difference is 0.9 TFLOPS. On the other side, the ProDuo is 20% faster, 30% more energy efficient. I can actually plug 2 ProDuo in my system without any further change, but I cannot plug the 295×2 and ProDuo together.  But what next it is still more interesting.

Yep. In double precision the ProDuo stays behind.  First, I am coming from the general purpose CPUs and their bench-marking,  I expect a factor of two penalty from going from single to double precision. Here, we can see that the factor is about 5. The Hawaii card can reach the 1 TFLOPS threshold marking, which sounds good;  Fiji’s has a 0.5 TFLOPS  upper limit. So the 0.9 TFLOP loss in single precision is a 0.4 TFLOP gain in double precision. Indeed, life is the stage for a variety of compromises.  In this case, I am not really sure if it is an architecture difference or kernels deployment. We will have to investigate, but it will require some help and effort.

But for me the most interesting part comes now, when I bought an extra PSU and now I can feed electricity to both cards on the same board.

In the next post, I will have a PCI extension so that I can put the big cards into the 2nd and  3rd slot,  I Will loose bandwidth but I should get full potential from the Hawaii and Fiji GPUs. Currently the two Hawaii are a little constrained and I can get the performance of a single GPU. With the extension,  I should be able to see  5 GPUS (the Turk on the first slot, 2 Hawaii and two Fiji). The system allows a three-way crossfire.

Now, We have an heterogeneous system, in practice we have 3 GPUs effectively. The current experiments do not balanced the work load as a function of the GPUs throughput and thus the plots can be better, higher.

We can see that we earned yet another 1TFLOPS in single precision. The good news is that even in my system, the problem size and the computation time has such a ratio that more hardware will provide better performance and I can show it.  Also the introduction of more GPUs shows that the computation time become linear (the communication is the bottle neck). If I can unleash the forth GPUs, likely I will have little  improvement. But for double precision the curves are little different.

The three GPUS (Hawaii in position 1 and Fiji 2 and 3) provide a scalable solution but it is not always the best. The beauty of these plots is their complexity: considering the problem size and the configuration available, the best solution is not always straightforward.

The Future and motivations:
At this time, my research is two fold:

First, I am investigating the field of deep learning for application of feature selection in advertising (yeah my real job) and GPUs seems the hardware of choice, so I wanted to have a test bed close by my real bed. These new systems promise and deliver performance unprecedented.

Second, with the coming of age of AutoGemm in clBLAS, we start having a self tuning BLAS for GPUs and an open source at that; this is an opportunity to re-evaluate kernels written using Strassen’s algorithm. In a system like mine, Strassen’s algorithm can be really appreciated only if they are done at kernel level: the computation performance plot it is too steep to take advantage of a divide and conquer (from the CPU) approach.