Fast Matrix Multiplication using AMD clBLAS

Now we can use clBLAS GEMM subroutines within the FastMMW framework, it is something  that I wanted to do for a long time; finally, I have got a chance.

The clBLAS+FastMMW addition  to the project started because I wanted to play with Deep Learning software on my PC. I installed Nervana’s code and Tensor Flow. Oh Boy! My system is ancient. Any training makes my CPU’s fan going nuts. Of course, I do not have any green  NVIDIA GPUs to off load the work, I do not have a TPU either. I have a A10 APU that needs upgrade. The training and testing does take a night of spinning.

OpenCL is an up and coming option for DL software. But there is only one, SINGA, that  provides support for CUDA and OpenCL please take a look at the complete list Unfortunately, I could make work not a single one. I upgraded my Ubuntu from 14 to 16, installed the new amd-pro driver and I tried. Oh Boy, I tried, and tried.

Clearly, I do not understand it and my installation is a Frankenstein’s of packages. Too messy to figure out what is so wrong that TF tests cannot run on my red GPUs. Furthermore, it is not enough to send off  matrix multiplications operations to a GPU:  the code must be optimized. I am  in that kind of business.  OpenCL interface is a good  way to move data but the original BLAS is about an interface to intensive computing routines, clBLAS is the same. There are two reasons I would like to use OpenCL for DL:

  1. Deep Learning is hot and I would like to learn (i.e., after all, My data science profession pays my bills).
  2. GEMM is a work horse in DL,
    1. FastMMW can be  used for CPUs
    2. OpenCL is one open source to unleash GPUs power, but so far FastMMW did not use in its distribution.
      • FastMMW should be used at kernel level, here we use it at Host level.

The GEMM in clBLAS is a set of operations that are tailored to the GPUs (like OpenBLAS or ATLAS for CPUs). So I went back to my experiments and my setting to figure out what I can do to use clBLAS, like I use BLAS.

I have found that I have two platforms: Clover and AMD Accelerated Parallel Processing. Some DL interfaces may have a way to specify the device (by a device number) using a environment variable. But I could not find a way to specify the platform. I realized that  Deep Learning frameworks  using OpenCL must overcome these trivial issues before they are accessible to the rest of us.

What I can do, it  is to give examples how to wrap clBLAS GEMMs so that they can be used similarly as BLAS is used but with whatever GPU you have. I am not completely happy about what I did but it is a solid step forward. In principle, you could use the code to run any GEMM (but float complex) and build a library using OpenBLAS and clBLAS (my experiments will use only one but the libraries will have all).

Yesterday, I finished to integrate the FastMMW code so that I can run sgemm, dgemm and zgemm. For the  cgemm I will need help.

For example: NxNxN problem double complex on a Fiji GPU

N= 1000 GFLOPS COLD     5.4 GFLOPS HOT   45.4
N= 3000 GFLOPS COLD   59.1 GFLOPS HOT 106.6
N= 5000 GFLOPS COLD   98.8 GFLOPS HOT 115.7
N= 7000 GFLOPS COLD 113.6 GFLOPS HOT 119.3
N= 9000 GFLOPS COLD 118.2 GFLOPS HOT 120.7
N=11000 GFLOPS COLD 49.7  GFLOPS HOT 50.0

Larger problems do not fit the 4GB of memory and thus I could use Winograd algorithm to break the problem in subproblems smaller than 10Kx10K.

N = 13000 GFLOPS COLD 112.1 GFLOPS HOT 112.6
N = 15000 GFLOPS COLD 115.9 GFLOPS HOT 116.3
N = 17000 GFLOPS COLD 114.1 GFLOPS HOT 112.3

I use a Pro Duo (Fiji) card and use only one at anytime. This is in line with the peak performance of the system. We do not do any matrix preparation so to make it easier to break the problem into subproblems.

This problem reduction works better for double complex but the idea will work for single precision (float) where we can reach Tera FLOPS (per GPU)

  1000 GFLOPS COLD 3.00     HOT 181.07
  3000 GFLOPS COLD 42.62   HOT 1116.74
  5000 GFLOPS COLD 267.32 HOT 2180.79
  7000 TFLOPS COLD 0.64     HOT 2.57
  9000 TFLOPS COLD 1.02     HOT 2.88
11000 TFLOPS COLD 1.48     HOT 3.25
13000 TFLOPS COLD 1.91     HOT 3.27
15000 TFLOPS COLD 2.31     HOT 3.40
17000 TFLOPS COLD 2.59     HOT 3.43
20000 TFLOPS COLD 0.25     HOT 0.26

The results above are a fit with respect my previous experiments with dedicated code. In my previous post, I have already shown that Fast algorithms are not really applicable for sizes that fit the GPU memory (4GB is you use oneGPU and 8GB if you use both). The performance plot does not reach a plateau where a change of algorithm is beneficial. In this post,  we suggest that if the problem size is larger enough and you are thinking to find a policy to break down the problem down, well Fast Algorithms are competitive and already use a “memory hierarchy oblivious” solution (fancy name for optimal).

On a personal note, I could sell the 295×2 and the funds will go towards the new system (thread ripper) next month. At that time, I will be able to have better memory communication; the CPUs will be competitive and provide a better support for those Matrix Additions (i.e., necessary for fast matrix multiplications). The month of July will bring more opportunities to work on related projects. I should start a github as well.