To compare the performance of the two cards, we compare the performance of each single GPU and a different way to fire up the computation. We take the sequence of four operations. This is a re-edit of the post basically better and more consistent performance plots.
TIME; C0 += A*B; C1 += A*B; C2 += A*B; C3 += A*B; TIME;
The GPUS 0 and 1 are Ellesmere and 2 and 3 are Fiji. We will have performance plots for the firing sequence 0-0-0-0, 1-1-1-1, and 0-1-0-1 for the blue Ellesmere and 2-2-2-2, 3-3-3-3, and 2-3-2-3 for the red Fiji. No data is re-used in this experiment so all performances account for the memory movement but you can imagine that A and B they do not need to go back and forth. We show also a parallel computation 0-1-2-3.
The Code is from the clBLAS generated for the Fiji architecture. No new code was generated for the Ellesmere, this slack will be addressed in the future hopefully because clBLAS is a “closed” project and I could not “autoGEMM” it. clBLAS for AMD GPUs is a fundamental stepping stone for so many applications, I hope we are going to see Lazarus. Size matrices are in thousands (Mille or Kilo).
For Single precision, we see that having 16GB per GPU pays off for larger problem sizes but not as much as I thought. In Single precision, we have a small window between 20 and 25K; for single complex, the window is smaller 15-20K and the effect even smaller. For larger or smaller problems Fiji is superior. In single precision, a 16Kx16Kx16K problem fit in the 4GB memory and for larger problem we use a Winograd algorithms (we could use a faster one) and we can see that we can cope well. For problems larger than 30K, we need more that 16GB and thus Ellesmere will loose further.
We can see that in single precision the peak performance for the Fiji system is 6TFLOPS and for Ellesmere is 4TFLOPS. There is a practical 20-40% peak performance gap. All considering, This is pretty good.
Let us check the Double precision computation.
Now, The Fiji is always better even when we need software help. The peak performance gap is negligible 400×2 + 300×2 ~ 1.4TFLOPS and we achieve 1.25TFLOPS, smaller 15%.
Using the frequency and number of computational units, we know that Fiji has 64 (unit) at 1GHz while Ellesmere has 36 (unit) at 1.23GHz, so on paper the Fiji should be 44% faster. If we considering Watts, the two GPUS should have the same ops/Joule, this means that the maintenance cost is the same.
So if you have the time to wait for the result, the blue ellesmere is a good choice and, for certain problems sizes, it is even faster (considering that the more memory allows to do more in the card). For single precision is more efficient for a set of problem sizes (same speed less power) but in general we can cope performance-wise with FIji just fine. Alas I cannot measure the power consumption and I assume the GPUs will consume the peak Watts. The blue card is quite, is efficient, it is a long card, it has a smaller current withdraw that help your budget. The air cooling solution penalizes one of the GPUs and the card gets quite hotter at the exhaust end especially during double precision computations.
If we plug in an extra PSU and we unleash all GPUs at the same time we can achieve an empirical peak performance of 7-8 TFLOPS in single precision and 350GFLOPS in double-complex precision. Considering that we should have about 10TFLOPS (6+4) achievable peak performance we are short changed of 2 TFLOPS. At this time, I will not linger on the subject. Double precision performance is consistent and additive: The double precision plot shows a peak at 1.2 TFLOPS (0.8 Fiji + 0.4 Ellesmere).
This is 2017, I can match the performance achieved in 2004 using a G4 with the AltiVec velocity engine, by Apple (which Google-search likes consistently better).
I hope I will see clBLAS for the newer version of GPUS (Rocm has only single precision that is does not have a BLAS). I hope to see a way to communicate across the GPU memories and having that instruction available at openCL level. These two wishes could make these dual GPU cards so much more appealing. In the future, I will be considering the application of other OpenCL implementation of BLAS. Better codes can do miracles.