The last in the family: the 8x8x8 algorithm with 343 products (so we have 2 M7,3 M23,4 M49 ,5 M99, and 8 M343).

The algorithms are not fully correct: The Matrix additions are sound only if the problem is divisible by the base=2,3,4,5, and 8 and basically we are computing O(N/base) more operations, which is negligible w.r.t. to O(N^3). Having all these parallel products, I just plugged all I have got: 4 GPUS.

All the algorithms M7, M23, M49, M99 and M343 are one level of recursion. That is, we want to apply the M algorithm once. However, Fiji has smaller memory and we apply a hybrid approach: if the leaf computation reaches a Ellesmere we call clBLAS, the problem just fits; otherwise, we apply a recursive Winograd 7 algorithm with leaf size 16000 (single), 12000 (Double and Single Complex) and 9000 (double complex). In the previous posts, I showed that Winograd (RM7) on a Fiji for problems larger than above is faster than or comparable than the clBLAS on Elsemere.

This is an example of experiments where the strategy is really hybrid. This comes with the disclosure that the performance makes sense if the problem size does not fit a GPU, otherwise clBLAS in GPU is by far superior. So the goal is to show the performance comparison for the M algorithms and figure out, in practice, when one is superior. A reminder GFLOPS -> operations/time and the operations are always computed as 2N^3 (complex operations involve 2x more operations but we still count them as 2N^3).

M343 does not have the opportunity, M99 gives its best however these algorithms never lead the pack. M49 is the M7 applied twice: DComplex have M7 and M49 cross performances a little earlier than 18K (the leaf is 9K), which is the ideal and theoretical.

These experiments show that 2x2x2 M7, 3x3x3 M23 and 4x4x4 M49 cover a continuous performance space where each has a contribution.

With only Ellesmere:

With fewer resources and thus less parallelism, M99 get the bronze. I will append the plot for only Fiji to see if the efficiency of the GPU makes any difference. Will I provide plots for only a single GPUs, Most likely not.

With Ellesmeres, we have slower kernels but larger size done in the GPUs. With Fijis, we have faster kernels. What is the effect overall.

M99 does not lead the pack but has a smooth and predictable delivery for single precision. In the Single complex performance plot we can see clearly the second level of recursion in M7 at 24K. The knee is evident in the following as well.

We conclude this post by providing the data in csv format and the comparison of the M algorithms by type (Single, Double, SComplex, and DComplex) and number of GPUs. The idea is to show you how the different parallelism and GPUs affect the performance of each algorithm. The plots above should have given already a comparison among algorithm with fixed type and parallelism.

In this link, you can find all performance and the plots by type and algorithm. Enjoy.It is a little messy because the naming is not completely clear. However, most of the performance plots are strangely consistent despite the asymmetric nature of my “poor scientist” rig.

What is left to me is to fix the matrix addition and then I may have a numerical comparison, wrap this up, write a paper, and publish it.