Using Winograd algorithms, I can achieve 1TFLOP equivalent performance in single precision out of a Treadripper 1950x. This is pretty good.
The motherboard is the Zenith Extreme X399, 4x8GB (32GB) Trident Z RGB G.Skill, and water cooling Corsair H115i. The setup is out of the box; that is, no over clocking. The power supply is a 1000W P2 EV3A. In this post, I will present the performance for Matrix multiplication without GPUs but it is in my schedule to redo the performance comparison because the memory and PCI communication are now faster.
I installed OpenBLAS and ATLAS once again for this architecture. I did not touch the configuration of BLAS libraries. OpenBLAS will use the first 16 cores and ATLAS will use all. Currently, I am not able to drive the core utilization of the BLAS libraries and I take them as granted. I am kind of able to work on core assignment for the Epyc processor and for the ATLAS system but that will be another post and another story.
Then, I probed the system for the break even point for different algorithms and precision.
Clearly, ATLAS GEMM is half the performance of OpenBLAS. We can see that 10K and 15K is the break even point for single precision computation.
For the Ripper, the performance plots are consistent. Winograd algorithms can be applied for problem sizes larger that 10Kx10K. These are large problems. There is no denying the break even points are getting larger but this is the first system that we can see a core speed up and a memory speed up. Also this is my first time when I can buy a processor that can reach 1TFLOP and keep it on the side of my desk. These are performance achieved by GPUs and FPGAs.
To conclude, I re-run the GPU (0) to check if the DDR4 expensive memory or anything else will help the PCI connection and thus the GPU GEMM performance. I cannot achieve the same performance of the previous and cheaper system .. about 10% loss of peak performance (Single precision)
Fiji GPU loves Single precision computation but it is dragging its feet for double precision (295×2 was much better in this respect). Clearly Fiji humiliates The-Ripper in Single Precision but they are equal for double precision.
Then I try to check if I can saturate the system running independently three single precision GEMMat the same time
OpenBLAS OPS 1.166400e+13 GFLOPS HOT 8.291898e+02 GPU0 OPS 1.166400e+13 GFLOPS HOT 2.977494e+03 GPU1 OPS 1.166400e+13 GFLOPS HOT 2.972505e+03
Overall the full performance is good, scalable, and the The Ripper holds its grounds. In Double precision, if we organize our work accordingly the DGEMM can achieve 1.3 TFLOPs, which is yet another first for me.
Let me add to this post the performance of ATLAS when we force to use 16 Cores (instead of 32).
First the anomaly: ATLAS using 16 cores in double complex precision has worse performance. Otherwise for double, single complex, and single precision having fewer threads is better .
A note before I forget: there are cases where ATLAS and OpenBLAS decreases performance as the problem increases. This is really uncommon and this is the first time I see such a drop. The reasons ? I am guessing core throttling because of heating dissipation issues. In practice, when the motherboard shows the processor temperature or when I added to the core the lmsensor the performance decreases.