Parallel (Winograd) Fast MM

The following is taken from the abstract of my latest paper to be published in Transaction in Mathematical Software:

We have found  a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen’s and Winograd’s fast matrix multiply or their combination with the
3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen’s until a highly tuned
BLAS matrix multiplication allows performance advantages).

We have found  that modern symmetric multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining. We have three contributions: first, we investigated the performance overview for double and double complex (published) and single and single complex (private)  precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd’s matrix multiplication (achieving up to 20% speed up w.r.t. regular matrix multiplication). About the latter Winograd’s algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we show how that yields up to 10% further performance improvements.

In this work and for one architecture (Nehalem) we achieve  the ideal performance speedup using Winograd/Strassen like algorithms. This means that for one architecture we have the fastest implementation of the Winograd algorithm.