**Fast algorithms ** for matrix multiplication — i.e., algorithms that compute less than * O(N^3) * operations— are becoming attractive for two simple reasons:

- Todays software libraries are reaching the core peak performance (i.e., 90% of peak performance) and thus reaching the limits of current systems. Fast algorithms deploy new algorithmic strategies, new opportunities, thus reaching new performance and, in practice, stretching the system capabilities.
- Fast algorithms reduce the number of operations and thus data movements. This means that for very large problems they (may) reduce off-chip communication (dominant factor in multicore systems).

### State of the art

In the era of parallel processors on a chip, we will be able to solve larger and larger problems faster than we could do before. In practice, we will have a small parallel system (i.e., a single 50-100 GFLOPS node), and we must have more and faster memory (i.e., GBytes and GBytes/sec) to make these systems scalable and, if not, possible at all.

These systems will be complex to build and understand. A developer or a compiler will have hard time in generating good codes.

We foresee that experimentation will be a fundamental part of the next generation of libraries and code generators (and softare tools in general). The choice of the algorithm and its implementations will be a function of the system (i.e., Athlon vs. Pentium 4), problem size (i.e., Matrix multiplication of size N=10,000) and problem type (i.e., double or single precision).

### Our Contribution

We propose the hybrid implementation of fast algorithms with the classic algorithms. These are divide-and-conquer algorithms that are simple to formulate and to understand, and easy to deploy. They naturally exploit parallelism and data locality.

The simplicity of the code leads to easy maintanance and portatibility, as well as to a simple performance model, where few terms must be tuned for a (specific) systems. In fact, these hybrid fast algorithms split explicitly the computation into parts (i.e., kernel) that are compute bound and others that are memory bound. By a carefull but straigtforward Experimentation + modeling, we can develop high performance fast codes with no hassle.