Self Installing Package: Fast Matrix Multiplication

In the last two weeks, I have been working on practical ways to improve the distribution of FastMMW library: to make my code less a research project and more like a software package that anyone can download, build, install and use.

I am almost done with the testing of an installation process, which is  written in Python. I chose Python because I am lazy and it is simple. I just finished up writing a Python package so that any one could use FastMMW and provide a simple example and pictures, see previous post.

I tested the installation process on my A8 machine, which is a 4 core single processor system, and it is running Ubuntu 10.4.   I am running the installation on one of my multiprocessor system with two dual core processor (Opteron), and it is running Fedora 9.  Just to test the installation and the process with a complete different system: Boy it is always needed.

Per se, the installation process is not really interesting. The library comes with an INSTALL.txt file where I present the process step by step. I wanted to have the process done a little automatically. For some architectures with  multi-cores multiprocessors systems, where cores are hardware threads, I recommend a little of imagination and testing: do it manually! It is very instructive you will get a feeling about how the system distribute the computation by using the command “mpstat -P ALL”.

Let me not digress,  back to the installation process.

The idea is to create 4 libraries: one for each basic type such as float, double, complex float and complex double. Each library is built independently of each other and this will allow different optimizations.

Matrix addition optimization

MA is a two  dimensional loop code: the inner loop follow the inner dimension (i.e., column Fortran, Python, R matrices or row C matrices). The computation as spatial locality and it is trivial to parallelize (outer loop). The first step is to measure the performance with different number of threads and different unrolling of the inner loop (exploiting register allocation).

I chose unrolling of U = [4,8,16,32,64], in order. With everything left along but the unrolling, I will expect to increase performance until register spill reduces performance. For example, if I measure a drop in performance between 16 and 32, I will not test 64.

I chose a thread allocation. For example, if I have 4 cores [0,1,2,3] I will test the following configurations: L = {[0], [0,1], [0,1,2] and [0,1,2,3]}.

When I found the best configuration L and U, I will estimate the break even point when Fast algorithms are as fast as Goto’s or Whaley’s. Let us call this estimate B.

Break even evaluation

With a first estimate B, I start a linear search for the problem size where Winograd’s algorithm is as fast as the best GEMM. Once I have found the break even point I can create the library. I create also binary code where the user can test the real performance.

A note of wisdom

The main advantage of breaking the installation per type is the opportunity to have different L, U and B as a function of the basic GEMM installed. In fact, on my A8 machine GotoBLAS has better performance than ATLAS, and somehow the configurations are different, in particular L and U (the number of cores and the unrolling  used for MA) are different for GotoBLAS and ATLAS.

This is no good: it means that the BLAS package installation affects the performance in some subtle ways  that are not really apparent. This is good: because the little voice in side my head always saying “I told you there is something interesting even here”.

Once again, it seems I concluded this post with one further open question.Good research often does feed itself.

If you like to help and test this new framework, please contact me, I will follow up after Oct 1 2012.

 

 

 

 

 

 

Fast Matrix Multiplication Python package

After working on a R package, I wanted to work a little bit longer and see what it must take to write up a Python package. I have to say it was fun and actually easier than I expected and remembered. The process is partially automatic, most of it, and without writing manuals, description, comments and all the good stuff they teach in Software engineering classes is pretty straightforward.

The R and the Python package share the same C interface, but each has its own way. Using numpy array object and the ability to write different types of array (float, double, complex float and complex double) the interface is quite simple and quite nimble. We can pass large matrices as reference without copy.

There are few bugs to take care but I could make it run a simple example and run the usual error comparison as done in the previous post. With positive matrices and using double precision as reference I could compare the error committed:

 

 

 

 

 

 

 

The first thing I have noticed was the striking difference from the picture presented in the previous post. Here, each algorithm has its own range without intersection, Strassen algorithm has the worst maximum  error (as expected) and the two Winograd algorithms present different errors.  In contrast, In the R exercise,  I have all fast algorithms clustering in the same area and having no great distinction. And using the same random number generator but in the interval [-1,1] this is the plot:

That is very similar to the R exercise.

Having two different environments (actually three: R, Python and C), it will make easier to investigate the nuances of the algorithms’ numerical properties. Just playing with these I have already questions about the code generators and the effect of the difference results. Experiments are integral part of science and good experiments always beg more questions.

I will be clean up the code and make available these modifications and interfaces in the next distribution.