Searching for Fast Matrix Multiplication …

… Frameworks are appearing.

Regularly, I search for the keywords Fast Matrix Multiplication or Multiply. I use Google and I use Bing/Yahoo. The two search engines have different algorithms and the results are often different. I search for two separate reasons: First, narcissism, I want to find my site in the first page. Second, to see what is new in the field (similar research tends to use the same terminology).

In Bing and Yahoo, I regularly appear in the first page close to wikipedia. This provides great visibility and their algorithm validates my effort to provide technical and high quality contents. As such, I cannot write a new blog every week and thus this search engine does not really consider me as a classic blog.

If I could publish regularly, my contents bubble up in google. As I write today, I appear in either the first or the second page, which is very good and not common. In general,  papers by Stanford or Berkeley alumni are higher ranked. If you repeat the search the result may well be tailored to your history and this site could be a long shot.

Nonetheless, I like this scenario: one search engine is consistent and it values my effort and the other often provides “temporal” importance, I will explain the negative effect first and in the next paragraph the positive one. The negative effect is simple to explain: Not being in the top in Google searches would suck if I had monetary investments or my professional career depends on this type of work. For example, citations from peers are important in so many ways in the research field especially in academia: for advancement, funding and personal satisfaction (pat on your back kind of thing). Not being in the top is equivalent to not being published in the top tier conferences or journals: it will diminish your contribution to the eyes of new researchers and your research peers. If they cannot find your work it means the majority will not know what is your contribution.

The temporal ranking effect helps me to stay updated in the current research easily: new research will bubble up rather quickly and I can easily being in the midst of smarter people thoughts. I found out that the field attracts interesting work and interest in general, this is exciting.

Smart people are reaching the conclusion that it is good if not necessary to have development framework for the generation of algorithms for Fast Matrix multiplication.

The first is by Alexey Smirnov: he provides a systematic way to create recursive algorithms using matrix notations a la Bini’s style. His work is available at If you ever wanted to create code for these algorithms, you need the matrix description and from them the code. He proposes the solution of larger problem where the saving in matrix multiplications are bigger than what I use to work on.

One of the obstacles the use of  Fast MM algorithms is their poor formulation for rectangular matrices. One of my humble contributions is the presentation of one single algorithm that can be used for any matrix size (my first TOMS paper). Now, I am happy to see that there is a practical interest in tackling this problem.

Recently, I have been contacted to share my adaptive library to tackle exactly rectangular matrices, and thank to Google, I have found that  Benson and Ballard  ( are proposing a new framework combining Smirnov’s idea. My library is used as reference and seems having a little edge when applicable. The good news is that the optimizations presented could be applied to mine in principle.

I am happy to see that the current research is going towards the direction I would go: a practical framework where algorithms are presented as matrices and automation will take these to create codes: for every matrices and for every architectures.







The Better Accuracy of Fast Matrix Multiply III

In the previous post, I showed that Strassen-Winograd algorithms have a directionality in their errors. This is the reason for their numerical properties and it has been used mainly to express bounds. Today, we use it to improve their numerical properties: after all if you have a constructive proof to prove a bound the least we can do is to use it for the optimization of the algorithm itself.

I admit there are several way to improve the algorithm. Randomization was my first guess. It is simple, it breaks patterns, or we make those pattern not predictable thus not pattern any more. In fact, randomization can be used to spread the error across the result matrix, but I could not make it reducing the intensity of the error.

I came to the idea of using permutations designed tilt the error without actually involve extra computations. These permutations must be tailored to the algorithm but they are not that difficult to grasp or implement. What are these permutations and how they are applied is in the paper. I want to show the effect of those permutation on the maximum error and their locations:


Take the previous post and look at scatter plots (this and the previous). First of all, I reduced by half the maximum error.  I can do that because I spread the error.

Once again, the geo-locality of the error is the cause of the constant but extra error FastMMW algorithms have. On one side, I find comfort in knowing where the error is. On the other side,  this knowledge can be use to break the error pattern and improve the error we commit. I see a nice connection with yet another one application of Werner Heisenberg’s principle.



The Better Accuracy of Fast Matrix Multiply II

What is the goal of analytic combinatorics? In algorithms, we use it to get an estimation of complexity. We use this complexity to choose a priori an algorithm: The one with fewer operations or fewer memory reads. Speed is often related to fewer operations but there are interesting exceptions. In the previous post, I wrote about sorting algorithms, their variety, their complexity and, in turn, their assumed performance. The numerical analysis of an algorithm is that different than it complexity analysis?

I had someone introducing me to the analytic combinatorics of algorithms. This is part of the curriculum of any computer science/engineering background.  However, I am self taught about the numerical analysis. Information theory and thus probability is integral part of CS/EECS.  However, Statistics and probability has been a recent deep dive. So to answer the question above, the CS educational system seems considering numerical analysis a topic for special interests.

Interestingly the error analysis methods do not look like much different from the methods used for the complexity analysis. The idea is to bound the error instead of bounding the number of operations, say. For Matrix Multiplication the result is a matrix, very often this is a two dimensional array.

We can bound the error of each matrix element. Each element is important equally. For the regular matrix multiplication, this is simple because each element of the result matrix has the same computation though on different data. If we assume that the inputs do not matter and only the computation is injecting errors, the regular matrix multiplication has a uniform error. In fact, the bounds are identical for all element of the result matrix.

Fast Matrix multiplication are different: by construction, the algorithm build the components of the result matrix using asymmetric computations. We used to write simplified equations so that to bound the whole matrix. The final result looks and feel as any complexity analysis:

    \[ \|D\| \leq \Big[ \Big(\frac{n}{n_0}\Big)^{\log_2K}(n_0^2 +5n_0)-5n \Big] u \|A\|\|B\| \]

Higham’s Book Chapter 23, Theorem 23.3

In principle, it  is possible to write bounds for each elements even for Fast matrix multiplication; however, in practice, no one actually tried. The equation above is beautiful  in so many ways once you start to know it.

About four years ago, I started looking at how the bounds above were computed.  I was awed by the complexity and elegance. I started understanding it when I saw the original proof by Brent. I saw its power when I saw for the first time the proof by Bini and Lotti.  All of the previous masters went at the problem to reduce the result to a single equation: a summary.

More I saw the equation more I hated it because  the result was hiding the methodology to reach such a concise result. The methodology had to compute a bound for each elements and in practice the proofs teach you how to write better algorithms and thus tighter bounds, better equations.

So to achieve a better algorithm we must somehow summarize the error analysis. HeatMError150x15010000Times[-1,1]Orthogonal

So I decided to show the location of the error graphically. Take 10,000 runs of four algorithms Strassen, SW Winograd variation 1, SWOPT Winograd variation 2 and Goto’s SGEMM. The figure show the location of the maximum errors and the maximum of the maxima. The beautiful equation hides these even more beautiful patterns.

Nonetheless, when I show this picture (or pictures like this) every one understands that the higher error of Fast MM is dues to the limited and focus pattern: “if we could spread the area as in the SGEMM we could reduce the error”.  The picture let you have the right intuition immediately. Once we combine the intuition with the equation proof, you have the main idea.

There are algorithm optimizations that allow to achieve more accurate fast matrix multiplications.

The Better Accuracy of Strassen-Winograd Matrix Multiply I

On March 1 2014 a paper of mine titled as above will be published in Advances in Linear Algebra and Matrix Theory Vol. 4 Number 1. This will be a open publishing venue. I am new to this kind as publishing experience, and I will provide here support, considerations, and  opinions.  I will post a commentary series to help the distribution and to explain the contents.

The title is to poke fun to the subject. In so far, the numerical analysis of these algorithms has been stale in the sense that has not been used to design more accurate algorithms: the analysis has been used to describe the worst case error for  FastMMW but has not been used neither to design better algorithms nor to select the best algorithm for the input at hand.  This point is embarrassingly important.

Let’s have a thought experiment about the sort algorithm, this will be appealing to CS people and Googler’s interviewers. There are about three dozens sorting algorithms with different best, average, and worst case scenarios.  This seems like we have a lot of algorithms to choose from. For example, the merge sort and the quick sort have the same best and average performance and the Quick sort has the worst case performance of O(n^2). That is, every time we choose the pivot element to split the vector into two and sort recursively, we choose always the smallest or the largest for n consecutive times. Albert King would say that he would not know luck at all if it was not for bad luck  or the vector is constant. Would you reject the quick sort a priori for that?

The community did not: the Quick sort comes  with multiple pivots variations, with hierarchical algorithms to improve the overall performance:  we choose the pivot and the number of recursions a finite number of times for the specific input, etc.  For Sorting the attitude is different, the variety and number of algorithms is exciting, the combination of different algorithms to fit a specific purpose is elegant and it makes a great subject for job interviews (usually not very deep after all you need to know three dozens algorithms and you cannot use Google while interviewing at Google).

FastMMW have thousands of variations without considering the even faster ones. I am not kidding during a personal conversation with smarter people, one counted them in the Ks (thousands). The Bounds are  for the family of algorithms (i.e., all sorting)  not for each implementations, the bounds provide only worst case scenarios (forget about best and average), and there is no consideration about the inputs and their nature.

I understand we are talking about precision and not performance, we must be cautious. However, it is true that if we apply Strassen 2 times recursively, the error increases by a bounded constant. It is like to choose to use the quick sort for two level of recursions and then yield to merge sort.

In this work, The Better Accuracy of Strassen-Winograd Matrix Multiply I show that we can design more accurate algorithm even if we use a finite number of recursions, that the theory started 40 years ago is a great starting point if used in combination with new tools and ideas, especially new ideas.

The format of this series is still in my head and thus … but I will start with a new idea.


Clearness, contents, contributions, and rejections

I have seen different facets about the presentation of new ideas. This is my collection of thoughts about a few hurdles thrown to any naive contributor of original contents in the scientific field.

Point A: I am reading an old booklet: An introduction to Mathematics  by Alfred North Whitehead (1910), you can find for a penny in Amazon but by a few is considered as the top 50 books of scientific literature. There is a paragraph about mathematical clearness I shall quote here:

“…. In the second place it is necessary for research. It makes for clearness of thought, and thence for boldness of thought and for fertility in trying new combinations of  ideas. When the initial statements are vague and slipshod, at every subsequence state of thought common sense has to step in to limit application and to explain meanings. Now in creating thought common sense is a bad master. Its sole criterion for judgement is that the new ideas shall look like the old ones. In other words it can only act by suppressing originality.”

This is the real reason for clarity and it shows the hard consequences. I like  that mathematical clearness is the foundation of creative thinking and not just a notational game: if I do not have to think all the time of the meaning of terms and ideas, my creative side will be able to tackle new ideas. This requires also some flexibility from the reader: if I write u and explain that is a unit of measure, please let it be. Because if the reader (you) is more familiar with the term b instead, it is just a short hand. If my notation has to be exactly like yours to make sense to you, it sounds more like forcing me to conform to your common sense and thus even less likely to convince you about a result that is truly original and  it is not by you.

Point B: I am a reviewer for a journal and recently I rejected a paper. The authors presented two algorithms A and B, they described them in fairly good details and they chose to present experiments only for one, B. Algorithm  A did not have experiments because it was not appealing, it was not appealing because the implementation was poorly parallel w.r.t. B. One processor was doing 1/4 of the work no matter how many processors were used.  In my review, I presented a different algorithm: that particular 1/4 of the work can be highly parallel, the parallel algorithm is elegant and short to describe,  and there is code for it. My concern was on the contents and what was missing: a better A algorithm. Strangely, if the authors would have given me only B, I would never bothered with it.

Sincerely, I did not fully understand the paper. That was not my point for the rejection. What I did unconsciously and consciously is to focus on the contents and see if B was better than A. Clearness should provide that because was the main goal of the paper. The authors may think that I am an idiot, their paper is just fine, and they are writing a rebuttal to me and the editor to reconsider the publication. If you are a small potatoes, being reviewed sucks. If you are a small potatoes and you are reviewing, you may get carry away.

Point C: At the beginning of 2013, I have been working on a project for a few months. In the second half of the year, I tried to collect my thoughts and results into a paper and submit to publication. The next post will be about the contents of this publication.  IF you follow my posts, you may know that I am an experimental person; that is, observations and experiments are the first part of my work. Nowadays, I often communicate these results during the investigation to  researchers who worked on the same subject and especially if my work may shed a different light on their work. Mostly I want to know if I did something obscenely wrong. Other researches go to a great length avoiding sharing or paying their dues (as B.B. King would say). Some times I even understand the reasons for not sharing but I understand better the reasons for it.

For me mathematical clearness is an attempt to organize and write a process that is often scruffy, based on inductions and sometime intuitions. Sometimes, I start during the experiments to guide me in finding the right questions and the right answers. More often the writing is after words when I have good enough answer. The experiments are driven by curiosity: the desire to know and the disappointment of remaining an idiot (i.e., who ignores). Thought, the writing is hard, I like seeing  the white and empty paper becoming an article. An original contribution describing in the best way I could  what I did and share it.  Without considering the publication process, this process may take months. With the publication process, it may take years.

I submitted this work to an old and good journal where I have already published but I knew the process will be tough. I believed and still I believe it was the right course. During the preparation of the work, I thought I had the blessing of my predecessors (or at least who matter to the topic and to me).   In fact, the paper was rejected. The reviews were mostly aiming at the writing part and how the writing cluttered the understanding of the material.  In practice, clarity is the responsibility of the authors.  A paper cannot be published if it is not clear.  I rewrote the paper, wrote a detailed rebuttal, and submitted somewhere else, it will be published shortly. This paper will prove better bounds to the error of FastMMW: the last such a result was published 40 years ago. Yep, creative thinking in action.

Conclusions: With experience and knowledge, I hope to reach the mathematical clearness that will help my successors and convince my contemporaries.






No, GPU Matrix Multiplication ain’t tiled.

Is wisdom  a gift of old age? Or patience may be? I do not remember which one. My age is affecting my memory.

All along these random posts, my research, and the presentation of my research  the topic of error analysis is a fastidious itch that everybody around me wants to scratch. Paraphrasing the sentence “See the stone set in your eyes”, when I talk about Fast MM there is  always a cry saying the use of any Fast Algorithms is inconceivable: the error those algorithms introduce is a huge stone that I should lift. The wise see my stone and  want me to scratch its itch.

I would recommend a talk to who will attend  ICS 2013:  Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms.

There are a few interesting ideas that I would like to express (not necessarily aligned with all the authors)  beside the obvious point that technology makes all lazy:

  • When new architectures are introduced, like these beautiful GPUs,  we tend to forgive the quality of the code written for them. For fast publication,  we forget what has been done previously (by ourselves?).
  • These things can awe you with 3TFLOPS performance, which is a slap in the face to  any high-end general purpose processors. For the same reason,  any decent code will do.
  • MM for GPUs do not use blocking. The irony is that who developed code for the GPUs  are the same authors who proposed blocking.

But WTF is blocking/tiling and why MM ain’t tiled?

It is actually easy to summarize Tiling/blocking as a way to break the computation into smaller computations following a divide-and-conquer algorithm. The MM is tiled if it is  decomposed into similar and smaller MM. For example, Fast algorithms are applying blocking naturally (i.e., recursive blocking).

If you pick up the code for MM for NVidia’s or for AMD’s GPUS as today April 2013, the code is not tiled. The authors break the matrix C in square sub blocks of fixed sizes (k_0 x k_0) and they perform  “long” matrix multiplications, in practice they exploit dot products. This is not me being obsessive about the definition of blocking: it has serious consequences on the error the computation commits. The code has long string of dependent additions that very likely introduce and accumulate error. Here is the stone set in your eyes.

The wise man would say: do not worry, young man! The error committed here is not even close to the error committed by the Fast MM Algorithms. After all, the wise is thinking anything is more accurate than a Fast MM, which is unstable. What the matter you do not like 3TFLOPS?

If I apply any Fast Algorithm, I may add another 1TFLOP (25% improvement), but the wise  would say: inconceivable. Because, my Fast Algorithm will introduce error of inconceivable sizes making the result plain useless. I like to use “inconceivable” here. Unfortunately, it seems I cannot say, what the matter you do not like 4TFLOPS for the price of 3?

In the ICS talk, the authors will show that for positive matrices (like probability matrices) there is a Winograd’s variant  that is  both 27% faster and 3-4 bits more accurate. The authors also show an accuracy comparison with Goto’s implementation, which should have been used as reference for numerical error analysis previously (but did not). The fast MM looses one bit to Goto’s implementation. 4TFLOPS and 1 decimal precision back, can anyone tell me what is wrong with that ?

Again, my conclusion is simple: keep an open mind. There are places where Fast MMs should not be used and places where they should, let’s learn together when and where.


Merry Christmas and Full Ahead 2013

As Benigni used to say “Happy birthday Little Boy Jesus (buon hompleanno bambin Gesu’)“. I would like to remind myself, the ones I love, and those very few  who have  read and will read Fast Matrix Multiply (fastMMW)  blog to enjoy these merry days.

Today please, join  me and  thank, kiss, and hug  who you love; and, remember to repeat this  everyday.



Self Installing Package: Fast Matrix Multiplication

In the last two weeks, I have been working on practical ways to improve the distribution of FastMMW library: to make my code less a research project and more like a software package that anyone can download, build, install and use.

I am almost done with the testing of an installation process, which is  written in Python. I chose Python because I am lazy and it is simple. I just finished up writing a Python package so that any one could use FastMMW and provide a simple example and pictures, see previous post.

I tested the installation process on my A8 machine, which is a 4 core single processor system, and it is running Ubuntu 10.4.   I am running the installation on one of my multiprocessor system with two dual core processor (Opteron), and it is running Fedora 9.  Just to test the installation and the process with a complete different system: Boy it is always needed.

Per se, the installation process is not really interesting. The library comes with an INSTALL.txt file where I present the process step by step. I wanted to have the process done a little automatically. For some architectures with  multi-cores multiprocessors systems, where cores are hardware threads, I recommend a little of imagination and testing: do it manually! It is very instructive you will get a feeling about how the system distribute the computation by using the command “mpstat -P ALL”.

Let me not digress,  back to the installation process.

The idea is to create 4 libraries: one for each basic type such as float, double, complex float and complex double. Each library is built independently of each other and this will allow different optimizations.

Matrix addition optimization

MA is a two  dimensional loop code: the inner loop follow the inner dimension (i.e., column Fortran, Python, R matrices or row C matrices). The computation as spatial locality and it is trivial to parallelize (outer loop). The first step is to measure the performance with different number of threads and different unrolling of the inner loop (exploiting register allocation).

I chose unrolling of U = [4,8,16,32,64], in order. With everything left along but the unrolling, I will expect to increase performance until register spill reduces performance. For example, if I measure a drop in performance between 16 and 32, I will not test 64.

I chose a thread allocation. For example, if I have 4 cores [0,1,2,3] I will test the following configurations: L = {[0], [0,1], [0,1,2] and [0,1,2,3]}.

When I found the best configuration L and U, I will estimate the break even point when Fast algorithms are as fast as Goto’s or Whaley’s. Let us call this estimate B.

Break even evaluation

With a first estimate B, I start a linear search for the problem size where Winograd’s algorithm is as fast as the best GEMM. Once I have found the break even point I can create the library. I create also binary code where the user can test the real performance.

A note of wisdom

The main advantage of breaking the installation per type is the opportunity to have different L, U and B as a function of the basic GEMM installed. In fact, on my A8 machine GotoBLAS has better performance than ATLAS, and somehow the configurations are different, in particular L and U (the number of cores and the unrolling  used for MA) are different for GotoBLAS and ATLAS.

This is no good: it means that the BLAS package installation affects the performance in some subtle ways  that are not really apparent. This is good: because the little voice in side my head always saying “I told you there is something interesting even here”.

Once again, it seems I concluded this post with one further open question.Good research often does feed itself.

If you like to help and test this new framework, please contact me, I will follow up after Oct 1 2012.







Fast Matrix Multiplication Python package

After working on a R package, I wanted to work a little bit longer and see what it must take to write up a Python package. I have to say it was fun and actually easier than I expected and remembered. The process is partially automatic, most of it, and without writing manuals, description, comments and all the good stuff they teach in Software engineering classes is pretty straightforward.

The R and the Python package share the same C interface, but each has its own way. Using numpy array object and the ability to write different types of array (float, double, complex float and complex double) the interface is quite simple and quite nimble. We can pass large matrices as reference without copy.

There are few bugs to take care but I could make it run a simple example and run the usual error comparison as done in the previous post. With positive matrices and using double precision as reference I could compare the error committed:








The first thing I have noticed was the striking difference from the picture presented in the previous post. Here, each algorithm has its own range without intersection, Strassen algorithm has the worst maximum  error (as expected) and the two Winograd algorithms present different errors.  In contrast, In the R exercise,  I have all fast algorithms clustering in the same area and having no great distinction. And using the same random number generator but in the interval [-1,1] this is the plot:

That is very similar to the R exercise.

Having two different environments (actually three: R, Python and C), it will make easier to investigate the nuances of the algorithms’ numerical properties. Just playing with these I have already questions about the code generators and the effect of the difference results. Experiments are integral part of science and good experiments always beg more questions.

I will be clean up the code and make available these modifications and interfaces in the next distribution.



Fast Matrix Multiplication R Package



My work experience taught me one thing: people use a lot of different languages and environments; they thrive where one application can be done in different ways. In previous projects, I wrote fast kernels in C and  then wrote different interfaces: R, Java, Python, and Perl.  The top reason is that the basic functionality did not change, during the years we have changed the environments and the languages.  So to keep up with the new products, to keep a single repository of the kernels,  and finally to keep to a minimum the code maintenance, we spent quite some time learning how to write interfaces. Fortunately, most languages are very friendly to C kernels.

R is one of my preferred languages and interpreter. It is easy to use. It is a functional language. It has “lotsa”  of statistical tools. Friendly environment in creating packages and to create your own C kernels. Today, I can say I accomplished the creation of the first FastMMWR package: making possible to use FastMMW algorithms on top of GotoBLAS2 on my A8 AMD machine.

R and the R-interface is simple to create and to use. However, the standard copy of data  from R to C and vice versa, is a overwhelming: the speed up obtained by FastMMW algorithm is almost obliterated by the cost of preparing the input and the outputs. This is because I wanted to use a R wrapper. It will be more efficient to use the .C() interface directly without copy of the data.

But there is a silver lining to have the complete set of FastMMW algorithms and their R interface: R uses double and double complex matrices and thus the regular matrix multiplication  C <- A %*% B is basically of these two types. We could speed up the process by casting double to single and double-complex to single complex. This down cast can be done using the R copy interface, and then call the appropriate function.

If FastMMWR will not really shine for performance, why bother to have an interface and a package?

Because we can create error analysis tests for FastMMW and create beautiful graph in a matter of minutes:








For example, let us consider the Figure above.  We create 50 examples of matrices and B from a uniform distribution between 0 ans 1. The matrices have size 5000×5000. We create positive matrices in double precision and we compute the product reference in double precision. Then, we cast the input in single precision and performed the same product but in single precision (SGEMM). We ran also Winograd algorithm (SWGEMM), Strassen algorithm (SSGEMM) and our fastest algorithm (SBGEMM) where B stands for Bodrato, Marco (the original creator of the algorithm). All FastMMW algorithms have a single recursion because the break even point for this architecture is 3000×3000.

So our reference is in double precision and the comparables are four single precision algorithms. For this architecture, Winograd’s and Strassen’s algorithms have the same error on an average.  In my experience, this is new and interesting because in previous experiments Winograd’s algorithms had always better maximum error.  On average, the fast algorithms for one recursion has a multiplicative error factor of 1.5 only. The theoretical factor for Strassen is 3 and for Winograd is 4. We loose half of a decimal point.







If instead of creating positive matrices, we create matrices from the normal distribution with average 0 and variance 1, the plot is different from before but overall the error is as expected: that is, Strassen’s has smaller maximum error than Winograd’s algorithm. However, the Strassen’s has a multiplicative factor of 1.5 or less. Winograd’s factor is about 3.

The amazing thing is the speed in writing the test, collecting the results, and plotting the results. This is the code:

compare <- function(n,MM,type="normal") {

  Col =c("red", "blue", "black", "yellow", "green","orange","gray")

  dr = vector("numeric",n)
  dw = vector("numeric",n)
  db = vector("numeric",n)
  ds = vector("numeric",n)
  sr = vector("numeric",n)
  sw = vector("numeric",n)
  sb = vector("numeric",n)

  i = 1
  while (i<=n) {
    as = 1.0
    bs =1.0

    if (type =="normal")  {
       AS = array(rnorm(MM*MM),dim=c(MM,MM))
       BS = array(rnorm(MM*MM),dim=c(MM,MM))
    else {
      AS = array(runif(MM*MM,0,1),dim=c(MM,MM))
      BS = array(runif(MM*MM,0,1),dim=c(MM,MM))

    print (c(dim(AS),dim(BS)))

    CD =AS%*%BS

    CS = s_mm_leaf_computationR(as,AS,bs,BS)
    sr[i] =  max(abs(CD-CS))

    CS = s_wmpipeR(as,AS,bs,BS)
    sw[i] =  max(abs(CD-CS))

    CS = s_BMOWR_PIPER(as,AS,bs,BS)
    sb[i] =  max(abs(CD-CS))

    CDC = s_smpipeR(as,AS,bs,BS)
    dw[i] = max(abs(CD-CDC))  

    i = i +1

  L = list(sr,sw,sb,dw)
  i = 1
  while (i<=length(L)) {
    average = sum(L[[i]])/length(L[[i]])
    var = sqrt(sum((L[[i]]-average)^2)/length(L[[i]]))

    T = c(average+var,average-var,L[[i]])
    L[[i]] = T

    print (c(i,average,var))
    i = i + 1



plotL <- function(L,l,Col,t,f) {

  png(filename=f, height=512, width=728, bg="white")

  i =1
  M = 0
  while (i<length(L)) {
    M =max(c(M,L[[i]]))
    i =i+1    

  X = c(0,0,1:(length(L[[1]])-2))

  i = 1
  plot(X,L[[i]],xlab="sample", ylab="maximum Error", main=t, type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
  i = 2
  while (i<=length(L)) {
    lines(X,L[[i]], type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
    i = i +1   


L1 = compare(50,5000)
L2 = compare(50,5000,"uniform")


The FastMMWR package shines here: if you want to check whether or not you algorithm can afford to loose a little of accuracy by means of a faster algorithm, this package provides the tools to make such a decision. Knowledge is a good thing (sorry Martha, I know it is your trademark catch phrase).