Fast Matrix Multiplication R Package



My work experience taught me one thing: people use a lot of different languages and environments; they thrive where one application can be done in different ways. In previous projects, I wrote fast kernels in C and  then wrote different interfaces: R, Java, Python, and Perl.  The top reason is that the basic functionality did not change, during the years we have changed the environments and the languages.  So to keep up with the new products, to keep a single repository of the kernels,  and finally to keep to a minimum the code maintenance, we spent quite some time learning how to write interfaces. Fortunately, most languages are very friendly to C kernels.

R is one of my preferred languages and interpreter. It is easy to use. It is a functional language. It has “lotsa”  of statistical tools. Friendly environment in creating packages and to create your own C kernels. Today, I can say I accomplished the creation of the first FastMMWR package: making possible to use FastMMW algorithms on top of GotoBLAS2 on my A8 AMD machine.

R and the R-interface is simple to create and to use. However, the standard copy of data  from R to C and vice versa, is a overwhelming: the speed up obtained by FastMMW algorithm is almost obliterated by the cost of preparing the input and the outputs. This is because I wanted to use a R wrapper. It will be more efficient to use the .C() interface directly without copy of the data.

But there is a silver lining to have the complete set of FastMMW algorithms and their R interface: R uses double and double complex matrices and thus the regular matrix multiplication  C <- A %*% B is basically of these two types. We could speed up the process by casting double to single and double-complex to single complex. This down cast can be done using the R copy interface, and then call the appropriate function.

If FastMMWR will not really shine for performance, why bother to have an interface and a package?

Because we can create error analysis tests for FastMMW and create beautiful graph in a matter of minutes:








For example, let us consider the Figure above.  We create 50 examples of matrices and B from a uniform distribution between 0 ans 1. The matrices have size 5000×5000. We create positive matrices in double precision and we compute the product reference in double precision. Then, we cast the input in single precision and performed the same product but in single precision (SGEMM). We ran also Winograd algorithm (SWGEMM), Strassen algorithm (SSGEMM) and our fastest algorithm (SBGEMM) where B stands for Bodrato, Marco (the original creator of the algorithm). All FastMMW algorithms have a single recursion because the break even point for this architecture is 3000×3000.

So our reference is in double precision and the comparables are four single precision algorithms. For this architecture, Winograd’s and Strassen’s algorithms have the same error on an average.  In my experience, this is new and interesting because in previous experiments Winograd’s algorithms had always better maximum error.  On average, the fast algorithms for one recursion has a multiplicative error factor of 1.5 only. The theoretical factor for Strassen is 3 and for Winograd is 4. We loose half of a decimal point.







If instead of creating positive matrices, we create matrices from the normal distribution with average 0 and variance 1, the plot is different from before but overall the error is as expected: that is, Strassen’s has smaller maximum error than Winograd’s algorithm. However, the Strassen’s has a multiplicative factor of 1.5 or less. Winograd’s factor is about 3.

The amazing thing is the speed in writing the test, collecting the results, and plotting the results. This is the code:

compare <- function(n,MM,type="normal") {

  Col =c("red", "blue", "black", "yellow", "green","orange","gray")

  dr = vector("numeric",n)
  dw = vector("numeric",n)
  db = vector("numeric",n)
  ds = vector("numeric",n)
  sr = vector("numeric",n)
  sw = vector("numeric",n)
  sb = vector("numeric",n)

  i = 1
  while (i<=n) {
    as = 1.0
    bs =1.0

    if (type =="normal")  {
       AS = array(rnorm(MM*MM),dim=c(MM,MM))
       BS = array(rnorm(MM*MM),dim=c(MM,MM))
    else {
      AS = array(runif(MM*MM,0,1),dim=c(MM,MM))
      BS = array(runif(MM*MM,0,1),dim=c(MM,MM))

    print (c(dim(AS),dim(BS)))

    CD =AS%*%BS

    CS = s_mm_leaf_computationR(as,AS,bs,BS)
    sr[i] =  max(abs(CD-CS))

    CS = s_wmpipeR(as,AS,bs,BS)
    sw[i] =  max(abs(CD-CS))

    CS = s_BMOWR_PIPER(as,AS,bs,BS)
    sb[i] =  max(abs(CD-CS))

    CDC = s_smpipeR(as,AS,bs,BS)
    dw[i] = max(abs(CD-CDC))  

    i = i +1

  L = list(sr,sw,sb,dw)
  i = 1
  while (i<=length(L)) {
    average = sum(L[[i]])/length(L[[i]])
    var = sqrt(sum((L[[i]]-average)^2)/length(L[[i]]))

    T = c(average+var,average-var,L[[i]])
    L[[i]] = T

    print (c(i,average,var))
    i = i + 1



plotL <- function(L,l,Col,t,f) {

  png(filename=f, height=512, width=728, bg="white")

  i =1
  M = 0
  while (i<length(L)) {
    M =max(c(M,L[[i]]))
    i =i+1    

  X = c(0,0,1:(length(L[[1]])-2))

  i = 1
  plot(X,L[[i]],xlab="sample", ylab="maximum Error", main=t, type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
  i = 2
  while (i<=length(L)) {
    lines(X,L[[i]], type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
    i = i +1   


L1 = compare(50,5000)
L2 = compare(50,5000,"uniform")


The FastMMWR package shines here: if you want to check whether or not you algorithm can afford to loose a little of accuracy by means of a faster algorithm, this package provides the tools to make such a decision. Knowledge is a good thing (sorry Martha, I know it is your trademark catch phrase).