Blue duo Ellesmere vs. Red duo Fiji

To compare the performance of the two cards, we compare the performance of each single GPU and a different way to fire up the computation. We take the sequence of four operations. This is a re-edit of the post basically better and more consistent performance plots.

 C0 += A*B;
 C1 += A*B;
 C2 += A*B;
 C3 += A*B;

The GPUS 0 and 1 are Ellesmere and 2 and 3 are Fiji. We will have performance plots for the firing sequence 0-0-0-0, 1-1-1-1, and 0-1-0-1 for the blue Ellesmere and 2-2-2-2, 3-3-3-3, and 2-3-2-3 for the red Fiji. No data is re-used in this experiment so all performances account for the memory movement but you can imagine that A and B they do not need to go back and forth. We show also a parallel computation 0-1-2-3.

The Code is from the clBLAS generated for the Fiji architecture. No new code was generated for the Ellesmere, this slack  will be addressed in the future hopefully because clBLAS is a “closed” project and I could not “autoGEMM” it.  clBLAS for AMD GPUs is a fundamental stepping stone for so many applications, I hope we are going to see Lazarus. Size matrices are in thousands (Mille or Kilo).

For Single precision, we see that having 16GB per GPU pays off for larger problem sizes but not as much as I thought. In Single precision, we have a small window between 20 and 25K; for single complex, the window is smaller  15-20K and the effect even smaller. For  larger or smaller  problems Fiji is superior. In single precision, a 16Kx16Kx16K problem fit in the 4GB memory and for larger problem we use a Winograd algorithms (we could use a faster one) and we can see that we can cope well. For problems larger than 30K, we need more that 16GB and thus Ellesmere will loose further.

We can see that in single precision the peak performance for the Fiji system is 6TFLOPS and for Ellesmere is 4TFLOPS. There is a practical 20-40% peak performance gap. All considering, This is pretty good.

Let us check the Double precision computation.

Now, The Fiji is always better even when we need software help. The peak performance gap is negligible 400×2 + 300×2 ~ 1.4TFLOPS  and we achieve 1.25TFLOPS,  smaller 15%.

Using the frequency and number of computational units, we know that Fiji has 64 (unit) at 1GHz while Ellesmere has 36 (unit) at 1.23GHz, so on paper the Fiji should be 44% faster. If we considering Watts, the two GPUS should have the same ops/Joule, this means that the maintenance cost is the same.

So if you have the time to wait for the result, the blue ellesmere is a good choice and, for certain problems sizes,  it is even faster (considering that the more memory allows to do more in the card). For single precision is more efficient for a set of problem sizes (same speed less power) but in general we can cope performance-wise with FIji just fine.  Alas I cannot measure the power consumption and I assume the GPUs will consume the peak Watts.  The blue card is quite, is efficient, it is a long card, it has a smaller current withdraw that help your budget. The air cooling solution penalizes one of the GPUs and the card gets quite hotter at the exhaust end especially during double precision computations.

If we plug in an extra PSU and we unleash all GPUs at the same time we can achieve an empirical peak performance of 7-8 TFLOPS in single precision and 350GFLOPS in double-complex precision.    Considering that we should have about 10TFLOPS (6+4) achievable peak performance we are short  changed of 2 TFLOPS. At this time, I will not linger on the subject. Double precision performance is consistent and additive: The double precision plot shows a peak at 1.2 TFLOPS (0.8 Fiji + 0.4 Ellesmere).

This is 2017,  I can match the performance achieved in 2004  using a G4 with the AltiVec velocity engine, by Apple (which Google-search likes consistently better).

I hope I will see clBLAS for the newer version of GPUS (Rocm has only single precision that is does not have a BLAS). I hope to see a way to communicate across the GPU memories and having that instruction available at openCL level. These two wishes could make these dual GPU cards so much more appealing. In the future, I will be considering the application of  other OpenCL implementation of BLAS. Better codes can do miracles.






Fiji ProDuo vs Hawaii 295×2: clBLAS preliminary performance.

What can you do with a few TeraFLOPs? I have got  a beautiful (if not a little noisy) Radeon 295×2. I plugged to my second PCI port and I have run a few experiments using clBLAS. This was my last post (below). Of course, the main limitation of my system is not the GPU, it  is every thing else. The 295×2 has a peak performance of 11 TFLOPS; however I  know that the data have to come from somewhere and the results often cannot stay tucked in the GPU. In my tests, I assume the data will come from the main memory and has to come back to it. I show that of the 11TFLOPS, I can achieve 3TFLOPS using single precision.  This is quite something:  considering that the card consume about  500W and I did not buy to play games, I thought I had a little supercomputer that I can keep beside my monitor.

Recently I added the ProDuo card based on the Fiji architecture in my GPUs arsenal. There are a few advantages (without considering the cost) versus the 295×2. The solution and packaging allows a smaller card without external fan (quiter), similar water cooling, and reduce power consumption, and more compute nodes (from 80 to 120). All of these goodies using the same 28nm technology, so this is pure reorganization thanks to the high bandwidth memory setup.  In practice, the ProDuo is clearly a better card and it should replace the 295×2. Right ?

single singlecomplex
Due to my limited PSU (1KW), because my setup limits any card in the first slot, because clBLAS has a new setup, I decided to re-run the experiments using the 295×2 and the ProDuo in the second slot (OpenCL would say that the card will take the place of GPU1 and GPU2). These cards are designed for single precision computations: The ProDuo has a peack performance of 16TFLOPS and the 295×2 has 11 TFLOPS (sorry I tend to repeat my self). The new clBLAS provides better kernels and you can see that The 295×2 achieves 5TFLOPS and ProDuo about 6TFLOPS. Good thing, I spent some time to redesign the experiments and I re-run the tests once again.  Once again, the bottle neck is my system and the way I feed the data, but you can see that having two GPUs in the card will allow a 2x speed up (instead of a single GPU).

A note, the plots above and the ones that follows will have missing points thanks to a few quirks in OpenCL that we are working on. Second, the cards have no modifications, they are coming directly from the box to the bench.

The 295×2 is still an awesome card considering that the difference is 0.9 TFLOPS. On the other side, the ProDuo is 20% faster, 30% more energy efficient. I can actually plug 2 ProDuo in my system without any further change, but I cannot plug the 295×2 and ProDuo together.  But what next it is still more interesting.

double doublecomplex

Yep. In double precision the ProDuo stays behind.  First, I am coming from the general purpose CPUs and their bench-marking,  I expect a factor of two penalty from going from single to double precision. Here, we can see that the factor is about 5. The Hawaii card can reach the 1 TFLOPS threshold marking, which sounds good;  Fiji’s has a 0.5 TFLOPS  upper limit. So the 0.9 TFLOP loss in single precision is a 0.4 TFLOP gain in double precision. Indeed, life is the stage for a variety of compromises.  In this case, I am not really sure if it is an architecture difference or kernels deployment. We will have to investigate, but it will require some help and effort.

But for me the most interesting part comes now, when I bought an extra PSU and now I can feed electricity to both cards on the same board.


In the next post, I will have a PCI extension so that I can put the big cards into the 2nd and  3rd slot,  I Will loose bandwidth but I should get full potential from the Hawaii and Fiji GPUs. Currently the two Hawaii are a little constrained and I can get the performance of a single GPU. With the extension,  I should be able to see  5 GPUS (the Turk on the first slot, 2 Hawaii and two Fiji). The system allows a three-way crossfire.

single precisionsingle complex preciston

Now, We have an heterogeneous system, in practice we have 3 GPUs effectively. The current experiments do not balanced the work load as a function of the GPUs throughput and thus the plots can be better, higher.

We can see that we earned yet another 1TFLOPS in single precision. The good news is that even in my system, the problem size and the computation time has such a ratio that more hardware will provide better performance and I can show it.  Also the introduction of more GPUs shows that the computation time become linear (the communication is the bottle neck). If I can unleash the forth GPUs, likely I will have little  improvement. But for double precision the curves are little different.

double precision  double complex precision

The three GPUS (Hawaii in position 1 and Fiji 2 and 3) provide a scalable solution but it is not always the best. The beauty of these plots is their complexity: considering the problem size and the configuration available, the best solution is not always straightforward.

The Future and motivations: 
At this time, my research is two fold:

First, I am investigating the field of deep learning for application of feature selection in advertising (yeah my real job) and GPUs seems the hardware of choice, so I wanted to have a test bed close by my real bed. These new systems promise and deliver performance unprecedented.

Second, with the coming of age of AutoGemm in clBLAS, we start having a self tuning BLAS for GPUs and an open source at that; this is an opportunity to re-evaluate kernels written using Strassen’s algorithm. In a system like mine, Strassen’s algorithm can be really appreciated only if they are done at kernel level: the computation performance plot it is too steep to take advantage of a divide and conquer (from the CPU) approach.

Clearness, contents, contributions, and rejections

I have seen different facets about the presentation of new ideas. This is my collection of thoughts about a few hurdles thrown to any naive contributor of original contents in the scientific field.

Point A: I am reading an old booklet: An introduction to Mathematics  by Alfred North Whitehead (1910), you can find for a penny in Amazon but by a few is considered as the top 50 books of scientific literature. There is a paragraph about mathematical clearness I shall quote here:

“…. In the second place it is necessary for research. It makes for clearness of thought, and thence for boldness of thought and for fertility in trying new combinations of  ideas. When the initial statements are vague and slipshod, at every subsequence state of thought common sense has to step in to limit application and to explain meanings. Now in creating thought common sense is a bad master. Its sole criterion for judgement is that the new ideas shall look like the old ones. In other words it can only act by suppressing originality.”

This is the real reason for clarity and it shows the hard consequences. I like  that mathematical clearness is the foundation of creative thinking and not just a notational game: if I do not have to think all the time of the meaning of terms and ideas, my creative side will be able to tackle new ideas. This requires also some flexibility from the reader: if I write u and explain that is a unit of measure, please let it be. Because if the reader (you) is more familiar with the term b instead, it is just a short hand. If my notation has to be exactly like yours to make sense to you, it sounds more like forcing me to conform to your common sense and thus even less likely to convince you about a result that is truly original and  it is not by you.

Point B: I am a reviewer for a journal and recently I rejected a paper. The authors presented two algorithms A and B, they described them in fairly good details and they chose to present experiments only for one, B. Algorithm  A did not have experiments because it was not appealing, it was not appealing because the implementation was poorly parallel w.r.t. B. One processor was doing 1/4 of the work no matter how many processors were used.  In my review, I presented a different algorithm: that particular 1/4 of the work can be highly parallel, the parallel algorithm is elegant and short to describe,  and there is code for it. My concern was on the contents and what was missing: a better A algorithm. Strangely, if the authors would have given me only B, I would never bothered with it.

Sincerely, I did not fully understand the paper. That was not my point for the rejection. What I did unconsciously and consciously is to focus on the contents and see if B was better than A. Clearness should provide that because was the main goal of the paper. The authors may think that I am an idiot, their paper is just fine, and they are writing a rebuttal to me and the editor to reconsider the publication. If you are a small potatoes, being reviewed sucks. If you are a small potatoes and you are reviewing, you may get carry away.

Point C: At the beginning of 2013, I have been working on a project for a few months. In the second half of the year, I tried to collect my thoughts and results into a paper and submit to publication. The next post will be about the contents of this publication.  IF you follow my posts, you may know that I am an experimental person; that is, observations and experiments are the first part of my work. Nowadays, I often communicate these results during the investigation to  researchers who worked on the same subject and especially if my work may shed a different light on their work. Mostly I want to know if I did something obscenely wrong. Other researches go to a great length avoiding sharing or paying their dues (as B.B. King would say). Some times I even understand the reasons for not sharing but I understand better the reasons for it.

For me mathematical clearness is an attempt to organize and write a process that is often scruffy, based on inductions and sometime intuitions. Sometimes, I start during the experiments to guide me in finding the right questions and the right answers. More often the writing is after words when I have good enough answer. The experiments are driven by curiosity: the desire to know and the disappointment of remaining an idiot (i.e., who ignores). Thought, the writing is hard, I like seeing  the white and empty paper becoming an article. An original contribution describing in the best way I could  what I did and share it.  Without considering the publication process, this process may take months. With the publication process, it may take years.

I submitted this work to an old and good journal where I have already published but I knew the process will be tough. I believed and still I believe it was the right course. During the preparation of the work, I thought I had the blessing of my predecessors (or at least who matter to the topic and to me).   In fact, the paper was rejected. The reviews were mostly aiming at the writing part and how the writing cluttered the understanding of the material.  In practice, clarity is the responsibility of the authors.  A paper cannot be published if it is not clear.  I rewrote the paper, wrote a detailed rebuttal, and submitted somewhere else, it will be published shortly. This paper will prove better bounds to the error of FastMMW: the last such a result was published 40 years ago. Yep, creative thinking in action.

Conclusions: With experience and knowledge, I hope to reach the mathematical clearness that will help my successors and convince my contemporaries.






Self Installing Package: Fast Matrix Multiplication

In the last two weeks, I have been working on practical ways to improve the distribution of FastMMW library: to make my code less a research project and more like a software package that anyone can download, build, install and use.

I am almost done with the testing of an installation process, which is  written in Python. I chose Python because I am lazy and it is simple. I just finished up writing a Python package so that any one could use FastMMW and provide a simple example and pictures, see previous post.

I tested the installation process on my A8 machine, which is a 4 core single processor system, and it is running Ubuntu 10.4.   I am running the installation on one of my multiprocessor system with two dual core processor (Opteron), and it is running Fedora 9.  Just to test the installation and the process with a complete different system: Boy it is always needed.

Per se, the installation process is not really interesting. The library comes with an INSTALL.txt file where I present the process step by step. I wanted to have the process done a little automatically. For some architectures with  multi-cores multiprocessors systems, where cores are hardware threads, I recommend a little of imagination and testing: do it manually! It is very instructive you will get a feeling about how the system distribute the computation by using the command “mpstat -P ALL”.

Let me not digress,  back to the installation process.

The idea is to create 4 libraries: one for each basic type such as float, double, complex float and complex double. Each library is built independently of each other and this will allow different optimizations.

Matrix addition optimization

MA is a two  dimensional loop code: the inner loop follow the inner dimension (i.e., column Fortran, Python, R matrices or row C matrices). The computation as spatial locality and it is trivial to parallelize (outer loop). The first step is to measure the performance with different number of threads and different unrolling of the inner loop (exploiting register allocation).

I chose unrolling of U = [4,8,16,32,64], in order. With everything left along but the unrolling, I will expect to increase performance until register spill reduces performance. For example, if I measure a drop in performance between 16 and 32, I will not test 64.

I chose a thread allocation. For example, if I have 4 cores [0,1,2,3] I will test the following configurations: L = {[0], [0,1], [0,1,2] and [0,1,2,3]}.

When I found the best configuration L and U, I will estimate the break even point when Fast algorithms are as fast as Goto’s or Whaley’s. Let us call this estimate B.

Break even evaluation

With a first estimate B, I start a linear search for the problem size where Winograd’s algorithm is as fast as the best GEMM. Once I have found the break even point I can create the library. I create also binary code where the user can test the real performance.

A note of wisdom

The main advantage of breaking the installation per type is the opportunity to have different L, U and B as a function of the basic GEMM installed. In fact, on my A8 machine GotoBLAS has better performance than ATLAS, and somehow the configurations are different, in particular L and U (the number of cores and the unrolling  used for MA) are different for GotoBLAS and ATLAS.

This is no good: it means that the BLAS package installation affects the performance in some subtle ways  that are not really apparent. This is good: because the little voice in side my head always saying “I told you there is something interesting even here”.

Once again, it seems I concluded this post with one further open question.Good research often does feed itself.

If you like to help and test this new framework, please contact me, I will follow up after Oct 1 2012.







Fast Matrix Multiplication Python package

After working on a R package, I wanted to work a little bit longer and see what it must take to write up a Python package. I have to say it was fun and actually easier than I expected and remembered. The process is partially automatic, most of it, and without writing manuals, description, comments and all the good stuff they teach in Software engineering classes is pretty straightforward.

The R and the Python package share the same C interface, but each has its own way. Using numpy array object and the ability to write different types of array (float, double, complex float and complex double) the interface is quite simple and quite nimble. We can pass large matrices as reference without copy.

There are few bugs to take care but I could make it run a simple example and run the usual error comparison as done in the previous post. With positive matrices and using double precision as reference I could compare the error committed:








The first thing I have noticed was the striking difference from the picture presented in the previous post. Here, each algorithm has its own range without intersection, Strassen algorithm has the worst maximum  error (as expected) and the two Winograd algorithms present different errors.  In contrast, In the R exercise,  I have all fast algorithms clustering in the same area and having no great distinction. And using the same random number generator but in the interval [-1,1] this is the plot:

That is very similar to the R exercise.

Having two different environments (actually three: R, Python and C), it will make easier to investigate the nuances of the algorithms’ numerical properties. Just playing with these I have already questions about the code generators and the effect of the difference results. Experiments are integral part of science and good experiments always beg more questions.

I will be clean up the code and make available these modifications and interfaces in the next distribution.



Fast Matrix Multiplication R Package



My work experience taught me one thing: people use a lot of different languages and environments; they thrive where one application can be done in different ways. In previous projects, I wrote fast kernels in C and  then wrote different interfaces: R, Java, Python, and Perl.  The top reason is that the basic functionality did not change, during the years we have changed the environments and the languages.  So to keep up with the new products, to keep a single repository of the kernels,  and finally to keep to a minimum the code maintenance, we spent quite some time learning how to write interfaces. Fortunately, most languages are very friendly to C kernels.

R is one of my preferred languages and interpreter. It is easy to use. It is a functional language. It has “lotsa”  of statistical tools. Friendly environment in creating packages and to create your own C kernels. Today, I can say I accomplished the creation of the first FastMMWR package: making possible to use FastMMW algorithms on top of GotoBLAS2 on my A8 AMD machine.

R and the R-interface is simple to create and to use. However, the standard copy of data  from R to C and vice versa, is a overwhelming: the speed up obtained by FastMMW algorithm is almost obliterated by the cost of preparing the input and the outputs. This is because I wanted to use a R wrapper. It will be more efficient to use the .C() interface directly without copy of the data.

But there is a silver lining to have the complete set of FastMMW algorithms and their R interface: R uses double and double complex matrices and thus the regular matrix multiplication  C <- A %*% B is basically of these two types. We could speed up the process by casting double to single and double-complex to single complex. This down cast can be done using the R copy interface, and then call the appropriate function.

If FastMMWR will not really shine for performance, why bother to have an interface and a package?

Because we can create error analysis tests for FastMMW and create beautiful graph in a matter of minutes:








For example, let us consider the Figure above.  We create 50 examples of matrices and B from a uniform distribution between 0 ans 1. The matrices have size 5000×5000. We create positive matrices in double precision and we compute the product reference in double precision. Then, we cast the input in single precision and performed the same product but in single precision (SGEMM). We ran also Winograd algorithm (SWGEMM), Strassen algorithm (SSGEMM) and our fastest algorithm (SBGEMM) where B stands for Bodrato, Marco (the original creator of the algorithm). All FastMMW algorithms have a single recursion because the break even point for this architecture is 3000×3000.

So our reference is in double precision and the comparables are four single precision algorithms. For this architecture, Winograd’s and Strassen’s algorithms have the same error on an average.  In my experience, this is new and interesting because in previous experiments Winograd’s algorithms had always better maximum error.  On average, the fast algorithms for one recursion has a multiplicative error factor of 1.5 only. The theoretical factor for Strassen is 3 and for Winograd is 4. We loose half of a decimal point.







If instead of creating positive matrices, we create matrices from the normal distribution with average 0 and variance 1, the plot is different from before but overall the error is as expected: that is, Strassen’s has smaller maximum error than Winograd’s algorithm. However, the Strassen’s has a multiplicative factor of 1.5 or less. Winograd’s factor is about 3.

The amazing thing is the speed in writing the test, collecting the results, and plotting the results. This is the code:

compare <- function(n,MM,type="normal") {

  Col =c("red", "blue", "black", "yellow", "green","orange","gray")

  dr = vector("numeric",n)
  dw = vector("numeric",n)
  db = vector("numeric",n)
  ds = vector("numeric",n)
  sr = vector("numeric",n)
  sw = vector("numeric",n)
  sb = vector("numeric",n)

  i = 1
  while (i<=n) {
    as = 1.0
    bs =1.0

    if (type =="normal")  {
       AS = array(rnorm(MM*MM),dim=c(MM,MM))
       BS = array(rnorm(MM*MM),dim=c(MM,MM))
    else {
      AS = array(runif(MM*MM,0,1),dim=c(MM,MM))
      BS = array(runif(MM*MM,0,1),dim=c(MM,MM))

    print (c(dim(AS),dim(BS)))

    CD =AS%*%BS

    CS = s_mm_leaf_computationR(as,AS,bs,BS)
    sr[i] =  max(abs(CD-CS))

    CS = s_wmpipeR(as,AS,bs,BS)
    sw[i] =  max(abs(CD-CS))

    CS = s_BMOWR_PIPER(as,AS,bs,BS)
    sb[i] =  max(abs(CD-CS))

    CDC = s_smpipeR(as,AS,bs,BS)
    dw[i] = max(abs(CD-CDC))  

    i = i +1

  L = list(sr,sw,sb,dw)
  i = 1
  while (i<=length(L)) {
    average = sum(L[[i]])/length(L[[i]])
    var = sqrt(sum((L[[i]]-average)^2)/length(L[[i]]))

    T = c(average+var,average-var,L[[i]])
    L[[i]] = T

    print (c(i,average,var))
    i = i + 1



plotL <- function(L,l,Col,t,f) {

  png(filename=f, height=512, width=728, bg="white")

  i =1
  M = 0
  while (i<length(L)) {
    M =max(c(M,L[[i]]))
    i =i+1    

  X = c(0,0,1:(length(L[[1]])-2))

  i = 1
  plot(X,L[[i]],xlab="sample", ylab="maximum Error", main=t, type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
  i = 2
  while (i<=length(L)) {
    lines(X,L[[i]], type="o", col=Col[i], ylim=c(0,M),pch=22) <- lm(L[[i]] ~ X)
    i = i +1   


L1 = compare(50,5000)
L2 = compare(50,5000,"uniform")


The FastMMWR package shines here: if you want to check whether or not you algorithm can afford to loose a little of accuracy by means of a faster algorithm, this package provides the tools to make such a decision. Knowledge is a good thing (sorry Martha, I know it is your trademark catch phrase).

OpenCL + APU + GPU + Fast Matrix Multiply

Next Month, I will be going to Bellevue to present a short work about matrix multiplication and heterogeneous computing at the AMD Fusion developer Symposium. I better prepare my presentation because it will be recorded and made available. I just posted the paper online, search my bio or Google the title above.

I tested several architectures during these last seven years, since I started working on Winograd’s Fast Matrix Multiplication (fastMM). Nowadays,  it is the age of GPUs as computational accelerators and I did not have much hands-on experience.  So to learn something new, last year at this time, I started considering working on APUs. In fact, During a conversation with Matteo Frigo, it came up the topic about processors with cores and a GPU: the so called APU. At least, AMD calls them APU.  The underlying question was whether or not Fast MM could be used for APU. The short answer is yes, but this is not my point for today.

In my previous posts, I talked about the recent work about MM and GPU.  Winograd algorithm can be applied to GPUs and it is only a question of problem size. Also it may be beneficial for error analysis; yes, you read me. But APUs are hybrid and they are not a separate unit with astronomical GFLOPS. You may use the CPUs (e.g., 4) or the GPU. The A8 processor,  I had the luck to test, provides this dichotomy and each part may provide 100 GFLOPS, if you add an equivalent and external  GPU, like I did, you will have a system with about 200 GFLOPS peak performance for Matrix Multiplication.

This may look like a wimpy system —I received this comment already. In fact,  to publish anything today you may have to use a GPU with 1TFLOPS peak performance. Otherwise, it is not supercomputing.  I have a different opinion, which is based on the economy of the system: my built has the same performance of a Cell processor with the same cost of a PS3 Playstation, with the same power consumption, and quite enough that I can write this post by using this built while listening Ravel’s Bolero. Think about a supercomputing centre without the annoying whistle, air conditioning, and electric bill.  The washing machine is making more noise.

From the beginning, one year ago, I built a system (new thing for me), I learn a new framework, I looked into the code written for ATI’s GPUs, and  finally I start looking how I could write Fast MM for this parallel system. Having the GFLOPS/$ of a system built on a Cell, there is not need to brag performance.  An APU will take advantage of the software written for x86_64 architectures, for which I am very familiar, and for the type of programming for GPUs. Differently from a Cell, I do not have to write new code for this and I can reuse the optimal one already written.  I used OpenCL to glue everything together. It required some time due to the learning curve, but if I could do it, any younger developer will be able to, actually faster.

Once I was  ready, my interest and curiosity moved from the development of fast MM (i.e., Winograd’s) to the understanding of the new challenges of such as system. Especially, I was very curious about using all these computational units: CPUs, internal GPU and the external GPU.  I wanted to write codes that exploit the system but they are easy to maintain just in case  I will plug out any of the chips.

From the paper the conclusions read as follows:

4.4. Conclusions
In our system, the APU provides a software solution using only CPUs that can achieve 90GFLOPS (GotoBLAS). If we would like to improve performance by just working on a different and fast algorithm, we can achieve 120 GFLOPS. If we take advantage of both GPUs, we can achieve sustainable performance of about 160 GFLOPS (and a peak performance of 200 GFLOPS). This is a first attempt in putting together a OpenCL solution for the implementation of MM using hybrid parallel
systems. The heterogeneous system presents interesting challenges and, thanks to the OpenCL API, ultimately a flexible and powerful environment

The communications to and from memory are the bottleneck of the system. In theory, that is without communication costs, we could use the CPUs, internal GPU and external GPUs concurrently. Once communication costs are included, the utilisation of the system is less.

Please, take a look at the paper OpenCL+APU+GPU+Fast Matrix Multiply or Google it.