I have found the symposium refreshing and I had a lot of fun. My presentation slides and the paper are available here in this site if you like; also, the talk has been recorded and it is available for broadcast on demand at the digital symposium framework.
It was fun because I could see things done instead of just talked about, the experience pavilion. The food was good. The coffee was plenty. The people interesting and sharp. The technical sessions may have less academic appealing than let’s say famous conferences. But you could feel that there is a sincere attempt to create a community for these new machines and tools to develop software for them.
I personally liked it and I attended just two days. Unfortunately, I could not stay longer.
About Matrix Multiplication, there were other three technical sessions comprised mine: mostly GPU related talks with incredible performance. My session was the only one that addressed all computational engines and presented a little about Fast Matrix Multiplication. I am happy that I refrained myself in presenting yet another implementation of FastMMW for this architecture. My main goal was to present a case for an algorithmic solution for heterogeneous computing of MM. Presenting Fast MM would be a step too much. Let us learn walking before running.
An attendee was so kind to share his opinion that Winograd’s should not be worth to pursue because of performance and because of its numerical instability. First about Performance, the new GPUs have a single and in hardware multiply-add operation, cutting by half the computation of the operations making less appealing any fast algorithms. Second, the unstable part was not really well explained but it was firmly embedded into the person mind.
As response for the performance caveat. In the past, I worked with architectures where multiply-add instructions were speeded up in hardware by avoiding register file access between the multiply and the addition, in practice by having fewer cycles for the operation and supposedly improving accuracy. For those Machines, I applied successfully Winograd. Matt Badin has currently an implementation for a NVidia GPU with fused multiply-add where Fast MM are applicable. I would say that it is architecture dependent but it is not a good reason.
As a response for the instability caveat. If you want to ignore the work I did, just Google the following: “fast matrix multiply is stable” and argue with prof. Demmel and previously with prof. Higham. Please, leave me alone. However, let me ask this question: did anyone measure the error of the GPU implementation of MM?