# Scoring as Sparse Matrix Multiplication

This is my first time working on sparse matrix multiplication and I must say that the project was quite something and it is worth sharing it. Let us do this in a top-down fashion. Let us start by describing the problem.

Scoring

Consider we are watching a YouTube’s video about the 2014 Kawasaki ZX14R (I like this one in particular https://www.youtube.com/watch?v=QGwlVdKpS1s). At the beginning of the video it appears a banner ad or a 30 sec pre-runner. We may click it, kill it or let it be. The Ad is tailored to our history, our  preferences,  where we live  or the time of the day; it is also tailored to the campaign that YouTube is paid to deliver.

Before the delivery of the ad, this can be a bid request and as such a third party may bid and win it, taking control of the ad contents that is ultimately delivered. The same awesome ZX14R will attract different viewers that will be matched to different ads thus to different campaigns.

Consider that the bid is simply summarized by a set of xs: for example, the time of the day, the browser type, the IP network. These Xs are called features and, if we are logged in YouTube, they can be quite complex and they can be very personal (age, income, gender, purchase of a ZX14R).

YouTube may have thousands of campaign running concurrently and probably we may fit the audience targeting of about 10 to 20 campaign at any time. Each campaign may compete with several others in bidding for our attention.  We are summarized by xs, the campaign is summarized by betas. Assume we have one campaign, say campaign 1, and one bid request, we can quantify the match by a probability function

In particular, the relation is quite interesting and well known as logistic regression

This is a dot product where the element in the betas matches exactly the right element in the x. If we are targeted by multiple campaigns, we may represent this a vectors

Which is a matrix-by-Vector computation.  Now there are thousands of us watching this video or something like this at any time.

This is a sparse matrix multiplication. Each row of betas represents a model, which is associated with a campaign. Each model is composed of different dimensions or attributes and each bid, each x, may intersect only partially with the model. Thus the computations is sparse. The result matrix is a probability measure of why the i-th campaign should be matched with the j-th bid. A threshold can be applied to create discrete values eventually and then we can argue about an approach in breaking ties.

So far so good, this is after all matrix multiplication: at least the last computation of this process. Before the MM computation, there must be the computation of the Betas.
This is beyond the scope of this post (and this site). I will share   reference and it is fascinating the theory and the practice of building these models: I loved this statistical and machine learning part  of the project.

The core of the project is the ability to perform the sparse matrix multiplication fast enough that we can make an intelligent matching between campaign and user in the allocated time to submit a bid for the impression or impressions. The decision has to be made in the order of milliseconds per bid to make sure that we can watch our video as soon as possible without long waiting time because the ad does not load.

The latency is important but the throughput is key: YouTube may have thousands of campaigns running and millions of users connected at any time. Of course, they will count the money to the bank, but this is the basic requirement for any Ad Targeting company in the market.

Three facets: the machines, the models, and performance

For this project, we wanted to use the Intel Phi. This is the first generation PCA connected system based on 50 something cores, with 4 threads each for a total of 200 cores. The cores are designed for this architecture from scratch and based on Pentium architecture.  The peak performance is in the 4TFLOPS ball park. We had one card  without fan for about 5K USD (yes you read it right). Because this guy has a 300Watts consumption at full load, we had to stick a fan to force cool air into the card in such a way we could actually use it. We do not show the artisan job.

To develop the code, we built a system based on Westemere CPUs. The picture above shows the 2-processor system with 12 real core plus 12 other multithreaded cores (24). We added as well a AMD GPU (another 4 TFLOPS)  and we wanted to test what system was the most efficient for our production system.

This built has a cost of 5K USD.  We install CentOS ans we had a 9TFLOPS machine literally under my desk in office. The Fan noise was acceptable but I did not keep it always on.

We then purchased the license for the Intel compiler (1K USD) and finally we were ready to code for the Phi.  The one advantage of this configuration is that we can write normal code that we can test on the 2 processor system, scale it up to 24 cores. Then we need to use the same code and scale it to 200 for the phi. There is no specific code for the Intel System, the compiler, or better Matteo Frigo’s legacy, takes the C code and create different executables.

Per se, the code is not that interesting and we can skip it.

The models: the construction of the betas.

All impressions delivered by the company reside(d) in a dedicated DB. I wanted to build as many models as I could. I could work with about 120 campaign and thus I could build as many models. I deployed the machine above and two other smaller ones: in practice I could investigate and build 18 models  at any time (10 + 4 +4). For this process multithreads were detrimental.

This process could take as long as one day.

Performance:

There are two type of performance: speed and accuracy. I can clearly say that the speed was disappointing.

We collected 26Million bids and used 102 Models. We moved all models and impressions into the Phi and scored them. We did the same on the Westemeres. After all the efforts the two systems had the same throughput. Considering that the impressions would have to move through  the PCA, the latency is not really comparable

The Full story is available in the following paper: ctr.