Music Recommendations at Scale with Spark

Spotify uses a methodology to recommend songs to its listeners which defers from the methods studies in class in two aspects: - Data is implicit - The recommendation has been stated as an optimization problem

Explicit and Implicit data

At a high level, Spotify’s cases might seem similar to those already looked at in class around movie recommendation. For these recommenders we built matrices with each combination of users and the movies they have rated. To obtain this data users have to proactively rate movies on say a scale of 1 to 5. Because users are providing the rating, this is know as explicit data. The matrix below is show as an example of this type of data.

The MovieLens dataset is a good of this kind of data.

data <- read.csv("movieLens.csv", header = FALSE)[1:40,]
colnames(data)<-c('user_id','item_id','rating','timestamp')
data %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "400px")

user_id	item_id	rating	timestamp
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013
62	257	2	879372434
286	1014	5	879781125
200	222	5	876042340
210	40	3	891035994
224	29	3	888104457
303	785	3	879485318
122	387	5	879270459
194	274	2	879539794
291	1042	4	874834944
234	1184	2	892079237
119	392	4	886176814
167	486	4	892738452
299	144	4	877881320
291	118	2	874833878
308	1	4	887736532
95	546	2	879196566
38	95	5	892430094
102	768	2	883748450
63	277	4	875747401
160	234	5	876861185
50	246	3	877052329
301	98	4	882075827
225	193	4	879539727
290	88	4	880731963
97	194	3	884238860
157	274	4	886890835
181	1081	1	878962623
278	603	5	891295330
276	796	1	874791932
7	32	4	891350932

From this data we can build the matrix. The matrix will have all the users and the ratings given by then to certain movies. But Spotify doesn’t ask its users to rate movies. Instead it tracks what songs have a user listened to. So the matrix is composed of zeros and ones denoting when the user hasn’t and has listened to a particular song.

But if we try to use any of the techniques already studies, such as collaborative filtering, we wouldn’t be able to account for the number of times a users listened to a song, which would really account for the rating for that song. If a user listens to a song many times, that is equivalent to it having a high rating. FOr this reason Spotify has propossed a different approach to recommendations.

Optimization problem

In contrast with many machine learning techniques such as linear regression, logistic regresion, neural networks, recommendation algorithm have not been optimization problems. In lineal regression for example, we are looking at minimizing a the error between a line and our data, not the case in recommenders so far. But what of we stated it this way? This is what Spotify has done.

What they have done is represented the original matrix as the product of two lower order vectors, one for users, the other for songs.

Now that we have approximated the matrix to a user vector and a songs vector, the task at hand is to find what vectors best approximate the matrix. To be clear, the same statement can be made for the explicit data with ratings.

So what is left is to determine a way to find these vectors. To do this an error function if defined to quantify the difference between any given vectors and the matrix. To find the vectors, this error has to be minimized.

This is similar to many ML techniques. But as in such techniques, regilarization is added to avoid overfitting

Andrew Ng’s machine learning is almost a classic and explains how to handle regularization very well: https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN

But stated as is, if the original matrix is 0 and 1’s, we would be missing the rating given by the number of times a user listens to a song. For this reason they added a weight.

Alternating least squares

So how do they calculate both vectors. It turns our they almost use brute force to do this. Basically they fix a vector, one of the two, they solve the least squares regression and calculate the other vector. Then the resulting vector is fixed and the first vector is calculated again solving the regression problem. This is done back and fourth until a minimum is found.

The presenter show how this was done first using Hadoop and then moving into Spark. The biggest difference here was moving from a disk based system, Hadoop, to one in memory, Spark, where data can be placed in a cache. The presenter doesn’t go into details, but the minimum can be found with techniques such as gradient decent.

Data 612 - Discussion 2

Peter Kowalchuk

3/2/2020

Music Recommendations at Scale with Spark

Explicit and Implicit data

Optimization problem

Alternating least squares