library(kableExtra)
Spotify uses a methodology to recommend songs to its listeners which defers from the methods studies in class in two aspects: - Data is implicit - The recommendation has been stated as an optimization problem
At a high level, Spotify’s cases might seem similar to those already looked at in class around movie recommendation. For these recommenders we built matrices with each combination of users and the movies they have rated. To obtain this data users have to proactively rate movies on say a scale of 1 to 5. Because users are providing the rating, this is know as explicit data. The matrix below is show as an example of this type of data.
The MovieLens dataset is a good of this kind of data.
data <- read.csv("movieLens.csv", header = FALSE)[1:40,]
colnames(data)<-c('user_id','item_id','rating','timestamp')
data %>% kable() %>% kable_styling() %>% scroll_box(width = "800px", height = "400px")
| user_id | item_id | rating | timestamp |
|---|---|---|---|
| 196 | 242 | 3 | 881250949 |
| 186 | 302 | 3 | 891717742 |
| 22 | 377 | 1 | 878887116 |
| 244 | 51 | 2 | 880606923 |
| 166 | 346 | 1 | 886397596 |
| 298 | 474 | 4 | 884182806 |
| 115 | 265 | 2 | 881171488 |
| 253 | 465 | 5 | 891628467 |
| 305 | 451 | 3 | 886324817 |
| 6 | 86 | 3 | 883603013 |
| 62 | 257 | 2 | 879372434 |
| 286 | 1014 | 5 | 879781125 |
| 200 | 222 | 5 | 876042340 |
| 210 | 40 | 3 | 891035994 |
| 224 | 29 | 3 | 888104457 |
| 303 | 785 | 3 | 879485318 |
| 122 | 387 | 5 | 879270459 |
| 194 | 274 | 2 | 879539794 |
| 291 | 1042 | 4 | 874834944 |
| 234 | 1184 | 2 | 892079237 |
| 119 | 392 | 4 | 886176814 |
| 167 | 486 | 4 | 892738452 |
| 299 | 144 | 4 | 877881320 |
| 291 | 118 | 2 | 874833878 |
| 308 | 1 | 4 | 887736532 |
| 95 | 546 | 2 | 879196566 |
| 38 | 95 | 5 | 892430094 |
| 102 | 768 | 2 | 883748450 |
| 63 | 277 | 4 | 875747401 |
| 160 | 234 | 5 | 876861185 |
| 50 | 246 | 3 | 877052329 |
| 301 | 98 | 4 | 882075827 |
| 225 | 193 | 4 | 879539727 |
| 290 | 88 | 4 | 880731963 |
| 97 | 194 | 3 | 884238860 |
| 157 | 274 | 4 | 886890835 |
| 181 | 1081 | 1 | 878962623 |
| 278 | 603 | 5 | 891295330 |
| 276 | 796 | 1 | 874791932 |
| 7 | 32 | 4 | 891350932 |
From this data we can build the matrix. The matrix will have all the users and the ratings given by then to certain movies. But Spotify doesn’t ask its users to rate movies. Instead it tracks what songs have a user listened to. So the matrix is composed of zeros and ones denoting when the user hasn’t and has listened to a particular song.
But if we try to use any of the techniques already studies, such as collaborative filtering, we wouldn’t be able to account for the number of times a users listened to a song, which would really account for the rating for that song. If a user listens to a song many times, that is equivalent to it having a high rating. FOr this reason Spotify has propossed a different approach to recommendations.
In contrast with many machine learning techniques such as linear regression, logistic regresion, neural networks, recommendation algorithm have not been optimization problems. In lineal regression for example, we are looking at minimizing a the error between a line and our data, not the case in recommenders so far. But what of we stated it this way? This is what Spotify has done.
What they have done is represented the original matrix as the product of two lower order vectors, one for users, the other for songs.
Now that we have approximated the matrix to a user vector and a songs vector, the task at hand is to find what vectors best approximate the matrix. To be clear, the same statement can be made for the explicit data with ratings.
So what is left is to determine a way to find these vectors. To do this an error function if defined to quantify the difference between any given vectors and the matrix. To find the vectors, this error has to be minimized.
This is similar to many ML techniques. But as in such techniques, regilarization is added to avoid overfitting
Andrew Ng’s machine learning is almost a classic and explains how to handle regularization very well: https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN
But stated as is, if the original matrix is 0 and 1’s, we would be missing the rating given by the number of times a user listens to a song. For this reason they added a weight.
So how do they calculate both vectors. It turns our they almost use brute force to do this. Basically they fix a vector, one of the two, they solve the least squares regression and calculate the other vector. Then the resulting vector is fixed and the first vector is calculated again solving the regression problem. This is done back and fourth until a minimum is found.
The presenter show how this was done first using Hadoop and then moving into Spark. The biggest difference here was moving from a disk based system, Hadoop, to one in memory, Spark, where data can be placed in a cache. The presenter doesn’t go into details, but the minimum can be found with techniques such as gradient decent.