After watching the spark conference video on how spotify is using spark I became interested in how ALS can be used to make predictions for users.
From the video, prior experience and new research I found several advantages to ALS :
ALS works by using matrix factorization. Matrix factorization works by taking a matrix \(A_{m \times n}\) and finding two other matricies $U_{mk} and $P_{kn} that approximately equal \(A\).
This is done by first initializing \(P \& U\) variables with random values and then holding one constant while updating the second and comparing the results to \(A\).
Once the updates do not lead to further improvements, the process is stopped and the final matrix, \(A'\) is compared to the existing values in \(A\).
In the video much of the talk was dedicated to problems with problems with implementing this with spark.
This is very interesting but beyond the scope of this discussion. I was going to show my own pyspark implementation using pyspark
but it turns out the spark did not like my computer. Instead I’m using R’s recomender lab.
This is a plug and play library that does it all for you. Full disclosure, the code below is coppied verbatim from this website. I included it as an example.
library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
data(MovieLense)
scheme <- evaluationScheme(MovieLense, method="split", train=0.9, given=-5, goodRating=4)
accuracy_table <- function(scheme, algorithm, parameter){
r <- Recommender(getData(scheme, "train"), algorithm, parameter = parameter)
p <- predict(r, getData(scheme, "known"), type="ratings")
acc_list <- calcPredictionAccuracy(p, getData(scheme, "unknown"))
total_list <- c(algorithm =algorithm, acc_list)
total_list <- total_list[sapply(total_list, function(x) !is.null(x))]
return(data.frame(as.list(total_list)))
}
table_random <- accuracy_table(scheme, algorithm = "RANDOM", parameter = NULL)
table_ubcf <- accuracy_table(scheme, algorithm = "UBCF", parameter = list(nn=50))
table_ibcf <- accuracy_table(scheme, algorithm = "IBCF", parameter = list(k=50))
table_pop <- accuracy_table(scheme, algorithm = "POPULAR", parameter = NULL)
table_ALS_1 <- accuracy_table(scheme, algorithm = "ALS",
parameter = list( normalize=NULL, lambda=0.1, n_factors=200,
n_iterations=10, seed = 1234, verbose = TRUE))
## Used parameters:
## normalize = NULL
## lambda = 0.1
## n_factors = 200
## n_iterations = 10
## min_item_nr = 1
## seed = 1234
## verbose = TRUE
## [1] "0th iteration: cost function = 234567.802886753"
## [1] "1th iteration, step 1: cost function = 219201.419286132"
## [1] "1th iteration, step 2: cost function = 209173.640547462"
## [1] "2th iteration, step 1: cost function = 200499.060344742"
## [1] "2th iteration, step 2: cost function = 193980.29851724"
## [1] "3th iteration, step 1: cost function = 188151.683064291"
## [1] "3th iteration, step 2: cost function = 183569.54912126"
## [1] "4th iteration, step 1: cost function = 179278.335628184"
## [1] "4th iteration, step 2: cost function = 174936.790473661"
## [1] "5th iteration, step 1: cost function = 169734.421170487"
## [1] "5th iteration, step 2: cost function = 165096.595903737"
## [1] "6th iteration, step 1: cost function = 161320.261699575"
## [1] "6th iteration, step 2: cost function = 157742.500282274"
## [1] "7th iteration, step 1: cost function = 154681.297374595"
## [1] "7th iteration, step 2: cost function = 152102.965757727"
## [1] "8th iteration, step 1: cost function = 150044.960735819"
## [1] "8th iteration, step 2: cost function = 148390.160518515"
## [1] "9th iteration, step 1: cost function = 147033.14083349"
## [1] "9th iteration, step 2: cost function = 145918.817904675"
## [1] "10th iteration, step 1: cost function = 144973.288106876"
## [1] "10th iteration, step 2: cost function = 144181.90082624"
rbind(table_random, table_pop, table_ubcf, table_ibcf, table_ALS_1)
## algorithm RMSE MSE MAE
## 1 RANDOM 1.35221499402814 1.82848539007453 1.05056156601822
## 2 POPULAR 1.0089118662917 1.0179031539442 0.795988545380187
## 3 UBCF 1.09697369627405 1.20335129031716 0.890989007984894
## 4 IBCF 1.39990648238184 1.95973815941471 0.985714285714286
## 5 ALS 0.951419431307476 0.905198934269441 0.757327141291289
It is interesting to see that the most accurate method is infact ALS.
Also, I am now going to spend the rest of my day trying to reinstall spark.