Introduction

The goal of this assignment is to adapt one of my recommendation systems to work with Apache Spark and compare the performance with my previous iteration. We will use MovieLense recommendation systems from project 2 and compare it with the model built on Apache Spark.

load libraries

## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loading required package: arules
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
## 
## Attaching package: 'sparklyr'
## The following object is masked from 'package:purrr':
## 
##     invoke

Dataset

We will use the MovieLense dataset. The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The data set contains about 100,000 ratings (1-5) from 943 users on 1664 movies.

load dataset

data("MovieLense")

Recommender System

For project 2, we built item-based collaborative filter and user-based collaborative filter models. However, sparklyr library does not offer such algorithms. Therefore, we will implement ALS method for comparing results.

I will use the k-fold method to split the data. Then fit the ALS model and make prediction. Lastly, I will extract accuracy of the model to compare with the model built on Spark.

# start 


# split dataset using k-fold method
set.seed(100)
scheme <- MovieLense %>% 
  evaluationScheme(method = "cross", k = 5, given = 15, goodRating = 3)
tic()
# fit the model
als_model <- Recommender(getData(scheme, "train"), method = "ALS")

t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

tic()
# make prediction
prediction <- predict(als_model, getData(scheme, "known"), type = "ratings")

# end 
t <- toc(quiet = TRUE)
prediction_time <- round(t$toc - t$tic, 2)


# get accuracy score
evaluation <- calcPredictionAccuracy(prediction, getData(scheme, "unknown"))

Spark Recommender System

We will build a model on Apache Spark. Apache Spark is open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data.

We will open a connection to spark and then copy the dataset over. Sparklyr package accepts only data frame and numeri variables so we will convert them to numeric types. We will split the data into training and testing data and the fit the model. Finally, we will make predictions and copy the results to R.

set.seed(100)
sdf_MovieLense <- MovieLense %>% 
  as(. , "data.frame") %>% 
  mutate(user = as.numeric(user),
         item = as.numeric(item)) 

sc <- spark_connect(master = "local")


# copy data
sdf_rating_matrix <- sdf_copy_to(sc, sdf_MovieLense, "sdf_rating_matrix", overwrite = TRUE)

# split dataset into training and testing 

partitioned <- sdf_rating_matrix %>% 
  sdf_random_split(training = 0.8, testing = 0.2)

# fit the model
tic()
sdf_als_model <- ml_als(partitioned$training)

t <- toc(quiet = TRUE)
spark_train_time <- round(t$toc - t$tic, 2)

tic()
# prediction
spark_prediction <- ml_transform(sdf_als_model, partitioned$testing) %>% collect()

# end 
t <- toc(quiet = TRUE)
spark_prediction_time <- round(t$toc - t$tic, 2)

# disconnect
spark_disconnect(sc)

Model Evaluation

To evaluate the performances of the models, we will compare the RMSE of the models. We will also compare the prediction and model building time on both envrinoments.

# rmse for both models
rmse1 <-  evaluation[[1]]
rmse_spark <- RMSE(spark_prediction$rating, spark_prediction$prediction)

 
kable(cbind(rmse1, rmse_spark), col.names = c("Recommenderlab", "sparklyr")) %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(c("RMSE" = 2))
RMSE
Recommenderlab sparklyr
0.9833423 0.9246919
time <- c(train_time, spark_train_time )
time <- rbind(time,c(prediction_time, spark_prediction_time))

rownames(time)<- c("Train Time", "Prediction Time")
colnames(time) <- c("recommenderlab", "sparklyr")
 
kable(time) %>% 
  kable_styling("striped", full_width = F) 
recommenderlab sparklyr
Train Time 0.10 14.25
Prediction Time 110.87 3.67

Conclusion

For this assignment, we used a dataset of 100 thousand ratings and built two recommender systems. We built one using Recommederlab library and another using sparklyr, a distributed general-purpose cluster-computing framework. We used the same data set to build the model. Based on the table above we see that RMSE score for the model built on Spark is lower than the recommendar lab model. Fitting the model using recommederlab is alot faster than spark. However, prediction time on spark is alot faster. For a real time system, like a costumer facing site, an implementation using spark might be the preffered over recommenderlab because of sparks fast prediction time and better RMSE.