Project 5 - Implementing a Recommender System on Spark

The goal of this project is give you practice beginning to work with a distributed recommender system. It is sufficient for this assignment to build out your application on a single node.

Adapt one of your recommendation systems to work with Apache Spark and compare the performance with your previous iteration. Consider the efficiency of the system and the added complexity of using Spark. You may complete the assignment using PySpark (Python), SparkR (R), sparklyr (R), or Scala.

Please include in your conclusion: For your given recommender system’s data, algorithm(s), and (envisioned) implementation, at what point would you see moving to a distributed platform such as Spark becoming necessary?

library(recommenderlab)
library(sparklyr)
library(dplyr)
library(tidyr)
library(kableExtra)
library(data.table)
library(ggplot2)
data("MovieLense")

Datasetup

Create a training and test data sets

eval_movies <- evaluationScheme(data=MovieLense, method="split", train=.80, 
                               given=min(rowCounts(MovieLense))-3, goodRating = 3)

RecommenderLab Model

Build an Alternating Least Square (ALS) recommender system using Recommender Lab and capture system run times at the start and end of the build of the model and the prediction

m1_s <- Sys.time()
rlab_m <- Recommender(getData(eval_movies, "train"), method = "ALS")
rlab_pred <- predict(rlab_m, getData(eval_movies, "known"), type = "ratings")
m1_f <- Sys.time()

m1_RMSE <- calcPredictionAccuracy(rlab_pred, getData(eval_movies, "unknown"))

Spark Model

Build an Alternating Least Square (ALS) recommender system using Spark and capture system run times at the start and end of the build of the model and the prediction

spark_c <- spark_connect(master = "local")

movie_id <- data.frame(item = MovieLenseMeta$title, title_ID = as.numeric(row.names(MovieLenseMeta)))

s_train <- getData.frame(getData(eval_movies, "train"))
s_train$user <- as.numeric(s_train$user)
s_train <- left_join(s_train, movie_id, by ="item")
spark_train <- copy_to(spark_c, s_train, overwrite = TRUE)

s_test <- getData.frame(getData(eval_movies, "known"))
s_test$user <- as.numeric(s_test$user)
s_test <- left_join(s_test, movie_id, by ="item")
spark_test <- copy_to(spark_c, s_test, overwrite = TRUE)

glimpse(spark_train)

## Rows: ??
## Columns: 4
## Database: spark_connection
## $ user     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ item     <chr> "Contact (1997)", "Kull the Conqueror (1997)", "Full Monty...
## $ rating   <dbl> 4, 2, 4, 3, 3, 4, 3, 2, 3, 2, 4, 4, 4, 3, 5, 5, 5, 1, 3, 3...
## $ title_ID <dbl> 258, 266, 269, 270, 271, 272, 286, 288, 289, 292, 294, 300...

m2_s <- Sys.time()
spark_m <- ml_als(spark_train, rating_col = "rating", user_col = "user", item_col = "title_ID", max_iter = 5)
spark_pred <- ml_transform(spark_m, spark_test) %>% collect()
m2_f <- Sys.time()

Comparison

Comparing the RMSE and Runtimes of each, we see that not only Spark runs the algorithm faster but also procures a similar prediction. For massive datasets it is imperative to run a distributed system as it would not only be more time efficient but also financial efficient

df <- data.frame(RMSE = c(m1_RMSE[1],RMSE(spark_pred$rating, spark_pred$prediction)), 
                 Runtime = c(m1_f-m1_s, m2_f-m2_s))
rownames(df) <- c("Recommender Lab", "Spark")
df

##                      RMSE       Runtime
## Recommender Lab 0.9977414 76.86428 secs
## Spark           1.2334842 17.16736 secs

spark_disconnect(spark_c)

Project 5 - Implementing a Recommender System on Spark

Dhairav Chhatbar

7/4/2020

Datasetup

RecommenderLab Model

Spark Model

Comparison