The goal of this project is give you practice beginning to work with a distributed recommender system. It is sufficient for this assignment to build out your application on a single node.
Adapt one of your recommendation systems to work with Apache Spark and compare the performance with your previous iteration. Consider the efficiency of the system and the added complexity of using Spark. You may complete the assignment using PySpark (Python), SparkR (R), sparklyr (R), or Scala.
Please include in your conclusion: For your given recommender system’s data, algorithm(s), and (envisioned) implementation, at what point would you see moving to a distributed platform such as Spark becoming necessary?
library(recommenderlab)
library(sparklyr)
library(dplyr)
library(tidyr)
library(kableExtra)
library(data.table)
library(ggplot2)
data("MovieLense")
Create a training and test data sets
eval_movies <- evaluationScheme(data=MovieLense, method="split", train=.80,
given=min(rowCounts(MovieLense))-3, goodRating = 3)
Build an Alternating Least Square (ALS) recommender system using Recommender Lab and capture system run times at the start and end of the build of the model and the prediction
m1_s <- Sys.time()
rlab_m <- Recommender(getData(eval_movies, "train"), method = "ALS")
rlab_pred <- predict(rlab_m, getData(eval_movies, "known"), type = "ratings")
m1_f <- Sys.time()
m1_RMSE <- calcPredictionAccuracy(rlab_pred, getData(eval_movies, "unknown"))
Build an Alternating Least Square (ALS) recommender system using Spark and capture system run times at the start and end of the build of the model and the prediction
spark_c <- spark_connect(master = "local")
movie_id <- data.frame(item = MovieLenseMeta$title, title_ID = as.numeric(row.names(MovieLenseMeta)))
s_train <- getData.frame(getData(eval_movies, "train"))
s_train$user <- as.numeric(s_train$user)
s_train <- left_join(s_train, movie_id, by ="item")
spark_train <- copy_to(spark_c, s_train, overwrite = TRUE)
s_test <- getData.frame(getData(eval_movies, "known"))
s_test$user <- as.numeric(s_test$user)
s_test <- left_join(s_test, movie_id, by ="item")
spark_test <- copy_to(spark_c, s_test, overwrite = TRUE)
glimpse(spark_train)
## Rows: ??
## Columns: 4
## Database: spark_connection
## $ user <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ item <chr> "Contact (1997)", "Kull the Conqueror (1997)", "Full Monty...
## $ rating <dbl> 4, 2, 4, 3, 3, 4, 3, 2, 3, 2, 4, 4, 4, 3, 5, 5, 5, 1, 3, 3...
## $ title_ID <dbl> 258, 266, 269, 270, 271, 272, 286, 288, 289, 292, 294, 300...
m2_s <- Sys.time()
spark_m <- ml_als(spark_train, rating_col = "rating", user_col = "user", item_col = "title_ID", max_iter = 5)
spark_pred <- ml_transform(spark_m, spark_test) %>% collect()
m2_f <- Sys.time()
Comparing the RMSE and Runtimes of each, we see that not only Spark runs the algorithm faster but also procures a similar prediction. For massive datasets it is imperative to run a distributed system as it would not only be more time efficient but also financial efficient
df <- data.frame(RMSE = c(m1_RMSE[1],RMSE(spark_pred$rating, spark_pred$prediction)),
Runtime = c(m1_f-m1_s, m2_f-m2_s))
rownames(df) <- c("Recommender Lab", "Spark")
df
## RMSE Runtime
## Recommender Lab 0.9977414 76.86428 secs
## Spark 1.2334842 17.16736 secs
spark_disconnect(spark_c)