The goal of the assignment is to adapt one of the recommendation systems to work with Apache Spark and compare the performance with the previous iteration. The efficiency of the system and the added complexity of using Spark should be considered.
Data set - MovieLense Source - https://grouplens.org/datasets/movielens/
ALS model will be build and predictions will be made using RecommenderLab package and Spark.
# reading data
ratings = read.csv("https://raw.githubusercontent.com/olgashiligin/DATA_612/master/project_5/ratings.csv")
# transforming to a wide format
data<-ratings%>% select (movieId, userId, rating) %>% spread (movieId,rating)
# converting the data set into a real rating matrix
movie_matrix <- as(as.matrix(data[-c(1)]), "realRatingMatrix")
# splitting data on train and test sets
esf<- evaluationScheme(movie_matrix, method = "split", train = 0.9, given = 5, goodRating = 3)
train <-getData(esf, "train")
test <-getData(esf, "unknown")
test_known <- getData(esf, "known")
# training ALS model
tic()
final_model <- Recommender(train, method = "ALS")
train_time <- toc(quiet = TRUE)
# making predictions - top 10
tic()
final_prediction<- predict (final_model, test, n = 10, type = "topNList")
predict_time <- toc(quiet = TRUE)
# spark installation
# spark_install()
# Connection to Spark
s_con <- spark_connect(master = "local")
# Split for training and testing (75%/25%)
spark_df <- ratings
smp_size <- floor(0.75 * nrow(spark_df))
# setting the seed to make the partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(spark_df)), size = smp_size)
train <- spark_df[train_ind, ]
test <- spark_df[-train_ind, ]
# moving data frames to Spark
spark_train <- sdf_copy_to(s_con, train, "train_ratings", overwrite = TRUE)
spark_test <- sdf_copy_to(s_con, test, "test_ratings", overwrite = TRUE)
# building ALS model
tic()
model <- ml_als(spark_train, max_iter = 5, nonnegative = TRUE, rating_col = "rating", user_col = "userId", item_col = "movieId")
train_time_spark <- toc(quiet = TRUE)
# predicting ratings
tic()
sparkPred<-ml_predict(model, spark_train)
head(sparkPred)
## # Source: spark<?> [?? x 5]
## userId movieId rating timestamp prediction
## <int> <int> <dbl> <int> <dbl>
## 1 273 12 1 835860711 2.06
## 2 294 12 1 966597190 1.53
## 3 492 12 3 863976249 3.10
## 4 599 12 1.5 1519181787 1.63
## 5 380 12 4 1493668065 2.62
## 6 217 12 3 955945336 1.98
predict_time_spark <- toc(quiet = TRUE)
# top 10 movies recommended for each user
ml_recommend(model, type = c("item"), n = 10)
## # Source: spark<?> [?? x 4]
## userId recommendations movieId rating
## <int> <list> <int> <dbl>
## 1 70 <list [2]> 96004 5.64
## 2 70 <list [2]> 33649 5.51
## 3 70 <list [2]> 8477 5.35
## 4 70 <list [2]> 7025 5.20
## 5 70 <list [2]> 171495 5.20
## 6 70 <list [2]> 7748 5.19
## 7 70 <list [2]> 7767 5.15
## 8 70 <list [2]> 27156 5.09
## 9 70 <list [2]> 6442 5.04
## 10 70 <list [2]> 95780 5.01
## # … with more rows
# Disconnect
spark_disconnect(s_con)
## NULL
m1<-cbind(train=train_time$toc - train_time$tic, predict = predict_time$toc - predict_time$tic)
m2<-cbind(train=train_time_spark$toc - train_time_spark$tic, predict = predict_time_spark$toc - predict_time_spark$tic)
summary = rbind(m1, m2)
rownames(summary) <- c("RecommenderLab","Spark")
summary
## train predict
## RecommenderLab 0.009 176.773
## Spark 3.858 1.614
Spark performed a little bit worse on the train stage, but significantly outperformed on the stage of making recommendations. Spark implementation is harder, so complexity VS the size of the dataset should be considered. There is no point to implement Spark on small data sets, Spark is necessary when it comes to big data.