The goal of this project is to practice working with a distributed recommender system in Spark and then compare the performance with a non-distributed recommender system. To compare, we will consider the efficiency of the system and the added complexity of using Spark.
# Load required packages
library(recommenderlab)
library(tidyverse)
library(sparklyr)
library(reshape2)
library(ggpubr)
library(psych)
library(purrr)
library(tictoc)
Both the movies and ratings datasets are taken from https://grouplens.org/datasets/movielens/latest/. There are two versions of these datasets. The small datasets are chosen due to limited computing power available on my laptop.
# Load movies and ratings datasets
movies <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-612/master/movies.csv")
ratings <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-612/master/ratings.csv")
head(movies)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
head(ratings)
## userId movieId rating timestamp
## 1 1 1 4 964982703
## 2 1 3 4 964981247
## 3 1 6 4 964982224
## 4 1 47 5 964983815
## 5 1 50 5 964982931
## 6 1 70 3 964982400
The movies dataset contain 3 columns and 9742 observations. The ratings dataset contain 4 columns and 100,836 observations.
We can see that the mean of the rating variable is at 3.5 and the standard deviation is 1.04 and the distribution is left skewed a little.
# Summary of movies and ratings datasets
str(movies)
## 'data.frame': 9742 obs. of 3 variables:
## $ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : Factor w/ 9737 levels "'71 (2014)","'burbs, The (1989)",..: 8895 4662 3676 9250 2979 3859 7348 8834 8159 3544 ...
## $ genres : Factor w/ 951 levels "(no genres listed)",..: 352 418 733 688 635 261 733 400 2 134 ...
str(ratings)
## 'data.frame': 100836 obs. of 4 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : int 1 3 6 47 50 70 101 110 151 157 ...
## $ rating : num 4 4 4 5 5 3 5 4 5 5 ...
## $ timestamp: int 964982703 964981247 964982224 964983815 964982931 964982400 964980868 964982176 964984041 964984100 ...
# Statistical summary of rating variable
describe(ratings$rating)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100836 3.5 1.04 3.5 3.57 0.74 0.5 5 4.5 -0.64 0.12 0
First of all, we have to convert the raw dataset into matrix format that can be used for building recommendation systems through the recommenderlab package.
# Convert to rating matrix
ratings_matrix <- dcast(ratings, userId~movieId, value.var = "rating", na.rm = FALSE)
# remove userId column
ratings_matrix <- as.matrix(ratings_matrix[,-1])
# Convert rating matrix into a recommenderlab sparse matrix
ratings_matrix <- as(ratings_matrix, "realRatingMatrix")
ratings_matrix
## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.
Each row of the ratings_matrix corresponds to a user, and each column corresponds to a movie id. There are more than 610 x 9724 = 5,931,640 combinations between a user and a movie id. So, it requires 5,931,640 cells to build the matrix. As we know that not every user has watched every movie. There are only 100,836 observations, so this matrix is sparse.
# Convert the ratings matrix into a vector
vec_ratings <- as.vector(ratings_matrix@data)
# Unique ratings
unique(vec_ratings)
## [1] 4.0 0.0 4.5 2.5 3.5 3.0 5.0 0.5 2.0 1.5 1.0
# Count the occurrences for each rating
table_ratings <- table(vec_ratings)
table_ratings
## vec_ratings
## 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
## 5830804 1370 2811 1791 7551 5550 20047 13136 26818 8551
## 5
## 13211
As we know a rating equal to 0 means a missing value in the matrix, so we can remove all of them before building a frequency plot of the ratings to visualize the ratings distribution.
# Remove zero rating and convert the vector to factor
vec_ratings <- vec_ratings[vec_ratings != 0] %>% factor()
# Visualize through qplot
qplot(vec_ratings, fill = I("steelblue")) +
ggtitle("Distribution of the Ratings") +
labs(x = "Ratings")
# Search for the top 10 most viewed movies
most_views <- colCounts(ratings_matrix) %>% melt()
most_views <- tibble::rowid_to_column(most_views, "movieId") %>%
rename(count = value) %>%
merge(movies, by = "movieId") %>%
top_n(count, n = 10)
# Visualize the top 10 most viewed movies
ggplot(most_views, aes(x = reorder(title, count), y = count, fill = 'lightblue')) +
geom_bar(stat = "identity") +
theme(axis.text.x =element_text(angle = 60, hjust = 1)) +
ggtitle("Top 10 Most Viewed Movies") +
theme(legend.position = "none", axis.title.x = element_blank())
# Average rating for each movie
avg_ratings_mv <- colMeans(ratings_matrix)
# Average rating for each user
avg_ratings_us <- rowMeans(ratings_matrix)
# Visualize the distribution of the average movie rating
avg1 <- qplot(avg_ratings_mv) +
stat_bin(binwidth = 0.1) +
ggtitle("Average Movie Rating Distribution") +
labs(x = 'Average Rating', y = 'Frequency')
# Visualize the distribution of the average user rating
avg2 <- qplot(avg_ratings_us) +
stat_bin(binwidth = 0.1) +
ggtitle("Average User Rating Distribution") +
labs(x = 'Average Rating', y = 'Frequency')
figure <- ggarrange(avg1, avg2, ncol = 1, nrow = 2)
figure
From both of the plots above, we can see that there are some movies have only few ratings and some users only rated few movies. For building recommendation systems, we don’t want take these movies and users into account as these ratings might be biased. To remove these least-watched movies and least-rated users, we can set a threshold of minimum number for example, 50.
# Filter users and movies more than 50
ratings_matrix <- ratings_matrix[rowCounts(ratings_matrix) > 50, colCounts(ratings_matrix) > 50]
# Average rating for each movie
avg_ratings_mv2 <- colMeans(ratings_matrix)
# Average rating for each user
avg_ratings_us2 <- rowMeans(ratings_matrix)
# Visualize the distribution of the average movie rating
avg3 <- qplot(avg_ratings_mv2) +
stat_bin(binwidth = 0.1) +
ggtitle("Average Movie Rating Distribution") +
labs(x = 'Average Rating', y = 'Frequency')
# Visualize the distribution of the average user rating
avg4 <- qplot(avg_ratings_us2) +
stat_bin(binwidth = 0.1) +
ggtitle("Average User Rating Distribution") +
labs(x = 'Average Rating', y = 'Frequency')
figure2 <- ggarrange(avg1, avg2, avg3, avg4,
labels = c("A", "B", "C", "D"),
ncol = 2, nrow = 2)
figure2
The effect of removing those potential biased ratings to the distribution is obvious. From above figure, we can see that the curve is much narrow and has less variance compared to before.
We will build the recommender models by using the splitting method that randomly assign a predefined proportion of the users to the training set and all others to the test set. For this project, we allocate 80% of the dataset to the training set and 20% to the test set. 10 ratings per user will be given to the recommender to make predictions and the other ratings are held out for computing prediction accuracy.
evaluation <- evaluationScheme(ratings_matrix, method = "split", train = 0.8, given = 10)
evaluation
## Evaluation scheme with 10 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: NA
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
train <- getData(evaluation, "train")
train
## 302 x 436 rating matrix of class 'realRatingMatrix' with 28874 ratings.
test_known <- getData(evaluation, "known")
test_known
## 76 x 436 rating matrix of class 'realRatingMatrix' with 760 ratings.
test_unknown <- getData(evaluation, "unknown")
test_unknown
## 76 x 436 rating matrix of class 'realRatingMatrix' with 6580 ratings.
Create a recommender based on Alternating Least Squares (ALS) method.
set.seed(123)
# Create an item-based CF recommender using training data
tic()
rec_als <- Recommender(data = train, method = "ALS")
train_time_rec <- toc(quiet = TRUE)
# Create predictions for the test items using known ratings with type as ratings
tic()
pred_als_acr <- predict(object = rec_als, newdata = test_known, type = "ratings")
predict_time_rec <- toc(quiet = TRUE)
# Create predictions for the test items using known ratings with type as top n recommendation list
tic()
pred_als_n <- predict(object = rec_als, newdata = test_known, n = 5)
top_n_time_rec <- toc(quiet = TRUE)
Based on the data exploration analysis completed above, we can do the same thing here to remove movies that have only few ratings and to remove users who only rated few movies. We don’t want take these movies and users into account as these ratings might be biased. To remove these least-watched movies and least-rated users, we can set a threshold of minimum number for example, 50.
# Connect to your Spark cluster
spark_conn <- spark_connect(master = "local")
# Copy ratings matrix to Spark
ratings_tbl <- copy_to(spark_conn, ratings, overwrite=TRUE)
# Remove timestamp column
ratings_tbl <- ratings_tbl %>% select(-timestamp)
# Remove least-watched movies and least-rated users less than 50
ratings_tbl <- ratings_tbl %>%
group_by(userId) %>%
mutate(count = n()) %>%
filter(count > 50)
ratings_tbl <- ratings_tbl %>%
select(-count) %>%
group_by(movieId) %>%
mutate(count = n()) %>%
filter(count >50)
ratings_tbl <- ratings_tbl %>% select(-count)
# Split the dataset into training and test set
partitions <- ratings_tbl %>% sdf_random_split(training = 0.8, test = 0.2, seed = 123)
set.seed(456)
# Train the ALS model
tic()
als_model <- ml_als(partitions$training, rating_col = "rating", user_col = "userId", item_col = "movieId")
train_time_sp <- toc(quiet = TRUE)
# Predict rating using test set
tic()
als_pred <- ml_predict(als_model, partitions$test)
predict_time_sp <- toc(quiet = TRUE)
# Return the top 5 recommendations
tic()
als_rec <- ml_recommend(als_model, type = "item", 5) %>% select(-recommendations)
top_n_time_sp <- toc(quiet =TRUE)
Evaluate the accuracy based on ratings for the Recommenderlab ALS model and Spark ALS model.
# Evaluate the accuracy for the Recommenderlab ALS model
acr_als <- calcPredictionAccuracy(pred_als_acr, test_unknown)
# Remove NaN values due to dataset splitting in Spark
als_pred <- als_pred %>% filter(!is.na(prediction))
# Evaluate the accuracy for the Spark ALS model
spark_mae <- als_pred %>%
mutate(error = abs(rating - prediction)) %>%
summarize(mean(error))
spark_mse <- als_pred %>%
mutate(error = (rating - prediction)^2) %>%
summarize(mean(error))
spark_rmse <- als_pred %>%
mutate(error = (rating - prediction)^2) %>%
summarize(sqrt(mean(error)))
# Combine the RMSE, MSE, and MAE for both models
acr <- rbind("Recommenderlab ALS" = acr_als,
"Spark ALS" = data.frame(RMSE = spark_rmse, MSE = spark_mse, MAE = spark_mae))
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
# Update column names to RMSE, MSE, and MAE
colnames(acr) <- c("RMSE", "MSE", "MAE")
acr
## RMSE MSE MAE
## Recommenderlab ALS 0.9001441 0.8102594 0.7045567
## Spark ALS 0.8165666 0.6667811 0.6350207
Evaluate the running time for both recommender systems.
# Set up data frame for running time performance
runtime <- data.frame(Method=character(), Training=double(), Predicting=double(), Top_N=double())
# Combine the running time for the Recommenderlab ALS model
runtime <- rbind(runtime, data.frame(Method = "Recommenderlab",
Training = round(train_time_rec$toc - train_time_rec$tic, 2),
Predicting = round(predict_time_rec$toc - predict_time_rec$tic, 2),
Top_N = round(top_n_time_rec$toc - top_n_time_rec$tic, 2)))
# Combine the running time for the Spark ALS model
runtime<- rbind(runtime, data.frame(Method = "Spark",
Training = round(train_time_sp$toc - train_time_sp$tic, 2),
Predicting = round(predict_time_sp$toc - predict_time_sp$tic, 2),
Top_N = round(top_n_time_sp$toc - top_n_time_sp$tic, 2)))
# Remove row names
rownames(runtime) <- NULL
runtime
## Method Training Predicting Top_N
## 1 Recommenderlab 0.11 71.58 69.75
## 2 Spark 125.64 0.81 73.68
From the evaluation results, we can see that the Spark ALS recommender system has lower RMSE, MSE, and MAE values than the Recommenderlab ALS recommender system. However, the running time is significant higher for Spark in training the model while much lower in predicting the model. For predicting the Top N recommendations, again the Recommenderlab is beating the Spark. Spark took longer time to predict the top 5 recommendations than the Recommenderlab.
The distributed system is much slower in training could be because the algorithm repeatedly processes training data to refine the model parameters until an acceptable solution is found given properly bounded errors. On the other hand, since the data used can fit in a single machine’s memory, so the Recommenderlab is the winner in the training process. However, Spark is much faster in predicting because it can cache the intermediate computation results in memory rather than dumping all results to the disk which involves much slower disk I/O.
From a performance perspective, Spark is better which seems to be the great advantage of distributed platforms. When we are working with massive amounts of data for example, centralized data with many different collection points, we can consider using distributed platforms. If we are not dealing with big data sets, then using distributed platforms such as Spark may not be preferred due to its implementation complexity and cost. During the process, we also ran into some issues to get the Spark to run in our local machine. We got variety of different errors such as “local host: 8880 did not respond”. We spent hours to troubleshoot this error but did not success. We ended up using a different computer to make it work. This shows some complexity of Spark depending on the programming language or IDE that is being used. However, considering the performance measures it seems Spark is a good choice to work with.
Gorakala, K.G. & Usuelli, M. (2015, Sept). Building a Recommendation System with R (pp. 50-92). Packt Publishing Ltd.
Hashler, M. & Vereet, B. (2019, Aug 27). Package ‘recommenderlab’. CRAN. Retrieved from https://cran.r-project.org/web/packages/recommenderlab/recommenderlab.pdf.
Luraschi, J., Kuo, K., & Ruiz, E. (2019, Oct 29). Mastering Spark with R. O’Reilly. Retrieved from https://therinspark.com/tuning.html
Wei, J., Kim, K.J., & Gibson, G.A. (2016, October). Benchmarking Apache Spark with Machine Learning Applications. Parallel Data Laboratory, Carnegie Mellon University. Retrieved from https://www.pdl.cmu.edu/PDL-FTP/BigLearning/CMU-PDL-16-107.pdf.