1 Project Goal

The goal for the final project is to build out a recommender system using a large dataset (e.g. 1M+ ratings or 10k+ users, 10k+ items). If you would like to use one of the datasets you have already worked with, you should add a uniqle element or incorporate additional data. The overall goal, however, will be to produce quality recommendations by extracting insights fom a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! Make a five-minute presentation of your system in our final meetup on Thursday.

2 Introduction

In this project, we are going to build out a recommender system using different algorithms for movie recommendations by using MovieLens datasets, which can be found at [https://grouplens.org/datasets/movielens/latest/] or [http://grouplens.org/datasets/]. This MovieLens dataset is different from the MovieLense dataset we used in project 4. We will implement User-Based Collaborative Filtering (UBCF) model, Item-Based Collaborative Filtering (IBCF) model, singular value decomposition (SVD) model, alternating least square (ALS) model, and Spark ALS model to our datasets and compare their performance.

2.1 Note

To develop an efficient program of this project in PC environment but yet to effectively demonstrate building recommender systems using R studio, we will be covering two MovieLens datasets. A relatively smaller MovieLens dataset of 100k+ observations will be used when building recommender systems using the package Recommenderlab in the first section. However to meet the size requirement of data in this project, a larger MovieLens dataset with 27M+ ratings that is shrinked to around 12,000 users and 12,000 movies will be used when building a recommender system using sparklyr in the second section.

3 Load Library

library(tidyverse)
library(sparklyr)
library(recommenderlab)
library(knitr)
library(kableExtra)
library(scales)
library(gplots)

set.seed(0)

4 Build Model in RecommenderLab

4.1 Import Small MovieLens Dataset

4.1.1 Read data from CSV

movie <- read.csv("https://raw.githubusercontent.com/shirley-wong/Data-612/master/Project-Final/ml-latest-small/movies.csv")
ratings <- read.csv("https://raw.githubusercontent.com/shirley-wong/Data-612/master/Project-Final/ml-latest-small/ratings.csv")

4.2 Data Exploration

The MovieLens dataset we are going to use for the this section contains 100,836 ratings and 3,683 tag applications across 9,742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018, and generated as a dataset on September 26, 2018. All users had rated at least 20 movies. Each user is represented by an ID and no other information is provided. All ratings are made on a 5-star scale with half-star increments (0.5 star - 5.0 stars). From the explorations below, this dataset is quite sparse.

4.2.1 Data: Ratings

Here is a glimpse of the ratings dataset.

We can take a look at the ratings dataset in both long format and wide format below.

head(ratings, 5)
glimpse(ratings)
## Rows: 100,836
## Columns: 4
## $ userId    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ movieId   <int> 1, 3, 6, 47, 50, 70, 101, 110, 151, 157, 163, 216, 223, 2...
## $ rating    <dbl> 4, 4, 4, 5, 5, 3, 5, 4, 5, 5, 5, 5, 3, 5, 4, 5, 3, 3, 5, ...
## $ timestamp <int> 964982703, 964981247, 964982224, 964983815, 964982931, 96...
ratings %>% 
  select(userId, movieId, rating) %>%
  spread(key = movieId, value = rating) %>%
  head(10)

The dataset is very sparse as observed from the heapmap of a subset of the dataset shown below.

ratings %>% 
  select(userId, movieId, rating) %>%
  spread(key = movieId, value = rating) %>%
  filter(userId %in% c(1:50)) %>%
  select(c(1:50)) %>%
  column_to_rownames('userId') %>%
  as.matrix() %>%
  heatmap.2(trace = 'none',
            dendrogram = 'none',
            density.info = 'none',
            Rowv = FALSE,
            Colv = FALSE,
            col = colorRampPalette(c('grey','deeppink4'))(n=299))
heatmap of partial dataset

heatmap of partial dataset

In this dataset, users rated at least 20 movies. Most of them rated less than 60 movies.

ratingmat <- ratings %>% 
  select(userId, movieId, rating) %>%
  spread(key = movieId, value = rating) %>%
  select(-userId) %>%
  as.matrix() %>%
  as('realRatingMatrix')

ratingmat %>%
  rowCounts() %>%
  data.frame(Row_Count = .) %>%
  ggplot(aes(x=Row_Count)) +
  geom_histogram(fill = 'deeppink4', color = 'grey', bins = 50) +
  #geom_text(stat= 'count', vjust=-0.5)+
  scale_x_continuous(limits = c(0,500),breaks = seq(0,500,20)) +
  labs(title = 'Histogram: Number of Ratings per User')

Most movies are rated by no more than 5 users.

ratingmat %>%
  colCounts() %>%
  data.frame(Col_Count = .) %>%
  ggplot(aes(x=Col_Count)) +
  geom_histogram(fill = 'deeppink4', color = 'grey', bins = 50) +
  #geom_text(stat= 'count', vjust=-0.5)+
  scale_x_continuous(limits = c(0,50),breaks = seq(0,50,5)) +
  labs(title = "Histogram: Number of Ratings per Movie")

#remove local variables, clear space for the downstream process
#rm(movies, ratings)

4.2.2 Data: Movie

Here is a glimpse of the movie dataset.

It contains movie IDs, movie names and genres.

head(movie)
glimpse(movie)
## Rows: 9,742
## Columns: 3
## $ movieId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
## $ title   <fct> "Toy Story (1995)", "Jumanji (1995)", "Grumpier Old Men (19...
## $ genres  <fct> Adventure|Animation|Children|Comedy|Fantasy, Adventure|Chil...

4.2.3 User Similarity

We can see the similarity between users by looking at the heatmap. White color represents no data.

similarity_users <- similarity(ratingmat[1:25, ], 
                               method = "pearson", 
                               which = "users")

image(as.matrix(similarity_users), main = "User Similarity")

4.2.4 Item Similarity

We can see the similarity between items by looking at the heatmap. White color represents no data.

similarity_items <- similarity(ratingmat[, 1:25], method =
                                 "pearson", which = "items")
image(as.matrix(similarity_items), main = "Item Similarity")

4.3 Select Sample Data

Select data that has rowcount > 50 and colcount > 100 as our sample dataset.

ratings_movies <- ratingmat[rowCounts(ratingmat) > 50,
                             colCounts(ratingmat) > 100]

This is a non-normalized heatmap of our sample dataset at quantile 0.95.

min_movies <- quantile(rowCounts(ratings_movies), 0.95)
min_users <- quantile(colCounts(ratings_movies), 0.95)

image(ratingmat[rowCounts(ratingmat) > min_movies,
colCounts(ratingmat) > min_users], main = "Non-Normalized Heatmap")

This is a normalized heatmap of our sample dataset at quantile 0.95.

rating_movie_normalize <- normalize(ratings_movies)

min_movies <- quantile(rowCounts(rating_movie_normalize), 0.95)
min_users <- quantile(colCounts(rating_movie_normalize), 0.95)

image(rating_movie_normalize[rowCounts(rating_movie_normalize) > min_movies,
colCounts(rating_movie_normalize) > min_users], main = "Normalized Heatmap")

4.4 Build Models

We are going to build models using the library RecommenderLab with different algorithms. We will implement User-Based Collaborative Filtering (UBCF) model, Item-Based Collaborative Filtering (IBCF) model, singular value decomposition (SVD) model and alternating least square (ALS) model, and compare their performance.

4.4.1 Train-Test Split

eval_sets <- evaluationScheme(data = ratings_movies, method = "split", train = 0.8, given = 4, goodRating = 3, k = 1)

4.4.2 UBCF Models

We will evaluate three models of User-Based Collaborative Filtering (UBCF) algorithm by using the recommenderlab package with mean-centering normalization technique and three similarity measures (Pearson correlation, Euclidean distance and Cosine distance).

ubcf_models <- list(
  ubcf_prs_center = list(name = "UBCF", param = list(method = "pearson", normalize = "center")),
  ubcf_euc_center = list(name = "UBCF", param = list(method = "euclidean", normalize = "center")),
  ubcf_cos_center = list(name = "UBCF", param = list(method = "cosine", normalize = "center"))
)

ubcf_eval_results <- evaluate(eval_sets, 
                              method = ubcf_models, 
                              n = seq(10, 100, 10)
                              )
## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/0.08sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.09sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.07sec]

The results of the three UBCF models are plotted below in ROC curve and Precision-Recall.

UBCF model with Pearson correlation performs the best among the three models.

plot(ubcf_eval_results, annotate = TRUE, legend="topleft")
title("UBCF_ROC Curve")

plot(ubcf_eval_results, "prec/rec", annotate = TRUE, legend="bottomleft")
title("UBCF_Precision-Recall")

4.4.3 IBCF Models

We will evaluate three models of Item-Based Collaborative Filtering (IBCF) algorithm by using the recommenderlab package with mean-centering normalization technique and three similarity measures (Pearson correlation, Euclidean distance and Cosine distance).

ibcf_models <- list(
  ibcf_prs_center = list(name = "IBCF", param = list(method = "pearson", normalize = "center")),
  ibcf_euc_center = list(name = "IBCF", param = list(method = "euclidean", normalize = "center")),
  ibcf_cos_center = list(name = "IBCF", param = list(method = "cosine", normalize = "center")),
  ibcf_cos_NULL = list(name = "IBCF", param = list(method = "cosine", normalize = NULL))
)
ibcf_eval_results <- evaluate(eval_sets, 
                              method = ibcf_models, 
                              n = seq(10, 100, 10)
                              )
## IBCF run fold/sample [model time/prediction time]
##   1  [0.05sec/0.03sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.04sec/0.02sec]

The results of the three IBCF models are plotted below in ROC curve and Precision-Recall.

IBCF model with Euclidean distance performs the best among the three models.

plot(ibcf_eval_results, annotate = TRUE, legend="topleft")
title("IBCF_ROC Curve")

plot(ibcf_eval_results, "prec/rec", annotate = TRUE, legend="bottomleft")
title("IBCF_Precision-Recall")

4.4.4 SVD Models

We will evaluate three models of Singular Value Decomposition (SVD) algorithm by using the recommenderlab package with non-normalization, mean-centering normalization, z-score normalization technique.

svd_models <- list(
  svd_center = list(name = "SVD", param = list(normalize = "center")),
  svd_z = list(name = "SVD", param = list(normalize = "Z-score")),
  svd = list(name = 'SVD', param = list(normalize = NULL))
)
svd_eval_results <- evaluate(x = eval_sets, 
                                 method = svd_models, 
                                 n = seq(10, 100, 10))
## SVD run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.03sec/0sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec]

The results of the three SVD models are plotted below in ROC curve and Precision-Recall.

SVD model without normalization performs the best among the three models.

plot(svd_eval_results, annotate = TRUE, legend="topleft")
title("SVD_ROC Curve")

plot(svd_eval_results, "prec/rec", annotate = TRUE, legend="bottomleft")
title("SVD_Precision-Recall")

4.4.5 ALS Models

We will evaluate three models of alternating least square (ALS) algorithm by using the recommenderlab package with non-normalization, mean-centering normalization, z-score normalization technique.

als_models <- list(
  als_center = list(name = "ALS", param = list(normalize = "center")),
  als_z = list(name = "ALS", param = list(normalize = "Z-score")),
  als = list(name = 'ALS', param = list(normalize = NULL))
)
als_eval_results <- evaluate(x = eval_sets, 
                                 method = als_models, 
                                 n = seq(10, 100, 10))
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/3.83sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/3.75sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/3.34sec]

The results of the three ALS models are plotted below in ROC curve and Precision-Recall.

ALS model without normalization performs the best among the three models.

plot(als_eval_results, annotate = TRUE, legend="topleft")
title("ALS_ROC Curve")

plot(als_eval_results, "prec/rec", annotate = TRUE, legend="bottomleft")
title("ALS_Precision-Recall")

4.4.6 Metrics

We are going to study the error metrics of the best model of each algorithm and compare their performances.

train <- getData(eval_sets, 'train')
known <- getData(eval_sets, 'known')
unknown <- getData(eval_sets, 'unknown')

4.4.6.1 UBCF

# UBCF Model
UBCF_train <- Recommender(getData(eval_sets, "train"), "IBCF", parameter = list(method = "pearson", normalize = "center"))

# Accuracy Metrics of UBCF Model
UBCF_pred <- predict(UBCF_train,getData(eval_sets,'known'), type = 'ratings')
UBCF_error <- calcPredictionAccuracy(UBCF_pred, getData(eval_sets, "unknown"))
UBCF_error
##     RMSE      MSE      MAE 
## 1.070182 1.145289 0.787501

4.4.6.2 IBCF

#IBCF Model
IBCF_train <- Recommender(getData(eval_sets, "train"), "IBCF", parameter = list(method = "Euclidean", normalize = "center"))

# Accuracy Metrics of IBCF Model
IBCF_pred <- predict(IBCF_train,getData(eval_sets,'known'), type = 'ratings')
IBCF_error <- calcPredictionAccuracy(IBCF_pred, getData(eval_sets, "unknown"))
IBCF_error
##      RMSE       MSE       MAE 
## 1.0390103 1.0795425 0.7464071

4.4.6.3 SVD

#SVD Model
SVD_train <- Recommender(getData(eval_sets, "train"), "SVD")

# Accuracy Metrics of SVD Model
SVD_pred <- predict(SVD_train,getData(eval_sets,'known'), type = 'ratings')
SVD_error <- calcPredictionAccuracy(SVD_pred, getData(eval_sets, "unknown"))
SVD_error
##      RMSE       MSE       MAE 
## 0.9574121 0.9166379 0.7334407

4.4.6.4 ALS

#als Model
ALS_train <- Recommender(getData(eval_sets, "train"), "ALS")

# Accuracy Metrics of ALS Model
ALS_pred <- predict(ALS_train,getData(eval_sets,'known'), type = 'ratings')
ALS_error <- calcPredictionAccuracy(ALS_pred, getData(eval_sets, "unknown"))
ALS_error
##      RMSE       MSE       MAE 
## 0.9363810 0.8768093 0.7215331

4.4.7 Conclusion

By comparing the metrics, it shows that the original non-normalized alternating least square (ALS) model performs the best by having the lowest RMSE value among all our models.

rbind(UBCF_error, IBCF_error, SVD_error, ALS_error) %>%
  data.frame() %>%
  rownames_to_column('Model') %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE) %>%
  add_header_above(c('Metrics Comparison'=4))
Metrics Comparison
Model RMSE MSE MAE
ALS_error 0.9363810 0.8768093 0.7215331
SVD_error 0.9574121 0.9166379 0.7334407
IBCF_error 1.0390103 1.0795425 0.7464071
UBCF_error 1.0701815 1.1452885 0.7875010
# remove variables, clear space for downstream process
rm(list=ls())

5 Build Model Using Spark

From the section above, we have concluded that ALS model performs the best by comparing with UBCF, IBCF and SVD model. In this second section, we are going to build a recommender system using the library sparklyr with alternating least square (ALS) model.

5.1 Create Local Spark Connection

Config Spark local server. Set 50% of our system(PC)'s accessible memory to Spark.

conf <- spark_config()
conf$spark.memory.fraction <- 0.5

sc <- spark_connect(master = 'local', config = conf)

5.2 Import Large MovieLens Dataset

5.2.1 Data Exploration

The large MovieLens dataset contains 27,753,444 ratings and 1,108,997 tag applications across 58,098 movies. These data were created by 283,228 users between January 09, 1995 and September 26, 2018, and generated as a dataset on September 26, 2018. All users had rated at least 1 movies. Each user is represented by an ID and no other information is provided. All ratings are made on a 5-star scale with half-star increments (0.5 star - 5.0 stars).

As the dataset is relatively too large which will overload our system's memory, to meet the requirement of project but also make the it executable in PC, the ratings dataset is shrinked to around 12,000+ users and 12,000+ movies with over 1 million ratings only for our study.

#set local file path
path = 'C://Users//HR//Desktop//DATA612//ml-latest//'

movies <- read.csv(str_c(path,'movies.csv'), stringsAsFactors = FALSE) %>% 
  filter(movieId %in% c(1:70000)) %>%
  #remove non graphical chacters to avoid errors when copying data to Spark
  mutate(title = title %>% str_replace_all('[^[:space:][:alnum:][:punct:]]','')) %>%
  rename(item = movieId)

ratings <- read.csv(str_c(path, 'ratings.csv')) %>% 
  filter(userId %in% c(1:13000)) %>%
  filter(movieId %in% c(1:70000)) %>%
  rename(user = userId, item = movieId)
ratings %>%
  select(user, item, rating) %>%
  spread(key=item, value = rating) %>%
  select(-user) %>%
  as.matrix() %>%
  as('realRatingMatrix')
## 12924 x 12057 rating matrix of class 'realRatingMatrix' with 1151103 ratings.

5.3 Copy Data to Spark

Copy the datasets movies and ratings to Spark. Note that [user IDs] and [movie IDs] are renamed to [user] and [item] respectively because Spark Recommender takes [user] and [item] as default arguments.

sdf_movies <- sdf_copy_to(sc, movies, 'movies', overwrite = TRUE)
sdf_ratings <- sdf_copy_to(sc, ratings, 'ratings', overwrite = TRUE)

# remove variables to clear space
rm(movies, ratings)

5.4 Train-Test-Split

split the dataset into training set (80%) and testing set (20%).

movie_split <- sdf_ratings %>%
  sdf_random_split(training = 0.8, testing = 0.2)

movie_train <- movie_split$training
movie_test <- movie_split$testing

5.5 Train Model

Train an ALS recommendation model in Spark using function ml_als.

model_formula = rating ~ user + item
rec_als_spark <- ml_als(movie_train, model_formula, max_iter = 5)

5.6 Make Prediction

Predict ratings using function ml_predict.

predict_spark <- ml_predict(rec_als_spark, movie_test)

5.7 Calculate RMSE

Calculate RMSE of the Spark recommendation model using the testing set.

The RMSE value is about 0.86, which is very low. Our Spark recommender system has great performance.

rmse_spark <- predict_spark %>% 
  filter(!isnan(prediction)) %>%
  summarise((rating - prediction)^2 %>% mean() %>% sqrt()) %>%
  collect() %>%
  as.numeric()
print(str_c('RMSE of Model Built in Spark: ',rmse_spark %>% as.character()))
## [1] "RMSE of Model Built in Spark: 0.853591536526088"

5.8 Make Top 10 Recomendation for Each User

Create top 10 item recommendations for all users. Showing the top 10 movie recommendations for the first 5 users below as an example.

rec_result <- ml_recommend(rec_als_spark, type = 'item', 10) %>% 
  left_join(sdf_movies %>% select(item, title), by = c('item')) %>%
  select(user, title) %>%
  group_by(user) %>%
  mutate(rank = rank(title)) %>%
  mutate(rank = paste('Recommendation', rank %>% as.character())) %>%
  collect() %>%
  select(user, rank, title) %>%
  rename(User = user, Rec_ID = rank, `Recommended Movies` = title)
  #spread(key = rank, value = title) %>%
  #arrange(user) %>%
  #kable() %>%
  #kable_styling(bootstrap_options = c('bordered', 'striped'), full_width = TRUE) %>%
  #scroll_box(height = '400px')

rec_result %>%
  spread(key = Rec_ID, value = `Recommended Movies`) %>%
  select(-`Recommendation 10`, `Recommendation 10`) %>%
  filter(User %in% 1:10)

5.9 Disconnect to Spark

Disconnect our R from Spark.

spark_disconnect(sc)

5.10 Recommender UI with Shiny

Using output from the recommender system built in Spark, we can use Shiny to design a simple user interface.

ui <- fluidPage(
  titlePanel("Movie Recommender"),
  sidebarLayout(
    sidebarPanel(
      textInput("txtInput", "Input User ID (Range from 1:12000):"),
      #selectInput('slctInput','Or select User ID from Dropdown Menu',choices = rec_result %>% select(User) %>% as.vector())
    ),
    mainPanel(
      paste("Movies that you might be interested in:"),
      tableOutput('tblOutput')
    )
  )
)

server <- shinyServer(function(input,output){
  output$tblOutput <- renderTable({
    recommendations <- rec_result %>% filter(User == input$txtInput) %>% select(Rec_ID, `Recommended Movies`)
  })
})

shinyApp(ui = ui, server = server)