Introduction

In our final project, we are going to continue to use the MovieLens dataset but with 1 million ratings. We have implemented many different Collaborative Filtering (CF) recommender systems such as user to user, item to item, and matrix factorization singular value decomposition (SVD), hybrid, in the past projects. With the final project, we will extract additional data and add implementation of Content-Based recommender system, and Spark ALS and H2O GLM models. We will continue to create most popular collaborative filtering recommender systems and pick the one that performs the best, optimize with different values of a numeric parameter, and then compare the first user recommendations with the different recommender systems.

Load R Packages

# Load required packages
library(recommenderlab)
library(data.table)
library(sparklyr)
library(rsparkling)
library(kableExtra)
library(gridExtra)
library(tidyverse)
library(reshape2)
library(tictoc)
library(tidyr)
library(purrr)
library(dplyr)
library(knitr)
library(proxy)
library(Hmisc)
library(grid)
library(plyr)
library(h2o)

Load Data

Both the movies and ratings datasets are taken from https://grouplens.org/datasets/movielens/1m/. The ratings dataset has 1 million ratings from 6000 users on 9743 movies. The 1 million movie ratings were collected by the Grouplens organization and released on 2003. So, our movies recommendations mostly will be those ’90s movies.

# Load movies and ratings datasets
movies <- fread("https://raw.githubusercontent.com/SieSiongWong/DATA-612/master/movies.csv")

ratings <- fread("https://raw.githubusercontent.com/SieSiongWong/DATA-612/master/ratings_1m.csv")

head(movies)
##    movieId                              title
## 1:       1                   Toy Story (1995)
## 2:       2                     Jumanji (1995)
## 3:       3            Grumpier Old Men (1995)
## 4:       4           Waiting to Exhale (1995)
## 5:       5 Father of the Bride Part II (1995)
## 6:       6                        Heat (1995)
##                                         genres
## 1: Adventure|Animation|Children|Comedy|Fantasy
## 2:                  Adventure|Children|Fantasy
## 3:                              Comedy|Romance
## 4:                        Comedy|Drama|Romance
## 5:                                      Comedy
## 6:                       Action|Crime|Thriller
head(ratings)
##    userId movieId rating
## 1:      1    1193      5
## 2:      1     661      3
## 3:      1     914      3
## 4:      1    3408      4
## 5:      1    2355      5
## 6:      1    1197      3

Data Exploration & Preparation

Statistic Summary

The movies dataset contain 3 columns and 9742 observations. The ratings dataset contain 3 columns and 1,000,209 observations.

We can see that the mean of the rating variable is at 3.582, 57% of the rating are between 4 and 5, and 26% of the rating is at rating 3.

# Summary of movies and ratings datasets
str(movies)
## Classes 'data.table' and 'data.frame':   9742 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title  : chr  "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
##  $ genres : chr  "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(ratings)
## Classes 'data.table' and 'data.frame':   1000209 obs. of  3 variables:
##  $ userId : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId: int  1193 661 914 3408 2355 1197 1287 2804 594 919 ...
##  $ rating : int  5 3 3 4 5 3 5 5 4 4 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Statistical summary of rating variable
describe(ratings$rating)
## ratings$rating 
##        n  missing distinct     Info     Mean      Gmd 
##  1000209        0        5    0.927    3.582    1.219 
## 
## lowest : 1 2 3 4 5, highest: 1 2 3 4 5
##                                              
## Value           1      2      3      4      5
## Frequency   56174 107557 261197 348971 226310
## Proportion  0.056  0.108  0.261  0.349  0.226

Matrix Conversion

First of all, we have to convert the raw dataset into matrix format that can be used for building recommendation systems through the recommenderlab package.

# Convert to rating matrix
ratings_matrix <- dcast(ratings, userId~movieId, value.var = "rating", na.rm = FALSE)
  
# Remove user Id column
ratings_matrix <- as.matrix(ratings_matrix[,-1])
  
# Convert rating matrix into a recommenderlab sparse matrix
ratings_matrix <- as(ratings_matrix, "realRatingMatrix")

ratings_matrix
## 6040 x 3706 rating matrix of class 'realRatingMatrix' with 1000209 ratings.

Each row of the ratings_matrix corresponds to a user, and each column corresponds to a movie id. There are more than 6040 x 3706 = 22,384,240 combinations between a user and a movie id. So, it requires 22,384,240 cells to build the matrix. As we know that not every user has watched every movie. There are only 1,000,209 observations, so this matrix is sparse.

Exploring the Values of the Rating

# Convert the ratings matrix into a vector
vec_ratings <- as.vector(ratings_matrix@data)

# Unique ratings
unique(vec_ratings)
## [1] 5 0 4 3 2 1
# Count the occurrences for each rating
table_ratings <- table(vec_ratings)

table_ratings
## vec_ratings
##        0        1        2        3        4        5 
## 21384031    56174   107557   261197   348971   226310

As we know a rating equal to 0 means a missing value in the matrix, so we can remove all of them before building a frequency plot of the ratings to visualize the ratings distribution. From the plot, we can see the distribution is left skewed.

# Remove zero rating and convert the vector to factor
vec_ratings <- vec_ratings[vec_ratings != 0] %>% factor()

# Visualize through qgplot
qplot(vec_ratings, fill = I("steelblue")) + 
  ggtitle("Distribution of the Ratings") + 
  labs(x = "Ratings")

Explore Most Viewed Movies

# Search for the top 10 most viewed movies
most_views <- colCounts(ratings_matrix) %>% melt()

most_views <- tibble::rowid_to_column(most_views, "movieId") 
names(most_views)[2] <- 'count' 
most_views <- most_views %>% 
  merge(movies, by = "movieId") %>% 
  top_n(count, n = 10)

# Visualize the top 10 most viewed movies
ggplot(most_views, aes(x = reorder(title, count), y = count, fill = 'lightblue')) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x =element_text(angle = 60, hjust = 1)) + 
  ggtitle("Top 10 Most Viewed Movies") + 
  theme(legend.position = "none", axis.title.x = element_blank())

Explore the Average Ratings

# Average rating for each movie
avg_ratings_mv <- colMeans(ratings_matrix)

# Average rating for each user
avg_ratings_us <- rowMeans(ratings_matrix)

# Visualize the distribution of the average movie rating
avg1 <- qplot(avg_ratings_mv) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average Movie Rating Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency') 

# Visualize the distribution of the average user rating
avg2 <- qplot(avg_ratings_us) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average User Rating Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency') 

# Compare the average rating distribution plots
grid.arrange(avg1, avg2, nrow = 1)

From both of the plots above, we can see that there are some movies have only few ratings and some users only rated few movies. For building recommendation systems, we don’t want take these movies and users into account as these ratings might be biased. To remove these least-watched movies and least-rated users, we can set a threshold of minimum number for example, 50.

# Filter users and movies more than 50 
ratings_matrix <- ratings_matrix[rowCounts(ratings_matrix) > 50, colCounts(ratings_matrix) > 50]

# Average rating for each movie
avg_ratings_mv2 <- colMeans(ratings_matrix)

# Average rating for each user
avg_ratings_us2 <- rowMeans(ratings_matrix)

# Visualize the distribution of the average movie rating
avg3 <- qplot(avg_ratings_mv2) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average Movie Rating Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency')

# Visualize the distribution of the average user rating
avg4 <- qplot(avg_ratings_us2) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Average User Rating Distribution") + 
  labs(x = 'Average Rating', y = 'Frequency')

# Compare the average rating distribution plots
grid.arrange(arrangeGrob(avg1, avg2, ncol = 1, top=textGrob("Before")), arrangeGrob(avg3, avg4, ncol = 1, top=textGrob("After")), ncol = 2)

The effect of removing those potential biased ratings to the distribution is obvious. From above figure, we can see that the curve is much narrow and has less variance compared to before.

Non-Distributed System

Recommenderlab

Find the Best Method

Based on previous projects, we observed that UBCF, SVDF, and ALS methods perform well in giving lower RMSE, and higher AUC for both ROC and Precision-Recall curves. SO, for the final project we’ll focus on these 3 methods to evaluate which one providing highest accuracy and then further optimize by choosing the best fit hyperparameter value.

# Setup the evaluation scheme
evaluation <- evaluationScheme(ratings_matrix, 
                                 method     = "cross", 
                                 k          = 5, 
                                 train      = 0.8, 
                                 given      = 10,
                                 goodRating = 3
                               )

evaluation
## Evaluation scheme with 10 items given
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=3.000000
## Data set: 4247 x 2499 rating matrix of class 'realRatingMatrix' with 918946 ratings.
# Set up list of algorithms
algorithms <- list(
  "User-Based CF"               = list(name  = "UBCF", parameter = list(method = "cosine", nn = 25)),
  "Funk SVD"                    = list(name  = "SVDF", parameter = list(k = 10)),
  "Alternating Least Squares"   = list(name  = "ALS")
                  )

# Estimate the models with top N recommendation lists
results <- evaluate(evaluation, 
                    algorithms, 
                    type  = "topNList", 
                    n     = c(1, 3, 5, 10, 15, 20)
                   )
## UBCF run fold/sample [model time/prediction time]
##   1  [0.098sec/106.699sec] 
##   2  [0.096sec/106.026sec] 
##   3  [0.09sec/105.447sec] 
##   4  [0.093sec/105.941sec] 
##   5  [0.087sec/105.788sec] 
## SVDF run fold/sample [model time/prediction time]
##   1  [2212.022sec/389.348sec] 
##   2  [2195.77sec/389.937sec] 
##   3  [2186.808sec/391.692sec] 
##   4  [2205.083sec/384.719sec] 
##   5  [2191.173sec/382.589sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.001sec/1129.171sec] 
##   2  [0.001sec/1161.458sec] 
##   3  [0.001sec/1117.905sec] 
##   4  [0.001sec/1133.27sec] 
##   5  [0.001sec/1140.663sec]
results
## List of evaluation results for 3 recommenders:
## Evaluation results for 5 folds/samples using method 'UBCF'.
## Evaluation results for 5 folds/samples using method 'SVDF'.
## Evaluation results for 5 folds/samples using method 'ALS'.
# Create a function to get average of precision, recall, TPR, FPR
avg_cf_matrix <- function(results) {
avg <- results %>%
  getConfusionMatrix()  %>%  
  as.list()
  as.data.frame( Reduce("+", avg) / length(avg)) %>% 
  mutate(n = c(1, 3, 5, 10, 15, 20)) %>%  
  select('n', 'precision', 'recall', 'TPR', 'FPR')
}

# Using map() to iterate the avg function across both models
results_tbl <- results %>% map(avg_cf_matrix) %>% enframe() %>% unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(value)`
results_tbl
## # A tibble: 18 x 6
##    name                          n precision   recall      TPR      FPR
##    <chr>                     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
##  1 User-Based CF                 1    0.0646 0.000387 0.000387 0.000404
##  2 User-Based CF                 3    0.0595 0.00101  0.00101  0.00122 
##  3 User-Based CF                 5    0.0592 0.00164  0.00164  0.00203 
##  4 User-Based CF                10    0.0602 0.00340  0.00340  0.00406 
##  5 User-Based CF                15    0.0620 0.00527  0.00527  0.00608 
##  6 User-Based CF                20    0.0632 0.00710  0.00710  0.00809 
##  7 Funk SVD                      1    0.376  0.00290  0.00290  0.000266
##  8 Funk SVD                      3    0.344  0.00772  0.00772  0.000838
##  9 Funk SVD                      5    0.330  0.0121   0.0121   0.00143 
## 10 Funk SVD                     10    0.303  0.0218   0.0218   0.00297 
## 11 Funk SVD                     15    0.287  0.0307   0.0307   0.00457 
## 12 Funk SVD                     20    0.274  0.0389   0.0389   0.00621 
## 13 Alternating Least Squares     1    0.249  0.00186  0.00186  0.000322
## 14 Alternating Least Squares     3    0.291  0.00650  0.00650  0.000910
## 15 Alternating Least Squares     5    0.297  0.0111   0.0111   0.00150 
## 16 Alternating Least Squares    10    0.291  0.0217   0.0217   0.00304 
## 17 Alternating Least Squares    15    0.283  0.0315   0.0315   0.00460 
## 18 Alternating Least Squares    20    0.274  0.0404   0.0404   0.00621
# Plot ROC curves for each model
results_tbl %>%
  ggplot(aes(FPR, TPR, color = fct_reorder2(as.factor(name), FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "ROC Curves", color = "Model") +
  theme_grey(base_size = 14)

# Plot Precision-Recall curves for each model
results_tbl %>%
  ggplot(aes(recall, precision, color = fct_reorder2(as.factor(name), recall, precision))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "Precision-Recall Curves", colour = "Model") +
  theme_grey(base_size = 14)

Optimize the Best Method

Optimize the n_factor value, the number of latent factors, for the ALS method. The default value for the n_factors is 10.

# Default parameter values for the ALS method
rec <-  recommenderRegistry$get_entries(dataType = "realRatingMatrix")
rec$ALS_realRatingMatrix$parameters
## $normalize
## NULL
## 
## $lambda
## [1] 0.1
## 
## $n_factors
## [1] 10
## 
## $n_iterations
## [1] 10
## 
## $min_item_nr
## [1] 1
## 
## $seed
## NULL
# Random select 100,000 rows to optimize,
ratings_matrix_opt <- ratings_matrix[sample(nrow(ratings_matrix), 100,000), ]

# Setup a new evaluation scheme for paratmeter optimization
evaluation_opt <- evaluationScheme(ratings_matrix_opt, 
                                   method     = "cross", 
                                   k          = 5, 
                                   train      = 0.8, 
                                   given      = 10,
                                   goodRating = 3
                                   )

# Let set the n_factors ranging from 1 to 20
nf <- c(1, 3, 5, 10, 15, 20)

# Using lapply to define a list of models to evaluate
als_models <- lapply(nf, function(n){
                                     list(name = "ALS", 
                                          param = list(n_factors = n))
                                     })

names(als_models) <- paste0("ALS_nf_", nf)

list_results <- evaluate(evaluation_opt, 
                         method = als_models, 
                         n = c(1, 3, 5, 10, 15, 20)
                         )
## ALS run fold/sample [model time/prediction time]
##   1  [0.001sec/23.527sec] 
##   2  [0.001sec/23.847sec] 
##   3  [0.001sec/23.459sec] 
##   4  [0sec/23.104sec] 
##   5  [0.001sec/23.332sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.001sec/25.246sec] 
##   2  [0.008sec/25.789sec] 
##   3  [0.006sec/25.409sec] 
##   4  [0.006sec/25.087sec] 
##   5  [0.006sec/24.868sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.006sec/25.264sec] 
##   2  [0.006sec/25.496sec] 
##   3  [0.006sec/24.917sec] 
##   4  [0.007sec/24.984sec] 
##   5  [0.007sec/24.988sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.007sec/33.319sec] 
##   2  [0.006sec/33.426sec] 
##   3  [0.006sec/25.121sec] 
##   4  [0.007sec/32.757sec] 
##   5  [0.006sec/32.71sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.006sec/38.957sec] 
##   2  [0.006sec/39.538sec] 
##   3  [0.006sec/38.509sec] 
##   4  [0.006sec/36.071sec] 
##   5  [0.007sec/39.329sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.006sec/216.627sec] 
##   2  [0.007sec/221.253sec] 
##   3  [0.006sec/214.294sec] 
##   4  [0.006sec/213.062sec] 
##   5  [0.008sec/215.316sec]
# Plot ROC curve
plot(list_results, annotate = 1, legend = "topleft") 
title("ROC Curve")

# Plot Precision-Recall curve
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-Recall")

We can see that the n_factors = 20 is having the biggest AUC. So, 20 is the best-peforming n_factors which we’ll use to build the final ALS recommender system.

ALS

To be consistent with the distributed system, we will be using the splitting method to prepare the data to evaluate the ALS model with n_factors = 20. We allocate 80% of the dataset to the training set and 20% to the test set. 10 ratings per user will be given to the recommender to make predictions and the other ratings are held out for computing prediction accuracy.

evaluation <- evaluationScheme(ratings_matrix, method = "split", train = 0.8, given = 10)

evaluation
## Evaluation scheme with 10 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: NA
## Data set: 4247 x 2499 rating matrix of class 'realRatingMatrix' with 918946 ratings.
train <- getData(evaluation, "train")
train
## 3397 x 2499 rating matrix of class 'realRatingMatrix' with 727994 ratings.
test_known <- getData(evaluation, "known")
test_known
## 850 x 2499 rating matrix of class 'realRatingMatrix' with 8500 ratings.
test_unknown <- getData(evaluation, "unknown")
test_unknown
## 850 x 2499 rating matrix of class 'realRatingMatrix' with 182452 ratings.

Create a recommender based on Alternating Least Squares (ALS) method with number of latent factors equal to 20.

set.seed(123)

# Create an item-based CF recommender using training data
tic()
rec_als <- Recommender(data = train, method = "ALS", parameter=list(n_factors = 20))
train_time_rec <- toc(quiet = TRUE)

# Create predictions for the test items using known ratings with type as ratings
tic()
pred_als_acr <- predict(object = rec_als, newdata = test_known, type = "ratings")
predict_time_rec <- toc(quiet = TRUE)

# Create predictions for the test items using known ratings with type as top n recommendation list
tic()
pred_als_n <- predict(object = rec_als, newdata = test_known, n = 5)
top_n_time_rec <- toc(quiet = TRUE)

Exploring the Recommender Model on the Test Set

Top 5 recommendations for the first user. You may notice that the recommendations is less than 5 movies as some movie id does not exist in movies dataset.

# Recommendations for the first user.
first_user_rec <- pred_als_n@items[1:1] %>% data.frame()
colnames(first_user_rec) <- c("movieId")
first_user_rec <- first_user_rec %>%
  merge(movies, by = "movieId") %>% 
  select(-movieId)

first_user_rec
##                                title                          genres
## 1                  Love Jones (1997)                         Romance
## 2 Back to the Future Part III (1990) Adventure|Comedy|Sci-Fi|Western

Distribution of the number of recommended movies

# Define a matrix with the recommendations to the test set users
rec_matrix <- as.matrix(data.frame(pred_als_n@items))

# Define a vector with all recommendations
num_of_items <- factor(table(rec_matrix))

# Visualize the distribution of the number of recommended movies
qplot(num_of_items) + 
  ggtitle("Distribution of the Number of Recommended Movies") + 
  labs(x = "Number of Count") + 
  theme(axis.text.x = element_text(angle=45))

From the distribution, we can see most of the movies have been recommended many times, and a few movies have been recommended a few times.

Top 10 most recommended movies

# Top 10 most recommended movies
top10_rec <- num_of_items %>% data.frame()
top10_rec <- cbind(movieId = rownames(top10_rec), top10_rec)
rownames(top10_rec) <- 1:nrow(top10_rec)
colnames(top10_rec)[2] <- "count"
top10_rec <- top10_rec %>% 
  mutate_if(is.factor, ~ as.integer(levels(.x))[.x]) %>%
  merge(movies, by = "movieId") %>%
  top_n(count, n = 10)

top10_rec <- top10_rec[order(top10_rec$count, decreasing = TRUE),] %>% 
  select(title:genres)

top10_rec
##                                           title                genres
## 3                       Boys on the Side (1995)          Comedy|Drama
## 8                      Cold Comfort Farm (1995)                Comedy
## 4                             Paper, The (1994)          Comedy|Drama
## 6                                   Rudy (1993)                 Drama
## 2                               Mallrats (1995)        Comedy|Romance
## 10 Rosencrantz and Guildenstern Are Dead (1990)          Comedy|Drama
## 1                             To Die For (1995) Comedy|Drama|Thriller
## 9                              Barb Wire (1996)         Action|Sci-Fi
## 7             Terminator 2: Judgment Day (1991)         Action|Sci-Fi
## 5         What's Love Got to Do with It? (1993)         Drama|Musical

Content Based Recommender System

We’ll be building a basic content-based recommender system based on movie genres only.

Data Preparation

# Map movie Id for ratings and movies dataset
movies_new <- data.frame(movieId = unique(ratings$movieId)) %>% merge(movies, by = "movieId")
ratings_new <- merge(ratings, movies, by = "movieId") %>% select(-c(title, genres))

# Convert to data frame
genres <- as.data.frame(movies_new$genres, stringsAsFactors = FALSE)

# Split the genres fro each row and transpose
genres_2 <- as.data.frame(tstrsplit(genres[, 1], '[|]', type.convert = TRUE), stringsAsFactors = FALSE)

# Name the columns from 1 to 7
colnames(genres_2) <- c(1:7)

# Create a matrix with columns representing every unique genre, and indicate whether a genre was present or not in each movie

## Find unique genres
genre_list <- str_c(c(movies$genres),collapse = ',') 
genre_list <- gsub("\\|", ",", genre_list)
genre_list <- unique(strsplit(genre_list, ",")[[1]])
genre_list
##  [1] "Adventure"          "Animation"          "Children"          
##  [4] "Comedy"             "Fantasy"            "Romance"           
##  [7] "Drama"              "Action"             "Crime"             
## [10] "Thriller"           "Horror"             "Mystery"           
## [13] "Sci-Fi"             "War"                "Musical"           
## [16] "Documentary"        "IMAX"               "Western"           
## [19] "Film-Noir"          "(no genres listed)"
# Empty matrix
genre_matrix <- matrix(0, length(movies_new$movieId) + 1, length(genre_list))

# Set first row to genre list
genre_matrix[1, ] <- genre_list

# Set column names to genre list
colnames(genre_matrix) <- genre_list 
 
# Iterate through matrix
for (i in 1:nrow(genres_2)) {
 for (c in 1:ncol(genres_2)) {
 genmat_col = which(genre_matrix[1, ] == genres_2[i, c])
 genre_matrix[i + 1, genmat_col] <- 1
 }
}
 
# Convert into dataframe

## Remove first row (genre list)
genre_matrix_2 <- as.data.frame(genre_matrix[-1, ], stringsAsFactors = FALSE)

## Convert from characters to integers
for (c in 1:ncol(genre_matrix_2)) {
  genre_matrix_2[, c] <- as.integer(genre_matrix_2[, c])
} 

head(genre_matrix_2)
##   Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 1         1         1        1      1       1       0     0      0     0
## 2         1         0        1      0       1       0     0      0     0
## 3         0         0        0      1       0       1     0      0     0
## 4         0         0        0      1       0       1     1      0     0
## 5         0         0        0      1       0       0     0      0     0
## 6         0         0        0      0       0       0     0      1     1
##   Thriller Horror Mystery Sci-Fi War Musical Documentary IMAX Western
## 1        0      0       0      0   0       0           0    0       0
## 2        0      0       0      0   0       0           0    0       0
## 3        0      0       0      0   0       0           0    0       0
## 4        0      0       0      0   0       0           0    0       0
## 5        0      0       0      0   0       0           0    0       0
## 6        1      0       0      0   0       0           0    0       0
##   Film-Noir (no genres listed)
## 1         0                  0
## 2         0                  0
## 3         0                  0
## 4         0                  0
## 5         0                  0
## 6         0                  0

Create a User Profile Matrix

tic()

# Convert the ratings into a binary format, where ratings of 4 and 5 are mapped to 1, and ratings of 3 and below are mapped to -1

binary_ratings <- ratings_new
binary_ratings <- binary_ratings %>% 
  mutate(rating = ifelse(rating==4|rating==5, 1,
                               ifelse(rating==1|rating==2|rating==3, -1, NA)))


# Transform from a long format to a wide format
binary_ratings_2 <- dcast(binary_ratings, movieId~userId, value.var = "rating", na.rm = FALSE)

# Convert NA to 0
binary_ratings_2[is.na(binary_ratings_2)] <- 0

# Remove movie Id column
binary_ratings_2 = binary_ratings_2[, -1]

# Calculate dot product of the movie genre matrix and the binary ratings matrix
x <- as.matrix(genre_matrix_2)
y <- as.matrix(binary_ratings_2)
result = t(x) %*% y


# Convert to binary scale
result <- ifelse(result > 0, 1, 0)

prof_mtx_time <- toc(quiet = TRUE)

Calculate Jaccard Distance

tic()

# First user's profile
result_2 <- result[1, ]

# Calculate Jaccard Distance to measure the similarity between user profiles and the movie genre matrix
sim_mtx <- rbind.data.frame(result_2, genre_matrix_2)
sim_mtx <- data.frame(lapply(sim_mtx, function(x){as.integer(x)}))
sim_results <- dist(sim_mtx, method = "Jaccard")
sim_results <- as.data.frame(as.matrix(sim_results[1:nrow(binary_ratings_2)]))
rows <- which(sim_results == min(sim_results))

sim_dist_time <- toc(quiet = TRUE)

Recommendations

# Movies recommended to the first user
movies_rec <- movies_new[rows, ]

kable(movies_rec, format = "html", row.names = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
movieId title genres
1907 Mulan (1998) Adventure|Animation|Children|Comedy|Drama|Musical|Romance

Distributed System

Spark

Data Preparation

Based on the data exploration analysis completed above, we can do the same thing here to remove movies that have only few ratings and to remove users who only rated few movies. We don’t want take these movies and users into account as these ratings might be biased. To remove these least-watched movies and least-rated users, we can set a threshold of minimum number for example, 50.

# Connect to your Spark cluster
spark_conn <- spark_connect(master = "local")

# Copy ratings matrix to Spark
ratings_tbl <- copy_to(spark_conn, ratings, overwrite=TRUE)

# Remove least-watched movies and least-rated users less than 50
ratings_tbl <- ratings_tbl %>% 
  group_by(userId) %>% 
  dplyr::mutate(count = n()) %>% 
  filter(count > 50)
## Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## Please use the `.add` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
ratings_tbl <- ratings_tbl %>% 
  select(-count) %>% 
  group_by(movieId) %>% 
  dplyr::mutate(count = n()) %>% 
  filter(count >50)
ratings_tbl <- ratings_tbl %>% select(-count)

Split Dataset

# Split the dataset into training and test set
partitions <- ratings_tbl %>% sdf_random_split(training = 0.8, test = 0.2, seed = 123)
## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
## Please use `eval_tidy()` with a data mask instead.
## This warning is displayed once per session.
## Warning: `overscope_clean()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.

ALS

set.seed(456)

# Train the ALS model
tic()
als_model <- ml_als(partitions$training, rating_col = "rating", user_col = "userId", item_col = "movieId")
train_time_sp <- toc(quiet = TRUE)

# Predict rating using test set
tic()
als_pred <- ml_predict(als_model, partitions$test)
predict_time_sp <- toc(quiet = TRUE)

# Return the top 5 recommendations
tic()
als_rec <- ml_recommend(als_model, type = "item", 5) %>% select(-recommendations)
top_n_time_sp <- toc(quiet =TRUE)

Top 5 Recommendations for the 1st User

# Recommendations for the first user.
first_user_sp <- als_rec %>% 
  filter(userId==13) %>% 
  select(-c(userId, rating)) %>% 
  merge(movies, by = "movieId") %>% 
  select(-movieId)

first_user_sp
##                                                                            title
## 1                                      Star Wars: Episode IV - A New Hope (1977)
## 2                                               Shawshank Redemption, The (1994)
## 3                                                        Schindler's List (1993)
## 4 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
## 5                                                        Sixth Sense, The (1999)
##                    genres
## 1 Action|Adventure|Sci-Fi
## 2             Crime|Drama
## 3               Drama|War
## 4        Action|Adventure
## 5    Drama|Horror|Mystery

H2O

Data Preparation

Connect to H2O using the RSparkling

# Specify H2O configuration needed the start and run of H2O-3 cluster
h2oConf <- H2OConf()

# Create H2O Context
hc <- H2OContext.getOrCreate(h2oConf)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         8 seconds 471 milliseconds 
##     H2O cluster timezone:       Etc/UTC 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.6 
##     H2O cluster version age:    15 days  
##     H2O cluster name:           sparkling-water-rstudio_local-1594907310140 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.78 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          172.31.35.234 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, Amazon S3, Sparkling Water REST API Extensions, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.0 (2019-04-26) 
## 
## Reference class object of class "H2OContext"
## Field "jhc":
## <jobj[94]>
##   org.apache.spark.h2o.H2OContext
##   
## Sparkling Water Context:
##  * Sparkling Water Version: 3.30.0.6-1-2.4
##  * H2O name: sparkling-water-rstudio_local-1594907310140
##  * cluster size: 1
##  * list of used nodes:
##   (executorId, host, port)
##   ------------------------
##   (0,172.31.35.234,54321)
##   ------------------------
## 
##   Open H2O Flow in browser: http://localhost:54323 (CMD + click in Mac OSX)
## 
## 

We use the same Spark dataframe and convert to H2O frame

# Covert the ratings_tbl_2 Spark dataframe into an H2O frame
ratings_tbl_hf <- hc$asH2OFrame(ratings_tbl)

Split Dataset

# Split the dataset into training and test set
splits_2 <- h2o.splitFrame(ratings_tbl_hf, ratios = 0.8, seed = 123)

GLM

We are going to build a GLM model which able to predict how well a user will like a movie they haven’t seen.

y = "rating"
x = setdiff(names(ratings_tbl_hf), y)

# Train the model
tic()

glm_model <-  h2o.glm(x = x, 
                       y = y, 
                       training_frame = splits_2[[1]], 
                       validation_frame = splits_2[[2]], 
                       lambda_search = TRUE, 
                       seed = 455)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
train_time_h2o <- toc(quiet =TRUE)

# Get top level summary information on our model
summary(glm_model)
## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model Key:  GLM_model_R_1594907336110_1 
## GLM Model: summary
##     family     link                               regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.03036 )
##                                                                 lambda_search
## 1 nlambda = 100, lambda.max = 0.1345, lambda.min = 0.03036, lambda.1se = -1.0
##   number_of_predictors_total number_of_active_predictors
## 1                          2                           1
##   number_of_iterations  training_frame
## 1                   17 RTMP_sid_bfc1_1
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  1.229628
## RMSE:  1.108886
## MAE:  0.9259446
## RMSLE:  0.2835432
## Mean Residual Deviance :  1.229628
## R^2 :  0.003459827
## Null Deviance :905653
## Null D.o.F. :733977
## Residual Deviance :902519.6
## Residual D.o.F. :733976
## AIC :2234667
## 
## 
## H2ORegressionMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  1.226579
## RMSE:  1.10751
## MAE:  0.9245265
## RMSLE:  0.2832008
## Mean Residual Deviance :  1.226579
## R^2 :  0.003648224
## Null Deviance :225385.5
## Null D.o.F. :183080
## Residual Deviance :224563.3
## Residual D.o.F. :183079
## AIC :556957.8
## 
## 
## 
## 
## Scoring History: 
##              timestamp   duration iteration lambda predictors
## 1  2020-07-16 13:49:09  0.000 sec         1  .13E0          1
## 2  2020-07-16 13:49:09  0.157 sec         2  .12E0          2
## 3  2020-07-16 13:49:09  0.164 sec         3  .11E0          2
## 4  2020-07-16 13:49:09  0.172 sec         4   .1E0          2
## 5  2020-07-16 13:49:09  0.180 sec         5 .93E-1          2
## 6  2020-07-16 13:49:09  0.187 sec         6 .84E-1          2
## 7  2020-07-16 13:49:09  0.193 sec         7 .77E-1          2
## 8  2020-07-16 13:49:09  0.201 sec         8  .7E-1          2
## 9  2020-07-16 13:49:09  0.208 sec         9 .64E-1          2
## 10 2020-07-16 13:49:09  0.213 sec        10 .58E-1          2
## 11 2020-07-16 13:49:09  0.220 sec        11 .53E-1          2
## 12 2020-07-16 13:49:09  0.226 sec        12 .48E-1          2
## 13 2020-07-16 13:49:09  0.232 sec        13 .44E-1          2
## 14 2020-07-16 13:49:09  0.240 sec        14  .4E-1          2
## 15 2020-07-16 13:49:09  0.246 sec        15 .37E-1          2
## 16 2020-07-16 13:49:09  0.252 sec        16 .33E-1          2
## 17 2020-07-16 13:49:09  0.259 sec        17  .3E-1          2
##    deviance_train deviance_test
## 1           1.234         1.231
## 2           1.233         1.230
## 3           1.233         1.230
## 4           1.232         1.229
## 5           1.232         1.229
## 6           1.231         1.228
## 7           1.231         1.228
## 8           1.231         1.228
## 9           1.230         1.227
## 10          1.230         1.227
## 11          1.230         1.227
## 12          1.230         1.227
## 13          1.230         1.227
## 14          1.230         1.227
## 15          1.230         1.227
## 16          1.230         1.227
## 17          1.230         1.227
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
##   variable relative_importance scaled_importance percentage
## 1  movieId          0.05129975                 1          1
## 2   userId          0.00000000                 0          0
# Performance on the test dataset
glm_perf <- h2o.performance(glm_model, valid = T)
glm_perf
## H2ORegressionMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  1.226579
## RMSE:  1.10751
## MAE:  0.9245265
## RMSLE:  0.2832008
## Mean Residual Deviance :  1.226579
## R^2 :  0.003648224
## Null Deviance :225385.5
## Null D.o.F. :183080
## Residual Deviance :224563.3
## Residual D.o.F. :183079
## AIC :556957.8
# Generate predictions on the test set
tic()

glm_pred <- h2o.predict(glm_model, newdata = splits_2[[2]])
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
predict_time_h2o <- toc(quiet =TRUE)

Exploring the Model Performance

We can see from the plots that the rating prediction mostly between 3.3 and 3.8 which does not predict well. The model can be further improved by tuning the parameters such as alpha and lambda.

# Convert from H2O Frame to Spark Data Frame
predicted <- hc$asSparkFrame(glm_pred)

# Extract the true 'rating' values from our test dataset
actual <- hc$asSparkFrame(splits_2[[2]]) %>% 
  select(rating) %>%
  collect() %>%
  `[[`("rating")

# Produce a data frame housing the predicted + actual 'rating' values
predict_actual <- data.frame(predicted = predicted, actual = actual)
names(predict_actual) <- c("predicted", "actual")

# Plot predicted vs. actual values
point_plot <- ggplot(predict_actual, aes(x = actual, y = predicted)) +
  geom_point() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(
    x = "Actual Rating",
    y = "Predicted Rating",
    title = "Predicted vs. Actual Movies Rating")

# Plot predicted rating distribution
dist_plot <- qplot(predict_actual$predicted) + 
  ggtitle("Predicted Rating Distribution") + 
  labs(x = 'Predicted Rating')

grid.arrange(point_plot, dist_plot, nrow = 1)

Evaluate & Compare

Evaluate the Accuracy

Evaluate the accuracy based on ratings for the Recommenderlab ALS model, Spark ALS model, and H2O GLM model.

# Evaluate the accuracy for the Recommenderlab ALS model
acr_als <- calcPredictionAccuracy(pred_als_acr, test_unknown)

# Remove NaN values due to dataset splitting in Spark
als_pred <- als_pred %>% filter(!is.na(prediction)) 

# Evaluate the accuracy for the Spark ALS model
spark_mae <- als_pred  %>% 
  data.frame() %>% 
  mutate(error = abs(rating - prediction)) %>% 
  summarize(mean(error))

spark_mse <- als_pred  %>% 
  data.frame() %>% 
  mutate(error = (rating - prediction)^2) %>% 
  summarize(mean(error))
  
spark_rmse <- als_pred  %>% 
  data.frame() %>% 
  mutate(error = (rating - prediction)^2) %>% 
  summarize(sqrt(mean(error)))

Spark_ALS <- data.frame(RMSE = spark_rmse, MSE = spark_mse, MAE = spark_mae)
colnames(Spark_ALS) <- c("RMSE", "MSE", "MAE")

# Evaluate the accuracy for the H2O GLM model
h2o_mse <- glm_perf@metrics$MSE %>% data.frame()

h2o_rmse <- glm_perf@metrics$RMSE %>% data.frame()
  
h2o_mae <- glm_perf@metrics$mae %>% data.frame()

H2O_GLM <- data.frame(RMSE = h2o_rmse, MSE = h2o_mse, MAE = h2o_mae)
colnames(H2O_GLM) <- c("RMSE", "MSE", "MAE")

# Combine the RMSE, MSE, and MAE for both models
acr <- rbind("Recommenderlab ALS" = acr_als, 
             "Spark ALS" = Spark_ALS,
             "H2O GLM" = H2O_GLM)

# Update column names to RMSE, MSE, and MAE
colnames(acr) <- c("RMSE", "MSE", "MAE")

kable(acr, format = "html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
RMSE MSE MAE
Recommenderlab ALS 0.9490180 0.9006351 0.7529266
Spark ALS 0.8654353 0.7489783 0.6941973
H2O GLM 1.1075102 1.2265789 0.9245265

Evaluate the Performance

Evaluate the running time for each recommender system: Recommenderlab, Content-Based, Spark, and H2O.

# Set up data frame for running time performance
runtime <- data.frame(Method=character(), Training=double(), Predicting=double(), Top_N=double())
runtime_2 <- data.frame(Method=character(), Profile=double(), Similarity=double())

# Combine the running time for the Recommenderlab ALS model
runtime <- rbind(runtime, data.frame(Method  = "Recommenderlab",
                                     Training = round(train_time_rec$toc, 2),
                                     Predicting = round(predict_time_rec$toc - predict_time_rec$tic, 2),
                                     Top_N = round(top_n_time_rec$toc - top_n_time_rec$tic, 2)))

# Combine the running time for the Content-Based model.
runtime_2 <- rbind(runtime_2, data.frame(Method  = "Content-Based",
                                     Profile = round(prof_mtx_time$toc - prof_mtx_time$tic, 2),
                                     Similarity = round(sim_dist_time$toc - sim_dist_time$tic, 2)))


# Combine the running time for the Spark ALS model
runtime<- rbind(runtime, data.frame(Method  = "Spark",
                                    Training = round(train_time_sp$toc - train_time_sp$tic, 2),
                                    Predicting = round(predict_time_sp$toc - predict_time_sp$tic, 2),
                                    Top_N = round(top_n_time_sp$toc - top_n_time_sp$tic, 2)))

# Combine the running time for the H2O GLM model
runtime<- rbind(runtime, data.frame(Method  = "H2O",
                                    Training = round(train_time_h2o$toc - train_time_h2o$tic, 2),
                                    Predicting = round(predict_time_h2o$toc - predict_time_h2o$tic, 2),
                                    Top_N = NA))


# Remove row names
rownames(runtime) <- NULL
rownames(runtime_2) <- NULL

kable(runtime, format = "html", row.names = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Method Training Predicting Top_N
Recommenderlab 6544.65 412.90 409.13
Spark 7.04 0.07 3.65
H2O 2.04 1.06 NA
kable(runtime_2, format = "html", row.names = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Method Profile Similarity
Content-Based 6.06 0.66

Evaluate the 1st User

first_user_all <- cbind(first_user_rec$title, movies_rec$title, first_user_sp$title)
## Warning in cbind(first_user_rec$title, movies_rec$title,
## first_user_sp$title): number of rows of result is not a multiple of vector
## length (arg 1)
colnames(first_user_all) <- c("Recommenderlab", "Content-Based", "Spark")

kable(first_user_all, format = "html", row.names = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Recommenderlab Content-Based Spark
Love Jones (1997) Mulan (1998) Star Wars: Episode IV - A New Hope (1977)
Back to the Future Part III (1990) Mulan (1998) Shawshank Redemption, The (1994)
Love Jones (1997) Mulan (1998) Schindler’s List (1993)
Back to the Future Part III (1990) Mulan (1998) Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Love Jones (1997) Mulan (1998) Sixth Sense, The (1999)

Conclusion

We have created 4 different recommender systems, distributed and non-distributed. * Recommenderlab, ALS method * Content-Based , movie genres * Spark, ALS method * H2O, GLM method

From the results, we can see that the Spark ALS recommender system performs the best in term of accuracy and speed. The accuracy can be further improved by optimizing the algorithm hyperparameters such as tuning the latent factor to find best fit value for the model. As Spark is designed for speed, operating both in memory and on disk and rapidly run repeated queries, the running time for training the model is lightning fast even though with big data. That is the biggest advantage of using distributed data processing engine in building a recommender system.

The time it took training the H2O GLM recommender system is also very fast and slighly faster than the Spark ALS recommender system. The combination of the fast, scalable machine learning algorithms of H2O with the capabilities of Spark through Sparkling Water could be the reason that makes it faster than the Spark ALS recommender system. However, the accuracy of the prediction is not that good.

The Recommenderlab ALS based recommender system took the longest time to run. It took approximately 5 minutes to train the model, 7 minutes to predict, with a million of rating data even though we used the AWS EC2 instance type T2 with 8 vCPU and 32gb memory. This is the highest level computing power in this instance type. Needless to say that it will take much longer time to evaluate different Recommenderlab algorithms together and in fact it did. It took more than an hour to finish run the multiple algorithms (UBCF, SVDF, ALS) together. The running time for Recommenderlab algorithm probably can be improved by using the instance type G4, which designed to accelerate computing for machine learning inference for applications like recommender systems. The accuracy produced by this recommender system is acceptable.

The content-based recommender engine which runtime to create user profiles and calculate the Jaccard similarity and distance is also very fast. The algorithms and methods used are simple and straightforward, which could be the reason why it is fast. But you can see that the recommendations produced for the first user has only one movie. As in the algorithms, the shortest distance (minimum) was chosen and more recommendations can be produced for the first user by considering a range of good distance. Furthermore, the content-based recommender system can be improved to include several other attributes such as movie overview, actors, directors, keywords combines with methods like the Term Frequency–Inverse Document Frequency algorithm (TFIDF) to produce high relevant recommendations.

Reference

Gorakala, K.G. & Usuelli, M. (2015, Sept). Building a Recommendation System with R (pp. 50-92). Packt Publishing Ltd.

Hashler, M. & Vereet, B. (2019, Aug 27). Package ‘recommenderlab’. CRAN. Retrieved from https://cran.r-project.org/web/packages/recommenderlab/recommenderlab.pdf.

Raela. (2015, Jun 7). Building a Movie Recommendation Engine with R. Muffynomster. Retrieved from https://muffynomster.wordpress.com/2015/06/07/building-a-movie-recommendation-engine-with-r/.

Boehmke, B. & Greenwell, B. (2020, Feb 1st). Hands-On Machine Learning with R. Taylor & Francis Group. Retrieved from https://bradleyboehmke.github.io/HOML/GLRM.html.

Generalized Linear Model (GLM). (2020, Jun 30). H2O.ai. Retrieved from https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html.

Usai, D. (2019, Mar 25). Market Basket Analysis with recommenderlab. Medium. Retrieved from https://towardsdatascience.com/market-basket-analysis-with-recommenderlab-5e8bdc0de236