1 Project Goal

The goal for the final project is to build out a recommender system using a large dataset (e.g. 1M+ ratings or 10k+ users, 10k+ items). If you would like to use one of the datasets you have already worked with, you should add a uniqle element or incorporate additional data. The overall goal, however, will be to produce quality recommendations by extracting insights fom a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! Make a five-minute presentation of your system in our final meetup on Thursday.

2 Introduction

In this project, we are going to build out a recommender system using different algorithms for movie recommendations by using MovieLens datasets, which can be found at [https://grouplens.org/datasets/movielens/latest/] or [http://grouplens.org/datasets/]. This MovieLens dataset is different from the MovieLense dataset we used in project 4. We will implement User-Based Collaborative Filtering (UBCF) model, Item-Based Collaborative Filtering (IBCF) model, singular value decomposition (SVD) model, alternating least square (ALS) model, and Spark ALS model to our datasets and compare their performance.

2.1 Note

To develop an efficient program of this project in PC environment but yet to effectively demonstrate building recommender systems using R studio, we will be covering two MovieLens datasets. A relatively smaller MovieLens dataset of 100k+ observations will be used when building recommender systems using the package Recommenderlab in the first section. However to meet the size requirement of data in this project, a larger MovieLens dataset with 27M+ ratings that is shrinked to around 12,000 users and 12,000 movies will be used when building a recommender system using sparklyr in the second section.

4 Build Model in RecommenderLab

4.2 Data Exploration

The MovieLens dataset we are going to use for the this section contains 100,836 ratings and 3,683 tag applications across 9,742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018, and generated as a dataset on September 26, 2018. All users had rated at least 20 movies. Each user is represented by an ID and no other information is provided. All ratings are made on a 5-star scale with half-star increments (0.5 star - 5.0 stars). From the explorations below, this dataset is quite sparse.

4.2.1 Data: Ratings

Here is a glimpse of the ratings dataset.

We can take a look at the ratings dataset in both long format and wide format below.

## Observations: 100,836
## Variables: 4
## $ userId    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ movieId   <int> 1, 3, 6, 47, 50, 70, 101, 110, 151, 157, 163, 216, 2...
## $ rating    <dbl> 4, 4, 4, 5, 5, 3, 5, 4, 5, 5, 5, 5, 3, 5, 4, 5, 3, 3...
## $ timestamp <int> 964982703, 964981247, 964982224, 964983815, 96498293...

The dataset is very sparse as observed from the heapmap of a subset of the dataset shown below.

heatmap of partial dataset

heatmap of partial dataset

In this dataset, users rated at least 20 movies. Most of them rated less than 60 movies.

Most movies are rated by no more than 5 users.

4.2.2 Data: Movie

Here is a glimpse of the movie dataset.

It contains movie IDs, movie names and genres.

## Observations: 9,742
## Variables: 3
## $ movieId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ title   <fct> "Toy Story (1995)", "Jumanji (1995)", "Grumpier Old Me...
## $ genres  <fct> Adventure|Animation|Children|Comedy|Fantasy, Adventure...

4.2.3 User Similarity

We can see the similarity between users by looking at the heatmap. White color represents no data.

4.2.4 Item Similarity

We can see the similarity between items by looking at the heatmap. White color represents no data.

4.4 Build Models

We are going to build models using the library RecommenderLab with different algorithms. We will implement User-Based Collaborative Filtering (UBCF) model, Item-Based Collaborative Filtering (IBCF) model, singular value decomposition (SVD) model and alternating least square (ALS) model, and compare their performance.

4.4.2 UBCF Models

We will evaluate three models of User-Based Collaborative Filtering (UBCF) algorithm by using the recommenderlab package with mean-centering normalization technique and three similarity measures (Pearson correlation, Euclidean distance and Cosine distance).

## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/0.04sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.06sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec]

The results of the three UBCF models are plotted below in ROC curve and Precision-Recall.

UBCF model with Pearson correlation performs the best among the three models.

4.4.3 IBCF Models

We will evaluate three models of Item-Based Collaborative Filtering (IBCF) algorithm by using the recommenderlab package with mean-centering normalization technique and three similarity measures (Pearson correlation, Euclidean distance and Cosine distance).

## IBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.02sec]

The results of the three IBCF models are plotted below in ROC curve and Precision-Recall.

IBCF model with Euclidean distance performs the best among the three models.

4.4.4 SVD Models

We will evaluate three models of Singular Value Decomposition (SVD) algorithm by using the recommenderlab package with non-normalization, mean-centering normalization, z-score normalization technique.

## SVD run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.02sec/0sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec]

The results of the three SVD models are plotted below in ROC curve and Precision-Recall.

SVD model without normalization performs the best among the three models.

4.4.5 ALS Models

We will evaluate three models of alternating least square (ALS) algorithm by using the recommenderlab package with non-normalization, mean-centering normalization, z-score normalization technique.

## ALS run fold/sample [model time/prediction time]
##   1  [0sec/2.36sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/2.39sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/2.39sec]

The results of the three ALS models are plotted below in ROC curve and Precision-Recall.

ALS model without normalization performs the best among the three models.

4.4.6 Metrics

We are going to study the error metrics of the best model of each algorithm and compare their performances.

4.4.7 Conclusion

By comparing the metrics, it shows that the original non-normalized alternating least square (ALS) model performs the best by having the lowest RMSE value among all our models.

Metrics Comparison
Model RMSE MSE MAE
ALS_error 0.8946856 0.8004623 0.6858065
SVD_error 0.9308717 0.8665221 0.7042577
IBCF_error 1.0076847 1.0154284 0.7302223
UBCF_error 1.0593320 1.1221843 0.7790488

5 Build Model Using Spark

From the section above, we have concluded that ALS model performs the best by comparing with UBCF, IBCF and SVD model. In this second section, we are going to build a recommender system using the library sparklyr with alternating least square (ALS) model.

5.1 Create Local Spark Connection

Config Spark local server. Set 50% of our system(PC)’s accessible memory to Spark.

5.2 Import Large MovieLens Dataset

5.2.1 Data Exploration

The large MovieLens dataset contains 27,753,444 ratings and 1,108,997 tag applications across 58,098 movies. These data were created by 283,228 users between January 09, 1995 and September 26, 2018, and generated as a dataset on September 26, 2018. All users had rated at least 1 movies. Each user is represented by an ID and no other information is provided. All ratings are made on a 5-star scale with half-star increments (0.5 star - 5.0 stars).

As the dataset is relatively too large which will overload our system’s memory, to meet the requirement of project but also make the it executable in PC, the ratings dataset is shrinked to around 12,000+ users and 12,000+ movies with over 1 million ratings only for our study.

## 12924 x 12057 rating matrix of class 'realRatingMatrix' with 1151103 ratings.

5.3 Copy Data to Spark

Copy the datasets movies and ratings to Spark. Note that [user IDs] and [movie IDs] are renamed to [user] and [item] respectively because Spark Recommender takes [user] and [item] as default arguments.

5.5 Train Model

Train an ALS recommendation model in Spark using function ml_als.

5.6 Make Prediction

Predict ratings using function ml_predict.

5.7 Calculate RMSE

Calculate RMSE of the Spark recommendation model using the testing set.

The RMSE value is about 0.86, which is very low. Our Spark recommender system has great performance.

## [1] "RMSE of Model Built in Spark: 0.859867755595935"

5.8 Make Top 10 Recomendation for Each User

Create top 10 item recommendations for all users. Showing the top 10 movie recommendations for the first 5 users below as an example.

5.9 Disconnect to Spark

Disconnect our R from Spark.