1 Objective

The goal of this assignment is to practice working with accuracy and other recommender system metrics. Tasks include the following:

Compare the accuracy of at least two recommender system algorithms against an offline dataset.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after the change made in #2.
As part of the textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how a reasonable online evaluation environment could be designed.

2 Data source and approach

For this project, we use the MovieLense 100k dataset of movie ratings, which is a dataset that I haven’t used in prior projects. The dataset includes about 100,000 ratings of 1,664 movies from 943 users. Ratings are given as real values ranging from 1 (worse) to 5 (best).

We use the recommenderlab package to develop and test our recommender system using different recommendation algorithms. The MovieLense 100k dataset is included with recommenderlab.

library(tidyverse)
library(recommenderlab)
library(knitr)

3 Data preparation

We start by loading the data and preparing it for the recommender system. We see that the full ratings matrix includes over 99 thousand ratings, out of a total of 1.57 million user-movie pairs. This implies that the ratings matrix is 94% sparse.

# load data
data("MovieLense")
MovieLense

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

# examine the ratings matrix
head(colnames(MovieLense), 10)

##  [1] "Toy Story (1995)"                                    
##  [2] "GoldenEye (1995)"                                    
##  [3] "Four Rooms (1995)"                                   
##  [4] "Get Shorty (1995)"                                   
##  [5] "Copycat (1995)"                                      
##  [6] "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"
##  [7] "Twelve Monkeys (1995)"                               
##  [8] "Babe (1995)"                                         
##  [9] "Dead Man Walking (1995)"                             
## [10] "Richard III (1995)"

getRatingMatrix(MovieLense)[1:10, 1:30]

## 10 x 30 sparse Matrix of class "dgCMatrix"
##                                                               
## 1  5 3 4 3 3 5 4 1 5 3 2 5 5 5 5 5 3 4 5 4 1 4 4 3 4 3 2 4 1 3
## 2  4 . . . . . . . . 2 . . 4 4 . . . . 3 . . . . . 4 . . . . .
## 3  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 4  . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . .
## 5  4 3 . . . . . . . . . . . . . . 4 . . . 3 . . 4 3 . . . 4 .
## 6  4 . . . . . 2 4 4 . . 4 2 5 3 . . . 4 . 3 3 4 . . . . 2 . .
## 7  . . . 5 . . 5 5 5 4 3 5 . . . . . . . . . 5 3 . 3 . 4 5 3 .
## 8  . . . . . . 3 . . . 3 . . . . . . . . . . 5 . . . . . . . .
## 9  . . . . . 5 4 . . . . . . . . . . . . . . . . . . . . . . .
## 10 4 . . 4 . . 4 . 4 . 4 5 3 . . 4 . . . . . 5 5 . . . . . . .

Let’s take a look at the distribution of ratings per user and ratings per movie. We observe that both distributions have long tails to the right, and a large number of users and movies with low rating counts to the left.

# distribution of ratings per user
hist(rowCounts(MovieLense), breaks = 40, main = "Distribution of Rating Count per User - Raw")

summary(rowCounts(MovieLense))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0    32.0    64.0   105.4   147.5   735.0

# distribution of ratings per movie
hist(colCounts(MovieLense), breaks = 40, main = "Distribution of Rating Count per Movie - Raw")

summary(colCounts(MovieLense))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   27.00   59.73   80.00  583.00

In order to make the model development process more efficient on my PC, we will remove a portion of the users and movies with low rating counts. Specifically, we filter the dataset to include only:

Users who have rated at least 50 movies
Movies that have been rated by at least 100 users.

The resulting ratings matrix contains 55,832 ratings from 565 users and 336 movies; the matrix is almost 71% sparse.

# drop users & movies with low rating counts
r <- MovieLense[rowCounts(MovieLense) >= 50, colCounts(MovieLense) >= 100]
r

## 565 x 336 rating matrix of class 'realRatingMatrix' with 55832 ratings.

Note that the distributions of ratings per user and per movie include users with less than 50 ratings and movies with less than 100 ratings. This arises because we eliminated both users and movies simultaneously, and some of the dropped ratings contributed to the original rating counts of the surviving users and movies.

# distribution of ratings per user
hist(rowCounts(r), breaks = 40, main = "Distribution of Rating Count per User - Reduced")

summary(rowCounts(r))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   54.00   88.00   98.82  131.00  269.00

# distribution of ratings per movie
hist(colCounts(r), breaks = 40, main = "Distribution of Rating Count per Movie - Reduced")

summary(colCounts(r))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    53.0   113.0   148.0   166.2   204.0   460.0

4 Recommender models

4.1 Development

For our analysis of ratings accuracy, we develop several recommender models based on the following algorithms:

RANDOM: randomly chosen items
SVD: singular value decomposition method
SVDF: singular value decomposition method with stochastic gradient descent optimization (Funk)
ALS: alternating least squares algorithm for latent factors.

Note that the SVD, SVDF, and ALS algorithms each depend on one or more calculation parameters. Normally we would undertake an optimization exercise to determine the parameter values that achieve the lowest error rate. However, since the goal of this project is to compare accuracy metrics, we will not do this optimization, and instead we will rely on the default parameters for the various algorithms.

For development of the recommender models, we use the following evaluation scheme:

90/10 split of the ratings data into training and holdout / test samples.
For the holdout sample, the model is given 15 ratings (“given-15 protocol”) and then will predict the remaining ratings.
For predicted ratings, a rating of 3.5 or higher will be classified as a good rating.

From the data preparation process earlier, recall that for the reduced ratings datasets, the minimum number of ratings per user was 18 and the median / mean number of ratings per user was 88 / 99; so the given-15 protocol will provide sufficient ratings to evaluate the prediction accuracy of the various algorithms.

# evaluation scheme
scheme <- evaluationScheme(r, method = "split", train = 0.9, given = 15, goodRating = 3.5)
scheme

## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: >=3.500000
## Data set: 565 x 336 rating matrix of class 'realRatingMatrix' with 55832 ratings.

# algorithms; use default parameters
algorithms <- list(
    "Random" = list(name="RANDOM", param = NULL),
    "SVD" = list(name="SVD", param = NULL),
    "SVDF" = list(name="SVDF", param = NULL),
    "ALS" = list(name="ALS", param = NULL)
)

4.2 Evaluation (Deliverable #1)

To evaluate the prediction accuracy of the various recommender models, we will train the models using the training set and then use the test set to predict ratings (via the given-15 protocol). We will measure the accuracy of the predicted versus the actual ratings using several error metrics:

RMSE: root mean squared error
MSE: mean squared error
MAE: mean average error.

# run algorithms for predicted ratings
results <- evaluate(scheme, algorithms, type = "ratings")

## RANDOM run fold/sample [model time/prediction time]
##   1  [0.02sec/0.1sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.08sec/0.01sec] 
## SVDF run fold/sample [model time/prediction time]
##   1  [21.76sec/2.08sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/11.35sec]

plot(results)
title(main = "Model Error: Predicted vs. Actual Ratings")

As expected, it is apparent that the SVD, SVDF, and ALS methods all have higher accuracy than randomly selecting ratings. Interestingly, while the SVDF method has slightly lower error metrics than the ALS method, it takes substantially more time to estimate the model (21.1 sec vs. ~0 sec) but less time to predict ratings (2.1 sec vs. 11.7 sec).

Finally we should note that we haven’t optimized parameters for the various algorithms and that these results rely on the default parameters.

5 Hybrid recommender model

Based on the algorithms implemented above, the recommender models could be used to generate top-N recommended movies for a given user based on the highest predicted ratings for movies that the user hasn’t rated yet. As we saw in a prior project, when we compared UBCF (user-based collaborative filtering), IBCF (item-based collaborative filtering), and POPULAR algorithms, the recommended movies may tend to be dominated by the most popular movies. This results from the strong correlation between ratings and popularity, i.e., highly rated movies tend to be the most popular movies, and hence are recommended more often regardless of model algorithm. In order to counteract this effect, we will add diversity to the recommended movies, by creating a hybrid model.

5.1 Development (Deliverable #2)

Let’s consider increasing the diversity of recommendations by adding an element of chance to the model. We do this by creating a hybrid model that builds on the most accurate algorithm above (SVDF) while adding diversity through the RANDOM algorithm. Specifically, we create a hybrid recommender based on a 50/50 weighting of the SVDF and RANDOM algorithms.

## use hybrid of SVDF and RANDOM
recom_svdf <- Recommender(getData(scheme, "train"), "SVDF")
recom_svdf

## Recommender of type 'SVDF' for 'realRatingMatrix' 
## learned using 508 users.

recom_rand <- Recommender(getData(scheme, "train"), "RANDOM")
recom_rand

## Recommender of type 'RANDOM' for 'realRatingMatrix' 
## learned using 508 users.

recom_hyb <- HybridRecommender(
    recom_svdf,
    recom_rand,
    weights = c(0.5, 0.5)
)
recom_hyb

## Recommender of type 'HYBRID' for 'ratingMatrix' 
## learned using NA users.

5.2 Evaluation (Deliverable #3)

For the hybrid model, we use the same evaluation scheme as before (90/10 training / test split with given-15 protocol) to predict ratings in the test set and then evaluate prediction accuracy.

pred_svdf <- predict(recom_svdf, getData(scheme, "known"), type = "ratings")
pred_svdf

## 57 x 336 rating matrix of class 'realRatingMatrix' with 18297 ratings.

pred_rand <- predict(recom_rand, getData(scheme, "known"), type = "ratings")
pred_rand

## 57 x 336 rating matrix of class 'realRatingMatrix' with 18297 ratings.

pred_hyb <- predict(recom_hyb, getData(scheme, "known"), type = "ratings")
pred_hyb

## 57 x 336 rating matrix of class 'realRatingMatrix' with 18297 ratings.

error <- rbind(
    SVDF = calcPredictionAccuracy(pred_svdf, getData(scheme, "unknown")),
    RANDOM = calcPredictionAccuracy(pred_rand, getData(scheme, "unknown")),
    HYBRID = calcPredictionAccuracy(pred_hyb, getData(scheme, "unknown"))
    )
error %>% round(3) %>% kable(caption = "Error Metrics for Hybrid Model of SVDF & RANDOM")

Error Metrics for Hybrid Model of SVDF & RANDOM
	RMSE	MSE	MAE
SVDF	0.888	0.789	0.697
RANDOM	1.300	1.691	1.027
HYBRID	1.014	1.029	0.807

Judging by the error metrics, the prediction accuracy of the hybrid recommender falls between the SVDF and RANDOM models, but closer to the SVDF model. Based on a 50/50 blend of the SVDF and RANDOM algorithms, the RMSE of the hybrid model weakened by 14% (from 0.888 to 1.014) compared to the SVDF model. However, at the cost of a slight deterioration in ratings accuracy, we would hope that movie viewers would benefit from greater diversity in movie recommendations, particularly for less popular movies.

6 Conclusions (Deliverable #4)

Key findings from this project include:

Of the algorithms tested (RANDOM, SVD, SVDF, and ALS), SVDF had the highest prediction accuracy when measured in terms of error metrics such as RMSE, MSE, and MAE. However, it is important to keep in mind that we didn’t optimize calculation parameters for the various algorithms, and consequently, our conclusions rely on the default parameters.
It was interesting that ALS was only slightly less accurate than SVDF, but it was a much faster model (by orders of magnitude) to train given this dataset. For a business application in a production environment, it may be preferable to use a faster / computationally less expensive algorithm that has slightly worse accuracy, compared to the highest accuracy algorithm.
Properties of different recommendation algorithms can be blended using an ensemble approach. In this project, we created a hybrid recommender using a 50/50 blend of the SVDF and RANDOM algorithms, in order to add diversity to the model outputs (for instance, recommendations that include less popular movies).
Error metrics of the hybrid model fell between the error metrics of the SVDF and RANDOM models. Compared to the pure-play SVDF model, the RMSE of the hybrid model increased by 14% when adding diversity from the RANDOM model. As a practical matter, the RMSE of 0.9 from the SVDF model and 1.0 from the hybrid model are likely indistinguishable from a user perspective, in which case the benefit of greater diversity in recommendations outweighs the cost of slightly worse accuracy.

Finally, in an online evaluation environment, a primary objective should be to evaluate the effectiveness of recommendations from these models. For instance, in this project, we measure accuracy by a model’s ability to predict user ratings. However, it would be enlightening (and commercially valuable) to see how effective the models are in generating actionable movie recommendations. For instance, an experiment could be done in a Netflix-style environment, where top-5 lists of recommended movies produced by different recommender models are shown to randomly selected samples of users. Then the click-through rate or intensity of the user interaction (from ignoring the recommendation, to viewing a movie description or trailer, to watching a portion or all of the movie) could be measured over a subsequent time period. Based on the results of the experiment, we could then determine which algorithms produce recommendations that are most effective at triggering a positive user response. This points to the difference between generating accurate ratings versus effective recommendations, and shifts the focus to model development that enhances the user experience and promotes greater user engagement and satisfaction. Such a development focus would have direct commercial benefits including lower attrition rates and higher revenue potential for the client base.

DATA 612 Project 4

Accuracy and Beyond

Kevin Benson

July 2, 2019