Data 612 Project 4 | Accuracy and Beyond

Assignment Instructions

  1. As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.

  2. Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.

  3. Compare and report on any change in accuracy before and after you’ve made the change in #2.

  4. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Introduction

For my previous projects, I used the MovieLense dataset so for this project, I will be using the Jester joke rating dataset - http://eigentaste.berkeley.edu/dataset. The recommederlab package ships with a thinned down version of the original version, so I will be using this version of the dataset.

Initial Data Analysis

Load the Jester5k dataset

set.seed(150)
data(Jester5k)
show(Jester5k)
## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.

The following image displays the first 25 users’ ratings of jokes 1 through 40.

image(Jester5k[1:25, 1:40], "Jester Ratings Sample")

The red cells represent positively rated jokes, the blue cells represent negatively rated items, and the white cells represent jokes without ratings.

Distribution of ratings

hist(getRatings(Jester5k),
     main = 'Distribution of Ratings',
     ylab = 'Frequency',
     xlab = 'Rating',
     col = 'Tomato')

The ratings distribution tells us that the majority of jokes in our subset were given positive ratings. However, a comparatively small amount of jokes in the set earned the highest rating of 10.

Average rating across all jokes.

mean_rating <- colMeans(Jester5k)
quantile(mean_rating)
##         0%        25%        50%        75%       100% 
## -3.8970909 -0.2883998  0.9772413  1.8560353  3.5660392
qplot(mean_rating,
     main = 'Distribution of Average Ratings',
     xlab = 'Average Rating')

A look at the average ratings across all jokes shows us that the average rating for a good joke is 1.

Recommender Algorithms

In this section, I am going to explore the performance of the UBCF, and IBCF recommender algorithms. Additionally, in order to satisfy requirement 2 of the assignment (introducing novelty), I will utilize the RANDOM model.

set.seed(1)
# Define an evaluation scheme and pass it to all 3 models.
evaluation <- evaluationScheme(Jester5k, method = 'split', train = 0.8, k = 1, given = 10, goodRating = 1)
user_based_model <- Recommender(getData(evaluation, 'train'), 'UBCF')
item_based_model <- Recommender(getData(evaluation, 'train'), 'IBCF')
random_model <- Recommender(getData(evaluation, 'train'), 'RANDOM')

Model Prediction Accuracy

Using recommenderlab’s prediction functions, we can calculate the accuracy of each model.

user_predict <- predict(user_based_model, getData(evaluation, 'known'), type = 'ratings')
item_predict <- predict(item_based_model, getData(evaluation, 'known'), type = 'ratings')
random_predict <- predict(random_model, getData(evaluation, 'known'), type = 'ratings')

results <- rbind(
  'User Model Accuracy' = calcPredictionAccuracy(user_predict, getData(evaluation, 'unknown')), 
  'Item Model Accuracy' = calcPredictionAccuracy(item_predict, getData(evaluation, 'unknown')),
  'Random Model Accuracy' = calcPredictionAccuracy(random_predict, getData(evaluation, 'unknown'))
)
knitr::kable(results, format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'))
RMSE MSE MAE
User Model Accuracy 4.583632 21.00968 3.595929
Item Model Accuracy 5.453080 29.73608 4.201270
Random Model Accuracy 6.256039 39.13802 4.825248

As reflected in the above table, The least accurate model is the RANDOM model, were as the most accurate model is the UBCF model. This is not surprising as recommendations based on previous ratings will be more accurate than those picked randomly.

models <- list('User Based Model' = list(name = 'UBCF', param = list(nn = 50)),
               'Item Based Model' = list(name = 'IBCF', param = list(k = 50)),
               'Random Model' = list(name = 'RANDOM', param = NULL))

results <- evaluate(evaluation, models, type = 'topNList', n = c(1, 5, 10, 20, 30, 50))
## UBCF run fold/sample [model time/prediction time]
##   1  [0.046sec/3.66sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.211sec/0.211sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.003sec/0.172sec]
results
## List of evaluation results for 3 recommenders:
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'IBCF'.
## Evaluation results for 1 folds/samples using method 'RANDOM'.

Analysis

Next we can explore the Precision Recall and ROC Curve for all 3 models. Again, the UBCF model out performs the other models in terms of both precision, and ROC.

Precision Recall

plot(results, 'prec/rec', annotate = 1)
title('Precision Recall')

ROC Curve

plot(results, 'ROC', annotate = 1)
title('ROC Curve')

Introducing Novelty

To introduce novelty into the equation, we will create a hybrid recommendation model that combines the RANDOM model with the top performing UBCF model. We will then measure the accuracy of the model.

# Create the hybrid model.
hybrid_model <- HybridRecommender(user_based_model, random_model, weights = c(0.8, 0.2))

# Predict the accuracy of the model.
hybrid_predict <- predict(hybrid_model, getData(evaluation, 'known'), type = 'ratings')

hybrid_results <- calcPredictionAccuracy(hybrid_predict, getData(evaluation, 'unknown')) 
knitr::kable(hybrid_results, format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'))
x
RMSE 4.648264
MSE 21.606355
MAE 3.652655

Model RMSE Comparision

Finally, we will compare the RMSE of the original UBCF model with that of the new hybrid model so we can get a side by side comparison.

user_results <- calcPredictionAccuracy(user_predict, getData(evaluation, 'unknown'))
comparison <- rbind(
  'User Model RMSE' = user_results['RMSE'], 
  'Hybrid Model RMSE' = hybrid_results['RMSE']
)

knitr::kable(comparison, format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'))
RMSE
User Model RMSE 4.583632
Hybrid Model RMSE 4.648264

As reflected in the above table, the RMSE is higher for the hybrid model than for the UBCF model and thus UBCF is the most accurate model.

Conclusion

Offline evaluations of recommender systems can be viewed as static, as the dataset used for testing does not change. However, with online evaluations, the dataset is being updated in realtime as new users and ratings are added to the set. In this sense then we can view offline evaluations as static, and online evaluations as dynamic.

Due to the dynamic nature of online evaluations, they provide the ability to monitor user interactions as they happen (such a Click Through Rates, actions taken on recommended items etc.), and thus provide a more accurate evaluation of the system in question.

If this particular evaluation were conducted in an online environment, we could have introduced more data to our analysis to assess the performance of the recommender system. For example, if we had the ability to monitor how user ratings were affected by the changes we applied to the models (for example adding random recommendations), we could assess which changes were effective and thus fine tune the system for accuracy.

Stephen Haslett

6/28/2020