Introduction

In this first assignment, we’ll attempt to predict ratings with very little information. We’ll first look at just raw averages across all (training dataset) users. We’ll then account for “bias” by normalizing across users and across items.

We’ll be working with ratings in a user-item matrix, where each rating may be (1) assigned to a training dataset, (2) assigned to a test dataset, or (3) missing.

Data Source

The data was sourced from: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings_small.csv The data set contains 100 ratings from 700 users. It constitutes a subset of the ratings available in the Full MovieLens dataset. Ratings are on the scale 0.5 to 5. For the purspose of this project, we extracted a 10 X 10 subset of the dataframe.

Motivation

This system will recommend movies to individuals based on the Global Baseline Predictors methodology.

Data Processing

Raw average and RMSE

RMSE-test set

## [1] 1.179329

Root mean square error measures how much error there is between the predicted values (in this case, the mean) and the actual values. The lower the value, the better the fit. Both the training and test set have high RSME, indicating that the mean is not a great way to impute missing values. Additionally, we can see that out test RSME is higher than the training RSME, which intuitively makes sense – the mean is based on the values in the training set, so we should expect that the error is lower on the training set.

Bias Calculation

##         247         253         355         358         410         431 
## -0.17857143 -0.21428571 -0.15079365  0.57142857 -0.14285714  0.27142857 
##         448         514         525         654 
## -0.09523810  0.07142857 -0.82857143  0.40476190
##           1        2571         260         296         318         356 
## -0.42857143  0.35714286  0.47142857 -0.01190476  0.57142857 -0.17857143 
##         480         527         589         593 
## -0.34523810 -0.62857143  0.07142857 -0.01190476

Baseline Predictors for each user-item combination

Creating test matrix

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Baseline Predictor RMSE

## [1] 0.6787828
## [1] 1.205466

Summary

The purpose of incorporating user and movie bias into the predictions is to introduce variation in the ratings that are user-movie specific. We expect that by introducing this bias, the overall error in both training and testing will decrease because our predictions should (in theory) be more specialized.

However, our results show otherwise – when incorporating bias into the predictions, the RMSE is lower on the training set, but higher on the test set. A closer look at the data shows us the culprit –

## [1] 1
## [1] 4.27381
## [1] 3.27381

We can see that the error from this user-item pairing is particularly high because the actual value and prediction are very different. If this were a recommendation system, the user would probably not be happy with the recommendation.

This brings up an important point – although bias helps to create more customized recommendations, it still can produce inaccurate recommendations if there is wide variation in the user/movie rankings.