Introduction
Data Source
- Motivation
Data Processing
- Load data, develop base data set
- Split data into training and test sets
Raw average and RMSE
Bias Calculation
Baseline Predictors for each user-item combination
Creating test matrix
Baseline Predictor RMSE
Summary

Introduction

In this first assignment, we’ll attempt to predict ratings with very little information. We’ll first look at just raw averages across all (training dataset) users. We’ll then account for “bias” by normalizing across users and across items.

We’ll be working with ratings in a user-item matrix, where each rating may be (1) assigned to a training dataset, (2) assigned to a test dataset, or (3) missing.

Data Source

The data was sourced from: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings_small.csv The data set contains 100 ratings from 700 users. It constitutes a subset of the ratings available in the Full MovieLens dataset. Ratings are on the scale 0.5 to 5. For the purspose of this project, we extracted a 10 X 10 subset of the dataframe.

Motivation

This system will recommend movies to individuals based on the Global Baseline Predictors methodology.

Data Processing

Load data, develop base data set

set.seed(200)

rawData <- data.frame(read.csv("ratings_small.csv"))

# top 10 movies
top10 <- rawData %>%
  group_by(movieId) %>%
  summarize(TOTAL_RATINGS = n()) %>%
  arrange(desc(TOTAL_RATINGS)) %>%
  mutate(RECORD_ID = as.numeric(rownames(.))) %>%
  filter(RECORD_ID <= 10)

# all users that have at least 1 of the top 10 movies
dataSubset <- inner_join(rawData, top10, by=c('movieId'='movieId'))

# user IDs to be used in final dataset
sampledData <- dataSubset %>%
  group_by(userId) %>%
  summarize(MOVIE_COUNT = n()) %>%
  filter(MOVIE_COUNT >= 6) %>%
  group_by(MOVIE_COUNT) %>% 
  sample_n(2)

# final data set in long format
finalData <- inner_join(dataSubset, sampledData, by=c("userId" = "userId")) %>%
  select(userId,movieId,rating)

# final data matrix
finalData.df <- spread(finalData,movieId,rating)
head(finalData.df)

##   userId   1 260 296 318 356 480 527 589 593 2571
## 1    247 4.0 5.0 3.0  NA 5.0 4.0 5.0   4 5.0  4.0
## 2    253 4.5 4.5 4.0   5 5.0 4.0  NA   4 3.5   NA
## 3    355 3.0 5.0 4.5   5 5.0 3.5 4.0   4 4.5  5.0
## 4    358 5.0 5.0 5.0   5 5.0  NA 5.0   5 5.0  5.0
## 5    410 5.0 4.5 1.0   5 5.0 5.0 0.5  NA  NA  5.0
## 6    431 4.5 5.0  NA   5 4.5  NA 4.5  NA  NA  4.5

Split data into training and test sets

# gather data -- long format
allValues <- gather(finalData.df,movieId,rating, -userId)

testSet <- allValues %>%
  filter(!is.na(rating)) %>% # do not include nulls
  group_by(userId) %>%
  sample_n(1)

trainingSet <- anti_join(allValues, testSet, by=c('userId'='userId', 'movieId'='movieId'))

Raw average and RMSE

Raw Average

# take the mean of the training dataset; every entry gets the same prediction
raw_avg <- mean(trainingSet$rating, na.rm =TRUE)
raw_avg

## [1] 4.428571

RMSE-training set

RMSE <- function(data, data_avg){ 
  sqrt(mean((data - data_avg)^2, na.rm =TRUE)) 
}

trainRMSE <- RMSE(trainingSet$rating,raw_avg )
trainRMSE

## [1] 0.8463608

RMSE-test set

testRMSE <- RMSE(testSet$rating,raw_avg)
testRMSE

## [1] 1.179329

Root mean square error measures how much error there is between the predicted values (in this case, the mean) and the actual values. The lower the value, the better the fit. Both the training and test set have high RSME, indicating that the mean is not a great way to impute missing values. Additionally, we can see that out test RSME is higher than the training RSME, which intuitively makes sense – the mean is based on the values in the training set, so we should expect that the error is lower on the training set.

Bias Calculation

# convert training set to user-item matrix
train_matrix <- trainingSet %>% 
  spread(movieId, rating) %>% 
  column_to_rownames(var = "userId")

# calculate bias for each user and each movie 
user_bias <- rowMeans(train_matrix, na.rm = TRUE) - raw_avg
item_bias <- colMeans(train_matrix, na.rm = TRUE) - raw_avg

user_bias

##         247         253         355         358         410         431 
## -0.17857143 -0.21428571 -0.15079365  0.57142857 -0.14285714  0.27142857 
##         448         514         525         654 
## -0.09523810  0.07142857 -0.82857143  0.40476190

item_bias

##           1        2571         260         296         318         356 
## -0.42857143  0.35714286  0.47142857 -0.01190476  0.57142857 -0.17857143 
##         480         527         589         593 
## -0.34523810 -0.62857143  0.07142857 -0.01190476

Baseline Predictors for each user-item combination

# Applying baseline predictor function to each user-item combination of train_matrix (10x10)
blp <- function(x,y) x + y + raw_avg

baseline_predictor <- data.frame(
  sapply(1:10, function(i) sapply(1:10, function(j) blp(user_bias[j], item_bias[i]))), 
                                 row.names = rownames(train_matrix))

names(baseline_predictor) <- names(train_matrix)

# Clipping max and min values
baseline_predictor[baseline_predictor > 5] <- 5
baseline_predictor[baseline_predictor < 0.5] <- 0.5

Creating test matrix

# Spreading testSet and moving userId to row names
test_matrix <- testSet %>% 
  spread(movieId, rating) %>% 
  column_to_rownames(var = "userId")

# Applying baseline predictor function to each user-item combination of test_matrix (10x8)
baseline_predictor_test <- data.frame(
  t(sapply(rownames(test_matrix), 
           function(i) sapply(names(test_matrix), 
                              function(j) blp(user_bias[i], item_bias[j])
                              ))))

names(baseline_predictor_test) <- names(test_matrix)

# Clipping max and min values
baseline_predictor_test[baseline_predictor_test > 5] <- 5
baseline_predictor_test[baseline_predictor_test < 0.5] <- 0.5

# Quick check of whether baseline predictions were created consistently across train_matrix and test_matrix
baseline_predictor_test$`356` == baseline_predictor$`356`

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Baseline Predictor RMSE

# Creating a function to calculate RMSE on each matrix using the baseline predictions
predictor_RMSE <- function(matrix, predictions) sqrt(sum((matrix - predictions)^2, na.rm = TRUE)/sum(!is.na(matrix)))

# Baseline predictor RMSE for train_matrix and test_matrix
(train_predictor_RMSE <- predictor_RMSE(train_matrix, baseline_predictor))

## [1] 0.6787828

(test_predictor_RMSE <- predictor_RMSE(test_matrix, baseline_predictor_test))

## [1] 1.205466

Summary

The purpose of incorporating user and movie bias into the predictions is to introduce variation in the ratings that are user-movie specific. We expect that by introducing this bias, the overall error in both training and testing will decrease because our predictions should (in theory) be more specialized.

However, our results show otherwise – when incorporating bias into the predictions, the RMSE is lower on the training set, but higher on the test set. A closer look at the data shows us the culprit –

# actual value
test_matrix[5,2]

## [1] 1

# predicted value
baseline_predictor_test[5,2]

## [1] 4.27381

# error
RMSE(test_matrix[5,2], baseline_predictor_test[5,2])

## [1] 3.27381

We can see that the error from this user-item pairing is particularly high because the actual value and prediction are very different. If this were a recommendation system, the user would probably not be happy with the recommendation.

This brings up an important point – although bias helps to create more customized recommendations, it still can produce inaccurate recommendations if there is wide variation in the user/movie rankings.

DATA 612: Project 1 | Global Baseline Predictors and RMSE

Amber Ferger, Charlie Rosemond, Juanelle Marks

6/3/2020