In this first assignment, we’ll attempt to predict ratings with very little information. We’ll first look at just raw averages across all (training dataset) users. We’ll then account for “bias” by normalizing across users and across items.
We’ll be working with ratings in a user-item matrix, where each rating may be (1) assigned to a training dataset, (2) assigned to a test dataset, or (3) missing.
The data was sourced from: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings_small.csv The data set contains 100 ratings from 700 users. It constitutes a subset of the ratings available in the Full MovieLens dataset. Ratings are on the scale 0.5 to 5. For the purspose of this project, we extracted a 10 X 10 subset of the dataframe.
This system will recommend movies to individuals based on the Global Baseline Predictors methodology.
set.seed(200)
rawData <- data.frame(read.csv("ratings_small.csv"))
# top 10 movies
top10 <- rawData %>%
group_by(movieId) %>%
summarize(TOTAL_RATINGS = n()) %>%
arrange(desc(TOTAL_RATINGS)) %>%
mutate(RECORD_ID = as.numeric(rownames(.))) %>%
filter(RECORD_ID <= 10)
# all users that have at least 1 of the top 10 movies
dataSubset <- inner_join(rawData, top10, by=c('movieId'='movieId'))
# user IDs to be used in final dataset
sampledData <- dataSubset %>%
group_by(userId) %>%
summarize(MOVIE_COUNT = n()) %>%
filter(MOVIE_COUNT >= 6) %>%
group_by(MOVIE_COUNT) %>%
sample_n(2)
# final data set in long format
finalData <- inner_join(dataSubset, sampledData, by=c("userId" = "userId")) %>%
select(userId,movieId,rating)
# final data matrix
finalData.df <- spread(finalData,movieId,rating)
head(finalData.df)## userId 1 260 296 318 356 480 527 589 593 2571
## 1 247 4.0 5.0 3.0 NA 5.0 4.0 5.0 4 5.0 4.0
## 2 253 4.5 4.5 4.0 5 5.0 4.0 NA 4 3.5 NA
## 3 355 3.0 5.0 4.5 5 5.0 3.5 4.0 4 4.5 5.0
## 4 358 5.0 5.0 5.0 5 5.0 NA 5.0 5 5.0 5.0
## 5 410 5.0 4.5 1.0 5 5.0 5.0 0.5 NA NA 5.0
## 6 431 4.5 5.0 NA 5 4.5 NA 4.5 NA NA 4.5
# take the mean of the training dataset; every entry gets the same prediction
raw_avg <- mean(trainingSet$rating, na.rm =TRUE)
raw_avg## [1] 4.428571
RMSE <- function(data, data_avg){
sqrt(mean((data - data_avg)^2, na.rm =TRUE))
}
trainRMSE <- RMSE(trainingSet$rating,raw_avg )
trainRMSE## [1] 0.8463608
## [1] 1.179329
Root mean square error measures how much error there is between the predicted values (in this case, the mean) and the actual values. The lower the value, the better the fit. Both the training and test set have high RSME, indicating that the mean is not a great way to impute missing values. Additionally, we can see that out test RSME is higher than the training RSME, which intuitively makes sense – the mean is based on the values in the training set, so we should expect that the error is lower on the training set.
# convert training set to user-item matrix
train_matrix <- trainingSet %>%
spread(movieId, rating) %>%
column_to_rownames(var = "userId")
# calculate bias for each user and each movie
user_bias <- rowMeans(train_matrix, na.rm = TRUE) - raw_avg
item_bias <- colMeans(train_matrix, na.rm = TRUE) - raw_avg
user_bias## 247 253 355 358 410 431
## -0.17857143 -0.21428571 -0.15079365 0.57142857 -0.14285714 0.27142857
## 448 514 525 654
## -0.09523810 0.07142857 -0.82857143 0.40476190
## 1 2571 260 296 318 356
## -0.42857143 0.35714286 0.47142857 -0.01190476 0.57142857 -0.17857143
## 480 527 589 593
## -0.34523810 -0.62857143 0.07142857 -0.01190476
# Applying baseline predictor function to each user-item combination of train_matrix (10x10)
blp <- function(x,y) x + y + raw_avg
baseline_predictor <- data.frame(
sapply(1:10, function(i) sapply(1:10, function(j) blp(user_bias[j], item_bias[i]))),
row.names = rownames(train_matrix))
names(baseline_predictor) <- names(train_matrix)
# Clipping max and min values
baseline_predictor[baseline_predictor > 5] <- 5
baseline_predictor[baseline_predictor < 0.5] <- 0.5# Spreading testSet and moving userId to row names
test_matrix <- testSet %>%
spread(movieId, rating) %>%
column_to_rownames(var = "userId")
# Applying baseline predictor function to each user-item combination of test_matrix (10x8)
baseline_predictor_test <- data.frame(
t(sapply(rownames(test_matrix),
function(i) sapply(names(test_matrix),
function(j) blp(user_bias[i], item_bias[j])
))))
names(baseline_predictor_test) <- names(test_matrix)
# Clipping max and min values
baseline_predictor_test[baseline_predictor_test > 5] <- 5
baseline_predictor_test[baseline_predictor_test < 0.5] <- 0.5
# Quick check of whether baseline predictions were created consistently across train_matrix and test_matrix
baseline_predictor_test$`356` == baseline_predictor$`356`## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Creating a function to calculate RMSE on each matrix using the baseline predictions
predictor_RMSE <- function(matrix, predictions) sqrt(sum((matrix - predictions)^2, na.rm = TRUE)/sum(!is.na(matrix)))
# Baseline predictor RMSE for train_matrix and test_matrix
(train_predictor_RMSE <- predictor_RMSE(train_matrix, baseline_predictor))## [1] 0.6787828
## [1] 1.205466
The purpose of incorporating user and movie bias into the predictions is to introduce variation in the ratings that are user-movie specific. We expect that by introducing this bias, the overall error in both training and testing will decrease because our predictions should (in theory) be more specialized.
However, our results show otherwise – when incorporating bias into the predictions, the RMSE is lower on the training set, but higher on the test set. A closer look at the data shows us the culprit –
## [1] 1
## [1] 4.27381
## [1] 3.27381
We can see that the error from this user-item pairing is particularly high because the actual value and prediction are very different. If this were a recommendation system, the user would probably not be happy with the recommendation.
This brings up an important point – although bias helps to create more customized recommendations, it still can produce inaccurate recommendations if there is wide variation in the user/movie rankings.