library(kableExtra)

Introduction

In this project we will build a baseline recommender using a dataset with movie ratings. Users, raters in our dataset, as well as items, or movies, will both be normalized to account for “bias”. Once the recommender is built, the performance of the same will be asses using the RMSE metric.

Data

The data used in the project consist of ratings of 10 popular movies by 7 individuals. Data was collected by interviewing family and friends.

data <- as.matrix(read.csv("data.csv", header = TRUE,row.names = 1))
kable(data)
Forrest.Gump Titanic Saving.Private.Ryan ET Ferris.Bueller Batman..The.Dark.Knight Joker Rocketman Bohemian.Rhapsody Wonder.Woman..2017.
Jennie 5 4.0 3 5 NA 2.5 NA 4.5 5 NA
Tatiana 5 5.0 3 5 NA 2.5 NA 4.0 NA 3
Sue 5 3.0 3 1 1 4.0 5.0 4.0 5 4
Nicole 5 NA NA 4 5 NA NA NA 4 3
Peter 5 3.5 3 3 5 4.0 3.5 4.0 4 4
Samantha 4 3.0 NA 5 5 5.0 3.0 NA 4 3

Splitting the data

A test dataset was selected from the data

testData<-data[NA,NA]
rownames(testData)<-rownames(data)
colnames(testData)<-colnames(data)
i<-0
while(i<(0.3*sum(!is.na(data)))) {
  row<-sample(1:nrow(data),1)
  col<-sample(1:ncol(data),1)
  if(!is.na(data[row,col])) {
    i<-i+1
    testData[row,col]<-data[row,col]
  }
}
kable(testData)
Forrest.Gump Titanic Saving.Private.Ryan ET Ferris.Bueller Batman..The.Dark.Knight Joker Rocketman Bohemian.Rhapsody Wonder.Woman..2017.
Jennie 5 4 NA 5 NA NA NA NA NA NA
Tatiana NA NA 3 NA NA NA NA NA NA NA
Sue NA NA NA 1 1 NA 5.0 NA NA NA
Nicole 5 NA NA NA NA NA NA NA NA 3
Peter NA NA 3 NA 5 NA 3.5 NA NA NA
Samantha 4 NA NA NA NA NA NA NA NA 3

The training dataset is derived from extracting the test set from the original data

testDataTemp<-testData
testDataTemp[!is.na(testDataTemp)]<-0
testDataTemp[is.na(testDataTemp)]<-1
trainingData<-testDataTemp*data
trainingData[trainingData==0]<-NA
kable(trainingData)
Forrest.Gump Titanic Saving.Private.Ryan ET Ferris.Bueller Batman..The.Dark.Knight Joker Rocketman Bohemian.Rhapsody Wonder.Woman..2017.
Jennie NA NA 3 NA NA 2.5 NA 4.5 5 NA
Tatiana 5 5.0 NA 5 NA 2.5 NA 4.0 NA 3
Sue 5 3.0 3 NA NA 4.0 NA 4.0 5 4
Nicole NA NA NA 4 5 NA NA NA 4 NA
Peter 5 3.5 NA 3 NA 4.0 NA 4.0 4 4
Samantha NA 3.0 NA 5 5 5.0 3 NA 4 NA

Raw average model

The raw average model simple predict all item/user combinations to be equal to the overall mean of the training data. To do this we calculate the mean of our training set.

rawAverage<-mean(trainingData,na.rm = TRUE)
averageData<-data
averageData<-apply(averageData,1:2,function(x) rawAverage)
kable(averageData)
Forrest.Gump Titanic Saving.Private.Ryan ET Ferris.Bueller Batman..The.Dark.Knight Joker Rocketman Bohemian.Rhapsody Wonder.Woman..2017.
Jennie 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303
Tatiana 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303
Sue 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303
Nicole 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303
Peter 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303
Samantha 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303 4.030303

Training RMSE

Error is calculated using the training matrix.

trainingRMSEData<-trainingData
trainingRMSEData<-apply(trainingRMSEData,1:2,function(x) (x-rawAverage)^2)
trainingRMSE<-(mean(trainingRMSEData,na.rm = TRUE))^0.5
trainingRMSE
## [1] 0.834297

Test RMSE

Error is calculated using the test matrix.

testRMSEData<-testData
testRMSEData<-apply(testRMSEData,1:2,function(x) (x-rawAverage)^2)
testRMSE<-(mean(testRMSEData,na.rm = TRUE))^0.5
testRMSE
## [1] 1.403979

As expected the training RMSE is lower than the test.

Baseline Predictor

Using our training data, we calculate the bias for each user and each item.

userBias<-rowMeans(trainingData,na.rm = TRUE)-rawAverage
kable(userBias)
x
Jennie -0.2803030
Tatiana 0.0530303
Sue -0.0303030
Nicole 0.3030303
Peter -0.1017316
Samantha 0.1363636
movieBias<-colMeans(trainingData,na.rm = TRUE)-rawAverage
kable(movieBias)
x
Forrest.Gump 0.9696970
Titanic -0.4053030
Saving.Private.Ryan -1.0303030
ET 0.2196970
Ferris.Bueller 0.9696970
Batman..The.Dark.Knight -0.4303030
Joker -1.0303030
Rocketman 0.0946970
Bohemian.Rhapsody 0.3696970
Wonder.Woman..2017. -0.3636364
baselinePredictor<-data
for(i in 1:nrow(data)) {
  for(j in 1:ncol(data))  {
    baselinePredictor[i,j]<-rawAverage+userBias[i]+movieBias[j] 
    if(baselinePredictor[i,j]>5) baselinePredictor[i,j]<-5
    if(baselinePredictor[i,j]<0) baselinePredictor[i,j]<-0
  }
}
kable(baselinePredictor)
Forrest.Gump Titanic Saving.Private.Ryan ET Ferris.Bueller Batman..The.Dark.Knight Joker Rocketman Bohemian.Rhapsody Wonder.Woman..2017.
Jennie 4.719697 3.344697 2.719697 3.969697 4.719697 3.319697 2.719697 3.844697 4.119697 3.386364
Tatiana 5.000000 3.678030 3.053030 4.303030 5.000000 3.653030 3.053030 4.178030 4.453030 3.719697
Sue 4.969697 3.594697 2.969697 4.219697 4.969697 3.569697 2.969697 4.094697 4.369697 3.636364
Nicole 5.000000 3.928030 3.303030 4.553030 5.000000 3.903030 3.303030 4.428030 4.703030 3.969697
Peter 4.898268 3.523268 2.898268 4.148268 4.898268 3.498268 2.898268 4.023268 4.298268 3.564935
Samantha 5.000000 3.761364 3.136364 4.386364 5.000000 3.736364 3.136364 4.261364 4.536364 3.803030

Training RMSE

Error is calculated using the training matrix.

trainingRMSEDataBaseline<-trainingData
for(i in 1:nrow(trainingRMSEDataBaseline)) {
  for(j in 1:ncol(trainingRMSEDataBaseline))  {
    trainingRMSEDataBaseline[i,j]<-(trainingRMSEDataBaseline[i,j]-baselinePredictor[i,j])^2
  }
}
trainingRMSEBaseline<-(mean(trainingRMSEDataBaseline,na.rm = TRUE))^0.5
trainingRMSEBaseline
## [1] 0.6195254

Test RMSE

Error is calculated using the test matrix.

testRMSEDataBaseline<-testData
for(i in 1:nrow(testRMSEDataBaseline)) {
  for(j in 1:ncol(testRMSEDataBaseline))  {
    testRMSEDataBaseline[i,j]<-(testRMSEDataBaseline[i,j]-baselinePredictor[i,j])^2
  }
}
testRMSEBaseline<-(mean(testRMSEDataBaseline,na.rm = TRUE))^0.5
testRMSEBaseline
## [1] 1.576328

Summary

The baseline predictor shows much better results than the simple raw average model. Although this is a small dataset, so data variance and how the random training and test datasets are defined will have a large effect in the results, we can consistently see how training RMSE for the baseline predictor is lower than the raw average. The tests RMSE for both are much closer. This should be a better measure of the higher quality of the baseline predictor. In this small dataset it is somewhat harder to see the improvement of one model to the other. Using larger datasets should provide a larger contrast.