library(kableExtra)
In this project we will build a baseline recommender using a dataset with movie ratings. Users, raters in our dataset, as well as items, or movies, will both be normalized to account for “bias”. Once the recommender is built, the performance of the same will be asses using the RMSE metric.
The data used in the project consist of ratings of 10 popular movies by 7 individuals. Data was collected by interviewing family and friends.
data <- as.matrix(read.csv("data.csv", header = TRUE,row.names = 1))
kable(data)
| Forrest.Gump | Titanic | Saving.Private.Ryan | ET | Ferris.Bueller | Batman..The.Dark.Knight | Joker | Rocketman | Bohemian.Rhapsody | Wonder.Woman..2017. | |
|---|---|---|---|---|---|---|---|---|---|---|
| Jennie | 5 | 4.0 | 3 | 5 | NA | 2.5 | NA | 4.5 | 5 | NA |
| Tatiana | 5 | 5.0 | 3 | 5 | NA | 2.5 | NA | 4.0 | NA | 3 |
| Sue | 5 | 3.0 | 3 | 1 | 1 | 4.0 | 5.0 | 4.0 | 5 | 4 |
| Nicole | 5 | NA | NA | 4 | 5 | NA | NA | NA | 4 | 3 |
| Peter | 5 | 3.5 | 3 | 3 | 5 | 4.0 | 3.5 | 4.0 | 4 | 4 |
| Samantha | 4 | 3.0 | NA | 5 | 5 | 5.0 | 3.0 | NA | 4 | 3 |
A test dataset was selected from the data
testData<-data[NA,NA]
rownames(testData)<-rownames(data)
colnames(testData)<-colnames(data)
i<-0
while(i<(0.3*sum(!is.na(data)))) {
row<-sample(1:nrow(data),1)
col<-sample(1:ncol(data),1)
if(!is.na(data[row,col])) {
i<-i+1
testData[row,col]<-data[row,col]
}
}
kable(testData)
| Forrest.Gump | Titanic | Saving.Private.Ryan | ET | Ferris.Bueller | Batman..The.Dark.Knight | Joker | Rocketman | Bohemian.Rhapsody | Wonder.Woman..2017. | |
|---|---|---|---|---|---|---|---|---|---|---|
| Jennie | 5 | 4 | NA | 5 | NA | NA | NA | NA | NA | NA |
| Tatiana | NA | NA | 3 | NA | NA | NA | NA | NA | NA | NA |
| Sue | NA | NA | NA | 1 | 1 | NA | 5.0 | NA | NA | NA |
| Nicole | 5 | NA | NA | NA | NA | NA | NA | NA | NA | 3 |
| Peter | NA | NA | 3 | NA | 5 | NA | 3.5 | NA | NA | NA |
| Samantha | 4 | NA | NA | NA | NA | NA | NA | NA | NA | 3 |
The training dataset is derived from extracting the test set from the original data
testDataTemp<-testData
testDataTemp[!is.na(testDataTemp)]<-0
testDataTemp[is.na(testDataTemp)]<-1
trainingData<-testDataTemp*data
trainingData[trainingData==0]<-NA
kable(trainingData)
| Forrest.Gump | Titanic | Saving.Private.Ryan | ET | Ferris.Bueller | Batman..The.Dark.Knight | Joker | Rocketman | Bohemian.Rhapsody | Wonder.Woman..2017. | |
|---|---|---|---|---|---|---|---|---|---|---|
| Jennie | NA | NA | 3 | NA | NA | 2.5 | NA | 4.5 | 5 | NA |
| Tatiana | 5 | 5.0 | NA | 5 | NA | 2.5 | NA | 4.0 | NA | 3 |
| Sue | 5 | 3.0 | 3 | NA | NA | 4.0 | NA | 4.0 | 5 | 4 |
| Nicole | NA | NA | NA | 4 | 5 | NA | NA | NA | 4 | NA |
| Peter | 5 | 3.5 | NA | 3 | NA | 4.0 | NA | 4.0 | 4 | 4 |
| Samantha | NA | 3.0 | NA | 5 | 5 | 5.0 | 3 | NA | 4 | NA |
The raw average model simple predict all item/user combinations to be equal to the overall mean of the training data. To do this we calculate the mean of our training set.
rawAverage<-mean(trainingData,na.rm = TRUE)
averageData<-data
averageData<-apply(averageData,1:2,function(x) rawAverage)
kable(averageData)
| Forrest.Gump | Titanic | Saving.Private.Ryan | ET | Ferris.Bueller | Batman..The.Dark.Knight | Joker | Rocketman | Bohemian.Rhapsody | Wonder.Woman..2017. | |
|---|---|---|---|---|---|---|---|---|---|---|
| Jennie | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 |
| Tatiana | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 |
| Sue | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 |
| Nicole | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 |
| Peter | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 |
| Samantha | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 | 4.030303 |
Error is calculated using the training matrix.
trainingRMSEData<-trainingData
trainingRMSEData<-apply(trainingRMSEData,1:2,function(x) (x-rawAverage)^2)
trainingRMSE<-(mean(trainingRMSEData,na.rm = TRUE))^0.5
trainingRMSE
## [1] 0.834297
Error is calculated using the test matrix.
testRMSEData<-testData
testRMSEData<-apply(testRMSEData,1:2,function(x) (x-rawAverage)^2)
testRMSE<-(mean(testRMSEData,na.rm = TRUE))^0.5
testRMSE
## [1] 1.403979
As expected the training RMSE is lower than the test.
Using our training data, we calculate the bias for each user and each item.
userBias<-rowMeans(trainingData,na.rm = TRUE)-rawAverage
kable(userBias)
| x | |
|---|---|
| Jennie | -0.2803030 |
| Tatiana | 0.0530303 |
| Sue | -0.0303030 |
| Nicole | 0.3030303 |
| Peter | -0.1017316 |
| Samantha | 0.1363636 |
movieBias<-colMeans(trainingData,na.rm = TRUE)-rawAverage
kable(movieBias)
| x | |
|---|---|
| Forrest.Gump | 0.9696970 |
| Titanic | -0.4053030 |
| Saving.Private.Ryan | -1.0303030 |
| ET | 0.2196970 |
| Ferris.Bueller | 0.9696970 |
| Batman..The.Dark.Knight | -0.4303030 |
| Joker | -1.0303030 |
| Rocketman | 0.0946970 |
| Bohemian.Rhapsody | 0.3696970 |
| Wonder.Woman..2017. | -0.3636364 |
baselinePredictor<-data
for(i in 1:nrow(data)) {
for(j in 1:ncol(data)) {
baselinePredictor[i,j]<-rawAverage+userBias[i]+movieBias[j]
if(baselinePredictor[i,j]>5) baselinePredictor[i,j]<-5
if(baselinePredictor[i,j]<0) baselinePredictor[i,j]<-0
}
}
kable(baselinePredictor)
| Forrest.Gump | Titanic | Saving.Private.Ryan | ET | Ferris.Bueller | Batman..The.Dark.Knight | Joker | Rocketman | Bohemian.Rhapsody | Wonder.Woman..2017. | |
|---|---|---|---|---|---|---|---|---|---|---|
| Jennie | 4.719697 | 3.344697 | 2.719697 | 3.969697 | 4.719697 | 3.319697 | 2.719697 | 3.844697 | 4.119697 | 3.386364 |
| Tatiana | 5.000000 | 3.678030 | 3.053030 | 4.303030 | 5.000000 | 3.653030 | 3.053030 | 4.178030 | 4.453030 | 3.719697 |
| Sue | 4.969697 | 3.594697 | 2.969697 | 4.219697 | 4.969697 | 3.569697 | 2.969697 | 4.094697 | 4.369697 | 3.636364 |
| Nicole | 5.000000 | 3.928030 | 3.303030 | 4.553030 | 5.000000 | 3.903030 | 3.303030 | 4.428030 | 4.703030 | 3.969697 |
| Peter | 4.898268 | 3.523268 | 2.898268 | 4.148268 | 4.898268 | 3.498268 | 2.898268 | 4.023268 | 4.298268 | 3.564935 |
| Samantha | 5.000000 | 3.761364 | 3.136364 | 4.386364 | 5.000000 | 3.736364 | 3.136364 | 4.261364 | 4.536364 | 3.803030 |
Error is calculated using the training matrix.
trainingRMSEDataBaseline<-trainingData
for(i in 1:nrow(trainingRMSEDataBaseline)) {
for(j in 1:ncol(trainingRMSEDataBaseline)) {
trainingRMSEDataBaseline[i,j]<-(trainingRMSEDataBaseline[i,j]-baselinePredictor[i,j])^2
}
}
trainingRMSEBaseline<-(mean(trainingRMSEDataBaseline,na.rm = TRUE))^0.5
trainingRMSEBaseline
## [1] 0.6195254
Error is calculated using the test matrix.
testRMSEDataBaseline<-testData
for(i in 1:nrow(testRMSEDataBaseline)) {
for(j in 1:ncol(testRMSEDataBaseline)) {
testRMSEDataBaseline[i,j]<-(testRMSEDataBaseline[i,j]-baselinePredictor[i,j])^2
}
}
testRMSEBaseline<-(mean(testRMSEDataBaseline,na.rm = TRUE))^0.5
testRMSEBaseline
## [1] 1.576328
The baseline predictor shows much better results than the simple raw average model. Although this is a small dataset, so data variance and how the random training and test datasets are defined will have a large effect in the results, we can consistently see how training RMSE for the baseline predictor is lower than the raw average. The tests RMSE for both are much closer. This should be a better measure of the higher quality of the baseline predictor. In this small dataset it is somewhat harder to see the improvement of one model to the other. Using larger datasets should provide a larger contrast.