Introduction

In this project we will build a baseline recommender using a dataset with movie ratings. Users, raters in our dataset, as well as items, or movies, will both be normalized to account for “bias”. Once the recommender is built, the performance of the same will be asses using the RMSE metric.

Data

The data used in the project consist of ratings of 10 popular movies by 7 individuals. Data was collected by interviewing family and friends.

data <- as.matrix(read.csv("data.csv", header = TRUE,row.names = 1))
kable(data)

	Forrest.Gump	Titanic	Saving.Private.Ryan	ET	Ferris.Bueller	Batman..The.Dark.Knight	Joker	Rocketman	Bohemian.Rhapsody	Wonder.Woman..2017.
Jennie	5	4.0	3	5	NA	2.5	NA	4.5	5	NA
Tatiana	5	5.0	3	5	NA	2.5	NA	4.0	NA	3
Sue	5	3.0	3	1	1	4.0	5.0	4.0	5	4
Nicole	5	NA	NA	4	5	NA	NA	NA	4	3
Peter	5	3.5	3	3	5	4.0	3.5	4.0	4	4
Samantha	4	3.0	NA	5	5	5.0	3.0	NA	4	3

Splitting the data

A test dataset was selected from the data

testData<-data[NA,NA]
rownames(testData)<-rownames(data)
colnames(testData)<-colnames(data)
i<-0
while(i<(0.3*sum(!is.na(data)))) {
  row<-sample(1:nrow(data),1)
  col<-sample(1:ncol(data),1)
  if(!is.na(data[row,col])) {
    i<-i+1
    testData[row,col]<-data[row,col]
  }
}
kable(testData)

	Forrest.Gump	Titanic	Saving.Private.Ryan	ET	Ferris.Bueller	Batman..The.Dark.Knight	Joker	Rocketman	Bohemian.Rhapsody	Wonder.Woman..2017.
Jennie	5	4	NA	5	NA	NA	NA	NA	NA	NA
Tatiana	NA	NA	3	NA	NA	NA	NA	NA	NA	NA
Sue	NA	NA	NA	1	1	NA	5.0	NA	NA	NA
Nicole	5	NA	NA	NA	NA	NA	NA	NA	NA	3
Peter	NA	NA	3	NA	5	NA	3.5	NA	NA	NA
Samantha	4	NA	NA	NA	NA	NA	NA	NA	NA	3

The training dataset is derived from extracting the test set from the original data

testDataTemp<-testData
testDataTemp[!is.na(testDataTemp)]<-0
testDataTemp[is.na(testDataTemp)]<-1
trainingData<-testDataTemp*data
trainingData[trainingData==0]<-NA
kable(trainingData)

	Forrest.Gump	Titanic	Saving.Private.Ryan	ET	Ferris.Bueller	Batman..The.Dark.Knight	Joker	Rocketman	Bohemian.Rhapsody	Wonder.Woman..2017.
Jennie	NA	NA	3	NA	NA	2.5	NA	4.5	5	NA
Tatiana	5	5.0	NA	5	NA	2.5	NA	4.0	NA	3
Sue	5	3.0	3	NA	NA	4.0	NA	4.0	5	4
Nicole	NA	NA	NA	4	5	NA	NA	NA	4	NA
Peter	5	3.5	NA	3	NA	4.0	NA	4.0	4	4
Samantha	NA	3.0	NA	5	5	5.0	3	NA	4	NA

Raw average model

The raw average model simple predict all item/user combinations to be equal to the overall mean of the training data. To do this we calculate the mean of our training set.

rawAverage<-mean(trainingData,na.rm = TRUE)
averageData<-data
averageData<-apply(averageData,1:2,function(x) rawAverage)
kable(averageData)

	Forrest.Gump	Titanic	Saving.Private.Ryan	ET	Ferris.Bueller	Batman..The.Dark.Knight	Joker	Rocketman	Bohemian.Rhapsody	Wonder.Woman..2017.
Jennie	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303
Tatiana	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303
Sue	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303
Nicole	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303
Peter	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303
Samantha	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303	4.030303

Training RMSE

Error is calculated using the training matrix.

trainingRMSEData<-trainingData
trainingRMSEData<-apply(trainingRMSEData,1:2,function(x) (x-rawAverage)^2)
trainingRMSE<-(mean(trainingRMSEData,na.rm = TRUE))^0.5
trainingRMSE

## [1] 0.834297

Test RMSE

Error is calculated using the test matrix.

testRMSEData<-testData
testRMSEData<-apply(testRMSEData,1:2,function(x) (x-rawAverage)^2)
testRMSE<-(mean(testRMSEData,na.rm = TRUE))^0.5
testRMSE

## [1] 1.403979

As expected the training RMSE is lower than the test.

Baseline Predictor

Using our training data, we calculate the bias for each user and each item.

userBias<-rowMeans(trainingData,na.rm = TRUE)-rawAverage
kable(userBias)

	x
Jennie	-0.2803030
Tatiana	0.0530303
Sue	-0.0303030
Nicole	0.3030303
Peter	-0.1017316
Samantha	0.1363636

movieBias<-colMeans(trainingData,na.rm = TRUE)-rawAverage
kable(movieBias)

	x
Forrest.Gump	0.9696970
Titanic	-0.4053030
Saving.Private.Ryan	-1.0303030
ET	0.2196970
Ferris.Bueller	0.9696970
Batman..The.Dark.Knight	-0.4303030
Joker	-1.0303030
Rocketman	0.0946970
Bohemian.Rhapsody	0.3696970
Wonder.Woman..2017.	-0.3636364

baselinePredictor<-data
for(i in 1:nrow(data)) {
  for(j in 1:ncol(data))  {
    baselinePredictor[i,j]<-rawAverage+userBias[i]+movieBias[j] 
    if(baselinePredictor[i,j]>5) baselinePredictor[i,j]<-5
    if(baselinePredictor[i,j]<0) baselinePredictor[i,j]<-0
  }
}
kable(baselinePredictor)

	Forrest.Gump	Titanic	Saving.Private.Ryan	ET	Ferris.Bueller	Batman..The.Dark.Knight	Joker	Rocketman	Bohemian.Rhapsody	Wonder.Woman..2017.
Jennie	4.719697	3.344697	2.719697	3.969697	4.719697	3.319697	2.719697	3.844697	4.119697	3.386364
Tatiana	5.000000	3.678030	3.053030	4.303030	5.000000	3.653030	3.053030	4.178030	4.453030	3.719697
Sue	4.969697	3.594697	2.969697	4.219697	4.969697	3.569697	2.969697	4.094697	4.369697	3.636364
Nicole	5.000000	3.928030	3.303030	4.553030	5.000000	3.903030	3.303030	4.428030	4.703030	3.969697
Peter	4.898268	3.523268	2.898268	4.148268	4.898268	3.498268	2.898268	4.023268	4.298268	3.564935
Samantha	5.000000	3.761364	3.136364	4.386364	5.000000	3.736364	3.136364	4.261364	4.536364	3.803030

Training RMSE

Error is calculated using the training matrix.

trainingRMSEDataBaseline<-trainingData
for(i in 1:nrow(trainingRMSEDataBaseline)) {
  for(j in 1:ncol(trainingRMSEDataBaseline))  {
    trainingRMSEDataBaseline[i,j]<-(trainingRMSEDataBaseline[i,j]-baselinePredictor[i,j])^2
  }
}
trainingRMSEBaseline<-(mean(trainingRMSEDataBaseline,na.rm = TRUE))^0.5
trainingRMSEBaseline

## [1] 0.6195254

Test RMSE

Error is calculated using the test matrix.

testRMSEDataBaseline<-testData
for(i in 1:nrow(testRMSEDataBaseline)) {
  for(j in 1:ncol(testRMSEDataBaseline))  {
    testRMSEDataBaseline[i,j]<-(testRMSEDataBaseline[i,j]-baselinePredictor[i,j])^2
  }
}
testRMSEBaseline<-(mean(testRMSEDataBaseline,na.rm = TRUE))^0.5
testRMSEBaseline

## [1] 1.576328

Summary

The baseline predictor shows much better results than the simple raw average model. Although this is a small dataset, so data variance and how the random training and test datasets are defined will have a large effect in the results, we can consistently see how training RMSE for the baseline predictor is lower than the raw average. The tests RMSE for both are much closer. This should be a better measure of the higher quality of the baseline predictor. In this small dataset it is somewhat harder to see the improvement of one model to the other. Using larger datasets should provide a larger contrast.

CUNY DATA 612 Project 1 | Global Baseline Predictors and RMSE

Peter Kowalchuk

2/2/2020

Introduction

Data

Splitting the data

Raw average model

Training RMSE

Test RMSE

Baseline Predictor

Training RMSE

Test RMSE

Summary