Dataset

This dataset contains ratings for some of the new movies.

Ratings are on a scale of 1 to 10.

Some of the ratinsg are from actual sites and others were made up.

Below table displays user movie ratings.

ratings <- read.table('Reviews.txt',header = T, sep = ',', na.strings = T)
kable_styling (kable(ratings),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer Birds.of.Prey Bad.Boys.For.Life Star.Wars Dolittle Jumanji..The.Next.Level The.Gentlemen X1917 Knives.Out
User01 8.0 6.0 10.0 6.0 6.0 8.0 2.0 NA
User02 7.5 7.5 7.5 5.0 2.5 7.5 2.5 7.5
User03 7.5 7.5 8.8 5.0 NA 8.8 NA NA
User04 7.5 6.0 9.0 4.0 NA NA NA NA
User05 7.5 8.8 10.0 6.3 3.8 8.8 NA NA
User06 7.5 7.5 7.5 7.5 NA 10.0 5.0 6.3
User07 7.0 9.0 8.0 5.0 4.0 8.0 2.0 NA
User08 7.0 6.0 7.0 4.0 4.0 9.0 6.0 8.0
User09 6.7 8.5 9.5 8.0 7.1 8.9 NA NA
User10 6.3 6.3 7.5 NA NA 10.0 6.3 10.0
User11 6.3 7.5 10.0 7.5 NA 8.8 NA NA

Below table displays average ratings by user Based on the table below, we see that User10 gives higher ratings and User02 gives lower ratings.

reviewers <- as.character(ratings$Reviewer)
movies <- as.character( colnames(ratings)[-1])
ratingsByUser <- cbind(reviewers,rowMeans(ratings[-1], na.rm = T))
kable_styling (kable(ratingsByUser,col.names = c("User",'Rating')),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
User Rating
User01 6.57142857142857
User02 5.9375
User03 7.52
User04 6.625
User05 7.53333333333333
User06 7.32857142857143
User07 6.14285714285714
User08 6.375
User09 8.11666666666667
User10 7.73333333333333
User11 8.02

Below table display average ratings by movie. Based on the table below, we see that Star.Wars is the highest rated movie and X1917 is the lowest rated movie.

ratinsByMovie <- cbind(movies, colMeans(ratings[-1], na.rm = T))
rownames(ratinsByMovie) <- c()
kable_styling (kable(ratinsByMovie,  col.names = c("Movie","Rating")),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Movie Rating
Birds.of.Prey 7.16363636363636
Bad.Boys.For.Life 7.32727272727273
Star.Wars 8.61818181818182
Dolittle 5.83
Jumanji..The.Next.Level 4.56666666666667
The.Gentlemen 8.78
X1917 3.96666666666667
Knives.Out 7.95

Wide to long format.

Here we are converting the data from wide format to long format.

ratings <- gather(ratings,key = Movie, value = Rating, -Reviewer)

Test and Train data

Split the data into triaing and test data. We are using 70% of the data for training and 30% for testing.

Below table display some of the data from the test and train data.

# Creating train and test using sample.int() function
set.seed(86) 

sample <- sample.int(n = nrow(ratings), 
                     size = floor(.70 * nrow(ratings)), # Selecting 70% of data
                     replace = F)
 
train <- ratings[sample, ]
test  <- ratings[-sample, ]

kable_styling (kable(head(train)),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer Movie Rating
67 User01 X1917 2.0
29 User07 Star.Wars 8.0
1 User01 Birds.of.Prey 8.0
87 User10 Knives.Out 10.0
49 User05 Jumanji..The.Next.Level 3.8
2 User02 Birds.of.Prey 7.5
kable_styling (kable(head(test)),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer Movie Rating
6 User06 Birds.of.Prey 7.5
10 User10 Birds.of.Prey 6.3
11 User11 Birds.of.Prey 6.3
13 User02 Bad.Boys.For.Life 7.5
15 User04 Bad.Boys.For.Life 6.0
16 User05 Bad.Boys.For.Life 8.8

Raw average

raw_avg = sum(train$Rating, na.rm = TRUE) / length(which(!is.na(train$Rating)))

Raw average for the movie reviews dataset is 6.8708333

raw_avg
## [1] 6.870833

Raw Average RMSE

We see from the table below the test RMSE is alot better than Train RMSE. This might be because the data is random and we only have small set of reviews.

train_rmse <- sqrt(sum((train$Rating[!is.na(train$Rating)] - raw_avg)^2) /
                         length(which(!is.na(train$Rating))))
test_rmse <- sqrt(sum((test$Rating[!is.na(test$Rating)] - raw_avg)^2) /
                         length(which(!is.na(train$Rating))))

kable_styling (kable(cbind(train_rmse, test_rmse ),  
                        col.names = c("Train RMSE","Test RMSE")),
                bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Train RMSE Test RMSE
2.045519 1.176856

User Bias

Below table display user review biases. User u5 on average gives more favorable reviews and u4 is a harsh reviewer.

train_data = train[!is.na(train$Rating),]
test_data = test[!is.na(test$Rating),]
reviewer_bias = tapply(train_data$Rating, train_data$Reviewer, function(x) {(sum(x)/length(x))-raw_avg})

reviewer_bias_display<- cbind( names(reviewer_bias), reviewer_bias)

row.names(reviewer_bias_display)<- c()
kable_styling (kable(reviewer_bias_display ,  
                     col.names = c("User", "Bias")),
               bootstrap_options = c("striped", "hover", "condensed", "responsive"))
User Bias
User01 -0.299404761904762
User02 -1.37083333333333
User03 0.649166666666667
User04 1.37916666666667
User05 -0.270833333333334
User06 0.229166666666667
User07 -0.72797619047619
User08 -0.470833333333333
User09 0.804166666666666
User10 1.57916666666667
User11 0.629166666666666

Movie Bias

Below table displays movie biases. Star Wars has been reviewed better than other movies and 1917 has gotten least favorable reviews.

movie_bias <- tapply(train_data$Rating, train_data$Movie, function(x) {(sum(x)/length(x))-raw_avg})

movie_bias_display<- cbind( names(movie_bias), movie_bias)

row.names(movie_bias_display)<- c()

kable_styling (kable(movie_bias_display ,  
                     col.names = c("Movie", "Bias")),
               bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Movie Bias
Bad.Boys.For.Life 0.629166666666666
Birds.of.Prey 0.466666666666667
Dolittle -0.708333333333334
Jumanji..The.Next.Level -2.30416666666667
Knives.Out 1.07916666666667
Star.Wars 1.78916666666667
The.Gentlemen 1.75416666666667
X1917 -3.67083333333333

Train Baseline

Below table displays train data baseline

MovieBias <- apply(train_data,1, function(x)  as.numeric(movie_bias[x['Movie']])) 
ReviewerBias <- apply(train_data,1, function(x) as.numeric(reviewer_bias[x['Reviewer']])) 

train_data  <- cbind(train_data , MovieBias, ReviewerBias)


# Max value for review 10 
Baseline  <- apply(train_data,1, 
                   function(x) as.numeric(x['ReviewerBias']) +  as.numeric(x['MovieBias']) +raw_avg ) 

# data usesd for calculations below
train_data  <- cbind(train_data, raw_avg , Baseline)

# display data
train_data_display <- train_data

#remove column names
row.names( train_data_display) <- c()

kable_styling (kable(train_data_display ),
               bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer Movie Rating MovieBias ReviewerBias raw_avg Baseline
User01 X1917 2.0 -3.6708333 -0.2994048 6.870833 2.900595
User07 Star.Wars 8.0 1.7891667 -0.7279762 6.870833 7.932024
User01 Birds.of.Prey 8.0 0.4666667 -0.2994048 6.870833 7.038095
User10 Knives.Out 10.0 1.0791667 1.5791667 6.870833 9.529167
User05 Jumanji..The.Next.Level 3.8 -2.3041667 -0.2708333 6.870833 4.295833
User02 Birds.of.Prey 7.5 0.4666667 -1.3708333 6.870833 5.966667
User05 Dolittle 6.3 -0.7083333 -0.2708333 6.870833 5.891667
User04 Star.Wars 9.0 1.7891667 1.3791667 6.870833 10.039167
User02 Jumanji..The.Next.Level 2.5 -2.3041667 -1.3708333 6.870833 3.195833
User11 Bad.Boys.For.Life 7.5 0.6291667 0.6291667 6.870833 8.129167
User08 The.Gentlemen 9.0 1.7541667 -0.4708333 6.870833 8.154167
User09 Dolittle 8.0 -0.7083333 0.8041667 6.870833 6.966667
User02 The.Gentlemen 7.5 1.7541667 -1.3708333 6.870833 7.254167
User09 Birds.of.Prey 6.7 0.4666667 0.8041667 6.870833 8.141667
User09 Jumanji..The.Next.Level 7.1 -2.3041667 0.8041667 6.870833 5.370833
User07 Jumanji..The.Next.Level 4.0 -2.3041667 -0.7279762 6.870833 3.838690
User09 The.Gentlemen 8.9 1.7541667 0.8041667 6.870833 9.429167
User05 Birds.of.Prey 7.5 0.4666667 -0.2708333 6.870833 7.066667
User08 Knives.Out 8.0 1.0791667 -0.4708333 6.870833 7.479167
User01 Jumanji..The.Next.Level 6.0 -2.3041667 -0.2994048 6.870833 4.267262
User02 Knives.Out 7.5 1.0791667 -1.3708333 6.870833 6.579167
User02 X1917 2.5 -3.6708333 -1.3708333 6.870833 1.829167
User10 Star.Wars 7.5 1.7891667 1.5791667 6.870833 10.239167
User01 Dolittle 6.0 -0.7083333 -0.2994048 6.870833 5.863095
User08 Jumanji..The.Next.Level 4.0 -2.3041667 -0.4708333 6.870833 4.095833
User04 Birds.of.Prey 7.5 0.4666667 1.3791667 6.870833 8.716667
User07 X1917 2.0 -3.6708333 -0.7279762 6.870833 2.472024
User07 Bad.Boys.For.Life 9.0 0.6291667 -0.7279762 6.870833 6.772024
User01 The.Gentlemen 8.0 1.7541667 -0.2994048 6.870833 8.325595
User07 Dolittle 5.0 -0.7083333 -0.7279762 6.870833 5.434524
User03 Dolittle 5.0 -0.7083333 0.6491667 6.870833 6.811667
User10 X1917 6.3 -3.6708333 1.5791667 6.870833 4.779167
User07 Birds.of.Prey 7.0 0.4666667 -0.7279762 6.870833 6.609524
User07 The.Gentlemen 8.0 1.7541667 -0.7279762 6.870833 7.897024
User06 Dolittle 7.5 -0.7083333 0.2291667 6.870833 6.391667
User03 The.Gentlemen 8.8 1.7541667 0.6491667 6.870833 9.274167
User05 The.Gentlemen 8.8 1.7541667 -0.2708333 6.870833 8.354167
User03 Birds.of.Prey 7.5 0.4666667 0.6491667 6.870833 7.986667
User06 Bad.Boys.For.Life 7.5 0.6291667 0.2291667 6.870833 7.729167
User06 Knives.Out 6.3 1.0791667 0.2291667 6.870833 8.179167
User08 Dolittle 4.0 -0.7083333 -0.4708333 6.870833 5.691667
User11 Dolittle 7.5 -0.7083333 0.6291667 6.870833 6.791667
User01 Star.Wars 10.0 1.7891667 -0.2994048 6.870833 8.360595
User03 Star.Wars 8.8 1.7891667 0.6491667 6.870833 9.309167
User03 Bad.Boys.For.Life 7.5 0.6291667 0.6491667 6.870833 8.149167
User10 The.Gentlemen 10.0 1.7541667 1.5791667 6.870833 10.204167
User08 Birds.of.Prey 7.0 0.4666667 -0.4708333 6.870833 6.866667
User01 Bad.Boys.For.Life 6.0 0.6291667 -0.2994048 6.870833 7.200595

Test Baseline

Below table displays Test data baseline.

MovieBias <- apply(test_data,1, function(x)  as.numeric(movie_bias[x['Movie']])) 
ReviewerBias <- apply(test_data,1, function(x) as.numeric(reviewer_bias[x['Reviewer']])) 

test_data <- cbind(test_data , MovieBias, ReviewerBias)

# Max value for review 10 
Baseline  <- apply(test_data,1, 
                   function(x) as.numeric(x['ReviewerBias']) +  as.numeric(x['MovieBias']) +raw_avg )

test_data  <- cbind(test_data, raw_avg , Baseline)
test_data_display <- test_data
row.names( test_data_display) <- c()
kable_styling (kable(test_data_display ),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer Movie Rating MovieBias ReviewerBias raw_avg Baseline
User06 Birds.of.Prey 7.5 0.4666667 0.2291667 6.870833 7.566667
User10 Birds.of.Prey 6.3 0.4666667 1.5791667 6.870833 8.916667
User11 Birds.of.Prey 6.3 0.4666667 0.6291667 6.870833 7.966667
User02 Bad.Boys.For.Life 7.5 0.6291667 -1.3708333 6.870833 6.129167
User04 Bad.Boys.For.Life 6.0 0.6291667 1.3791667 6.870833 8.879167
User05 Bad.Boys.For.Life 8.8 0.6291667 -0.2708333 6.870833 7.229167
User08 Bad.Boys.For.Life 6.0 0.6291667 -0.4708333 6.870833 7.029167
User09 Bad.Boys.For.Life 8.5 0.6291667 0.8041667 6.870833 8.304167
User10 Bad.Boys.For.Life 6.3 0.6291667 1.5791667 6.870833 9.079167
User02 Star.Wars 7.5 1.7891667 -1.3708333 6.870833 7.289167
User05 Star.Wars 10.0 1.7891667 -0.2708333 6.870833 8.389167
User06 Star.Wars 7.5 1.7891667 0.2291667 6.870833 8.889167
User08 Star.Wars 7.0 1.7891667 -0.4708333 6.870833 8.189167
User09 Star.Wars 9.5 1.7891667 0.8041667 6.870833 9.464167
User11 Star.Wars 10.0 1.7891667 0.6291667 6.870833 9.289167
User02 Dolittle 5.0 -0.7083333 -1.3708333 6.870833 4.791667
User04 Dolittle 4.0 -0.7083333 1.3791667 6.870833 7.541667
User06 The.Gentlemen 10.0 1.7541667 0.2291667 6.870833 8.854167
User11 The.Gentlemen 8.8 1.7541667 0.6291667 6.870833 9.254167
User06 X1917 5.0 -3.6708333 0.2291667 6.870833 3.429167
User08 X1917 6.0 -3.6708333 -0.4708333 6.870833 2.729167

Baseline Predictor

Below table displays Raw Average and Baseline Predictor for test and train dataset. For the training dataset, we see that baseline predictor is significantly better(around 48.81%)than raw averages. We donโ€™t see the same improvements for the test dataset; we believe this is the result of using random data and small dataset.

train_baseLine_pred <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] - 
                               train_data$Baseline[!is.na(train_data$Rating)])^2) /
                          length(which(!is.na(train_data$Rating))))

test_baseLine_pred <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] - 
                              test_data$Baseline[!is.na(test_data$Rating)])^2) /
                         length(which(!is.na(test_data$Rating))))

train_impro <- (1 -(train_baseLine_pred /train_rmse)) *100
test_impro <- (1 -(test_baseLine_pred /test_rmse)) *100

summary_display <-  cbind(train_rmse,  train_baseLine_pred, train_impro)
summary_display <-  rbind(summary_display, c(test_rmse, test_baseLine_pred, test_impro ))

colnames(summary_display) <-c("Raw Average ", "Baseline Predictor", "Improvement %")
row.names(summary_display) <-c("Train ", "Test")
kable_styling (kable(summary_display),
                bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Raw Average Baseline Predictor Improvement %
Train 2.045519 1.047019 48.81403
Test 1.176856 1.757275 -49.31948