This dataset contains ratings for some of the new movies.
Ratings are on a scale of 1 to 10.
Some of the ratinsg are from actual sites and others were made up.
Below table displays user movie ratings.
ratings <- read.table('Reviews.txt',header = T, sep = ',', na.strings = T)
kable_styling (kable(ratings),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer | Birds.of.Prey | Bad.Boys.For.Life | Star.Wars | Dolittle | Jumanji..The.Next.Level | The.Gentlemen | X1917 | Knives.Out |
---|---|---|---|---|---|---|---|---|
User01 | 8.0 | 6.0 | 10.0 | 6.0 | 6.0 | 8.0 | 2.0 | NA |
User02 | 7.5 | 7.5 | 7.5 | 5.0 | 2.5 | 7.5 | 2.5 | 7.5 |
User03 | 7.5 | 7.5 | 8.8 | 5.0 | NA | 8.8 | NA | NA |
User04 | 7.5 | 6.0 | 9.0 | 4.0 | NA | NA | NA | NA |
User05 | 7.5 | 8.8 | 10.0 | 6.3 | 3.8 | 8.8 | NA | NA |
User06 | 7.5 | 7.5 | 7.5 | 7.5 | NA | 10.0 | 5.0 | 6.3 |
User07 | 7.0 | 9.0 | 8.0 | 5.0 | 4.0 | 8.0 | 2.0 | NA |
User08 | 7.0 | 6.0 | 7.0 | 4.0 | 4.0 | 9.0 | 6.0 | 8.0 |
User09 | 6.7 | 8.5 | 9.5 | 8.0 | 7.1 | 8.9 | NA | NA |
User10 | 6.3 | 6.3 | 7.5 | NA | NA | 10.0 | 6.3 | 10.0 |
User11 | 6.3 | 7.5 | 10.0 | 7.5 | NA | 8.8 | NA | NA |
Below table displays average ratings by user Based on the table below, we see that User10 gives higher ratings and User02 gives lower ratings.
reviewers <- as.character(ratings$Reviewer)
movies <- as.character( colnames(ratings)[-1])
ratingsByUser <- cbind(reviewers,rowMeans(ratings[-1], na.rm = T))
kable_styling (kable(ratingsByUser,col.names = c("User",'Rating')),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
User | Rating |
---|---|
User01 | 6.57142857142857 |
User02 | 5.9375 |
User03 | 7.52 |
User04 | 6.625 |
User05 | 7.53333333333333 |
User06 | 7.32857142857143 |
User07 | 6.14285714285714 |
User08 | 6.375 |
User09 | 8.11666666666667 |
User10 | 7.73333333333333 |
User11 | 8.02 |
Below table display average ratings by movie. Based on the table below, we see that Star.Wars is the highest rated movie and X1917 is the lowest rated movie.
ratinsByMovie <- cbind(movies, colMeans(ratings[-1], na.rm = T))
rownames(ratinsByMovie) <- c()
kable_styling (kable(ratinsByMovie, col.names = c("Movie","Rating")),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Movie | Rating |
---|---|
Birds.of.Prey | 7.16363636363636 |
Bad.Boys.For.Life | 7.32727272727273 |
Star.Wars | 8.61818181818182 |
Dolittle | 5.83 |
Jumanji..The.Next.Level | 4.56666666666667 |
The.Gentlemen | 8.78 |
X1917 | 3.96666666666667 |
Knives.Out | 7.95 |
Here we are converting the data from wide format to long format.
ratings <- gather(ratings,key = Movie, value = Rating, -Reviewer)
Split the data into triaing and test data. We are using 70% of the data for training and 30% for testing.
Below table display some of the data from the test and train data.
# Creating train and test using sample.int() function
set.seed(86)
sample <- sample.int(n = nrow(ratings),
size = floor(.70 * nrow(ratings)), # Selecting 70% of data
replace = F)
train <- ratings[sample, ]
test <- ratings[-sample, ]
kable_styling (kable(head(train)),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer | Movie | Rating | |
---|---|---|---|
67 | User01 | X1917 | 2.0 |
29 | User07 | Star.Wars | 8.0 |
1 | User01 | Birds.of.Prey | 8.0 |
87 | User10 | Knives.Out | 10.0 |
49 | User05 | Jumanji..The.Next.Level | 3.8 |
2 | User02 | Birds.of.Prey | 7.5 |
kable_styling (kable(head(test)),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer | Movie | Rating | |
---|---|---|---|
6 | User06 | Birds.of.Prey | 7.5 |
10 | User10 | Birds.of.Prey | 6.3 |
11 | User11 | Birds.of.Prey | 6.3 |
13 | User02 | Bad.Boys.For.Life | 7.5 |
15 | User04 | Bad.Boys.For.Life | 6.0 |
16 | User05 | Bad.Boys.For.Life | 8.8 |
raw_avg = sum(train$Rating, na.rm = TRUE) / length(which(!is.na(train$Rating)))
Raw average for the movie reviews dataset is 6.8708333
raw_avg
## [1] 6.870833
We see from the table below the test RMSE is alot better than Train RMSE. This might be because the data is random and we only have small set of reviews.
train_rmse <- sqrt(sum((train$Rating[!is.na(train$Rating)] - raw_avg)^2) /
length(which(!is.na(train$Rating))))
test_rmse <- sqrt(sum((test$Rating[!is.na(test$Rating)] - raw_avg)^2) /
length(which(!is.na(train$Rating))))
kable_styling (kable(cbind(train_rmse, test_rmse ),
col.names = c("Train RMSE","Test RMSE")),
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Train RMSE | Test RMSE |
---|---|
2.045519 | 1.176856 |
Below table display user review biases. User u5 on average gives more favorable reviews and u4 is a harsh reviewer.
train_data = train[!is.na(train$Rating),]
test_data = test[!is.na(test$Rating),]
reviewer_bias = tapply(train_data$Rating, train_data$Reviewer, function(x) {(sum(x)/length(x))-raw_avg})
reviewer_bias_display<- cbind( names(reviewer_bias), reviewer_bias)
row.names(reviewer_bias_display)<- c()
kable_styling (kable(reviewer_bias_display ,
col.names = c("User", "Bias")),
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
User | Bias |
---|---|
User01 | -0.299404761904762 |
User02 | -1.37083333333333 |
User03 | 0.649166666666667 |
User04 | 1.37916666666667 |
User05 | -0.270833333333334 |
User06 | 0.229166666666667 |
User07 | -0.72797619047619 |
User08 | -0.470833333333333 |
User09 | 0.804166666666666 |
User10 | 1.57916666666667 |
User11 | 0.629166666666666 |
Below table displays movie biases. Star Wars has been reviewed better than other movies and 1917 has gotten least favorable reviews.
movie_bias <- tapply(train_data$Rating, train_data$Movie, function(x) {(sum(x)/length(x))-raw_avg})
movie_bias_display<- cbind( names(movie_bias), movie_bias)
row.names(movie_bias_display)<- c()
kable_styling (kable(movie_bias_display ,
col.names = c("Movie", "Bias")),
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Movie | Bias |
---|---|
Bad.Boys.For.Life | 0.629166666666666 |
Birds.of.Prey | 0.466666666666667 |
Dolittle | -0.708333333333334 |
Jumanji..The.Next.Level | -2.30416666666667 |
Knives.Out | 1.07916666666667 |
Star.Wars | 1.78916666666667 |
The.Gentlemen | 1.75416666666667 |
X1917 | -3.67083333333333 |
Below table displays train data baseline
MovieBias <- apply(train_data,1, function(x) as.numeric(movie_bias[x['Movie']]))
ReviewerBias <- apply(train_data,1, function(x) as.numeric(reviewer_bias[x['Reviewer']]))
train_data <- cbind(train_data , MovieBias, ReviewerBias)
# Max value for review 10
Baseline <- apply(train_data,1,
function(x) as.numeric(x['ReviewerBias']) + as.numeric(x['MovieBias']) +raw_avg )
# data usesd for calculations below
train_data <- cbind(train_data, raw_avg , Baseline)
# display data
train_data_display <- train_data
#remove column names
row.names( train_data_display) <- c()
kable_styling (kable(train_data_display ),
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer | Movie | Rating | MovieBias | ReviewerBias | raw_avg | Baseline |
---|---|---|---|---|---|---|
User01 | X1917 | 2.0 | -3.6708333 | -0.2994048 | 6.870833 | 2.900595 |
User07 | Star.Wars | 8.0 | 1.7891667 | -0.7279762 | 6.870833 | 7.932024 |
User01 | Birds.of.Prey | 8.0 | 0.4666667 | -0.2994048 | 6.870833 | 7.038095 |
User10 | Knives.Out | 10.0 | 1.0791667 | 1.5791667 | 6.870833 | 9.529167 |
User05 | Jumanji..The.Next.Level | 3.8 | -2.3041667 | -0.2708333 | 6.870833 | 4.295833 |
User02 | Birds.of.Prey | 7.5 | 0.4666667 | -1.3708333 | 6.870833 | 5.966667 |
User05 | Dolittle | 6.3 | -0.7083333 | -0.2708333 | 6.870833 | 5.891667 |
User04 | Star.Wars | 9.0 | 1.7891667 | 1.3791667 | 6.870833 | 10.039167 |
User02 | Jumanji..The.Next.Level | 2.5 | -2.3041667 | -1.3708333 | 6.870833 | 3.195833 |
User11 | Bad.Boys.For.Life | 7.5 | 0.6291667 | 0.6291667 | 6.870833 | 8.129167 |
User08 | The.Gentlemen | 9.0 | 1.7541667 | -0.4708333 | 6.870833 | 8.154167 |
User09 | Dolittle | 8.0 | -0.7083333 | 0.8041667 | 6.870833 | 6.966667 |
User02 | The.Gentlemen | 7.5 | 1.7541667 | -1.3708333 | 6.870833 | 7.254167 |
User09 | Birds.of.Prey | 6.7 | 0.4666667 | 0.8041667 | 6.870833 | 8.141667 |
User09 | Jumanji..The.Next.Level | 7.1 | -2.3041667 | 0.8041667 | 6.870833 | 5.370833 |
User07 | Jumanji..The.Next.Level | 4.0 | -2.3041667 | -0.7279762 | 6.870833 | 3.838690 |
User09 | The.Gentlemen | 8.9 | 1.7541667 | 0.8041667 | 6.870833 | 9.429167 |
User05 | Birds.of.Prey | 7.5 | 0.4666667 | -0.2708333 | 6.870833 | 7.066667 |
User08 | Knives.Out | 8.0 | 1.0791667 | -0.4708333 | 6.870833 | 7.479167 |
User01 | Jumanji..The.Next.Level | 6.0 | -2.3041667 | -0.2994048 | 6.870833 | 4.267262 |
User02 | Knives.Out | 7.5 | 1.0791667 | -1.3708333 | 6.870833 | 6.579167 |
User02 | X1917 | 2.5 | -3.6708333 | -1.3708333 | 6.870833 | 1.829167 |
User10 | Star.Wars | 7.5 | 1.7891667 | 1.5791667 | 6.870833 | 10.239167 |
User01 | Dolittle | 6.0 | -0.7083333 | -0.2994048 | 6.870833 | 5.863095 |
User08 | Jumanji..The.Next.Level | 4.0 | -2.3041667 | -0.4708333 | 6.870833 | 4.095833 |
User04 | Birds.of.Prey | 7.5 | 0.4666667 | 1.3791667 | 6.870833 | 8.716667 |
User07 | X1917 | 2.0 | -3.6708333 | -0.7279762 | 6.870833 | 2.472024 |
User07 | Bad.Boys.For.Life | 9.0 | 0.6291667 | -0.7279762 | 6.870833 | 6.772024 |
User01 | The.Gentlemen | 8.0 | 1.7541667 | -0.2994048 | 6.870833 | 8.325595 |
User07 | Dolittle | 5.0 | -0.7083333 | -0.7279762 | 6.870833 | 5.434524 |
User03 | Dolittle | 5.0 | -0.7083333 | 0.6491667 | 6.870833 | 6.811667 |
User10 | X1917 | 6.3 | -3.6708333 | 1.5791667 | 6.870833 | 4.779167 |
User07 | Birds.of.Prey | 7.0 | 0.4666667 | -0.7279762 | 6.870833 | 6.609524 |
User07 | The.Gentlemen | 8.0 | 1.7541667 | -0.7279762 | 6.870833 | 7.897024 |
User06 | Dolittle | 7.5 | -0.7083333 | 0.2291667 | 6.870833 | 6.391667 |
User03 | The.Gentlemen | 8.8 | 1.7541667 | 0.6491667 | 6.870833 | 9.274167 |
User05 | The.Gentlemen | 8.8 | 1.7541667 | -0.2708333 | 6.870833 | 8.354167 |
User03 | Birds.of.Prey | 7.5 | 0.4666667 | 0.6491667 | 6.870833 | 7.986667 |
User06 | Bad.Boys.For.Life | 7.5 | 0.6291667 | 0.2291667 | 6.870833 | 7.729167 |
User06 | Knives.Out | 6.3 | 1.0791667 | 0.2291667 | 6.870833 | 8.179167 |
User08 | Dolittle | 4.0 | -0.7083333 | -0.4708333 | 6.870833 | 5.691667 |
User11 | Dolittle | 7.5 | -0.7083333 | 0.6291667 | 6.870833 | 6.791667 |
User01 | Star.Wars | 10.0 | 1.7891667 | -0.2994048 | 6.870833 | 8.360595 |
User03 | Star.Wars | 8.8 | 1.7891667 | 0.6491667 | 6.870833 | 9.309167 |
User03 | Bad.Boys.For.Life | 7.5 | 0.6291667 | 0.6491667 | 6.870833 | 8.149167 |
User10 | The.Gentlemen | 10.0 | 1.7541667 | 1.5791667 | 6.870833 | 10.204167 |
User08 | Birds.of.Prey | 7.0 | 0.4666667 | -0.4708333 | 6.870833 | 6.866667 |
User01 | Bad.Boys.For.Life | 6.0 | 0.6291667 | -0.2994048 | 6.870833 | 7.200595 |
Below table displays Test data baseline.
MovieBias <- apply(test_data,1, function(x) as.numeric(movie_bias[x['Movie']]))
ReviewerBias <- apply(test_data,1, function(x) as.numeric(reviewer_bias[x['Reviewer']]))
test_data <- cbind(test_data , MovieBias, ReviewerBias)
# Max value for review 10
Baseline <- apply(test_data,1,
function(x) as.numeric(x['ReviewerBias']) + as.numeric(x['MovieBias']) +raw_avg )
test_data <- cbind(test_data, raw_avg , Baseline)
test_data_display <- test_data
row.names( test_data_display) <- c()
kable_styling (kable(test_data_display ),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Reviewer | Movie | Rating | MovieBias | ReviewerBias | raw_avg | Baseline |
---|---|---|---|---|---|---|
User06 | Birds.of.Prey | 7.5 | 0.4666667 | 0.2291667 | 6.870833 | 7.566667 |
User10 | Birds.of.Prey | 6.3 | 0.4666667 | 1.5791667 | 6.870833 | 8.916667 |
User11 | Birds.of.Prey | 6.3 | 0.4666667 | 0.6291667 | 6.870833 | 7.966667 |
User02 | Bad.Boys.For.Life | 7.5 | 0.6291667 | -1.3708333 | 6.870833 | 6.129167 |
User04 | Bad.Boys.For.Life | 6.0 | 0.6291667 | 1.3791667 | 6.870833 | 8.879167 |
User05 | Bad.Boys.For.Life | 8.8 | 0.6291667 | -0.2708333 | 6.870833 | 7.229167 |
User08 | Bad.Boys.For.Life | 6.0 | 0.6291667 | -0.4708333 | 6.870833 | 7.029167 |
User09 | Bad.Boys.For.Life | 8.5 | 0.6291667 | 0.8041667 | 6.870833 | 8.304167 |
User10 | Bad.Boys.For.Life | 6.3 | 0.6291667 | 1.5791667 | 6.870833 | 9.079167 |
User02 | Star.Wars | 7.5 | 1.7891667 | -1.3708333 | 6.870833 | 7.289167 |
User05 | Star.Wars | 10.0 | 1.7891667 | -0.2708333 | 6.870833 | 8.389167 |
User06 | Star.Wars | 7.5 | 1.7891667 | 0.2291667 | 6.870833 | 8.889167 |
User08 | Star.Wars | 7.0 | 1.7891667 | -0.4708333 | 6.870833 | 8.189167 |
User09 | Star.Wars | 9.5 | 1.7891667 | 0.8041667 | 6.870833 | 9.464167 |
User11 | Star.Wars | 10.0 | 1.7891667 | 0.6291667 | 6.870833 | 9.289167 |
User02 | Dolittle | 5.0 | -0.7083333 | -1.3708333 | 6.870833 | 4.791667 |
User04 | Dolittle | 4.0 | -0.7083333 | 1.3791667 | 6.870833 | 7.541667 |
User06 | The.Gentlemen | 10.0 | 1.7541667 | 0.2291667 | 6.870833 | 8.854167 |
User11 | The.Gentlemen | 8.8 | 1.7541667 | 0.6291667 | 6.870833 | 9.254167 |
User06 | X1917 | 5.0 | -3.6708333 | 0.2291667 | 6.870833 | 3.429167 |
User08 | X1917 | 6.0 | -3.6708333 | -0.4708333 | 6.870833 | 2.729167 |
Below table displays Raw Average and Baseline Predictor for test and train dataset. For the training dataset, we see that baseline predictor is significantly better(around 48.81%)than raw averages. We donโt see the same improvements for the test dataset; we believe this is the result of using random data and small dataset.
train_baseLine_pred <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] -
train_data$Baseline[!is.na(train_data$Rating)])^2) /
length(which(!is.na(train_data$Rating))))
test_baseLine_pred <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] -
test_data$Baseline[!is.na(test_data$Rating)])^2) /
length(which(!is.na(test_data$Rating))))
train_impro <- (1 -(train_baseLine_pred /train_rmse)) *100
test_impro <- (1 -(test_baseLine_pred /test_rmse)) *100
summary_display <- cbind(train_rmse, train_baseLine_pred, train_impro)
summary_display <- rbind(summary_display, c(test_rmse, test_baseLine_pred, test_impro ))
colnames(summary_display) <-c("Raw Average ", "Baseline Predictor", "Improvement %")
row.names(summary_display) <-c("Train ", "Test")
kable_styling (kable(summary_display),
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Raw Average | Baseline Predictor | Improvement % | |
---|---|---|---|
Train | 2.045519 | 1.047019 | 48.81403 |
Test | 1.176856 | 1.757275 | -49.31948 |