Project 1

Dataset

This dataset contains ratings for some of the new movies.

Ratings are on a scale of 1 to 10.

Some of the ratinsg are from actual sites and others were made up.

Below table displays user movie ratings.

ratings <- read.table('Reviews.txt',header = T, sep = ',', na.strings = T)
kable_styling (kable(ratings),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Reviewer	Birds.of.Prey	Bad.Boys.For.Life	Star.Wars	Dolittle	Jumanji..The.Next.Level	The.Gentlemen	X1917	Knives.Out
User01	8.0	6.0	10.0	6.0	6.0	8.0	2.0	NA
User02	7.5	7.5	7.5	5.0	2.5	7.5	2.5	7.5
User03	7.5	7.5	8.8	5.0	NA	8.8	NA	NA
User04	7.5	6.0	9.0	4.0	NA	NA	NA	NA
User05	7.5	8.8	10.0	6.3	3.8	8.8	NA	NA
User06	7.5	7.5	7.5	7.5	NA	10.0	5.0	6.3
User07	7.0	9.0	8.0	5.0	4.0	8.0	2.0	NA
User08	7.0	6.0	7.0	4.0	4.0	9.0	6.0	8.0
User09	6.7	8.5	9.5	8.0	7.1	8.9	NA	NA
User10	6.3	6.3	7.5	NA	NA	10.0	6.3	10.0
User11	6.3	7.5	10.0	7.5	NA	8.8	NA	NA

Below table displays average ratings by user Based on the table below, we see that User10 gives higher ratings and User02 gives lower ratings.

reviewers <- as.character(ratings$Reviewer)
movies <- as.character( colnames(ratings)[-1])
ratingsByUser <- cbind(reviewers,rowMeans(ratings[-1], na.rm = T))
kable_styling (kable(ratingsByUser,col.names = c("User",'Rating')),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

User	Rating
User01	6.57142857142857
User02	5.9375
User03	7.52
User04	6.625
User05	7.53333333333333
User06	7.32857142857143
User07	6.14285714285714
User08	6.375
User09	8.11666666666667
User10	7.73333333333333
User11	8.02

Below table display average ratings by movie. Based on the table below, we see that Star.Wars is the highest rated movie and X1917 is the lowest rated movie.

ratinsByMovie <- cbind(movies, colMeans(ratings[-1], na.rm = T))
rownames(ratinsByMovie) <- c()
kable_styling (kable(ratinsByMovie,  col.names = c("Movie","Rating")),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Movie	Rating
Birds.of.Prey	7.16363636363636
Bad.Boys.For.Life	7.32727272727273
Star.Wars	8.61818181818182
Dolittle	5.83
Jumanji..The.Next.Level	4.56666666666667
The.Gentlemen	8.78
X1917	3.96666666666667
Knives.Out	7.95

Wide to long format.

Here we are converting the data from wide format to long format.

ratings <- gather(ratings,key = Movie, value = Rating, -Reviewer)

Test and Train data

Split the data into triaing and test data. We are using 70% of the data for training and 30% for testing.

Below table display some of the data from the test and train data.

# Creating train and test using sample.int() function
set.seed(86) 

sample <- sample.int(n = nrow(ratings), 
                     size = floor(.70 * nrow(ratings)), # Selecting 70% of data
                     replace = F)
 
train <- ratings[sample, ]
test  <- ratings[-sample, ]

kable_styling (kable(head(train)),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Reviewer	Movie	Rating
67	User01	X1917	2.0
29	User07	Star.Wars	8.0
1	User01	Birds.of.Prey	8.0
87	User10	Knives.Out	10.0
49	User05	Jumanji..The.Next.Level	3.8
2	User02	Birds.of.Prey	7.5

kable_styling (kable(head(test)),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Reviewer	Movie	Rating
6	User06	Birds.of.Prey	7.5
10	User10	Birds.of.Prey	6.3
11	User11	Birds.of.Prey	6.3
13	User02	Bad.Boys.For.Life	7.5
15	User04	Bad.Boys.For.Life	6.0
16	User05	Bad.Boys.For.Life	8.8

Raw average

raw_avg = sum(train$Rating, na.rm = TRUE) / length(which(!is.na(train$Rating)))

Raw average for the movie reviews dataset is 6.8708333

raw_avg

## [1] 6.870833

Raw Average RMSE

We see from the table below the test RMSE is alot better than Train RMSE. This might be because the data is random and we only have small set of reviews.

train_rmse <- sqrt(sum((train$Rating[!is.na(train$Rating)] - raw_avg)^2) /
                         length(which(!is.na(train$Rating))))
test_rmse <- sqrt(sum((test$Rating[!is.na(test$Rating)] - raw_avg)^2) /
                         length(which(!is.na(train$Rating))))

kable_styling (kable(cbind(train_rmse, test_rmse ),  
                        col.names = c("Train RMSE","Test RMSE")),
                bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Train RMSE	Test RMSE
2.045519	1.176856

User Bias

Below table display user review biases. User u5 on average gives more favorable reviews and u4 is a harsh reviewer.

train_data = train[!is.na(train$Rating),]
test_data = test[!is.na(test$Rating),]
reviewer_bias = tapply(train_data$Rating, train_data$Reviewer, function(x) {(sum(x)/length(x))-raw_avg})

reviewer_bias_display<- cbind( names(reviewer_bias), reviewer_bias)

row.names(reviewer_bias_display)<- c()
kable_styling (kable(reviewer_bias_display ,  
                     col.names = c("User", "Bias")),
               bootstrap_options = c("striped", "hover", "condensed", "responsive"))

User	Bias
User01	-0.299404761904762
User02	-1.37083333333333
User03	0.649166666666667
User04	1.37916666666667
User05	-0.270833333333334
User06	0.229166666666667
User07	-0.72797619047619
User08	-0.470833333333333
User09	0.804166666666666
User10	1.57916666666667
User11	0.629166666666666

Movie Bias

Below table displays movie biases. Star Wars has been reviewed better than other movies and 1917 has gotten least favorable reviews.

movie_bias <- tapply(train_data$Rating, train_data$Movie, function(x) {(sum(x)/length(x))-raw_avg})

movie_bias_display<- cbind( names(movie_bias), movie_bias)

row.names(movie_bias_display)<- c()

kable_styling (kable(movie_bias_display ,  
                     col.names = c("Movie", "Bias")),
               bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Movie	Bias
Bad.Boys.For.Life	0.629166666666666
Birds.of.Prey	0.466666666666667
Dolittle	-0.708333333333334
Jumanji..The.Next.Level	-2.30416666666667
Knives.Out	1.07916666666667
Star.Wars	1.78916666666667
The.Gentlemen	1.75416666666667
X1917	-3.67083333333333

Train Baseline

Below table displays train data baseline

MovieBias <- apply(train_data,1, function(x)  as.numeric(movie_bias[x['Movie']])) 
ReviewerBias <- apply(train_data,1, function(x) as.numeric(reviewer_bias[x['Reviewer']])) 

train_data  <- cbind(train_data , MovieBias, ReviewerBias)


# Max value for review 10 
Baseline  <- apply(train_data,1, 
                   function(x) as.numeric(x['ReviewerBias']) +  as.numeric(x['MovieBias']) +raw_avg ) 

# data usesd for calculations below
train_data  <- cbind(train_data, raw_avg , Baseline)

# display data
train_data_display <- train_data

#remove column names
row.names( train_data_display) <- c()

kable_styling (kable(train_data_display ),
               bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Reviewer	Movie	Rating	MovieBias	ReviewerBias	raw_avg	Baseline
User01	X1917	2.0	-3.6708333	-0.2994048	6.870833	2.900595
User07	Star.Wars	8.0	1.7891667	-0.7279762	6.870833	7.932024
User01	Birds.of.Prey	8.0	0.4666667	-0.2994048	6.870833	7.038095
User10	Knives.Out	10.0	1.0791667	1.5791667	6.870833	9.529167
User05	Jumanji..The.Next.Level	3.8	-2.3041667	-0.2708333	6.870833	4.295833
User02	Birds.of.Prey	7.5	0.4666667	-1.3708333	6.870833	5.966667
User05	Dolittle	6.3	-0.7083333	-0.2708333	6.870833	5.891667
User04	Star.Wars	9.0	1.7891667	1.3791667	6.870833	10.039167
User02	Jumanji..The.Next.Level	2.5	-2.3041667	-1.3708333	6.870833	3.195833
User11	Bad.Boys.For.Life	7.5	0.6291667	0.6291667	6.870833	8.129167
User08	The.Gentlemen	9.0	1.7541667	-0.4708333	6.870833	8.154167
User09	Dolittle	8.0	-0.7083333	0.8041667	6.870833	6.966667
User02	The.Gentlemen	7.5	1.7541667	-1.3708333	6.870833	7.254167
User09	Birds.of.Prey	6.7	0.4666667	0.8041667	6.870833	8.141667
User09	Jumanji..The.Next.Level	7.1	-2.3041667	0.8041667	6.870833	5.370833
User07	Jumanji..The.Next.Level	4.0	-2.3041667	-0.7279762	6.870833	3.838690
User09	The.Gentlemen	8.9	1.7541667	0.8041667	6.870833	9.429167
User05	Birds.of.Prey	7.5	0.4666667	-0.2708333	6.870833	7.066667
User08	Knives.Out	8.0	1.0791667	-0.4708333	6.870833	7.479167
User01	Jumanji..The.Next.Level	6.0	-2.3041667	-0.2994048	6.870833	4.267262
User02	Knives.Out	7.5	1.0791667	-1.3708333	6.870833	6.579167
User02	X1917	2.5	-3.6708333	-1.3708333	6.870833	1.829167
User10	Star.Wars	7.5	1.7891667	1.5791667	6.870833	10.239167
User01	Dolittle	6.0	-0.7083333	-0.2994048	6.870833	5.863095
User08	Jumanji..The.Next.Level	4.0	-2.3041667	-0.4708333	6.870833	4.095833
User04	Birds.of.Prey	7.5	0.4666667	1.3791667	6.870833	8.716667
User07	X1917	2.0	-3.6708333	-0.7279762	6.870833	2.472024
User07	Bad.Boys.For.Life	9.0	0.6291667	-0.7279762	6.870833	6.772024
User01	The.Gentlemen	8.0	1.7541667	-0.2994048	6.870833	8.325595
User07	Dolittle	5.0	-0.7083333	-0.7279762	6.870833	5.434524
User03	Dolittle	5.0	-0.7083333	0.6491667	6.870833	6.811667
User10	X1917	6.3	-3.6708333	1.5791667	6.870833	4.779167
User07	Birds.of.Prey	7.0	0.4666667	-0.7279762	6.870833	6.609524
User07	The.Gentlemen	8.0	1.7541667	-0.7279762	6.870833	7.897024
User06	Dolittle	7.5	-0.7083333	0.2291667	6.870833	6.391667
User03	The.Gentlemen	8.8	1.7541667	0.6491667	6.870833	9.274167
User05	The.Gentlemen	8.8	1.7541667	-0.2708333	6.870833	8.354167
User03	Birds.of.Prey	7.5	0.4666667	0.6491667	6.870833	7.986667
User06	Bad.Boys.For.Life	7.5	0.6291667	0.2291667	6.870833	7.729167
User06	Knives.Out	6.3	1.0791667	0.2291667	6.870833	8.179167
User08	Dolittle	4.0	-0.7083333	-0.4708333	6.870833	5.691667
User11	Dolittle	7.5	-0.7083333	0.6291667	6.870833	6.791667
User01	Star.Wars	10.0	1.7891667	-0.2994048	6.870833	8.360595
User03	Star.Wars	8.8	1.7891667	0.6491667	6.870833	9.309167
User03	Bad.Boys.For.Life	7.5	0.6291667	0.6491667	6.870833	8.149167
User10	The.Gentlemen	10.0	1.7541667	1.5791667	6.870833	10.204167
User08	Birds.of.Prey	7.0	0.4666667	-0.4708333	6.870833	6.866667
User01	Bad.Boys.For.Life	6.0	0.6291667	-0.2994048	6.870833	7.200595

Test Baseline

Below table displays Test data baseline.

MovieBias <- apply(test_data,1, function(x)  as.numeric(movie_bias[x['Movie']])) 
ReviewerBias <- apply(test_data,1, function(x) as.numeric(reviewer_bias[x['Reviewer']])) 

test_data <- cbind(test_data , MovieBias, ReviewerBias)

# Max value for review 10 
Baseline  <- apply(test_data,1, 
                   function(x) as.numeric(x['ReviewerBias']) +  as.numeric(x['MovieBias']) +raw_avg )

test_data  <- cbind(test_data, raw_avg , Baseline)
test_data_display <- test_data
row.names( test_data_display) <- c()
kable_styling (kable(test_data_display ),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Reviewer	Movie	Rating	MovieBias	ReviewerBias	raw_avg	Baseline
User06	Birds.of.Prey	7.5	0.4666667	0.2291667	6.870833	7.566667
User10	Birds.of.Prey	6.3	0.4666667	1.5791667	6.870833	8.916667
User11	Birds.of.Prey	6.3	0.4666667	0.6291667	6.870833	7.966667
User02	Bad.Boys.For.Life	7.5	0.6291667	-1.3708333	6.870833	6.129167
User04	Bad.Boys.For.Life	6.0	0.6291667	1.3791667	6.870833	8.879167
User05	Bad.Boys.For.Life	8.8	0.6291667	-0.2708333	6.870833	7.229167
User08	Bad.Boys.For.Life	6.0	0.6291667	-0.4708333	6.870833	7.029167
User09	Bad.Boys.For.Life	8.5	0.6291667	0.8041667	6.870833	8.304167
User10	Bad.Boys.For.Life	6.3	0.6291667	1.5791667	6.870833	9.079167
User02	Star.Wars	7.5	1.7891667	-1.3708333	6.870833	7.289167
User05	Star.Wars	10.0	1.7891667	-0.2708333	6.870833	8.389167
User06	Star.Wars	7.5	1.7891667	0.2291667	6.870833	8.889167
User08	Star.Wars	7.0	1.7891667	-0.4708333	6.870833	8.189167
User09	Star.Wars	9.5	1.7891667	0.8041667	6.870833	9.464167
User11	Star.Wars	10.0	1.7891667	0.6291667	6.870833	9.289167
User02	Dolittle	5.0	-0.7083333	-1.3708333	6.870833	4.791667
User04	Dolittle	4.0	-0.7083333	1.3791667	6.870833	7.541667
User06	The.Gentlemen	10.0	1.7541667	0.2291667	6.870833	8.854167
User11	The.Gentlemen	8.8	1.7541667	0.6291667	6.870833	9.254167
User06	X1917	5.0	-3.6708333	0.2291667	6.870833	3.429167
User08	X1917	6.0	-3.6708333	-0.4708333	6.870833	2.729167

Baseline Predictor

Below table displays Raw Average and Baseline Predictor for test and train dataset. For the training dataset, we see that baseline predictor is significantly better(around 48.81%)than raw averages. We don’t see the same improvements for the test dataset; we believe this is the result of using random data and small dataset.

train_baseLine_pred <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] - 
                               train_data$Baseline[!is.na(train_data$Rating)])^2) /
                          length(which(!is.na(train_data$Rating))))

test_baseLine_pred <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] - 
                              test_data$Baseline[!is.na(test_data$Rating)])^2) /
                         length(which(!is.na(test_data$Rating))))

train_impro <- (1 -(train_baseLine_pred /train_rmse)) *100
test_impro <- (1 -(test_baseLine_pred /test_rmse)) *100

summary_display <-  cbind(train_rmse,  train_baseLine_pred, train_impro)
summary_display <-  rbind(summary_display, c(test_rmse, test_baseLine_pred, test_impro ))

colnames(summary_display) <-c("Raw Average ", "Baseline Predictor", "Improvement %")
row.names(summary_display) <-c("Train ", "Test")
kable_styling (kable(summary_display),
                bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Raw Average	Baseline Predictor	Improvement %
Train	2.045519	1.047019	48.81403
Test	1.176856	1.757275	-49.31948