Introduction
The recommender system I would like to implement would recommend Movies. This is an area were a lot have been done so it’s a nice area to gather data and benchmark results and models. In this case grab data from metacritic.com since is one of the few delivering numeric ratings in a 0-100 scale among multiple reviewers.
# Required libraries
library(caTools) # Train/test Split
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
Data Set
For this project, I wanted to create a small data set based on real reviews. Using reviews from metacritic.com, which assigns numeric values to their reviews (on a 0 to 100 scale), I recorded some ratings for recent movies. I quickly realized that not allthe selected reviewers review all the selected movies. So for this project will have a lot of NA values that I’ll handle
We start importing a preloade CSV file available in my github and showing the resulting dataframe.
# Data import
data <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA612/master/movies.csv")
colnames(data) <- gsub("ï..Reviewers", "Reviewers", colnames(data))
data
## Reviewers Aladdin John.Wick.3 Avengers..End.Game
## 1 The Guardian 80 60 100
## 2 San Francisco Chronicle 75 75 75
## 3 USA Today 75 75 88
## 4 CNN 75 60 90
## 5 Chicago Sun-Times 75 88 100
## 6 New York Post 75 75 75
## 7 Rolling Stone 70 90 80
## 8 Variety 70 60 70
## 9 IGN 67 85 95
## 10 Boston Globe 63 63 75
## 11 Washington Post 63 75 100
## Pokemon.detective Brightburn Booksmart The.Hustle The.Intruder
## 1 60 60 80 20 NA
## 2 50 25 75 25 75
## 3 50 NA 88 NA NA
## 4 40 NA NA NA NA
## 5 63 38 88 NA NA
## 6 75 NA 100 50 63
## 7 50 40 80 20 NA
## 8 40 40 90 60 80
## 9 80 71 89 NA NA
## 10 NA NA 100 63 100
## 11 75 NA 88 NA NA
Next step is dividing the dataframe in training test datasets to keep randomness I’ll use routines pre made for this purpose
Most manipulations and calculations were done using tidyverse. Data frame was converted to long form and split into training and testing sets based on 0.75 split ratio.
# Convert to long form
split_data <- data %>% gather(key = Movie, value = Rating, -Reviewers)
# Randomly split all ratings for training and testing sets
set.seed(50)
split <- sample.split(split_data$Rating, SplitRatio = 0.75)
# Prepare training set
train_data <- split_data
train_data$Rating[!split] <- NA
print("Training Dataset")
## [1] "Training Dataset"
head(train_data)
## Reviewers Movie Rating
## 1 The Guardian Aladdin NA
## 2 San Francisco Chronicle Aladdin NA
## 3 USA Today Aladdin NA
## 4 CNN Aladdin 75
## 5 Chicago Sun-Times Aladdin 75
## 6 New York Post Aladdin 75
# Prepare testing set
test_data <- split_data
test_data$Rating[split] <- NA
print("Test Dataset")
## [1] "Test Dataset"
head(test_data)
## Reviewers Movie Rating
## 1 The Guardian Aladdin 80
## 2 San Francisco Chronicle Aladdin 75
## 3 USA Today Aladdin 75
## 4 CNN Aladdin NA
## 5 Chicago Sun-Times Aladdin NA
## 6 New York Post Aladdin NA
Now that we have two different dataset ramndomly chosen we can move to RMSE calculations
# Get raw average
raw_avg <- sum(train_data$Rating, na.rm = TRUE) / length(which(!is.na(train_data$Rating)))
# Calculate RMSE for raw average
rmse_raw_train <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] - raw_avg)^2) /
length(which(!is.na(train_data$Rating))))
rmse_raw_train
## [1] 20.63277
rmse_raw_test <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] - raw_avg)^2) /
length(which(!is.na(test_data$Rating))))
rmse_raw_test
## [1] 16.34963
We can see RMSE values are quite large expected in such a small sample with so many empty values
# Get Reviewers and Movie biases
Reviewers_bias <- train_data %>% filter(!is.na(Rating)) %>%
group_by(Reviewers) %>%
summarise(sum = sum(Rating), count = n()) %>%
mutate(bias = sum/count-raw_avg) %>%
select(Reviewers, ReviewersBias = bias)
ReviewersBias<-Reviewers_bias$ReviewersBias
Movie_bias <- train_data %>% filter(!is.na(Rating)) %>%
group_by(Movie) %>%
summarise(sum = sum(Rating), count = n()) %>%
mutate(bias = sum/count-raw_avg) %>%
select(Movie, MovieBias = bias)
MovieBias<-Movie_bias$MovieBias
train_data <- train_data %>% left_join(Reviewers_bias, by = "Reviewers") %>%
left_join(Movie_bias, by = "Movie") %>%
mutate(RawAvg = raw_avg) %>%
mutate(Baseline = RawAvg + ReviewersBias + MovieBias)
train_data
## Reviewers Movie Rating ReviewersBias
## 1 The Guardian Aladdin NA -9.0192308
## 2 San Francisco Chronicle Aladdin NA -14.8525641
## 3 USA Today Aladdin NA 6.2307692
## 4 CNN Aladdin 75 -2.7692308
## 5 Chicago Sun-Times Aladdin 75 3.7807692
## 6 New York Post Aladdin 75 -0.2692308
## 7 Rolling Stone Aladdin 70 -5.6858974
## 8 Variety Aladdin NA -6.5192308
## 9 IGN Aladdin 67 12.1474359
## 10 Boston Globe Aladdin NA 11.1807692
## 11 Washington Post Aladdin 63 6.2307692
## 12 The Guardian John.Wick.3 60 -9.0192308
## 13 San Francisco Chronicle John.Wick.3 75 -14.8525641
## 14 USA Today John.Wick.3 75 6.2307692
## 15 CNN John.Wick.3 60 -2.7692308
## 16 Chicago Sun-Times John.Wick.3 88 3.7807692
## 17 New York Post John.Wick.3 75 -0.2692308
## 18 Rolling Stone John.Wick.3 90 -5.6858974
## 19 Variety John.Wick.3 60 -6.5192308
## 20 IGN John.Wick.3 85 12.1474359
## 21 Boston Globe John.Wick.3 63 11.1807692
## 22 Washington Post John.Wick.3 75 6.2307692
## 23 The Guardian Avengers..End.Game 100 -9.0192308
## 24 San Francisco Chronicle Avengers..End.Game NA -14.8525641
## 25 USA Today Avengers..End.Game 88 6.2307692
## 26 CNN Avengers..End.Game 90 -2.7692308
## 27 Chicago Sun-Times Avengers..End.Game 100 3.7807692
## 28 New York Post Avengers..End.Game NA -0.2692308
## 29 Rolling Stone Avengers..End.Game 80 -5.6858974
## 30 Variety Avengers..End.Game 70 -6.5192308
## 31 IGN Avengers..End.Game 95 12.1474359
## 32 Boston Globe Avengers..End.Game 75 11.1807692
## 33 Washington Post Avengers..End.Game NA 6.2307692
## 34 The Guardian Pokemon.detective NA -9.0192308
## 35 San Francisco Chronicle Pokemon.detective 50 -14.8525641
## 36 USA Today Pokemon.detective 50 6.2307692
## 37 CNN Pokemon.detective 40 -2.7692308
## 38 Chicago Sun-Times Pokemon.detective 63 3.7807692
## 39 New York Post Pokemon.detective 75 -0.2692308
## 40 Rolling Stone Pokemon.detective NA -5.6858974
## 41 Variety Pokemon.detective 40 -6.5192308
## 42 IGN Pokemon.detective 80 12.1474359
## 43 Boston Globe Pokemon.detective NA 11.1807692
## 44 Washington Post Pokemon.detective 75 6.2307692
## 45 The Guardian Brightburn 60 -9.0192308
## 46 San Francisco Chronicle Brightburn 25 -14.8525641
## 47 USA Today Brightburn NA 6.2307692
## 48 CNN Brightburn NA -2.7692308
## 49 Chicago Sun-Times Brightburn 38 3.7807692
## 50 New York Post Brightburn NA -0.2692308
## 51 Rolling Stone Brightburn 40 -5.6858974
## 52 Variety Brightburn NA -6.5192308
## 53 IGN Brightburn 71 12.1474359
## 54 Boston Globe Brightburn NA 11.1807692
## 55 Washington Post Brightburn NA 6.2307692
## 56 The Guardian Booksmart NA -9.0192308
## 57 San Francisco Chronicle Booksmart 75 -14.8525641
## 58 USA Today Booksmart 88 6.2307692
## 59 CNN Booksmart NA -2.7692308
## 60 Chicago Sun-Times Booksmart NA 3.7807692
## 61 New York Post Booksmart NA -0.2692308
## 62 Rolling Stone Booksmart 80 -5.6858974
## 63 Variety Booksmart NA -6.5192308
## 64 IGN Booksmart 89 12.1474359
## 65 Boston Globe Booksmart 100 11.1807692
## 66 Washington Post Booksmart 88 6.2307692
## 67 The Guardian The.Hustle 20 -9.0192308
## 68 San Francisco Chronicle The.Hustle 25 -14.8525641
## 69 USA Today The.Hustle NA 6.2307692
## 70 CNN The.Hustle NA -2.7692308
## 71 Chicago Sun-Times The.Hustle NA 3.7807692
## 72 New York Post The.Hustle 50 -0.2692308
## 73 Rolling Stone The.Hustle 20 -5.6858974
## 74 Variety The.Hustle NA -6.5192308
## 75 IGN The.Hustle NA 12.1474359
## 76 Boston Globe The.Hustle 63 11.1807692
## 77 Washington Post The.Hustle NA 6.2307692
## 78 The Guardian The.Intruder NA -9.0192308
## 79 San Francisco Chronicle The.Intruder 75 -14.8525641
## 80 USA Today The.Intruder NA 6.2307692
## 81 CNN The.Intruder NA -2.7692308
## 82 Chicago Sun-Times The.Intruder NA 3.7807692
## 83 New York Post The.Intruder NA -0.2692308
## 84 Rolling Stone The.Intruder NA -5.6858974
## 85 Variety The.Intruder 80 -6.5192308
## 86 IGN The.Intruder NA 12.1474359
## 87 Boston Globe The.Intruder 100 11.1807692
## 88 Washington Post The.Intruder NA 6.2307692
## MovieBias RawAvg Baseline
## 1 1.814103 69.01923 61.81410
## 2 1.814103 69.01923 55.98077
## 3 1.814103 69.01923 77.06410
## 4 1.814103 69.01923 68.06410
## 5 1.814103 69.01923 74.61410
## 6 1.814103 69.01923 70.56410
## 7 1.814103 69.01923 65.14744
## 8 1.814103 69.01923 64.31410
## 9 1.814103 69.01923 82.98077
## 10 1.814103 69.01923 82.01410
## 11 1.814103 69.01923 77.06410
## 12 4.253497 69.01923 64.25350
## 13 4.253497 69.01923 58.42016
## 14 4.253497 69.01923 79.50350
## 15 4.253497 69.01923 70.50350
## 16 4.253497 69.01923 77.05350
## 17 4.253497 69.01923 73.00350
## 18 4.253497 69.01923 67.58683
## 19 4.253497 69.01923 66.75350
## 20 4.253497 69.01923 85.42016
## 21 4.253497 69.01923 84.45350
## 22 4.253497 69.01923 79.50350
## 23 18.230769 69.01923 78.23077
## 24 18.230769 69.01923 72.39744
## 25 18.230769 69.01923 93.48077
## 26 18.230769 69.01923 84.48077
## 27 18.230769 69.01923 91.03077
## 28 18.230769 69.01923 86.98077
## 29 18.230769 69.01923 81.56410
## 30 18.230769 69.01923 80.73077
## 31 18.230769 69.01923 99.39744
## 32 18.230769 69.01923 98.43077
## 33 18.230769 69.01923 93.48077
## 34 -9.894231 69.01923 50.10577
## 35 -9.894231 69.01923 44.27244
## 36 -9.894231 69.01923 65.35577
## 37 -9.894231 69.01923 56.35577
## 38 -9.894231 69.01923 62.90577
## 39 -9.894231 69.01923 58.85577
## 40 -9.894231 69.01923 53.43910
## 41 -9.894231 69.01923 52.60577
## 42 -9.894231 69.01923 71.27244
## 43 -9.894231 69.01923 70.30577
## 44 -9.894231 69.01923 65.35577
## 45 -22.219231 69.01923 37.78077
## 46 -22.219231 69.01923 31.94744
## 47 -22.219231 69.01923 53.03077
## 48 -22.219231 69.01923 44.03077
## 49 -22.219231 69.01923 50.58077
## 50 -22.219231 69.01923 46.53077
## 51 -22.219231 69.01923 41.11410
## 52 -22.219231 69.01923 40.28077
## 53 -22.219231 69.01923 58.94744
## 54 -22.219231 69.01923 57.98077
## 55 -22.219231 69.01923 53.03077
## 56 17.647436 69.01923 77.64744
## 57 17.647436 69.01923 71.81410
## 58 17.647436 69.01923 92.89744
## 59 17.647436 69.01923 83.89744
## 60 17.647436 69.01923 90.44744
## 61 17.647436 69.01923 86.39744
## 62 17.647436 69.01923 80.98077
## 63 17.647436 69.01923 80.14744
## 64 17.647436 69.01923 98.81410
## 65 17.647436 69.01923 97.84744
## 66 17.647436 69.01923 92.89744
## 67 -33.419231 69.01923 26.58077
## 68 -33.419231 69.01923 20.74744
## 69 -33.419231 69.01923 41.83077
## 70 -33.419231 69.01923 32.83077
## 71 -33.419231 69.01923 39.38077
## 72 -33.419231 69.01923 35.33077
## 73 -33.419231 69.01923 29.91410
## 74 -33.419231 69.01923 29.08077
## 75 -33.419231 69.01923 47.74744
## 76 -33.419231 69.01923 46.78077
## 77 -33.419231 69.01923 41.83077
## 78 15.980769 69.01923 75.98077
## 79 15.980769 69.01923 70.14744
## 80 15.980769 69.01923 91.23077
## 81 15.980769 69.01923 82.23077
## 82 15.980769 69.01923 88.78077
## 83 15.980769 69.01923 84.73077
## 84 15.980769 69.01923 79.31410
## 85 15.980769 69.01923 78.48077
## 86 15.980769 69.01923 97.14744
## 87 15.980769 69.01923 96.18077
## 88 15.980769 69.01923 91.23077
test_data <- test_data %>% left_join(Reviewers_bias, by = "Reviewers") %>%
left_join(Movie_bias, by = "Movie") %>%
mutate(RawAvg = raw_avg) %>%
mutate(Baseline = RawAvg + ReviewersBias + MovieBias)
test_data
## Reviewers Movie Rating ReviewersBias
## 1 The Guardian Aladdin 80 -9.0192308
## 2 San Francisco Chronicle Aladdin 75 -14.8525641
## 3 USA Today Aladdin 75 6.2307692
## 4 CNN Aladdin NA -2.7692308
## 5 Chicago Sun-Times Aladdin NA 3.7807692
## 6 New York Post Aladdin NA -0.2692308
## 7 Rolling Stone Aladdin NA -5.6858974
## 8 Variety Aladdin 70 -6.5192308
## 9 IGN Aladdin NA 12.1474359
## 10 Boston Globe Aladdin 63 11.1807692
## 11 Washington Post Aladdin NA 6.2307692
## 12 The Guardian John.Wick.3 NA -9.0192308
## 13 San Francisco Chronicle John.Wick.3 NA -14.8525641
## 14 USA Today John.Wick.3 NA 6.2307692
## 15 CNN John.Wick.3 NA -2.7692308
## 16 Chicago Sun-Times John.Wick.3 NA 3.7807692
## 17 New York Post John.Wick.3 NA -0.2692308
## 18 Rolling Stone John.Wick.3 NA -5.6858974
## 19 Variety John.Wick.3 NA -6.5192308
## 20 IGN John.Wick.3 NA 12.1474359
## 21 Boston Globe John.Wick.3 NA 11.1807692
## 22 Washington Post John.Wick.3 NA 6.2307692
## 23 The Guardian Avengers..End.Game NA -9.0192308
## 24 San Francisco Chronicle Avengers..End.Game 75 -14.8525641
## 25 USA Today Avengers..End.Game NA 6.2307692
## 26 CNN Avengers..End.Game NA -2.7692308
## 27 Chicago Sun-Times Avengers..End.Game NA 3.7807692
## 28 New York Post Avengers..End.Game 75 -0.2692308
## 29 Rolling Stone Avengers..End.Game NA -5.6858974
## 30 Variety Avengers..End.Game NA -6.5192308
## 31 IGN Avengers..End.Game NA 12.1474359
## 32 Boston Globe Avengers..End.Game NA 11.1807692
## 33 Washington Post Avengers..End.Game 100 6.2307692
## 34 The Guardian Pokemon.detective 60 -9.0192308
## 35 San Francisco Chronicle Pokemon.detective NA -14.8525641
## 36 USA Today Pokemon.detective NA 6.2307692
## 37 CNN Pokemon.detective NA -2.7692308
## 38 Chicago Sun-Times Pokemon.detective NA 3.7807692
## 39 New York Post Pokemon.detective NA -0.2692308
## 40 Rolling Stone Pokemon.detective 50 -5.6858974
## 41 Variety Pokemon.detective NA -6.5192308
## 42 IGN Pokemon.detective NA 12.1474359
## 43 Boston Globe Pokemon.detective NA 11.1807692
## 44 Washington Post Pokemon.detective NA 6.2307692
## 45 The Guardian Brightburn NA -9.0192308
## 46 San Francisco Chronicle Brightburn NA -14.8525641
## 47 USA Today Brightburn NA 6.2307692
## 48 CNN Brightburn NA -2.7692308
## 49 Chicago Sun-Times Brightburn NA 3.7807692
## 50 New York Post Brightburn NA -0.2692308
## 51 Rolling Stone Brightburn NA -5.6858974
## 52 Variety Brightburn 40 -6.5192308
## 53 IGN Brightburn NA 12.1474359
## 54 Boston Globe Brightburn NA 11.1807692
## 55 Washington Post Brightburn NA 6.2307692
## 56 The Guardian Booksmart 80 -9.0192308
## 57 San Francisco Chronicle Booksmart NA -14.8525641
## 58 USA Today Booksmart NA 6.2307692
## 59 CNN Booksmart NA -2.7692308
## 60 Chicago Sun-Times Booksmart 88 3.7807692
## 61 New York Post Booksmart 100 -0.2692308
## 62 Rolling Stone Booksmart NA -5.6858974
## 63 Variety Booksmart 90 -6.5192308
## 64 IGN Booksmart NA 12.1474359
## 65 Boston Globe Booksmart NA 11.1807692
## 66 Washington Post Booksmart NA 6.2307692
## 67 The Guardian The.Hustle NA -9.0192308
## 68 San Francisco Chronicle The.Hustle NA -14.8525641
## 69 USA Today The.Hustle NA 6.2307692
## 70 CNN The.Hustle NA -2.7692308
## 71 Chicago Sun-Times The.Hustle NA 3.7807692
## 72 New York Post The.Hustle NA -0.2692308
## 73 Rolling Stone The.Hustle NA -5.6858974
## 74 Variety The.Hustle 60 -6.5192308
## 75 IGN The.Hustle NA 12.1474359
## 76 Boston Globe The.Hustle NA 11.1807692
## 77 Washington Post The.Hustle NA 6.2307692
## 78 The Guardian The.Intruder NA -9.0192308
## 79 San Francisco Chronicle The.Intruder NA -14.8525641
## 80 USA Today The.Intruder NA 6.2307692
## 81 CNN The.Intruder NA -2.7692308
## 82 Chicago Sun-Times The.Intruder NA 3.7807692
## 83 New York Post The.Intruder 63 -0.2692308
## 84 Rolling Stone The.Intruder NA -5.6858974
## 85 Variety The.Intruder NA -6.5192308
## 86 IGN The.Intruder NA 12.1474359
## 87 Boston Globe The.Intruder NA 11.1807692
## 88 Washington Post The.Intruder NA 6.2307692
## MovieBias RawAvg Baseline
## 1 1.814103 69.01923 61.81410
## 2 1.814103 69.01923 55.98077
## 3 1.814103 69.01923 77.06410
## 4 1.814103 69.01923 68.06410
## 5 1.814103 69.01923 74.61410
## 6 1.814103 69.01923 70.56410
## 7 1.814103 69.01923 65.14744
## 8 1.814103 69.01923 64.31410
## 9 1.814103 69.01923 82.98077
## 10 1.814103 69.01923 82.01410
## 11 1.814103 69.01923 77.06410
## 12 4.253497 69.01923 64.25350
## 13 4.253497 69.01923 58.42016
## 14 4.253497 69.01923 79.50350
## 15 4.253497 69.01923 70.50350
## 16 4.253497 69.01923 77.05350
## 17 4.253497 69.01923 73.00350
## 18 4.253497 69.01923 67.58683
## 19 4.253497 69.01923 66.75350
## 20 4.253497 69.01923 85.42016
## 21 4.253497 69.01923 84.45350
## 22 4.253497 69.01923 79.50350
## 23 18.230769 69.01923 78.23077
## 24 18.230769 69.01923 72.39744
## 25 18.230769 69.01923 93.48077
## 26 18.230769 69.01923 84.48077
## 27 18.230769 69.01923 91.03077
## 28 18.230769 69.01923 86.98077
## 29 18.230769 69.01923 81.56410
## 30 18.230769 69.01923 80.73077
## 31 18.230769 69.01923 99.39744
## 32 18.230769 69.01923 98.43077
## 33 18.230769 69.01923 93.48077
## 34 -9.894231 69.01923 50.10577
## 35 -9.894231 69.01923 44.27244
## 36 -9.894231 69.01923 65.35577
## 37 -9.894231 69.01923 56.35577
## 38 -9.894231 69.01923 62.90577
## 39 -9.894231 69.01923 58.85577
## 40 -9.894231 69.01923 53.43910
## 41 -9.894231 69.01923 52.60577
## 42 -9.894231 69.01923 71.27244
## 43 -9.894231 69.01923 70.30577
## 44 -9.894231 69.01923 65.35577
## 45 -22.219231 69.01923 37.78077
## 46 -22.219231 69.01923 31.94744
## 47 -22.219231 69.01923 53.03077
## 48 -22.219231 69.01923 44.03077
## 49 -22.219231 69.01923 50.58077
## 50 -22.219231 69.01923 46.53077
## 51 -22.219231 69.01923 41.11410
## 52 -22.219231 69.01923 40.28077
## 53 -22.219231 69.01923 58.94744
## 54 -22.219231 69.01923 57.98077
## 55 -22.219231 69.01923 53.03077
## 56 17.647436 69.01923 77.64744
## 57 17.647436 69.01923 71.81410
## 58 17.647436 69.01923 92.89744
## 59 17.647436 69.01923 83.89744
## 60 17.647436 69.01923 90.44744
## 61 17.647436 69.01923 86.39744
## 62 17.647436 69.01923 80.98077
## 63 17.647436 69.01923 80.14744
## 64 17.647436 69.01923 98.81410
## 65 17.647436 69.01923 97.84744
## 66 17.647436 69.01923 92.89744
## 67 -33.419231 69.01923 26.58077
## 68 -33.419231 69.01923 20.74744
## 69 -33.419231 69.01923 41.83077
## 70 -33.419231 69.01923 32.83077
## 71 -33.419231 69.01923 39.38077
## 72 -33.419231 69.01923 35.33077
## 73 -33.419231 69.01923 29.91410
## 74 -33.419231 69.01923 29.08077
## 75 -33.419231 69.01923 47.74744
## 76 -33.419231 69.01923 46.78077
## 77 -33.419231 69.01923 41.83077
## 78 15.980769 69.01923 75.98077
## 79 15.980769 69.01923 70.14744
## 80 15.980769 69.01923 91.23077
## 81 15.980769 69.01923 82.23077
## 82 15.980769 69.01923 88.78077
## 83 15.980769 69.01923 84.73077
## 84 15.980769 69.01923 79.31410
## 85 15.980769 69.01923 78.48077
## 86 15.980769 69.01923 97.14744
## 87 15.980769 69.01923 96.18077
## 88 15.980769 69.01923 91.23077
# Calculate RMSE for baseline predictors
rmse_base_train <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] -
train_data$Baseline[!is.na(train_data$Rating)])^2) /
length(which(!is.na(train_data$Rating))))
rmse_base_test <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] -
test_data$Baseline[!is.na(test_data$Rating)])^2) /
length(which(!is.na(test_data$Rating))))
You can tell the largest Bias are determined by the NA reviews or reviewers didn’t review the movie.Continuing the RMSE calculation with the Bias Information for the baseline Ratings you’ll see the RMSE for the new baseline ratings.
rmse_base_train
## [1] 10.947
rmse_base_test
## [1] 13.53655
This table shows RMSE values for training and testing sets and for raw average and baseline predictors.
| RMSE | |
|---|---|
| Training: Raw Average | 20.63277 |
| Training: Baseline Predictor | 10.94700 |
| Testing: Raw Average | 16.34963 |
| Testing: Baseline Predictor | 13.53655 |
We can see that RMSE improved baseline predictors in both training and test datasets. Eventhough a small dataset and you could say incompete gave us enough information to visualize and apply the movie and reviewer bias into the model.Since the test and training datasets are randomly generated with every execution I’m reporting over a Snapshot.