Introduction

The recommender system I would like to implement would recommend Movies. This is an area were a lot have been done so it’s a nice area to gather data and benchmark results and models. In this case grab data from metacritic.com since is one of the few delivering numeric ratings in a 0-100 scale among multiple reviewers.

# Required libraries
library(caTools)  # Train/test Split
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Data Set

For this project, I wanted to create a small data set based on real reviews. Using reviews from metacritic.com, which assigns numeric values to their reviews (on a 0 to 100 scale), I recorded some ratings for recent movies. I quickly realized that not allthe selected reviewers review all the selected movies. So for this project will have a lot of NA values that I’ll handle

We start importing a preloade CSV file available in my github and showing the resulting dataframe.

# Data import
data <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA612/master/movies.csv")  
colnames(data) <- gsub("ï..Reviewers", "Reviewers", colnames(data))
data
##                  Reviewers Aladdin John.Wick.3 Avengers..End.Game
## 1             The Guardian      80          60                100
## 2  San Francisco Chronicle      75          75                 75
## 3                USA Today      75          75                 88
## 4                      CNN      75          60                 90
## 5        Chicago Sun-Times      75          88                100
## 6            New York Post      75          75                 75
## 7            Rolling Stone      70          90                 80
## 8                  Variety      70          60                 70
## 9                      IGN      67          85                 95
## 10            Boston Globe      63          63                 75
## 11         Washington Post      63          75                100
##    Pokemon.detective Brightburn Booksmart The.Hustle The.Intruder
## 1                 60         60        80         20           NA
## 2                 50         25        75         25           75
## 3                 50         NA        88         NA           NA
## 4                 40         NA        NA         NA           NA
## 5                 63         38        88         NA           NA
## 6                 75         NA       100         50           63
## 7                 50         40        80         20           NA
## 8                 40         40        90         60           80
## 9                 80         71        89         NA           NA
## 10                NA         NA       100         63          100
## 11                75         NA        88         NA           NA

Next step is dividing the dataframe in training test datasets to keep randomness I’ll use routines pre made for this purpose

Trainig/Testing Split

Most manipulations and calculations were done using tidyverse. Data frame was converted to long form and split into training and testing sets based on 0.75 split ratio.

# Convert to long form
split_data <- data %>% gather(key = Movie, value = Rating, -Reviewers)

# Randomly split all ratings for training and testing sets
set.seed(50)
split <- sample.split(split_data$Rating, SplitRatio = 0.75)

# Prepare training set
train_data <- split_data
train_data$Rating[!split] <- NA
print("Training Dataset")
## [1] "Training Dataset"
head(train_data)
##                 Reviewers   Movie Rating
## 1            The Guardian Aladdin     NA
## 2 San Francisco Chronicle Aladdin     NA
## 3               USA Today Aladdin     NA
## 4                     CNN Aladdin     75
## 5       Chicago Sun-Times Aladdin     75
## 6           New York Post Aladdin     75
# Prepare testing set
test_data <- split_data
test_data$Rating[split] <- NA
print("Test Dataset")
## [1] "Test Dataset"
head(test_data)
##                 Reviewers   Movie Rating
## 1            The Guardian Aladdin     80
## 2 San Francisco Chronicle Aladdin     75
## 3               USA Today Aladdin     75
## 4                     CNN Aladdin     NA
## 5       Chicago Sun-Times Aladdin     NA
## 6           New York Post Aladdin     NA

Now that we have two different dataset ramndomly chosen we can move to RMSE calculations

# Get raw average
raw_avg <- sum(train_data$Rating, na.rm = TRUE) / length(which(!is.na(train_data$Rating)))

# Calculate RMSE for raw average
rmse_raw_train <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] - raw_avg)^2) /
                         length(which(!is.na(train_data$Rating))))
rmse_raw_train
## [1] 20.63277
rmse_raw_test <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] - raw_avg)^2) /
                        length(which(!is.na(test_data$Rating))))
rmse_raw_test
## [1] 16.34963

We can see RMSE values are quite large expected in such a small sample with so many empty values

Baseline Predictors

# Get Reviewers and Movie biases
Reviewers_bias <- train_data %>% filter(!is.na(Rating)) %>% 
  group_by(Reviewers) %>%
  summarise(sum = sum(Rating), count = n()) %>% 
  mutate(bias = sum/count-raw_avg) %>%
  select(Reviewers, ReviewersBias = bias)
ReviewersBias<-Reviewers_bias$ReviewersBias

Movie_bias <- train_data %>% filter(!is.na(Rating)) %>% 
  group_by(Movie) %>%
  summarise(sum = sum(Rating), count = n()) %>% 
  mutate(bias = sum/count-raw_avg) %>%
  select(Movie, MovieBias = bias)
MovieBias<-Movie_bias$MovieBias



train_data <- train_data %>% left_join(Reviewers_bias, by = "Reviewers") %>%
  left_join(Movie_bias, by = "Movie") %>%
  mutate(RawAvg = raw_avg) %>%
  mutate(Baseline = RawAvg + ReviewersBias + MovieBias)
train_data
##                  Reviewers              Movie Rating ReviewersBias
## 1             The Guardian            Aladdin     NA    -9.0192308
## 2  San Francisco Chronicle            Aladdin     NA   -14.8525641
## 3                USA Today            Aladdin     NA     6.2307692
## 4                      CNN            Aladdin     75    -2.7692308
## 5        Chicago Sun-Times            Aladdin     75     3.7807692
## 6            New York Post            Aladdin     75    -0.2692308
## 7            Rolling Stone            Aladdin     70    -5.6858974
## 8                  Variety            Aladdin     NA    -6.5192308
## 9                      IGN            Aladdin     67    12.1474359
## 10            Boston Globe            Aladdin     NA    11.1807692
## 11         Washington Post            Aladdin     63     6.2307692
## 12            The Guardian        John.Wick.3     60    -9.0192308
## 13 San Francisco Chronicle        John.Wick.3     75   -14.8525641
## 14               USA Today        John.Wick.3     75     6.2307692
## 15                     CNN        John.Wick.3     60    -2.7692308
## 16       Chicago Sun-Times        John.Wick.3     88     3.7807692
## 17           New York Post        John.Wick.3     75    -0.2692308
## 18           Rolling Stone        John.Wick.3     90    -5.6858974
## 19                 Variety        John.Wick.3     60    -6.5192308
## 20                     IGN        John.Wick.3     85    12.1474359
## 21            Boston Globe        John.Wick.3     63    11.1807692
## 22         Washington Post        John.Wick.3     75     6.2307692
## 23            The Guardian Avengers..End.Game    100    -9.0192308
## 24 San Francisco Chronicle Avengers..End.Game     NA   -14.8525641
## 25               USA Today Avengers..End.Game     88     6.2307692
## 26                     CNN Avengers..End.Game     90    -2.7692308
## 27       Chicago Sun-Times Avengers..End.Game    100     3.7807692
## 28           New York Post Avengers..End.Game     NA    -0.2692308
## 29           Rolling Stone Avengers..End.Game     80    -5.6858974
## 30                 Variety Avengers..End.Game     70    -6.5192308
## 31                     IGN Avengers..End.Game     95    12.1474359
## 32            Boston Globe Avengers..End.Game     75    11.1807692
## 33         Washington Post Avengers..End.Game     NA     6.2307692
## 34            The Guardian  Pokemon.detective     NA    -9.0192308
## 35 San Francisco Chronicle  Pokemon.detective     50   -14.8525641
## 36               USA Today  Pokemon.detective     50     6.2307692
## 37                     CNN  Pokemon.detective     40    -2.7692308
## 38       Chicago Sun-Times  Pokemon.detective     63     3.7807692
## 39           New York Post  Pokemon.detective     75    -0.2692308
## 40           Rolling Stone  Pokemon.detective     NA    -5.6858974
## 41                 Variety  Pokemon.detective     40    -6.5192308
## 42                     IGN  Pokemon.detective     80    12.1474359
## 43            Boston Globe  Pokemon.detective     NA    11.1807692
## 44         Washington Post  Pokemon.detective     75     6.2307692
## 45            The Guardian         Brightburn     60    -9.0192308
## 46 San Francisco Chronicle         Brightburn     25   -14.8525641
## 47               USA Today         Brightburn     NA     6.2307692
## 48                     CNN         Brightburn     NA    -2.7692308
## 49       Chicago Sun-Times         Brightburn     38     3.7807692
## 50           New York Post         Brightburn     NA    -0.2692308
## 51           Rolling Stone         Brightburn     40    -5.6858974
## 52                 Variety         Brightburn     NA    -6.5192308
## 53                     IGN         Brightburn     71    12.1474359
## 54            Boston Globe         Brightburn     NA    11.1807692
## 55         Washington Post         Brightburn     NA     6.2307692
## 56            The Guardian          Booksmart     NA    -9.0192308
## 57 San Francisco Chronicle          Booksmart     75   -14.8525641
## 58               USA Today          Booksmart     88     6.2307692
## 59                     CNN          Booksmart     NA    -2.7692308
## 60       Chicago Sun-Times          Booksmart     NA     3.7807692
## 61           New York Post          Booksmart     NA    -0.2692308
## 62           Rolling Stone          Booksmart     80    -5.6858974
## 63                 Variety          Booksmart     NA    -6.5192308
## 64                     IGN          Booksmart     89    12.1474359
## 65            Boston Globe          Booksmart    100    11.1807692
## 66         Washington Post          Booksmart     88     6.2307692
## 67            The Guardian         The.Hustle     20    -9.0192308
## 68 San Francisco Chronicle         The.Hustle     25   -14.8525641
## 69               USA Today         The.Hustle     NA     6.2307692
## 70                     CNN         The.Hustle     NA    -2.7692308
## 71       Chicago Sun-Times         The.Hustle     NA     3.7807692
## 72           New York Post         The.Hustle     50    -0.2692308
## 73           Rolling Stone         The.Hustle     20    -5.6858974
## 74                 Variety         The.Hustle     NA    -6.5192308
## 75                     IGN         The.Hustle     NA    12.1474359
## 76            Boston Globe         The.Hustle     63    11.1807692
## 77         Washington Post         The.Hustle     NA     6.2307692
## 78            The Guardian       The.Intruder     NA    -9.0192308
## 79 San Francisco Chronicle       The.Intruder     75   -14.8525641
## 80               USA Today       The.Intruder     NA     6.2307692
## 81                     CNN       The.Intruder     NA    -2.7692308
## 82       Chicago Sun-Times       The.Intruder     NA     3.7807692
## 83           New York Post       The.Intruder     NA    -0.2692308
## 84           Rolling Stone       The.Intruder     NA    -5.6858974
## 85                 Variety       The.Intruder     80    -6.5192308
## 86                     IGN       The.Intruder     NA    12.1474359
## 87            Boston Globe       The.Intruder    100    11.1807692
## 88         Washington Post       The.Intruder     NA     6.2307692
##     MovieBias   RawAvg Baseline
## 1    1.814103 69.01923 61.81410
## 2    1.814103 69.01923 55.98077
## 3    1.814103 69.01923 77.06410
## 4    1.814103 69.01923 68.06410
## 5    1.814103 69.01923 74.61410
## 6    1.814103 69.01923 70.56410
## 7    1.814103 69.01923 65.14744
## 8    1.814103 69.01923 64.31410
## 9    1.814103 69.01923 82.98077
## 10   1.814103 69.01923 82.01410
## 11   1.814103 69.01923 77.06410
## 12   4.253497 69.01923 64.25350
## 13   4.253497 69.01923 58.42016
## 14   4.253497 69.01923 79.50350
## 15   4.253497 69.01923 70.50350
## 16   4.253497 69.01923 77.05350
## 17   4.253497 69.01923 73.00350
## 18   4.253497 69.01923 67.58683
## 19   4.253497 69.01923 66.75350
## 20   4.253497 69.01923 85.42016
## 21   4.253497 69.01923 84.45350
## 22   4.253497 69.01923 79.50350
## 23  18.230769 69.01923 78.23077
## 24  18.230769 69.01923 72.39744
## 25  18.230769 69.01923 93.48077
## 26  18.230769 69.01923 84.48077
## 27  18.230769 69.01923 91.03077
## 28  18.230769 69.01923 86.98077
## 29  18.230769 69.01923 81.56410
## 30  18.230769 69.01923 80.73077
## 31  18.230769 69.01923 99.39744
## 32  18.230769 69.01923 98.43077
## 33  18.230769 69.01923 93.48077
## 34  -9.894231 69.01923 50.10577
## 35  -9.894231 69.01923 44.27244
## 36  -9.894231 69.01923 65.35577
## 37  -9.894231 69.01923 56.35577
## 38  -9.894231 69.01923 62.90577
## 39  -9.894231 69.01923 58.85577
## 40  -9.894231 69.01923 53.43910
## 41  -9.894231 69.01923 52.60577
## 42  -9.894231 69.01923 71.27244
## 43  -9.894231 69.01923 70.30577
## 44  -9.894231 69.01923 65.35577
## 45 -22.219231 69.01923 37.78077
## 46 -22.219231 69.01923 31.94744
## 47 -22.219231 69.01923 53.03077
## 48 -22.219231 69.01923 44.03077
## 49 -22.219231 69.01923 50.58077
## 50 -22.219231 69.01923 46.53077
## 51 -22.219231 69.01923 41.11410
## 52 -22.219231 69.01923 40.28077
## 53 -22.219231 69.01923 58.94744
## 54 -22.219231 69.01923 57.98077
## 55 -22.219231 69.01923 53.03077
## 56  17.647436 69.01923 77.64744
## 57  17.647436 69.01923 71.81410
## 58  17.647436 69.01923 92.89744
## 59  17.647436 69.01923 83.89744
## 60  17.647436 69.01923 90.44744
## 61  17.647436 69.01923 86.39744
## 62  17.647436 69.01923 80.98077
## 63  17.647436 69.01923 80.14744
## 64  17.647436 69.01923 98.81410
## 65  17.647436 69.01923 97.84744
## 66  17.647436 69.01923 92.89744
## 67 -33.419231 69.01923 26.58077
## 68 -33.419231 69.01923 20.74744
## 69 -33.419231 69.01923 41.83077
## 70 -33.419231 69.01923 32.83077
## 71 -33.419231 69.01923 39.38077
## 72 -33.419231 69.01923 35.33077
## 73 -33.419231 69.01923 29.91410
## 74 -33.419231 69.01923 29.08077
## 75 -33.419231 69.01923 47.74744
## 76 -33.419231 69.01923 46.78077
## 77 -33.419231 69.01923 41.83077
## 78  15.980769 69.01923 75.98077
## 79  15.980769 69.01923 70.14744
## 80  15.980769 69.01923 91.23077
## 81  15.980769 69.01923 82.23077
## 82  15.980769 69.01923 88.78077
## 83  15.980769 69.01923 84.73077
## 84  15.980769 69.01923 79.31410
## 85  15.980769 69.01923 78.48077
## 86  15.980769 69.01923 97.14744
## 87  15.980769 69.01923 96.18077
## 88  15.980769 69.01923 91.23077
test_data <- test_data %>% left_join(Reviewers_bias, by = "Reviewers") %>%
  left_join(Movie_bias, by = "Movie") %>%
  mutate(RawAvg = raw_avg) %>%
  mutate(Baseline = RawAvg + ReviewersBias + MovieBias)
test_data
##                  Reviewers              Movie Rating ReviewersBias
## 1             The Guardian            Aladdin     80    -9.0192308
## 2  San Francisco Chronicle            Aladdin     75   -14.8525641
## 3                USA Today            Aladdin     75     6.2307692
## 4                      CNN            Aladdin     NA    -2.7692308
## 5        Chicago Sun-Times            Aladdin     NA     3.7807692
## 6            New York Post            Aladdin     NA    -0.2692308
## 7            Rolling Stone            Aladdin     NA    -5.6858974
## 8                  Variety            Aladdin     70    -6.5192308
## 9                      IGN            Aladdin     NA    12.1474359
## 10            Boston Globe            Aladdin     63    11.1807692
## 11         Washington Post            Aladdin     NA     6.2307692
## 12            The Guardian        John.Wick.3     NA    -9.0192308
## 13 San Francisco Chronicle        John.Wick.3     NA   -14.8525641
## 14               USA Today        John.Wick.3     NA     6.2307692
## 15                     CNN        John.Wick.3     NA    -2.7692308
## 16       Chicago Sun-Times        John.Wick.3     NA     3.7807692
## 17           New York Post        John.Wick.3     NA    -0.2692308
## 18           Rolling Stone        John.Wick.3     NA    -5.6858974
## 19                 Variety        John.Wick.3     NA    -6.5192308
## 20                     IGN        John.Wick.3     NA    12.1474359
## 21            Boston Globe        John.Wick.3     NA    11.1807692
## 22         Washington Post        John.Wick.3     NA     6.2307692
## 23            The Guardian Avengers..End.Game     NA    -9.0192308
## 24 San Francisco Chronicle Avengers..End.Game     75   -14.8525641
## 25               USA Today Avengers..End.Game     NA     6.2307692
## 26                     CNN Avengers..End.Game     NA    -2.7692308
## 27       Chicago Sun-Times Avengers..End.Game     NA     3.7807692
## 28           New York Post Avengers..End.Game     75    -0.2692308
## 29           Rolling Stone Avengers..End.Game     NA    -5.6858974
## 30                 Variety Avengers..End.Game     NA    -6.5192308
## 31                     IGN Avengers..End.Game     NA    12.1474359
## 32            Boston Globe Avengers..End.Game     NA    11.1807692
## 33         Washington Post Avengers..End.Game    100     6.2307692
## 34            The Guardian  Pokemon.detective     60    -9.0192308
## 35 San Francisco Chronicle  Pokemon.detective     NA   -14.8525641
## 36               USA Today  Pokemon.detective     NA     6.2307692
## 37                     CNN  Pokemon.detective     NA    -2.7692308
## 38       Chicago Sun-Times  Pokemon.detective     NA     3.7807692
## 39           New York Post  Pokemon.detective     NA    -0.2692308
## 40           Rolling Stone  Pokemon.detective     50    -5.6858974
## 41                 Variety  Pokemon.detective     NA    -6.5192308
## 42                     IGN  Pokemon.detective     NA    12.1474359
## 43            Boston Globe  Pokemon.detective     NA    11.1807692
## 44         Washington Post  Pokemon.detective     NA     6.2307692
## 45            The Guardian         Brightburn     NA    -9.0192308
## 46 San Francisco Chronicle         Brightburn     NA   -14.8525641
## 47               USA Today         Brightburn     NA     6.2307692
## 48                     CNN         Brightburn     NA    -2.7692308
## 49       Chicago Sun-Times         Brightburn     NA     3.7807692
## 50           New York Post         Brightburn     NA    -0.2692308
## 51           Rolling Stone         Brightburn     NA    -5.6858974
## 52                 Variety         Brightburn     40    -6.5192308
## 53                     IGN         Brightburn     NA    12.1474359
## 54            Boston Globe         Brightburn     NA    11.1807692
## 55         Washington Post         Brightburn     NA     6.2307692
## 56            The Guardian          Booksmart     80    -9.0192308
## 57 San Francisco Chronicle          Booksmart     NA   -14.8525641
## 58               USA Today          Booksmart     NA     6.2307692
## 59                     CNN          Booksmart     NA    -2.7692308
## 60       Chicago Sun-Times          Booksmart     88     3.7807692
## 61           New York Post          Booksmart    100    -0.2692308
## 62           Rolling Stone          Booksmart     NA    -5.6858974
## 63                 Variety          Booksmart     90    -6.5192308
## 64                     IGN          Booksmart     NA    12.1474359
## 65            Boston Globe          Booksmart     NA    11.1807692
## 66         Washington Post          Booksmart     NA     6.2307692
## 67            The Guardian         The.Hustle     NA    -9.0192308
## 68 San Francisco Chronicle         The.Hustle     NA   -14.8525641
## 69               USA Today         The.Hustle     NA     6.2307692
## 70                     CNN         The.Hustle     NA    -2.7692308
## 71       Chicago Sun-Times         The.Hustle     NA     3.7807692
## 72           New York Post         The.Hustle     NA    -0.2692308
## 73           Rolling Stone         The.Hustle     NA    -5.6858974
## 74                 Variety         The.Hustle     60    -6.5192308
## 75                     IGN         The.Hustle     NA    12.1474359
## 76            Boston Globe         The.Hustle     NA    11.1807692
## 77         Washington Post         The.Hustle     NA     6.2307692
## 78            The Guardian       The.Intruder     NA    -9.0192308
## 79 San Francisco Chronicle       The.Intruder     NA   -14.8525641
## 80               USA Today       The.Intruder     NA     6.2307692
## 81                     CNN       The.Intruder     NA    -2.7692308
## 82       Chicago Sun-Times       The.Intruder     NA     3.7807692
## 83           New York Post       The.Intruder     63    -0.2692308
## 84           Rolling Stone       The.Intruder     NA    -5.6858974
## 85                 Variety       The.Intruder     NA    -6.5192308
## 86                     IGN       The.Intruder     NA    12.1474359
## 87            Boston Globe       The.Intruder     NA    11.1807692
## 88         Washington Post       The.Intruder     NA     6.2307692
##     MovieBias   RawAvg Baseline
## 1    1.814103 69.01923 61.81410
## 2    1.814103 69.01923 55.98077
## 3    1.814103 69.01923 77.06410
## 4    1.814103 69.01923 68.06410
## 5    1.814103 69.01923 74.61410
## 6    1.814103 69.01923 70.56410
## 7    1.814103 69.01923 65.14744
## 8    1.814103 69.01923 64.31410
## 9    1.814103 69.01923 82.98077
## 10   1.814103 69.01923 82.01410
## 11   1.814103 69.01923 77.06410
## 12   4.253497 69.01923 64.25350
## 13   4.253497 69.01923 58.42016
## 14   4.253497 69.01923 79.50350
## 15   4.253497 69.01923 70.50350
## 16   4.253497 69.01923 77.05350
## 17   4.253497 69.01923 73.00350
## 18   4.253497 69.01923 67.58683
## 19   4.253497 69.01923 66.75350
## 20   4.253497 69.01923 85.42016
## 21   4.253497 69.01923 84.45350
## 22   4.253497 69.01923 79.50350
## 23  18.230769 69.01923 78.23077
## 24  18.230769 69.01923 72.39744
## 25  18.230769 69.01923 93.48077
## 26  18.230769 69.01923 84.48077
## 27  18.230769 69.01923 91.03077
## 28  18.230769 69.01923 86.98077
## 29  18.230769 69.01923 81.56410
## 30  18.230769 69.01923 80.73077
## 31  18.230769 69.01923 99.39744
## 32  18.230769 69.01923 98.43077
## 33  18.230769 69.01923 93.48077
## 34  -9.894231 69.01923 50.10577
## 35  -9.894231 69.01923 44.27244
## 36  -9.894231 69.01923 65.35577
## 37  -9.894231 69.01923 56.35577
## 38  -9.894231 69.01923 62.90577
## 39  -9.894231 69.01923 58.85577
## 40  -9.894231 69.01923 53.43910
## 41  -9.894231 69.01923 52.60577
## 42  -9.894231 69.01923 71.27244
## 43  -9.894231 69.01923 70.30577
## 44  -9.894231 69.01923 65.35577
## 45 -22.219231 69.01923 37.78077
## 46 -22.219231 69.01923 31.94744
## 47 -22.219231 69.01923 53.03077
## 48 -22.219231 69.01923 44.03077
## 49 -22.219231 69.01923 50.58077
## 50 -22.219231 69.01923 46.53077
## 51 -22.219231 69.01923 41.11410
## 52 -22.219231 69.01923 40.28077
## 53 -22.219231 69.01923 58.94744
## 54 -22.219231 69.01923 57.98077
## 55 -22.219231 69.01923 53.03077
## 56  17.647436 69.01923 77.64744
## 57  17.647436 69.01923 71.81410
## 58  17.647436 69.01923 92.89744
## 59  17.647436 69.01923 83.89744
## 60  17.647436 69.01923 90.44744
## 61  17.647436 69.01923 86.39744
## 62  17.647436 69.01923 80.98077
## 63  17.647436 69.01923 80.14744
## 64  17.647436 69.01923 98.81410
## 65  17.647436 69.01923 97.84744
## 66  17.647436 69.01923 92.89744
## 67 -33.419231 69.01923 26.58077
## 68 -33.419231 69.01923 20.74744
## 69 -33.419231 69.01923 41.83077
## 70 -33.419231 69.01923 32.83077
## 71 -33.419231 69.01923 39.38077
## 72 -33.419231 69.01923 35.33077
## 73 -33.419231 69.01923 29.91410
## 74 -33.419231 69.01923 29.08077
## 75 -33.419231 69.01923 47.74744
## 76 -33.419231 69.01923 46.78077
## 77 -33.419231 69.01923 41.83077
## 78  15.980769 69.01923 75.98077
## 79  15.980769 69.01923 70.14744
## 80  15.980769 69.01923 91.23077
## 81  15.980769 69.01923 82.23077
## 82  15.980769 69.01923 88.78077
## 83  15.980769 69.01923 84.73077
## 84  15.980769 69.01923 79.31410
## 85  15.980769 69.01923 78.48077
## 86  15.980769 69.01923 97.14744
## 87  15.980769 69.01923 96.18077
## 88  15.980769 69.01923 91.23077
# Calculate RMSE for baseline predictors

rmse_base_train <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] - 
                               train_data$Baseline[!is.na(train_data$Rating)])^2) /
                          length(which(!is.na(train_data$Rating))))

rmse_base_test <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] - 
                              test_data$Baseline[!is.na(test_data$Rating)])^2) /
                         length(which(!is.na(test_data$Rating))))

RMSE and Summary

You can tell the largest Bias are determined by the NA reviews or reviewers didn’t review the movie.Continuing the RMSE calculation with the Bias Information for the baseline Ratings you’ll see the RMSE for the new baseline ratings.

rmse_base_train
## [1] 10.947
rmse_base_test
## [1] 13.53655

This table shows RMSE values for training and testing sets and for raw average and baseline predictors.

RMSE
Training: Raw Average 20.63277
Training: Baseline Predictor 10.94700
Testing: Raw Average 16.34963
Testing: Baseline Predictor 13.53655

We can see that RMSE improved baseline predictors in both training and test datasets. Eventhough a small dataset and you could say incompete gave us enough information to visualize and apply the movie and reviewer bias into the model.Since the test and training datasets are randomly generated with every execution I’m reporting over a Snapshot.