Background

The purpose of this assignment is to predict ratings by 1) looking at just the raw averages across all users and 2) accounting for “bias” by normalizing across users and across items. The challenge presents itself by splitting one dataset into a training and testing subset and by working around missing entries.

The data for building this recommender system can be found at http://nifty.stanford.edu/2011/craig-book-recommendations/ and was originally gathered with the intention of recommending books to high school seniors based on 55 novels and ratings from 86 students in Canada.The ratings for each book are given as: (-5: Hated, -3: Didn’t Like, 0: Haven’t Read, 1: Neutral, 3: Liked, 5: Loved)

Data input

Loading libraries

library(splitstackshape)
library(dplyr)
library(caTools)
library(Amelia)
library(hydroGOF)
library(knitr)

Importing/Formatting datasets

The datasets for the books and the ratings are from two different sources, so after importing them, they have to be combined to create one dataframe. The books dataset was originally one column formatted as (Author, Title) and therefore had to be split into two columns (Author and Title) as seen below. This was originally done with the intention of enhancing the recommendation system in case an author was included more than once, however each author is a unique entry, so it seemed more pertinent to just work with the titles.

# grab the urls (there are 2: ratings and book titles)
url1 = "http://nifty.stanford.edu/2011/craig-book-recommendations/ratings.txt"
url2 = "http://nifty.stanford.edu/2011/craig-book-recommendations/books.txt"

# read in the titles and move the authors into another column
books = data.frame(read.delim(url2, header=F, sep="\t", stringsAsFactors = F))
books = cSplit(books, "V1", ",")
kable(head(books))
V1_1 V1_2
Douglas Adams The Hitchhiker’s Guide To The Galaxy
Richard Adams Watership Down
Mitch Albom The Five People You Meet in Heaven
Laurie Halse Anderson Speak
Maya Angelou I Know Why the Caged Bird Sings
Jay Asher Thirteen Reasons Why

Directly importing the ratings dataset from the website resulted in some award formatting (where the odd rows were the name of the reader with all NA values following and the odd rows were the previous row’s actual ratings).

# read in the ratings
ratings = data.frame(read.delim(url1, header=F, sep=" ", stringsAsFactors = F))
kable(head(ratings))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56
Ben NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 0 0 0 0 0 0 1 0 1 -3 5 0 0 0 5 5 0 0 0 0 5 0 0 0 0 0 0 0 0 1 3 0 1 0 -5 0 0 5 5 0 5 5 5 0 5 5 0 0 0 5 5 5 5 -5 NA
Moose NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 5 0 0 0 0 3 0 0 1 0 5 3 0 5 0 3 3 5 0 0 0 0 0 5 0 0 0 0 0 3 5 0 0 0 0 0 5 -3 0 0 0 5 0 0 0 0 0 0 5 5 0 3 0 0 NA
Reuven NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 -5 0 0 0 0 -3 -5 0 1 -5 5 0 1 0 1 -3 1 -5 0 0 0 0 0 0 3 0 0 0 0 -5 1 0 1 0 -5 0 3 -3 3 0 1 5 1 0 0 0 0 0 1 3 1 5 1 3 NA

To correct this, the even rows were extracted and the names in the odd rows were converted into a new column and attached to the ratings.

# Split the ratings into odd and even rows
odd = ratings %>% dplyr::filter(row_number() %% 2 == 1)
## Warning: package 'bindrcpp' was built under R version 3.4.4
odd = within(odd,  consumer <- paste(V1, V2, sep=" "))
even = ratings %>% dplyr::filter(row_number() %% 2 == 0)
colnames(even) = books$V1_2

# final dataset
rate = data.frame(cbind(odd$consumer, even))[,1:56]
names(rate)[names(rate) == "odd.consumer"] = "reader"
kable(head(rate))
reader The.Hitchhiker.s.Guide.To.The.Galaxy Watership.Down The.Five.People.You.Meet.in.Heaven Speak I.Know.Why.the.Caged.Bird.Sings Thirteen.Reasons.Why Foundation.Series The.Sisterhood.of.the.Travelling.Pants A.Great.and.Terrible.Beauty The.Da.Vinci.Code The.Princess.Diaries Ender.s.Game The.Hunt.for.Red.October The.Hunger.Games The.Great.Gatsby Ranger.s.Apprentice.Series Inkheart Neuromancer Lord.of.the.Flies The.Princess.Bride Dinotopia..A.Land.Apart.from.Time Far.North Practical.Magic Brave.New.World The.Summer.Tree Flowers.For.Algernon Owl.in.Love Naruto Bleach..graphic.novel. Kiss.the.Dust To.Kill.a.Mockingbird The.Lion.the.Witch.and.the.Wardrobe The.Bourne.Series Life.of.Pi Breathless Twilight.Series Sabriel Nineteen.Eighty.Four..1984. Eragon Hatchet My.Sister.s.Keeper The.Golden.Compass Harry.Potter.Series Holes Shonen.Jump.Series The.Shadow.Club Bone.Series Maus..A.Survivor.s.Tale The.Joy.Luck.Club The.Lord.of.the.Rings The.Hobbit Shattered The.War.Of.The.Worlds Dealing.with.Dragons The.Chrysalids
Ben 5 0 0 0 0 0 0 1 0 1 -3 5 0 0 0 5 5 0 0 0 0 5 0 0 0 0 0 0 0 0 1 3 0 1 0 -5 0 0 5 5 0 5 5 5 0 5 5 0 0 0 5 5 5 5 -5
Moose 5 5 0 0 0 0 3 0 0 1 0 5 3 0 5 0 3 3 5 0 0 0 0 0 5 0 0 0 0 0 3 5 0 0 0 0 0 5 -3 0 0 0 5 0 0 0 0 0 0 5 5 0 3 0 0
Reuven 5 -5 0 0 0 0 -3 -5 0 1 -5 5 0 1 0 1 -3 1 -5 0 0 0 0 0 0 3 0 0 0 0 -5 1 0 1 0 -5 0 3 -3 3 0 1 5 1 0 0 0 0 0 1 3 1 5 1 3
Cust1 3 3 5 0 0 0 3 0 0 3 0 3 0 0 0 0 0 3 0 5 0 0 0 1 3 1 0 0 0 0 0 3 0 3 0 0 0 1 3 0 0 3 3 0 0 0 5 0 0 3 1 0 0 0 0
Cust2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 3 1 0 0 0 3 0 0 0 3 0 3 3 5 0 3 0 3
Francois 3 3 5 0 0 0 3 0 0 3 0 3 0 0 0 0 0 3 0 5 0 0 0 1 3 1 0 0 0 0 0 3 0 3 0 0 0 1 3 0 0 3 3 0 0 0 5 0 0 3 1 0 0 0 0

For ease of understanding (making the dataset more intuitive), the values were changed and replaced from the scale of [-5, 5] to [1, 5] and, where previously a zero value was the equivalent of a “not read” response, “NA” became representative of “no rating”.

rate[rate == 0] = NA
rate[rate == 3] = 4
rate[rate == 1] = 3
rate[rate == -5] = 1
rate[rate == -3] = 2

Just looking at the structure of the dataset, two books’ ratings (Hitchhiker’s Guide and Watership Down) were input as character variables. They were subsequently changed to numerical values.

kable(str(rate))
## 'data.frame':    86 obs. of  56 variables:
##  $ reader                                : Factor w/ 85 levels "Albus Dumbledore",..: 6 59 67 15 17 32 44 39 18 19 ...
##  $ The.Hitchhiker.s.Guide.To.The.Galaxy  : chr  "5" "5" "5" "4" ...
##  $ Watership.Down                        : chr  NA "5" "1" "4" ...
##  $ The.Five.People.You.Meet.in.Heaven    : num  NA NA NA 5 NA 5 NA NA NA NA ...
##  $ Speak                                 : num  NA NA NA NA NA NA NA NA 2 NA ...
##  $ I.Know.Why.the.Caged.Bird.Sings       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Thirteen.Reasons.Why                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Foundation.Series                     : num  NA 4 2 4 NA 4 4 NA NA NA ...
##  $ The.Sisterhood.of.the.Travelling.Pants: num  3 NA 1 NA NA NA NA NA NA 5 ...
##  $ A.Great.and.Terrible.Beauty           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ The.Da.Vinci.Code                     : num  3 3 3 4 NA 4 NA NA NA NA ...
##  $ The.Princess.Diaries                  : num  2 NA 1 NA NA NA NA NA NA NA ...
##  $ Ender.s.Game                          : num  5 5 5 4 NA 4 NA NA NA NA ...
##  $ The.Hunt.for.Red.October              : num  NA 4 NA NA NA NA NA NA NA NA ...
##  $ The.Hunger.Games                      : num  NA NA 3 NA NA NA NA NA NA NA ...
##  $ The.Great.Gatsby                      : num  NA 5 NA NA NA NA 3 NA NA NA ...
##  $ Ranger.s.Apprentice.Series            : num  5 NA 3 NA NA NA NA NA NA NA ...
##  $ Inkheart                              : num  5 4 2 NA NA NA NA 3 4 NA ...
##  $ Neuromancer                           : num  NA 4 3 4 NA 4 NA NA NA NA ...
##  $ Lord.of.the.Flies                     : num  NA 5 1 NA 3 NA NA NA NA NA ...
##  $ The.Princess.Bride                    : num  NA NA NA 5 NA 5 NA NA NA NA ...
##  $ Dinotopia..A.Land.Apart.from.Time     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Far.North                             : num  5 NA NA NA NA NA NA NA NA NA ...
##  $ Practical.Magic                       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Brave.New.World                       : num  NA NA NA 3 NA 3 4 NA NA NA ...
##  $ The.Summer.Tree                       : num  NA 5 NA 4 NA 4 NA NA NA NA ...
##  $ Flowers.For.Algernon                  : num  NA NA 4 3 NA 3 4 NA NA NA ...
##  $ Owl.in.Love                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Naruto                                : num  NA NA NA NA NA NA NA 5 5 NA ...
##  $ Bleach..graphic.novel.                : num  NA NA NA NA NA NA NA NA 5 NA ...
##  $ Kiss.the.Dust                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ To.Kill.a.Mockingbird                 : num  3 4 1 NA NA NA NA NA NA NA ...
##  $ The.Lion.the.Witch.and.the.Wardrobe   : num  4 5 3 4 5 4 5 3 NA NA ...
##  $ The.Bourne.Series                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Life.of.Pi                            : num  3 NA 3 4 NA 4 5 NA NA NA ...
##  $ Breathless                            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Twilight.Series                       : num  1 NA 1 NA NA NA NA 3 NA 5 ...
##  $ Sabriel                               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Nineteen.Eighty.Four..1984.           : num  NA 5 4 3 4 3 NA NA NA NA ...
##  $ Eragon                                : num  5 2 2 4 3 4 NA 5 3 1 ...
##  $ Hatchet                               : num  5 NA 4 NA NA NA NA NA 3 NA ...
##  $ My.Sister.s.Keeper                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ The.Golden.Compass                    : num  5 NA 3 4 NA 4 NA NA 2 1 ...
##  $ Harry.Potter.Series                   : num  5 5 5 4 4 4 NA 5 5 5 ...
##  $ Holes                                 : num  5 NA 3 NA NA NA NA 4 4 NA ...
##  $ Shonen.Jump.Series                    : num  NA NA NA NA NA NA NA 5 NA NA ...
##  $ The.Shadow.Club                       : num  5 NA NA NA NA NA NA NA NA NA ...
##  $ Bone.Series                           : num  5 NA NA 5 4 5 NA NA 3 NA ...
##  $ Maus..A.Survivor.s.Tale               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ The.Joy.Luck.Club                     : num  NA NA NA NA 4 NA 4 NA NA NA ...
##  $ The.Lord.of.the.Rings                 : num  NA 5 3 4 4 4 4 5 NA NA ...
##  $ The.Hobbit                            : num  5 5 4 3 5 3 4 4 3 2 ...
##  $ Shattered                             : num  5 NA 3 NA NA NA NA NA NA NA ...
##  $ The.War.Of.The.Worlds                 : num  5 4 5 NA 4 NA NA 4 NA NA ...
##  $ Dealing.with.Dragons                  : num  5 NA 3 NA NA NA NA NA NA NA ...
##  $ The.Chrysalids                        : num  1 NA 4 NA 4 NA NA 1 NA 2 ...
rate$The.Hitchhiker.s.Guide.To.The.Galaxy = as.numeric(rate$The.Hitchhiker.s.Guide.To.The.Galaxy)
rate$Watership.Down = as.numeric(rate$Watership.Down)

Downsizing dataset

The data has 86 readers and 55 novels. To narrow down the dataset, the first thing to look at was the number of missing values. The plot below shows that more than half of the books have too many NA entries to make accurate recommendations. In fact, every reader has not rated at least one book, making it necessary to determine which variables to retain based on the number of missing values.

# Look at missing values
missmap(rate, main = "Missing values vs Observed")

The next step was to order the number of NA values of readers by least to greatest and retain the top 10 most active readers. From there, 10 books with the least NA values were also retained, creating a 10 by 10 matrix of the most involved readers and the most reviewed novels.

# Count na values for readers
rate$na_count = apply(rate, 1, function(x){ sum(is.na(x))})
# Ordering na values and include only the top ten
rates = head(rate[order(rate$na_count),],10)
# Create new row for na count of books
rates[11,] = colSums(is.na(rates))
## Warning in `[<-.factor`(`*tmp*`, iseq, value = 0): invalid factor level, NA
## generated
# Sort based on book na count
rates_sort = rates[,order(rates[11,])]
# Select only ten books
final_rates = rates_sort[-11, -c(2,10,13:56)]
kable(final_rates)
The.Hitchhiker.s.Guide.To.The.Galaxy The.Da.Vinci.Code Lord.of.the.Flies To.Kill.a.Mockingbird The.Golden.Compass Harry.Potter.Series The.Hobbit The.War.Of.The.Worlds The.Sisterhood.of.the.Travelling.Pants The.Princess.Diaries reader
78 1 1 5 4 NA NA 5 3 3 1 Tony
3 5 3 1 1 3 5 4 5 1 1 Reuven
12 3 5 4 NA 5 5 NA 4 5 5 Cust6
16 4 2 4 4 4 4 4 4 NA NA andrew
76 4 5 3 5 4 5 5 5 2 2 Tiffany
1 5 3 NA 3 5 5 5 5 3 2 Ben
14 4 5 5 5 4 5 5 5 5 4 Cust8
22 4 4 4 5 5 4 5 NA NA NA joe
72 5 4 4 5 5 5 5 3 5 3 Claire
84 4 NA 3 2 5 3 5 2 2 3 James

Splitting into training and testing sets

To create and compare the prediction ratings of the books, a training and test set were created. Values from the dataset were selected and removed to create the test set. The training set was to be the dataset but with the extracted test set values were replaced with NA’s.

# Selected testing values
samples = rbind(c(1,10), c(2,9), c(3,8), c(4,7), c(5,6),
                c(6,5), c(7,4), c(8,3), c(9,2), c(10,1))

# Train set
train = final_rates
train[samples] = NA
kable(train)
The.Hitchhiker.s.Guide.To.The.Galaxy The.Da.Vinci.Code Lord.of.the.Flies To.Kill.a.Mockingbird The.Golden.Compass Harry.Potter.Series The.Hobbit The.War.Of.The.Worlds The.Sisterhood.of.the.Travelling.Pants The.Princess.Diaries reader
78 1 1 5 4 NA NA 5 3 3 NA Tony
3 5 3 1 1 3 5 4 5 NA 1 Reuven
12 3 5 4 NA 5 5 NA NA 5 5 Cust6
16 4 2 4 4 4 4 NA 4 NA NA andrew
76 4 5 3 5 4 NA 5 5 2 2 Tiffany
1 5 3 NA 3 NA 5 5 5 3 2 Ben
14 4 5 5 NA 4 5 5 5 5 4 Cust8
22 4 4 NA 5 5 4 5 NA NA NA joe
72 5 NA 4 5 5 5 5 3 5 3 Claire
84 NA NA 3 2 5 3 5 2 2 3 James
# Test set
test = as.numeric(final_rates[samples])
test
##  [1] 1 1 4 4 5 5 5 4 4 4

Tasks:

1. Raw Average Approach

Raw Average

The raw average of the training set is the mean rating for each user-item (reader-book) combination. This does not include the missing values (you don’t convert them to 0 or something), so (if every item entry has NA’s like this dataset) be mindful to work around them if need be.

# Calculate Raw Average Rating for user-item combination
raw_train = round(sum(colSums(train[,-11], na.rm = T))/(sum(colSums(!is.na(train[,-11])))), 3)
raw_train
## [1] 3.899
raw_test = round(mean(test), 3)

RMSE

The Root Mean Square Error (RMSE) for the raw average rating is the square root of the average of the squared differences between the training set’s values and the raw average. \(RMSE =\sqrt{\frac{\Sigma(train - rawAvg)^2}{n}}\) The RMSE was calculated for the training and test set is calculated below. Lower values of RMSE indicate better fit.

# Calculate RMSE of raw average
matrix_RMSE = function(matrix){
  matrix = select_if(matrix, is.numeric)
  # matrix mean
  matrix_mean = sum(colSums(matrix, na.rm = T))/(sum(colSums(!is.na(matrix))))
  # square difference of error
  matrix_rme = sum(colSums((matrix-matrix_mean)^2, na.rm = T))/(sum(colSums(!is.na(matrix))))
  # RMSE
  rmse = round(sqrt(matrix_rme),3)
  return(rmse)
}

train_RMSE = matrix_RMSE(train)
print(paste("Train set RMSE: ", train_RMSE))
## [1] "Train set RMSE:  1.239"
test_RMSE = round(sqrt(mean((test - raw_train)^2, na.rm =TRUE)), 3)
print(paste("Test set RMSE: ", test_RMSE))
## [1] "Test set RMSE:  1.432"

2. Baseline Approach

Bias

Though the majority of the ratings in this dataset are 4 and 5 (as seen visually and through the raw average score of approximately 3.8), there are bound to be certain readers that are harsh judges and others that are generous. Some books may also have been perceived as having a higher level of entertainment than others. To account for this, the bias of each user (reader) and item (book) is calculated below. This bias can only be used with the training data. The test dataset was excluded.

bias = function(matrix, item){
  matrix = select_if(train, is.numeric)
  if (item == T){
    bias = round((colSums(matrix, na.rm = T)/colSums(!is.na(matrix)))-raw_train, 2)
  } else{
    bias = round((rowSums(matrix, na.rm = T)/rowSums(!is.na(matrix)))-raw_train, 2)
  }
  return(data.frame(bias))
}

user_bias = bias(train, T)
user_bias
##                                         bias
## The.Hitchhiker.s.Guide.To.The.Galaxy   -0.01
## The.Da.Vinci.Code                      -0.40
## Lord.of.the.Flies                      -0.27
## To.Kill.a.Mockingbird                  -0.27
## The.Golden.Compass                      0.48
## Harry.Potter.Series                     0.60
## The.Hobbit                              0.98
## The.War.Of.The.Worlds                   0.10
## The.Sisterhood.of.the.Travelling.Pants -0.33
## The.Princess.Diaries                   -1.04
item_bias = bias(train, F)
cbind(train$reader,item_bias)
##    train$reader  bias
## 78        Tony  -0.76
## 3       Reuven  -0.79
## 12       Cust6   0.67
## 16      andrew  -0.18
## 76     Tiffany  -0.01
## 1          Ben  -0.02
## 14       Cust8   0.77
## 22         joe   0.60
## 72      Claire   0.55
## 84       James  -0.77

Baseline Predictors

The baseline predictors are predictors for every user and item that take into consideration the user and the item biases to better assume what the value for every user-item combination would be. \(Baseline Predictors = Raw Average + User Bias + Item Bias\) Because some of the predictors exceeded the [1,5] rating range, any values above or below these limits were changed to highest (5) or lowest (1) value respectively.

# Empty matrix
baseline = matrix(, nrow = dim(user_bias)[1], ncol = dim(item_bias)[1])

# Using rmse and biases, calculate baseline predictors
for (i in 1:dim(user_bias)[1]){
  item = t(as.matrix(bias(train, F)))
  baseline[i, ] = round(user_bias[i,] + item + raw_train, 2)
}

rownames(baseline) = rownames(user_bias)
colnames(baseline) = train$reader

# Upper and lower prediction limits adjustment
baseline[baseline > 5] = 5
baseline[baseline < 1] = 1
kable(baseline)
Tony Reuven Cust6 andrew Tiffany Ben Cust8 joe Claire James
The.Hitchhiker.s.Guide.To.The.Galaxy 3.13 3.10 4.56 3.71 3.88 3.87 4.66 4.49 4.44 3.12
The.Da.Vinci.Code 2.74 2.71 4.17 3.32 3.49 3.48 4.27 4.10 4.05 2.73
Lord.of.the.Flies 2.87 2.84 4.30 3.45 3.62 3.61 4.40 4.23 4.18 2.86
To.Kill.a.Mockingbird 2.87 2.84 4.30 3.45 3.62 3.61 4.40 4.23 4.18 2.86
The.Golden.Compass 3.62 3.59 5.00 4.20 4.37 4.36 5.00 4.98 4.93 3.61
Harry.Potter.Series 3.74 3.71 5.00 4.32 4.49 4.48 5.00 5.00 5.00 3.73
The.Hobbit 4.12 4.09 5.00 4.70 4.87 4.86 5.00 5.00 5.00 4.11
The.War.Of.The.Worlds 3.24 3.21 4.67 3.82 3.99 3.98 4.77 4.60 4.55 3.23
The.Sisterhood.of.the.Travelling.Pants 2.81 2.78 4.24 3.39 3.56 3.55 4.34 4.17 4.12 2.80
The.Princess.Diaries 2.10 2.07 3.53 2.68 2.85 2.84 3.63 3.46 3.41 2.09

One thing to note was that the dimensions of the baseline predictor dataset (10x10 matrix) was different from the dimensions of the unique baseline predictor dataset (9x10 matrix). On closer inspection, the baseline predictors for “To Kill a Mockingbird” and “The Lord of the Flies” were identical. This seems odd since, in the training dataset, these two novels did not have similar ratings from similar users. However, both books had the same bias value, resulting in this similarity.

Baseline RMSE

To determine the performance of the baseline predictors (especially in comparison to the raw average) the RMSE for baseline predictors was calculated for both the training and test set. This was done by taking the square root of the average of the squared differences between the training set’s (and test set’s) values and the user-item baseline predictors.

# Calculate rmse for baseline for train and test
train_base_rmse = round(sqrt(sum((train[,-11] - baseline)^2, na.rm=TRUE) / length(train[,-11][!is.na(train[,-11])])), 3)
print(paste("Training set Baseline RMSE: ", train_base_rmse))
## [1] "Training set Baseline RMSE:  1.223"
test_base = baseline[samples]
test_base_rmse = round(sqrt(sum((test - test_base)^2) / length(test)), 3)
print(paste("Test set Baseline RMSE: ", test_base_rmse))
## [1] "Test set Baseline RMSE:  1.425"

Summary

A comparison table was created to better see the difference between the results of the Raw Average approach and the baseline approach. This was done for both the training and the test set.

# Summarize results
# percent improvement
train_imp = round((1-(train_base_rmse/matrix_RMSE(train)))*100, 2)
test_imp = round((1-(test_base_rmse/test_RMSE))*100, 2)

Raw_Average = c(raw_train, raw_test) 
RMSE = c(train_RMSE, test_RMSE) 
Baseline_RMSE = c(train_base_rmse, test_base_rmse)
Improvement_Percent = c(train_imp, test_imp) 

results = data.frame(Raw_Average, RMSE, Baseline_RMSE, Improvement_Percent)
row.names(results) = c("Training Set", "Test Set")
kable(results)
Raw_Average RMSE Baseline_RMSE Improvement_Percent
Training Set 3.899 1.239 1.223 1.29
Test Set 3.700 1.432 1.425 0.49

Training vs Testing Set

Comparing the raw average scores, we see that the testing set can be deemed representative of the training set. From that alone, we can assume that the comparisons between the RMSE values for the raw-average and baseline approach will lead to similar conclusions. This seems to be the case for the Raw Average RMSE. The values for the training and testing set are similar (1.2 and 1.4, respectively), with the testing set having a slightly lesser fit (lower RMSE values are indicative of a better fit). This also is the case with the baseline results, with the training set having an RMSE of 1.2 and the training set having a value of 1.4. Overall, the training and testing sets have similar results.

Raw Average vs Baseline Performance

In both the training and the test set, the RMSE for the raw average was slightly higher. However, the values were so close to each other that the percent improvement for the training set was 1.29%, which was almost two-and-a-half times higher than the percent improvement of the test set (0.49%). This shows that using baseline predictors is a better predictive method of what the rating for a book will be by a particular user, but not by much. However, this can be attributed to the small range of rating values (from 1 to 5), or the sample size (10 books and 10 readers) or even that the selected sample had primarily values of 3 and above, meaning that the recommender system did not have a wide enough range of negative ratings in order to make a better distinction between the two approaches.