Background

The purpose of this assignment is to predict ratings by 1) looking at just the raw averages across all users and 2) accounting for “bias” by normalizing across users and across items. The challenge presents itself by splitting one dataset into a training and testing subset and by working around missing entries.

The data for building this recommender system can be found at http://nifty.stanford.edu/2011/craig-book-recommendations/ and was originally gathered with the intention of recommending books to high school seniors based on 55 novels and ratings from 86 students in Canada.The ratings for each book are given as: (-5: Hated, -3: Didn’t Like, 0: Haven’t Read, 1: Neutral, 3: Liked, 5: Loved)

Data input

Loading libraries

library(splitstackshape)
library(dplyr)
library(caTools)
library(Amelia)
library(hydroGOF)
library(knitr)

Importing/Formatting datasets

The datasets for the books and the ratings are from two different sources, so after importing them, they have to be combined to create one dataframe. The books dataset was originally one column formatted as (Author, Title) and therefore had to be split into two columns (Author and Title) as seen below. This was originally done with the intention of enhancing the recommendation system in case an author was included more than once, however each author is a unique entry, so it seemed more pertinent to just work with the titles.

# grab the urls (there are 2: ratings and book titles)
url1 = "http://nifty.stanford.edu/2011/craig-book-recommendations/ratings.txt"
url2 = "http://nifty.stanford.edu/2011/craig-book-recommendations/books.txt"

# read in the titles and move the authors into another column
books = data.frame(read.delim(url2, header=F, sep="\t", stringsAsFactors = F))
books = cSplit(books, "V1", ",")
kable(head(books))

V1_1	V1_2
Douglas Adams	The Hitchhiker’s Guide To The Galaxy
Richard Adams	Watership Down
Mitch Albom	The Five People You Meet in Heaven
Laurie Halse Anderson	Speak
Maya Angelou	I Know Why the Caged Bird Sings
Jay Asher	Thirteen Reasons Why

Directly importing the ratings dataset from the website resulted in some award formatting (where the odd rows were the name of the reader with all NA values following and the odd rows were the previous row’s actual ratings).

# read in the ratings
ratings = data.frame(read.delim(url1, header=F, sep=" ", stringsAsFactors = F))
kable(head(ratings))

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32	V33	V34	V35	V36	V37	V38	V39	V40	V41	V42	V43	V44	V45	V46	V47	V48	V49	V50	V51	V52	V53	V54	V55	V56
Ben		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
5	0	0	0	0	0	0	1	0	1	-3	5	0	0	0	5	5	0	0	0	0	5	0	0	0	0	0	0	0	0	1	3	0	1	0	-5	0	0	5	5	0	5	5	5	0	5	5	0	0	0	5	5	5	5	-5	NA
Moose		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
5	5	0	0	0	0	3	0	0	1	0	5	3	0	5	0	3	3	5	0	0	0	0	0	5	0	0	0	0	0	3	5	0	0	0	0	0	5	-3	0	0	0	5	0	0	0	0	0	0	5	5	0	3	0	0	NA
Reuven		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
5	-5	0	0	0	0	-3	-5	0	1	-5	5	0	1	0	1	-3	1	-5	0	0	0	0	0	0	3	0	0	0	0	-5	1	0	1	0	-5	0	3	-3	3	0	1	5	1	0	0	0	0	0	1	3	1	5	1	3	NA

To correct this, the even rows were extracted and the names in the odd rows were converted into a new column and attached to the ratings.

# Split the ratings into odd and even rows
odd = ratings %>% dplyr::filter(row_number() %% 2 == 1)

## Warning: package 'bindrcpp' was built under R version 3.4.4

odd = within(odd,  consumer <- paste(V1, V2, sep=" "))
even = ratings %>% dplyr::filter(row_number() %% 2 == 0)
colnames(even) = books$V1_2

# final dataset
rate = data.frame(cbind(odd$consumer, even))[,1:56]
names(rate)[names(rate) == "odd.consumer"] = "reader"
kable(head(rate))

reader	The.Hitchhiker.s.Guide.To.The.Galaxy	Watership.Down	The.Five.People.You.Meet.in.Heaven	Foundation.Series	The.Sisterhood.of.the.Travelling.Pants	The.Da.Vinci.Code	The.Princess.Diaries	Ender.s.Game	The.Hunt.for.Red.October	The.Hunger.Games	The.Great.Gatsby	Ranger.s.Apprentice.Series	Inkheart	Neuromancer	Lord.of.the.Flies	The.Princess.Bride	Far.North	Brave.New.World	The.Summer.Tree	Flowers.For.Algernon	To.Kill.a.Mockingbird	The.Lion.the.Witch.and.the.Wardrobe	Life.of.Pi	Twilight.Series	Nineteen.Eighty.Four..1984.	Eragon	Hatchet	The.Golden.Compass	Harry.Potter.Series	Holes	The.Shadow.Club	Bone.Series	The.Joy.Luck.Club	The.Lord.of.the.Rings	The.Hobbit	Shattered	The.War.Of.The.Worlds	Dealing.with.Dragons	The.Chrysalids
Ben	5	0	0	0	1	1	-3	5	0	0	0	5	5	0	0	0	5	0	0	0	1	3	1	-5	0	5	5	5	5	5	5	5	0	0	5	5	5	5	-5
Moose	5	5	0	3	0	1	0	5	3	0	5	0	3	3	5	0	0	0	5	0	3	5	0	0	5	-3	0	0	5	0	0	0	0	5	5	0	3	0	0
Reuven	5	-5	0	-3	-5	1	-5	5	0	1	0	1	-3	1	-5	0	0	0	0	3	-5	1	1	-5	3	-3	3	1	5	1	0	0	0	1	3	1	5	1	3
Cust1	3	3	5	3	0	3	0	3	0	0	0	0	0	3	0	5	0	1	3	1	0	3	3	0	1	3	0	3	3	0	0	5	0	3	1	0	0	0	0
Cust2	3	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	5	0	0	3	1	0	0	3	0	0	3	3	3	5	0	3	0	3
Francois	3	3	5	3	0	3	0	3	0	0	0	0	0	3	0	5	0	1	3	1	0	3	3	0	1	3	0	3	3	0	0	5	0	3	1	0	0	0	0

For ease of understanding (making the dataset more intuitive), the values were changed and replaced from the scale of [-5, 5] to [1, 5] and, where previously a zero value was the equivalent of a “not read” response, “NA” became representative of “no rating”.

rate[rate == 0] = NA
rate[rate == 3] = 4
rate[rate == 1] = 3
rate[rate == -5] = 1
rate[rate == -3] = 2

Just looking at the structure of the dataset, two books’ ratings (Hitchhiker’s Guide and Watership Down) were input as character variables. They were subsequently changed to numerical values.

kable(str(rate))

## 'data.frame':    86 obs. of  56 variables:
##  $ reader                                : Factor w/ 85 levels "Albus Dumbledore",..: 6 59 67 15 17 32 44 39 18 19 ...
##  $ The.Hitchhiker.s.Guide.To.The.Galaxy  : chr  "5" "5" "5" "4" ...
##  $ Watership.Down                        : chr  NA "5" "1" "4" ...
##  $ The.Five.People.You.Meet.in.Heaven    : num  NA NA NA 5 NA 5 NA NA NA NA ...
##  $ Speak                                 : num  NA NA NA NA NA NA NA NA 2 NA ...
##  $ I.Know.Why.the.Caged.Bird.Sings       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Thirteen.Reasons.Why                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Foundation.Series                     : num  NA 4 2 4 NA 4 4 NA NA NA ...
##  $ The.Sisterhood.of.the.Travelling.Pants: num  3 NA 1 NA NA NA NA NA NA 5 ...
##  $ A.Great.and.Terrible.Beauty           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ The.Da.Vinci.Code                     : num  3 3 3 4 NA 4 NA NA NA NA ...
##  $ The.Princess.Diaries                  : num  2 NA 1 NA NA NA NA NA NA NA ...
##  $ Ender.s.Game                          : num  5 5 5 4 NA 4 NA NA NA NA ...
##  $ The.Hunt.for.Red.October              : num  NA 4 NA NA NA NA NA NA NA NA ...
##  $ The.Hunger.Games                      : num  NA NA 3 NA NA NA NA NA NA NA ...
##  $ The.Great.Gatsby                      : num  NA 5 NA NA NA NA 3 NA NA NA ...
##  $ Ranger.s.Apprentice.Series            : num  5 NA 3 NA NA NA NA NA NA NA ...
##  $ Inkheart                              : num  5 4 2 NA NA NA NA 3 4 NA ...
##  $ Neuromancer                           : num  NA 4 3 4 NA 4 NA NA NA NA ...
##  $ Lord.of.the.Flies                     : num  NA 5 1 NA 3 NA NA NA NA NA ...
##  $ The.Princess.Bride                    : num  NA NA NA 5 NA 5 NA NA NA NA ...
##  $ Dinotopia..A.Land.Apart.from.Time     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Far.North                             : num  5 NA NA NA NA NA NA NA NA NA ...
##  $ Practical.Magic                       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Brave.New.World                       : num  NA NA NA 3 NA 3 4 NA NA NA ...
##  $ The.Summer.Tree                       : num  NA 5 NA 4 NA 4 NA NA NA NA ...
##  $ Flowers.For.Algernon                  : num  NA NA 4 3 NA 3 4 NA NA NA ...
##  $ Owl.in.Love                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Naruto                                : num  NA NA NA NA NA NA NA 5 5 NA ...
##  $ Bleach..graphic.novel.                : num  NA NA NA NA NA NA NA NA 5 NA ...
##  $ Kiss.the.Dust                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ To.Kill.a.Mockingbird                 : num  3 4 1 NA NA NA NA NA NA NA ...
##  $ The.Lion.the.Witch.and.the.Wardrobe   : num  4 5 3 4 5 4 5 3 NA NA ...
##  $ The.Bourne.Series                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Life.of.Pi                            : num  3 NA 3 4 NA 4 5 NA NA NA ...
##  $ Breathless                            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Twilight.Series                       : num  1 NA 1 NA NA NA NA 3 NA 5 ...
##  $ Sabriel                               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Nineteen.Eighty.Four..1984.           : num  NA 5 4 3 4 3 NA NA NA NA ...
##  $ Eragon                                : num  5 2 2 4 3 4 NA 5 3 1 ...
##  $ Hatchet                               : num  5 NA 4 NA NA NA NA NA 3 NA ...
##  $ My.Sister.s.Keeper                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ The.Golden.Compass                    : num  5 NA 3 4 NA 4 NA NA 2 1 ...
##  $ Harry.Potter.Series                   : num  5 5 5 4 4 4 NA 5 5 5 ...
##  $ Holes                                 : num  5 NA 3 NA NA NA NA 4 4 NA ...
##  $ Shonen.Jump.Series                    : num  NA NA NA NA NA NA NA 5 NA NA ...
##  $ The.Shadow.Club                       : num  5 NA NA NA NA NA NA NA NA NA ...
##  $ Bone.Series                           : num  5 NA NA 5 4 5 NA NA 3 NA ...
##  $ Maus..A.Survivor.s.Tale               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ The.Joy.Luck.Club                     : num  NA NA NA NA 4 NA 4 NA NA NA ...
##  $ The.Lord.of.the.Rings                 : num  NA 5 3 4 4 4 4 5 NA NA ...
##  $ The.Hobbit                            : num  5 5 4 3 5 3 4 4 3 2 ...
##  $ Shattered                             : num  5 NA 3 NA NA NA NA NA NA NA ...
##  $ The.War.Of.The.Worlds                 : num  5 4 5 NA 4 NA NA 4 NA NA ...
##  $ Dealing.with.Dragons                  : num  5 NA 3 NA NA NA NA NA NA NA ...
##  $ The.Chrysalids                        : num  1 NA 4 NA 4 NA NA 1 NA 2 ...

rate$The.Hitchhiker.s.Guide.To.The.Galaxy = as.numeric(rate$The.Hitchhiker.s.Guide.To.The.Galaxy)
rate$Watership.Down = as.numeric(rate$Watership.Down)

Downsizing dataset

The data has 86 readers and 55 novels. To narrow down the dataset, the first thing to look at was the number of missing values. The plot below shows that more than half of the books have too many NA entries to make accurate recommendations. In fact, every reader has not rated at least one book, making it necessary to determine which variables to retain based on the number of missing values.

# Look at missing values
missmap(rate, main = "Missing values vs Observed")

The next step was to order the number of NA values of readers by least to greatest and retain the top 10 most active readers. From there, 10 books with the least NA values were also retained, creating a 10 by 10 matrix of the most involved readers and the most reviewed novels.

# Count na values for readers
rate$na_count = apply(rate, 1, function(x){ sum(is.na(x))})
# Ordering na values and include only the top ten
rates = head(rate[order(rate$na_count),],10)
# Create new row for na count of books
rates[11,] = colSums(is.na(rates))

## Warning in `[<-.factor`(`*tmp*`, iseq, value = 0): invalid factor level, NA
## generated

# Sort based on book na count
rates_sort = rates[,order(rates[11,])]
# Select only ten books
final_rates = rates_sort[-11, -c(2,10,13:56)]
kable(final_rates)

	The.Hitchhiker.s.Guide.To.The.Galaxy	The.Da.Vinci.Code	Lord.of.the.Flies	To.Kill.a.Mockingbird	The.Golden.Compass	Harry.Potter.Series	The.Hobbit	The.War.Of.The.Worlds	The.Sisterhood.of.the.Travelling.Pants	The.Princess.Diaries	reader
78	1	1	5	4	NA	NA	5	3	3	1	Tony
3	5	3	1	1	3	5	4	5	1	1	Reuven
12	3	5	4	NA	5	5	NA	4	5	5	Cust6
16	4	2	4	4	4	4	4	4	NA	NA	andrew
76	4	5	3	5	4	5	5	5	2	2	Tiffany
1	5	3	NA	3	5	5	5	5	3	2	Ben
14	4	5	5	5	4	5	5	5	5	4	Cust8
22	4	4	4	5	5	4	5	NA	NA	NA	joe
72	5	4	4	5	5	5	5	3	5	3	Claire
84	4	NA	3	2	5	3	5	2	2	3	James

Splitting into training and testing sets

To create and compare the prediction ratings of the books, a training and test set were created. Values from the dataset were selected and removed to create the test set. The training set was to be the dataset but with the extracted test set values were replaced with NA’s.

# Selected testing values
samples = rbind(c(1,10), c(2,9), c(3,8), c(4,7), c(5,6),
                c(6,5), c(7,4), c(8,3), c(9,2), c(10,1))

# Train set
train = final_rates
train[samples] = NA
kable(train)

	The.Hitchhiker.s.Guide.To.The.Galaxy	The.Da.Vinci.Code	Lord.of.the.Flies	To.Kill.a.Mockingbird	The.Golden.Compass	Harry.Potter.Series	The.Hobbit	The.War.Of.The.Worlds	The.Sisterhood.of.the.Travelling.Pants	The.Princess.Diaries	reader
78	1	1	5	4	NA	NA	5	3	3	NA	Tony
3	5	3	1	1	3	5	4	5	NA	1	Reuven
12	3	5	4	NA	5	5	NA	NA	5	5	Cust6
16	4	2	4	4	4	4	NA	4	NA	NA	andrew
76	4	5	3	5	4	NA	5	5	2	2	Tiffany
1	5	3	NA	3	NA	5	5	5	3	2	Ben
14	4	5	5	NA	4	5	5	5	5	4	Cust8
22	4	4	NA	5	5	4	5	NA	NA	NA	joe
72	5	NA	4	5	5	5	5	3	5	3	Claire
84	NA	NA	3	2	5	3	5	2	2	3	James

# Test set
test = as.numeric(final_rates[samples])
test

##  [1] 1 1 4 4 5 5 5 4 4 4

Tasks:

1. Raw Average Approach

Raw Average

The raw average of the training set is the mean rating for each user-item (reader-book) combination. This does not include the missing values (you don’t convert them to 0 or something), so (if every item entry has NA’s like this dataset) be mindful to work around them if need be.

# Calculate Raw Average Rating for user-item combination
raw_train = round(sum(colSums(train[,-11], na.rm = T))/(sum(colSums(!is.na(train[,-11])))), 3)
raw_train

## [1] 3.899

raw_test = round(mean(test), 3)

RMSE

The Root Mean Square Error (RMSE) for the raw average rating is the square root of the average of the squared differences between the training set’s values and the raw average. \(RMSE =\sqrt{\frac{\Sigma(train - rawAvg)^2}{n}}\) The RMSE was calculated for the training and test set is calculated below. Lower values of RMSE indicate better fit.

# Calculate RMSE of raw average
matrix_RMSE = function(matrix){
  matrix = select_if(matrix, is.numeric)
  # matrix mean
  matrix_mean = sum(colSums(matrix, na.rm = T))/(sum(colSums(!is.na(matrix))))
  # square difference of error
  matrix_rme = sum(colSums((matrix-matrix_mean)^2, na.rm = T))/(sum(colSums(!is.na(matrix))))
  # RMSE
  rmse = round(sqrt(matrix_rme),3)
  return(rmse)
}

train_RMSE = matrix_RMSE(train)
print(paste("Train set RMSE: ", train_RMSE))

## [1] "Train set RMSE:  1.239"

test_RMSE = round(sqrt(mean((test - raw_train)^2, na.rm =TRUE)), 3)
print(paste("Test set RMSE: ", test_RMSE))

## [1] "Test set RMSE:  1.432"

2. Baseline Approach

Bias

Though the majority of the ratings in this dataset are 4 and 5 (as seen visually and through the raw average score of approximately 3.8), there are bound to be certain readers that are harsh judges and others that are generous. Some books may also have been perceived as having a higher level of entertainment than others. To account for this, the bias of each user (reader) and item (book) is calculated below. This bias can only be used with the training data. The test dataset was excluded.

bias = function(matrix, item){
  matrix = select_if(train, is.numeric)
  if (item == T){
    bias = round((colSums(matrix, na.rm = T)/colSums(!is.na(matrix)))-raw_train, 2)
  } else{
    bias = round((rowSums(matrix, na.rm = T)/rowSums(!is.na(matrix)))-raw_train, 2)
  }
  return(data.frame(bias))
}

user_bias = bias(train, T)
user_bias

##                                         bias
## The.Hitchhiker.s.Guide.To.The.Galaxy   -0.01
## The.Da.Vinci.Code                      -0.40
## Lord.of.the.Flies                      -0.27
## To.Kill.a.Mockingbird                  -0.27
## The.Golden.Compass                      0.48
## Harry.Potter.Series                     0.60
## The.Hobbit                              0.98
## The.War.Of.The.Worlds                   0.10
## The.Sisterhood.of.the.Travelling.Pants -0.33
## The.Princess.Diaries                   -1.04

item_bias = bias(train, F)
cbind(train$reader,item_bias)

##    train$reader  bias
## 78        Tony  -0.76
## 3       Reuven  -0.79
## 12       Cust6   0.67
## 16      andrew  -0.18
## 76     Tiffany  -0.01
## 1          Ben  -0.02
## 14       Cust8   0.77
## 22         joe   0.60
## 72      Claire   0.55
## 84       James  -0.77

Baseline Predictors

The baseline predictors are predictors for every user and item that take into consideration the user and the item biases to better assume what the value for every user-item combination would be. \(Baseline Predictors = Raw Average + User Bias + Item Bias\) Because some of the predictors exceeded the [1,5] rating range, any values above or below these limits were changed to highest (5) or lowest (1) value respectively.

# Empty matrix
baseline = matrix(, nrow = dim(user_bias)[1], ncol = dim(item_bias)[1])

# Using rmse and biases, calculate baseline predictors
for (i in 1:dim(user_bias)[1]){
  item = t(as.matrix(bias(train, F)))
  baseline[i, ] = round(user_bias[i,] + item + raw_train, 2)
}

rownames(baseline) = rownames(user_bias)
colnames(baseline) = train$reader

# Upper and lower prediction limits adjustment
baseline[baseline > 5] = 5
baseline[baseline < 1] = 1
kable(baseline)

	Tony	Reuven	Cust6	andrew	Tiffany	Ben	Cust8	joe	Claire	James
The.Hitchhiker.s.Guide.To.The.Galaxy	3.13	3.10	4.56	3.71	3.88	3.87	4.66	4.49	4.44	3.12
The.Da.Vinci.Code	2.74	2.71	4.17	3.32	3.49	3.48	4.27	4.10	4.05	2.73
Lord.of.the.Flies	2.87	2.84	4.30	3.45	3.62	3.61	4.40	4.23	4.18	2.86
To.Kill.a.Mockingbird	2.87	2.84	4.30	3.45	3.62	3.61	4.40	4.23	4.18	2.86
The.Golden.Compass	3.62	3.59	5.00	4.20	4.37	4.36	5.00	4.98	4.93	3.61
Harry.Potter.Series	3.74	3.71	5.00	4.32	4.49	4.48	5.00	5.00	5.00	3.73
The.Hobbit	4.12	4.09	5.00	4.70	4.87	4.86	5.00	5.00	5.00	4.11
The.War.Of.The.Worlds	3.24	3.21	4.67	3.82	3.99	3.98	4.77	4.60	4.55	3.23
The.Sisterhood.of.the.Travelling.Pants	2.81	2.78	4.24	3.39	3.56	3.55	4.34	4.17	4.12	2.80
The.Princess.Diaries	2.10	2.07	3.53	2.68	2.85	2.84	3.63	3.46	3.41	2.09

One thing to note was that the dimensions of the baseline predictor dataset (10x10 matrix) was different from the dimensions of the unique baseline predictor dataset (9x10 matrix). On closer inspection, the baseline predictors for “To Kill a Mockingbird” and “The Lord of the Flies” were identical. This seems odd since, in the training dataset, these two novels did not have similar ratings from similar users. However, both books had the same bias value, resulting in this similarity.

Baseline RMSE

To determine the performance of the baseline predictors (especially in comparison to the raw average) the RMSE for baseline predictors was calculated for both the training and test set. This was done by taking the square root of the average of the squared differences between the training set’s (and test set’s) values and the user-item baseline predictors.

# Calculate rmse for baseline for train and test
train_base_rmse = round(sqrt(sum((train[,-11] - baseline)^2, na.rm=TRUE) / length(train[,-11][!is.na(train[,-11])])), 3)
print(paste("Training set Baseline RMSE: ", train_base_rmse))

## [1] "Training set Baseline RMSE:  1.223"

test_base = baseline[samples]
test_base_rmse = round(sqrt(sum((test - test_base)^2) / length(test)), 3)
print(paste("Test set Baseline RMSE: ", test_base_rmse))

## [1] "Test set Baseline RMSE:  1.425"

Summary

A comparison table was created to better see the difference between the results of the Raw Average approach and the baseline approach. This was done for both the training and the test set.

# Summarize results
# percent improvement
train_imp = round((1-(train_base_rmse/matrix_RMSE(train)))*100, 2)
test_imp = round((1-(test_base_rmse/test_RMSE))*100, 2)

Raw_Average = c(raw_train, raw_test) 
RMSE = c(train_RMSE, test_RMSE) 
Baseline_RMSE = c(train_base_rmse, test_base_rmse)
Improvement_Percent = c(train_imp, test_imp) 

results = data.frame(Raw_Average, RMSE, Baseline_RMSE, Improvement_Percent)
row.names(results) = c("Training Set", "Test Set")
kable(results)

	Raw_Average	RMSE	Baseline_RMSE	Improvement_Percent
Training Set	3.899	1.239	1.223	1.29
Test Set	3.700	1.432	1.425	0.49

Training vs Testing Set

Comparing the raw average scores, we see that the testing set can be deemed representative of the training set. From that alone, we can assume that the comparisons between the RMSE values for the raw-average and baseline approach will lead to similar conclusions. This seems to be the case for the Raw Average RMSE. The values for the training and testing set are similar (1.2 and 1.4, respectively), with the testing set having a slightly lesser fit (lower RMSE values are indicative of a better fit). This also is the case with the baseline results, with the training set having an RMSE of 1.2 and the training set having a value of 1.4. Overall, the training and testing sets have similar results.

Raw Average vs Baseline Performance

In both the training and the test set, the RMSE for the raw average was slightly higher. However, the values were so close to each other that the percent improvement for the training set was 1.29%, which was almost two-and-a-half times higher than the percent improvement of the test set (0.49%). This shows that using baseline predictors is a better predictive method of what the rating for a book will be by a particular user, but not by much. However, this can be attributed to the small range of rating values (from 1 to 5), or the sample size (10 books and 10 readers) or even that the selected sample had primarily values of 3 and above, meaning that the recommender system did not have a wide enough range of negative ratings in order to make a better distinction between the two approaches.

Global Baseline Predictors and RMSE

Georgia Galanopoulos

June 12, 2018