The purpose of this assignment is to predict ratings by 1) looking at just the raw averages across all users and 2) accounting for “bias” by normalizing across users and across items. The challenge presents itself by splitting one dataset into a training and testing subset and by working around missing entries.
The data for building this recommender system can be found at http://nifty.stanford.edu/2011/craig-book-recommendations/ and was originally gathered with the intention of recommending books to high school seniors based on 55 novels and ratings from 86 students in Canada.The ratings for each book are given as: (-5: Hated, -3: Didn’t Like, 0: Haven’t Read, 1: Neutral, 3: Liked, 5: Loved)
library(splitstackshape)
library(dplyr)
library(caTools)
library(Amelia)
library(hydroGOF)
library(knitr)
The datasets for the books and the ratings are from two different sources, so after importing them, they have to be combined to create one dataframe. The books dataset was originally one column formatted as (Author, Title) and therefore had to be split into two columns (Author and Title) as seen below. This was originally done with the intention of enhancing the recommendation system in case an author was included more than once, however each author is a unique entry, so it seemed more pertinent to just work with the titles.
# grab the urls (there are 2: ratings and book titles)
url1 = "http://nifty.stanford.edu/2011/craig-book-recommendations/ratings.txt"
url2 = "http://nifty.stanford.edu/2011/craig-book-recommendations/books.txt"
# read in the titles and move the authors into another column
books = data.frame(read.delim(url2, header=F, sep="\t", stringsAsFactors = F))
books = cSplit(books, "V1", ",")
kable(head(books))
| V1_1 | V1_2 |
|---|---|
| Douglas Adams | The Hitchhiker’s Guide To The Galaxy |
| Richard Adams | Watership Down |
| Mitch Albom | The Five People You Meet in Heaven |
| Laurie Halse Anderson | Speak |
| Maya Angelou | I Know Why the Caged Bird Sings |
| Jay Asher | Thirteen Reasons Why |
Directly importing the ratings dataset from the website resulted in some award formatting (where the odd rows were the name of the reader with all NA values following and the odd rows were the previous row’s actual ratings).
# read in the ratings
ratings = data.frame(read.delim(url1, header=F, sep=" ", stringsAsFactors = F))
kable(head(ratings))
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | V41 | V42 | V43 | V44 | V45 | V46 | V47 | V48 | V49 | V50 | V51 | V52 | V53 | V54 | V55 | V56 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ben | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | -3 | 5 | 0 | 0 | 0 | 5 | 5 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 1 | 0 | -5 | 0 | 0 | 5 | 5 | 0 | 5 | 5 | 5 | 0 | 5 | 5 | 0 | 0 | 0 | 5 | 5 | 5 | 5 | -5 | NA |
| Moose | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| 5 | 5 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 5 | 3 | 0 | 5 | 0 | 3 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 5 | -3 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 5 | 0 | 3 | 0 | 0 | NA |
| Reuven | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| 5 | -5 | 0 | 0 | 0 | 0 | -3 | -5 | 0 | 1 | -5 | 5 | 0 | 1 | 0 | 1 | -3 | 1 | -5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | -5 | 1 | 0 | 1 | 0 | -5 | 0 | 3 | -3 | 3 | 0 | 1 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 1 | 5 | 1 | 3 | NA |
To correct this, the even rows were extracted and the names in the odd rows were converted into a new column and attached to the ratings.
# Split the ratings into odd and even rows
odd = ratings %>% dplyr::filter(row_number() %% 2 == 1)
## Warning: package 'bindrcpp' was built under R version 3.4.4
odd = within(odd, consumer <- paste(V1, V2, sep=" "))
even = ratings %>% dplyr::filter(row_number() %% 2 == 0)
colnames(even) = books$V1_2
# final dataset
rate = data.frame(cbind(odd$consumer, even))[,1:56]
names(rate)[names(rate) == "odd.consumer"] = "reader"
kable(head(rate))
| reader | The.Hitchhiker.s.Guide.To.The.Galaxy | Watership.Down | The.Five.People.You.Meet.in.Heaven | Speak | I.Know.Why.the.Caged.Bird.Sings | Thirteen.Reasons.Why | Foundation.Series | The.Sisterhood.of.the.Travelling.Pants | A.Great.and.Terrible.Beauty | The.Da.Vinci.Code | The.Princess.Diaries | Ender.s.Game | The.Hunt.for.Red.October | The.Hunger.Games | The.Great.Gatsby | Ranger.s.Apprentice.Series | Inkheart | Neuromancer | Lord.of.the.Flies | The.Princess.Bride | Dinotopia..A.Land.Apart.from.Time | Far.North | Practical.Magic | Brave.New.World | The.Summer.Tree | Flowers.For.Algernon | Owl.in.Love | Naruto | Bleach..graphic.novel. | Kiss.the.Dust | To.Kill.a.Mockingbird | The.Lion.the.Witch.and.the.Wardrobe | The.Bourne.Series | Life.of.Pi | Breathless | Twilight.Series | Sabriel | Nineteen.Eighty.Four..1984. | Eragon | Hatchet | My.Sister.s.Keeper | The.Golden.Compass | Harry.Potter.Series | Holes | Shonen.Jump.Series | The.Shadow.Club | Bone.Series | Maus..A.Survivor.s.Tale | The.Joy.Luck.Club | The.Lord.of.the.Rings | The.Hobbit | Shattered | The.War.Of.The.Worlds | Dealing.with.Dragons | The.Chrysalids |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ben | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | -3 | 5 | 0 | 0 | 0 | 5 | 5 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 1 | 0 | -5 | 0 | 0 | 5 | 5 | 0 | 5 | 5 | 5 | 0 | 5 | 5 | 0 | 0 | 0 | 5 | 5 | 5 | 5 | -5 |
| Moose | 5 | 5 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 5 | 3 | 0 | 5 | 0 | 3 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 5 | -3 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 5 | 0 | 3 | 0 | 0 |
| Reuven | 5 | -5 | 0 | 0 | 0 | 0 | -3 | -5 | 0 | 1 | -5 | 5 | 0 | 1 | 0 | 1 | -3 | 1 | -5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | -5 | 1 | 0 | 1 | 0 | -5 | 0 | 3 | -3 | 3 | 0 | 1 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 1 | 5 | 1 | 3 |
| Cust1 | 3 | 3 | 5 | 0 | 0 | 0 | 3 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 5 | 0 | 0 | 0 | 1 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 3 | 3 | 0 | 0 | 0 | 5 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 |
| Cust2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 3 | 0 | 3 | 3 | 5 | 0 | 3 | 0 | 3 |
| Francois | 3 | 3 | 5 | 0 | 0 | 0 | 3 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 5 | 0 | 0 | 0 | 1 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 3 | 3 | 0 | 0 | 0 | 5 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 |
For ease of understanding (making the dataset more intuitive), the values were changed and replaced from the scale of [-5, 5] to [1, 5] and, where previously a zero value was the equivalent of a “not read” response, “NA” became representative of “no rating”.
rate[rate == 0] = NA
rate[rate == 3] = 4
rate[rate == 1] = 3
rate[rate == -5] = 1
rate[rate == -3] = 2
Just looking at the structure of the dataset, two books’ ratings (Hitchhiker’s Guide and Watership Down) were input as character variables. They were subsequently changed to numerical values.
kable(str(rate))
## 'data.frame': 86 obs. of 56 variables:
## $ reader : Factor w/ 85 levels "Albus Dumbledore",..: 6 59 67 15 17 32 44 39 18 19 ...
## $ The.Hitchhiker.s.Guide.To.The.Galaxy : chr "5" "5" "5" "4" ...
## $ Watership.Down : chr NA "5" "1" "4" ...
## $ The.Five.People.You.Meet.in.Heaven : num NA NA NA 5 NA 5 NA NA NA NA ...
## $ Speak : num NA NA NA NA NA NA NA NA 2 NA ...
## $ I.Know.Why.the.Caged.Bird.Sings : num NA NA NA NA NA NA NA NA NA NA ...
## $ Thirteen.Reasons.Why : num NA NA NA NA NA NA NA NA NA NA ...
## $ Foundation.Series : num NA 4 2 4 NA 4 4 NA NA NA ...
## $ The.Sisterhood.of.the.Travelling.Pants: num 3 NA 1 NA NA NA NA NA NA 5 ...
## $ A.Great.and.Terrible.Beauty : num NA NA NA NA NA NA NA NA NA NA ...
## $ The.Da.Vinci.Code : num 3 3 3 4 NA 4 NA NA NA NA ...
## $ The.Princess.Diaries : num 2 NA 1 NA NA NA NA NA NA NA ...
## $ Ender.s.Game : num 5 5 5 4 NA 4 NA NA NA NA ...
## $ The.Hunt.for.Red.October : num NA 4 NA NA NA NA NA NA NA NA ...
## $ The.Hunger.Games : num NA NA 3 NA NA NA NA NA NA NA ...
## $ The.Great.Gatsby : num NA 5 NA NA NA NA 3 NA NA NA ...
## $ Ranger.s.Apprentice.Series : num 5 NA 3 NA NA NA NA NA NA NA ...
## $ Inkheart : num 5 4 2 NA NA NA NA 3 4 NA ...
## $ Neuromancer : num NA 4 3 4 NA 4 NA NA NA NA ...
## $ Lord.of.the.Flies : num NA 5 1 NA 3 NA NA NA NA NA ...
## $ The.Princess.Bride : num NA NA NA 5 NA 5 NA NA NA NA ...
## $ Dinotopia..A.Land.Apart.from.Time : num NA NA NA NA NA NA NA NA NA NA ...
## $ Far.North : num 5 NA NA NA NA NA NA NA NA NA ...
## $ Practical.Magic : num NA NA NA NA NA NA NA NA NA NA ...
## $ Brave.New.World : num NA NA NA 3 NA 3 4 NA NA NA ...
## $ The.Summer.Tree : num NA 5 NA 4 NA 4 NA NA NA NA ...
## $ Flowers.For.Algernon : num NA NA 4 3 NA 3 4 NA NA NA ...
## $ Owl.in.Love : num NA NA NA NA NA NA NA NA NA NA ...
## $ Naruto : num NA NA NA NA NA NA NA 5 5 NA ...
## $ Bleach..graphic.novel. : num NA NA NA NA NA NA NA NA 5 NA ...
## $ Kiss.the.Dust : num NA NA NA NA NA NA NA NA NA NA ...
## $ To.Kill.a.Mockingbird : num 3 4 1 NA NA NA NA NA NA NA ...
## $ The.Lion.the.Witch.and.the.Wardrobe : num 4 5 3 4 5 4 5 3 NA NA ...
## $ The.Bourne.Series : num NA NA NA NA NA NA NA NA NA NA ...
## $ Life.of.Pi : num 3 NA 3 4 NA 4 5 NA NA NA ...
## $ Breathless : int NA NA NA NA NA NA NA NA NA NA ...
## $ Twilight.Series : num 1 NA 1 NA NA NA NA 3 NA 5 ...
## $ Sabriel : num NA NA NA NA NA NA NA NA NA NA ...
## $ Nineteen.Eighty.Four..1984. : num NA 5 4 3 4 3 NA NA NA NA ...
## $ Eragon : num 5 2 2 4 3 4 NA 5 3 1 ...
## $ Hatchet : num 5 NA 4 NA NA NA NA NA 3 NA ...
## $ My.Sister.s.Keeper : num NA NA NA NA NA NA NA NA NA NA ...
## $ The.Golden.Compass : num 5 NA 3 4 NA 4 NA NA 2 1 ...
## $ Harry.Potter.Series : num 5 5 5 4 4 4 NA 5 5 5 ...
## $ Holes : num 5 NA 3 NA NA NA NA 4 4 NA ...
## $ Shonen.Jump.Series : num NA NA NA NA NA NA NA 5 NA NA ...
## $ The.Shadow.Club : num 5 NA NA NA NA NA NA NA NA NA ...
## $ Bone.Series : num 5 NA NA 5 4 5 NA NA 3 NA ...
## $ Maus..A.Survivor.s.Tale : num NA NA NA NA NA NA NA NA NA NA ...
## $ The.Joy.Luck.Club : num NA NA NA NA 4 NA 4 NA NA NA ...
## $ The.Lord.of.the.Rings : num NA 5 3 4 4 4 4 5 NA NA ...
## $ The.Hobbit : num 5 5 4 3 5 3 4 4 3 2 ...
## $ Shattered : num 5 NA 3 NA NA NA NA NA NA NA ...
## $ The.War.Of.The.Worlds : num 5 4 5 NA 4 NA NA 4 NA NA ...
## $ Dealing.with.Dragons : num 5 NA 3 NA NA NA NA NA NA NA ...
## $ The.Chrysalids : num 1 NA 4 NA 4 NA NA 1 NA 2 ...
rate$The.Hitchhiker.s.Guide.To.The.Galaxy = as.numeric(rate$The.Hitchhiker.s.Guide.To.The.Galaxy)
rate$Watership.Down = as.numeric(rate$Watership.Down)
The data has 86 readers and 55 novels. To narrow down the dataset, the first thing to look at was the number of missing values. The plot below shows that more than half of the books have too many NA entries to make accurate recommendations. In fact, every reader has not rated at least one book, making it necessary to determine which variables to retain based on the number of missing values.
# Look at missing values
missmap(rate, main = "Missing values vs Observed")
The next step was to order the number of NA values of readers by least to greatest and retain the top 10 most active readers. From there, 10 books with the least NA values were also retained, creating a 10 by 10 matrix of the most involved readers and the most reviewed novels.
# Count na values for readers
rate$na_count = apply(rate, 1, function(x){ sum(is.na(x))})
# Ordering na values and include only the top ten
rates = head(rate[order(rate$na_count),],10)
# Create new row for na count of books
rates[11,] = colSums(is.na(rates))
## Warning in `[<-.factor`(`*tmp*`, iseq, value = 0): invalid factor level, NA
## generated
# Sort based on book na count
rates_sort = rates[,order(rates[11,])]
# Select only ten books
final_rates = rates_sort[-11, -c(2,10,13:56)]
kable(final_rates)
| The.Hitchhiker.s.Guide.To.The.Galaxy | The.Da.Vinci.Code | Lord.of.the.Flies | To.Kill.a.Mockingbird | The.Golden.Compass | Harry.Potter.Series | The.Hobbit | The.War.Of.The.Worlds | The.Sisterhood.of.the.Travelling.Pants | The.Princess.Diaries | reader | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 78 | 1 | 1 | 5 | 4 | NA | NA | 5 | 3 | 3 | 1 | Tony |
| 3 | 5 | 3 | 1 | 1 | 3 | 5 | 4 | 5 | 1 | 1 | Reuven |
| 12 | 3 | 5 | 4 | NA | 5 | 5 | NA | 4 | 5 | 5 | Cust6 |
| 16 | 4 | 2 | 4 | 4 | 4 | 4 | 4 | 4 | NA | NA | andrew |
| 76 | 4 | 5 | 3 | 5 | 4 | 5 | 5 | 5 | 2 | 2 | Tiffany |
| 1 | 5 | 3 | NA | 3 | 5 | 5 | 5 | 5 | 3 | 2 | Ben |
| 14 | 4 | 5 | 5 | 5 | 4 | 5 | 5 | 5 | 5 | 4 | Cust8 |
| 22 | 4 | 4 | 4 | 5 | 5 | 4 | 5 | NA | NA | NA | joe |
| 72 | 5 | 4 | 4 | 5 | 5 | 5 | 5 | 3 | 5 | 3 | Claire |
| 84 | 4 | NA | 3 | 2 | 5 | 3 | 5 | 2 | 2 | 3 | James |
To create and compare the prediction ratings of the books, a training and test set were created. Values from the dataset were selected and removed to create the test set. The training set was to be the dataset but with the extracted test set values were replaced with NA’s.
# Selected testing values
samples = rbind(c(1,10), c(2,9), c(3,8), c(4,7), c(5,6),
c(6,5), c(7,4), c(8,3), c(9,2), c(10,1))
# Train set
train = final_rates
train[samples] = NA
kable(train)
| The.Hitchhiker.s.Guide.To.The.Galaxy | The.Da.Vinci.Code | Lord.of.the.Flies | To.Kill.a.Mockingbird | The.Golden.Compass | Harry.Potter.Series | The.Hobbit | The.War.Of.The.Worlds | The.Sisterhood.of.the.Travelling.Pants | The.Princess.Diaries | reader | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 78 | 1 | 1 | 5 | 4 | NA | NA | 5 | 3 | 3 | NA | Tony |
| 3 | 5 | 3 | 1 | 1 | 3 | 5 | 4 | 5 | NA | 1 | Reuven |
| 12 | 3 | 5 | 4 | NA | 5 | 5 | NA | NA | 5 | 5 | Cust6 |
| 16 | 4 | 2 | 4 | 4 | 4 | 4 | NA | 4 | NA | NA | andrew |
| 76 | 4 | 5 | 3 | 5 | 4 | NA | 5 | 5 | 2 | 2 | Tiffany |
| 1 | 5 | 3 | NA | 3 | NA | 5 | 5 | 5 | 3 | 2 | Ben |
| 14 | 4 | 5 | 5 | NA | 4 | 5 | 5 | 5 | 5 | 4 | Cust8 |
| 22 | 4 | 4 | NA | 5 | 5 | 4 | 5 | NA | NA | NA | joe |
| 72 | 5 | NA | 4 | 5 | 5 | 5 | 5 | 3 | 5 | 3 | Claire |
| 84 | NA | NA | 3 | 2 | 5 | 3 | 5 | 2 | 2 | 3 | James |
# Test set
test = as.numeric(final_rates[samples])
test
## [1] 1 1 4 4 5 5 5 4 4 4
The raw average of the training set is the mean rating for each user-item (reader-book) combination. This does not include the missing values (you don’t convert them to 0 or something), so (if every item entry has NA’s like this dataset) be mindful to work around them if need be.
# Calculate Raw Average Rating for user-item combination
raw_train = round(sum(colSums(train[,-11], na.rm = T))/(sum(colSums(!is.na(train[,-11])))), 3)
raw_train
## [1] 3.899
raw_test = round(mean(test), 3)
The Root Mean Square Error (RMSE) for the raw average rating is the square root of the average of the squared differences between the training set’s values and the raw average. \(RMSE =\sqrt{\frac{\Sigma(train - rawAvg)^2}{n}}\) The RMSE was calculated for the training and test set is calculated below. Lower values of RMSE indicate better fit.
# Calculate RMSE of raw average
matrix_RMSE = function(matrix){
matrix = select_if(matrix, is.numeric)
# matrix mean
matrix_mean = sum(colSums(matrix, na.rm = T))/(sum(colSums(!is.na(matrix))))
# square difference of error
matrix_rme = sum(colSums((matrix-matrix_mean)^2, na.rm = T))/(sum(colSums(!is.na(matrix))))
# RMSE
rmse = round(sqrt(matrix_rme),3)
return(rmse)
}
train_RMSE = matrix_RMSE(train)
print(paste("Train set RMSE: ", train_RMSE))
## [1] "Train set RMSE: 1.239"
test_RMSE = round(sqrt(mean((test - raw_train)^2, na.rm =TRUE)), 3)
print(paste("Test set RMSE: ", test_RMSE))
## [1] "Test set RMSE: 1.432"
Though the majority of the ratings in this dataset are 4 and 5 (as seen visually and through the raw average score of approximately 3.8), there are bound to be certain readers that are harsh judges and others that are generous. Some books may also have been perceived as having a higher level of entertainment than others. To account for this, the bias of each user (reader) and item (book) is calculated below. This bias can only be used with the training data. The test dataset was excluded.
bias = function(matrix, item){
matrix = select_if(train, is.numeric)
if (item == T){
bias = round((colSums(matrix, na.rm = T)/colSums(!is.na(matrix)))-raw_train, 2)
} else{
bias = round((rowSums(matrix, na.rm = T)/rowSums(!is.na(matrix)))-raw_train, 2)
}
return(data.frame(bias))
}
user_bias = bias(train, T)
user_bias
## bias
## The.Hitchhiker.s.Guide.To.The.Galaxy -0.01
## The.Da.Vinci.Code -0.40
## Lord.of.the.Flies -0.27
## To.Kill.a.Mockingbird -0.27
## The.Golden.Compass 0.48
## Harry.Potter.Series 0.60
## The.Hobbit 0.98
## The.War.Of.The.Worlds 0.10
## The.Sisterhood.of.the.Travelling.Pants -0.33
## The.Princess.Diaries -1.04
item_bias = bias(train, F)
cbind(train$reader,item_bias)
## train$reader bias
## 78 Tony -0.76
## 3 Reuven -0.79
## 12 Cust6 0.67
## 16 andrew -0.18
## 76 Tiffany -0.01
## 1 Ben -0.02
## 14 Cust8 0.77
## 22 joe 0.60
## 72 Claire 0.55
## 84 James -0.77
The baseline predictors are predictors for every user and item that take into consideration the user and the item biases to better assume what the value for every user-item combination would be. \(Baseline Predictors = Raw Average + User Bias + Item Bias\) Because some of the predictors exceeded the [1,5] rating range, any values above or below these limits were changed to highest (5) or lowest (1) value respectively.
# Empty matrix
baseline = matrix(, nrow = dim(user_bias)[1], ncol = dim(item_bias)[1])
# Using rmse and biases, calculate baseline predictors
for (i in 1:dim(user_bias)[1]){
item = t(as.matrix(bias(train, F)))
baseline[i, ] = round(user_bias[i,] + item + raw_train, 2)
}
rownames(baseline) = rownames(user_bias)
colnames(baseline) = train$reader
# Upper and lower prediction limits adjustment
baseline[baseline > 5] = 5
baseline[baseline < 1] = 1
kable(baseline)
| Tony | Reuven | Cust6 | andrew | Tiffany | Ben | Cust8 | joe | Claire | James | |
|---|---|---|---|---|---|---|---|---|---|---|
| The.Hitchhiker.s.Guide.To.The.Galaxy | 3.13 | 3.10 | 4.56 | 3.71 | 3.88 | 3.87 | 4.66 | 4.49 | 4.44 | 3.12 |
| The.Da.Vinci.Code | 2.74 | 2.71 | 4.17 | 3.32 | 3.49 | 3.48 | 4.27 | 4.10 | 4.05 | 2.73 |
| Lord.of.the.Flies | 2.87 | 2.84 | 4.30 | 3.45 | 3.62 | 3.61 | 4.40 | 4.23 | 4.18 | 2.86 |
| To.Kill.a.Mockingbird | 2.87 | 2.84 | 4.30 | 3.45 | 3.62 | 3.61 | 4.40 | 4.23 | 4.18 | 2.86 |
| The.Golden.Compass | 3.62 | 3.59 | 5.00 | 4.20 | 4.37 | 4.36 | 5.00 | 4.98 | 4.93 | 3.61 |
| Harry.Potter.Series | 3.74 | 3.71 | 5.00 | 4.32 | 4.49 | 4.48 | 5.00 | 5.00 | 5.00 | 3.73 |
| The.Hobbit | 4.12 | 4.09 | 5.00 | 4.70 | 4.87 | 4.86 | 5.00 | 5.00 | 5.00 | 4.11 |
| The.War.Of.The.Worlds | 3.24 | 3.21 | 4.67 | 3.82 | 3.99 | 3.98 | 4.77 | 4.60 | 4.55 | 3.23 |
| The.Sisterhood.of.the.Travelling.Pants | 2.81 | 2.78 | 4.24 | 3.39 | 3.56 | 3.55 | 4.34 | 4.17 | 4.12 | 2.80 |
| The.Princess.Diaries | 2.10 | 2.07 | 3.53 | 2.68 | 2.85 | 2.84 | 3.63 | 3.46 | 3.41 | 2.09 |
One thing to note was that the dimensions of the baseline predictor dataset (10x10 matrix) was different from the dimensions of the unique baseline predictor dataset (9x10 matrix). On closer inspection, the baseline predictors for “To Kill a Mockingbird” and “The Lord of the Flies” were identical. This seems odd since, in the training dataset, these two novels did not have similar ratings from similar users. However, both books had the same bias value, resulting in this similarity.
To determine the performance of the baseline predictors (especially in comparison to the raw average) the RMSE for baseline predictors was calculated for both the training and test set. This was done by taking the square root of the average of the squared differences between the training set’s (and test set’s) values and the user-item baseline predictors.
# Calculate rmse for baseline for train and test
train_base_rmse = round(sqrt(sum((train[,-11] - baseline)^2, na.rm=TRUE) / length(train[,-11][!is.na(train[,-11])])), 3)
print(paste("Training set Baseline RMSE: ", train_base_rmse))
## [1] "Training set Baseline RMSE: 1.223"
test_base = baseline[samples]
test_base_rmse = round(sqrt(sum((test - test_base)^2) / length(test)), 3)
print(paste("Test set Baseline RMSE: ", test_base_rmse))
## [1] "Test set Baseline RMSE: 1.425"
A comparison table was created to better see the difference between the results of the Raw Average approach and the baseline approach. This was done for both the training and the test set.
# Summarize results
# percent improvement
train_imp = round((1-(train_base_rmse/matrix_RMSE(train)))*100, 2)
test_imp = round((1-(test_base_rmse/test_RMSE))*100, 2)
Raw_Average = c(raw_train, raw_test)
RMSE = c(train_RMSE, test_RMSE)
Baseline_RMSE = c(train_base_rmse, test_base_rmse)
Improvement_Percent = c(train_imp, test_imp)
results = data.frame(Raw_Average, RMSE, Baseline_RMSE, Improvement_Percent)
row.names(results) = c("Training Set", "Test Set")
kable(results)
| Raw_Average | RMSE | Baseline_RMSE | Improvement_Percent | |
|---|---|---|---|---|
| Training Set | 3.899 | 1.239 | 1.223 | 1.29 |
| Test Set | 3.700 | 1.432 | 1.425 | 0.49 |
Training vs Testing Set
Comparing the raw average scores, we see that the testing set can be deemed representative of the training set. From that alone, we can assume that the comparisons between the RMSE values for the raw-average and baseline approach will lead to similar conclusions. This seems to be the case for the Raw Average RMSE. The values for the training and testing set are similar (1.2 and 1.4, respectively), with the testing set having a slightly lesser fit (lower RMSE values are indicative of a better fit). This also is the case with the baseline results, with the training set having an RMSE of 1.2 and the training set having a value of 1.4. Overall, the training and testing sets have similar results.
Raw Average vs Baseline Performance
In both the training and the test set, the RMSE for the raw average was slightly higher. However, the values were so close to each other that the percent improvement for the training set was 1.29%, which was almost two-and-a-half times higher than the percent improvement of the test set (0.49%). This shows that using baseline predictors is a better predictive method of what the rating for a book will be by a particular user, but not by much. However, this can be attributed to the small range of rating values (from 1 to 5), or the sample size (10 books and 10 readers) or even that the selected sample had primarily values of 3 and above, meaning that the recommender system did not have a wide enough range of negative ratings in order to make a better distinction between the two approaches.