Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers.”
The system recommends books based upon the reviews captured/ submitted by users
Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.
This dataset is a toy dataset which has users info. The users have rated books on the scale of 1 to 5 and have some missing data as well. The dataset is created in json format.
Load your data into (for example) an R or dataframe, a Python dictionary or list of lists, (or another data structure of your choosing). From there, create a user-item matrix. If you choose to work with a large dataset, you’re encouraged to also create a small, relatively dense “user-item” matrix as a subset so that you can hand-verify your calculations.
library(tidyverse)
library(jsonlite)
library(knitr)
library(RCurl)
json.url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/books_rating.json"
json.file <- getURLContent(json.url)
json.df <- as.data.frame(fromJSON(json.file[[1]]))
colnames(json.df) <- str_replace(colnames(json.df),"books\\.", "")
colnames(json.df) <- str_replace(colnames(json.df),"\\.", " ")
dim(json.df)
## [1] 5 6
rownames(json.df) <- json.df$Title
book.df <- select(json.df,User1:User5)
kable(book.df)
User1 | User2 | User3 | User4 | User5 | |
---|---|---|---|---|---|
How to Win Friends & Influence People | 4.5 | 3 | 3.5 | 5 | NA |
Change or Die The Three Keys to Change at Work and in Life | 4 | 2 | 3.5 | 4.5 | 4 |
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | 4.5 | NA | 4 | 4.5 | 4 |
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | 5 | 3.5 | NA | 4.5 | 5 |
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google | NA | 3 | 4.5 | NA | 4.5 |
bk.matrix <- data.matrix(book.df)
kable(bk.matrix)
User1 | User2 | User3 | User4 | User5 | |
---|---|---|---|---|---|
How to Win Friends & Influence People | 4.5 | 3.0 | 3.5 | 5.0 | NA |
Change or Die The Three Keys to Change at Work and in Life | 4.0 | 2.0 | 3.5 | 4.5 | 4.0 |
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | 4.5 | NA | 4.0 | 4.5 | 4.0 |
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | 5.0 | 3.5 | NA | 4.5 | 5.0 |
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google | NA | 3.0 | 4.5 | NA | 4.5 |
Break your ratings into separate training and test datasets.
test.rating.cell <- rbind(c(3,1), c(2,2), c(2,3), c(3,4), c(5,5))
test.data.values <- as.numeric(bk.matrix[test.rating.cell])
test.data.values
## [1] 4.5 2.0 3.5 4.5 4.5
train.dataset <- bk.matrix
train.dataset[test.rating.cell] <- NA
kable(train.dataset)
User1 | User2 | User3 | User4 | User5 | |
---|---|---|---|---|---|
How to Win Friends & Influence People | 4.5 | 3.0 | 3.5 | 5.0 | NA |
Change or Die The Three Keys to Change at Work and in Life | 4.0 | NA | NA | 4.5 | 4 |
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | NA | NA | 4.0 | NA | 4 |
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | 5.0 | 3.5 | NA | 4.5 | 5 |
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google | NA | 3.0 | 4.5 | NA | NA |
Using your training data, calculate the raw average (mean) rating for every user-item combination.
train_mean <- mean(as.numeric(as.matrix(train.dataset)), na.rm=TRUE)
train_mean
## [1] 4.133333
train.err <- train.dataset - train_mean
kable(train.err)
User1 | User2 | User3 | User4 | User5 | |
---|---|---|---|---|---|
How to Win Friends & Influence People | 0.3666667 | -1.1333333 | -0.6333333 | 0.8666667 | NA |
Change or Die The Three Keys to Change at Work and in Life | -0.1333333 | NA | NA | 0.3666667 | -0.1333333 |
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | NA | NA | -0.1333333 | NA | -0.1333333 |
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | 0.8666667 | -0.6333333 | NA | 0.3666667 | 0.8666667 |
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google | NA | -1.1333333 | 0.3666667 | NA | NA |
Calculate the RMSE for raw average for both your training data and your test data.
train_rmse <- sqrt(mean((train.err^2),na.rm = TRUE))
train_rmse
## [1] 0.644636
A function to calculate RMSE
rmseCal <- function(actual, predict){
val <- (actual-predict)^2
sqrt(mean(val))
}
test_rmse <- rmseCal(test.data.values,train_mean)
test_rmse
## [1] 1.034945
Using your training data, calculate the bias for each user and each item. Finding user bias and book bias
user_bias <- colMeans(bk.matrix, na.rm=TRUE) - train_mean
kable(user_bias)
x | |
---|---|
User1 | 0.3666667 |
User2 | -1.2583333 |
User3 | -0.2583333 |
User4 | 0.4916667 |
User5 | 0.2416667 |
book_bias <- rowMeans(bk.matrix, na.rm=TRUE) - train_mean
kable(book_bias)
x | |
---|---|
How to Win Friends & Influence People | -0.1333333 |
Change or Die The Three Keys to Change at Work and in Life | -0.5333333 |
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | 0.1166667 |
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | 0.3666667 |
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google | -0.1333333 |
From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.
Calculating the baeline predictor from raw average, user and book bias.
baseline <- as.data.frame(matrix(nrow=5, ncol=5))
for (i in 1:length(user_bias)){
row <- c(book_bias + user_bias[i] + train_mean)
baseline[i, ] <- row
}
# Score maximum possible score of 5
baseline[baseline > 5] <- 5
#baseline
rownames(baseline) <- json.df$Title
colnames(baseline) <- colnames(book.df)
kable(round(baseline, 2))
User1 | User2 | User3 | User4 | User5 | |
---|---|---|---|---|---|
How to Win Friends & Influence People | 4.37 | 3.97 | 4.62 | 4.87 | 4.37 |
Change or Die The Three Keys to Change at Work and in Life | 2.74 | 2.34 | 2.99 | 3.24 | 2.74 |
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | 3.74 | 3.34 | 3.99 | 4.24 | 3.74 |
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | 4.49 | 4.09 | 4.74 | 4.99 | 4.49 |
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google | 4.24 | 3.84 | 4.49 | 4.74 | 4.24 |
Calculate the RMSE for the baseline predictors for both your training data and your test data.
train_bias_RMSE <- sqrt(sum((train.dataset - baseline)^2, na.rm=TRUE) / length(book.df[!is.na(book.df)]))
train_bias_RMSE
## [1] 0.6652459
test_baseline <- baseline[test.rating.cell]
test_baseline
## [1] 3.741667 2.341667 2.991667 4.241667 4.241667
test_bias_RMSE <- sqrt(sum((test.data.values - test_baseline)^2) / length(test.data.values))
test_bias_RMSE
## [1] 0.4655493
Summarize your results.
# % test increase/ decrease
(1-(test_bias_RMSE/test_rmse)) * 100
## [1] 55.017
# % train increase/ decrease
(1-(train_bias_RMSE/train_rmse)) * 100
## [1] -3.197137
After evaluating two methods, raw and bias baseline predictors, it appears that for RMSE value there is 55.017 % increase in the test dataset whereas the train dataset decreased by ~ -3.2 %. One of the main reason could be the toy dataset which is small 5x5 data matrix. It might perform well with larger dataset.