1 Recommender System

Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers.”

The system recommends books based upon the reviews captured/ submitted by users

2 Dataset Description

Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.

This dataset is a toy dataset which has users info. The users have rated books on the scale of 1 to 5 and have some missing data as well. The dataset is created in json format.

3 Dataset Ingestion

Load your data into (for example) an R or dataframe, a Python dictionary or list of lists, (or another data structure of your choosing). From there, create a user-item matrix. If you choose to work with a large dataset, you’re encouraged to also create a small, relatively dense “user-item” matrix as a subset so that you can hand-verify your calculations.

library(tidyverse)
library(jsonlite)

library(knitr)
library(RCurl)
json.url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/books_rating.json"
json.file <- getURLContent(json.url)
json.df <- as.data.frame(fromJSON(json.file[[1]]))
colnames(json.df) <- str_replace(colnames(json.df),"books\\.", "")
colnames(json.df) <- str_replace(colnames(json.df),"\\.", " ")
dim(json.df)

## [1] 5 6

rownames(json.df) <- json.df$Title
book.df <- select(json.df,User1:User5)
kable(book.df)

	User1	User2	User3	User4	User5
How to Win Friends & Influence People	4.5	3	3.5	5	NA
Change or Die The Three Keys to Change at Work and in Life	4	2	3.5	4.5	4
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	4.5	NA	4	4.5	4
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	5	3.5	NA	4.5	5
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google	NA	3	4.5	NA	4.5

bk.matrix <- data.matrix(book.df)
kable(bk.matrix)

	User1	User2	User3	User4	User5
How to Win Friends & Influence People	4.5	3.0	3.5	5.0	NA
Change or Die The Three Keys to Change at Work and in Life	4.0	2.0	3.5	4.5	4.0
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	4.5	NA	4.0	4.5	4.0
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	5.0	3.5	NA	4.5	5.0
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google	NA	3.0	4.5	NA	4.5

4 Build Train and Test Dataset

Break your ratings into separate training and test datasets.

test.rating.cell <- rbind(c(3,1), c(2,2), c(2,3), c(3,4), c(5,5))
test.data.values <- as.numeric(bk.matrix[test.rating.cell])
test.data.values

## [1] 4.5 2.0 3.5 4.5 4.5

train.dataset <- bk.matrix
train.dataset[test.rating.cell] <- NA
kable(train.dataset)

	User1	User2	User3	User4	User5
How to Win Friends & Influence People	4.5	3.0	3.5	5.0	NA
Change or Die The Three Keys to Change at Work and in Life	4.0	NA	NA	4.5	4
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	NA	NA	4.0	NA	4
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	5.0	3.5	NA	4.5	5
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google	NA	3.0	4.5	NA	NA

5 Raw Average and RMSE Calculation

Using your training data, calculate the raw average (mean) rating for every user-item combination.

train_mean <- mean(as.numeric(as.matrix(train.dataset)), na.rm=TRUE)
train_mean

## [1] 4.133333

train.err <- train.dataset - train_mean
kable(train.err)

	User1	User2	User3	User4	User5
How to Win Friends & Influence People	0.3666667	-1.1333333	-0.6333333	0.8666667	NA
Change or Die The Three Keys to Change at Work and in Life	-0.1333333	NA	NA	0.3666667	-0.1333333
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	NA	NA	-0.1333333	NA	-0.1333333
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	0.8666667	-0.6333333	NA	0.3666667	0.8666667
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google	NA	-1.1333333	0.3666667	NA	NA

Calculate the RMSE for raw average for both your training data and your test data.

train_rmse <- sqrt(mean((train.err^2),na.rm = TRUE))
train_rmse

## [1] 0.644636

A function to calculate RMSE

rmseCal <- function(actual, predict){
  val <- (actual-predict)^2
  sqrt(mean(val))
}

test_rmse <- rmseCal(test.data.values,train_mean)
test_rmse

## [1] 1.034945

6 Bias

Using your training data, calculate the bias for each user and each item. Finding user bias and book bias

user_bias <- colMeans(bk.matrix, na.rm=TRUE) - train_mean
kable(user_bias)

	x
User1	0.3666667
User2	-1.2583333
User3	-0.2583333
User4	0.4916667
User5	0.2416667

book_bias <- rowMeans(bk.matrix, na.rm=TRUE) - train_mean
kable(book_bias)

	x
How to Win Friends & Influence People	-0.1333333
Change or Die The Three Keys to Change at Work and in Life	-0.5333333
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	0.1166667
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	0.3666667
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google	-0.1333333

7 Baseline Predictor

From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

Calculating the baeline predictor from raw average, user and book bias.

baseline <- as.data.frame(matrix(nrow=5, ncol=5))

for (i in 1:length(user_bias)){
  row <- c(book_bias + user_bias[i] + train_mean)
  baseline[i, ] <- row
}

# Score maximum possible score of 5
baseline[baseline > 5] <- 5

#baseline
rownames(baseline) <- json.df$Title
colnames(baseline) <- colnames(book.df)
kable(round(baseline, 2))

	User1	User2	User3	User4	User5
How to Win Friends & Influence People	4.37	3.97	4.62	4.87	4.37
Change or Die The Three Keys to Change at Work and in Life	2.74	2.34	2.99	3.24	2.74
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	3.74	3.34	3.99	4.24	3.74
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	4.49	4.09	4.74	4.99	4.49
The Four: The Hidden DNA of Amazon, Apple, Facebook, and Google	4.24	3.84	4.49	4.74	4.24

Calculate the RMSE for the baseline predictors for both your training data and your test data.

train_bias_RMSE <- sqrt(sum((train.dataset - baseline)^2, na.rm=TRUE) / length(book.df[!is.na(book.df)]))
train_bias_RMSE

## [1] 0.6652459

test_baseline <- baseline[test.rating.cell]
test_baseline

## [1] 3.741667 2.341667 2.991667 4.241667 4.241667

test_bias_RMSE <- sqrt(sum((test.data.values - test_baseline)^2) / length(test.data.values))
test_bias_RMSE

## [1] 0.4655493

8 Summary

Summarize your results.

# % test increase/ decrease
(1-(test_bias_RMSE/test_rmse)) * 100

## [1] 55.017

# % train increase/ decrease
(1-(train_bias_RMSE/train_rmse)) * 100

## [1] -3.197137

After evaluating two methods, raw and bias baseline predictors, it appears that for RMSE value there is 55.017 % increase in the test dataset whereas the train dataset decreased by ~ -3.2 %. One of the main reason could be the toy dataset which is small 5x5 data matrix. It might perform well with larger dataset.

DATA643_Project1

Niteen Kumar

June 6, 2018