The recommender system that we will create in this assignment will recommend a product to a consumer through the business website based on previously collected product ratings data from other consumers. The use case of the recommender system is that, as a business we want the fastest purchase funnel in order to increase conversion. For example, Colgate allows users to provide ratings based to their products. Similary Colgate sells the same products in Amazon or Wallmart (https://www.amazon.com/s?k=colgate+whitening+toothpaste&crid=33FXM4TZIJEDB&sprefix=colgate+white%2Caps%2C145&ref=nb_sb_ss_i_1_13 or https://www.walmart.com/search/?query=colgate%20toothpaste&typeahead=colgate) . We can leverage these ratings data to recommend products within the brand owned ecommerce website.
We will pick one brand and multiple products to create our dataset and assume these review ratings are collected via different websites by utilizing third party reviews application such as Power Reviews and provided to us.
# creating the dataset
retail <- c("colgate", "walmart", "amazon", "cvs", "rite_aid", "duane_read")
products <- c("zero", "pepermint", "optic_white", "advanced", "stain_fighter", "charcoal")
zero <- c(5,4,4,5,3,2)
pepermint <- c(3,4,5,2,5,4)
optic_white <- c(5,5,5,3,3,NA)
advanced <- c(2,3,3,NA,5,NA)
stain_fighter <- c(3,NA,NA,5,3,4)
charcoal <- c(1,3,2,2,NA,3)
ratings <-data.frame(zero,pepermint, optic_white,advanced,stain_fighter,charcoal)
rownames(ratings) <- retail
ratings
## zero pepermint optic_white advanced stain_fighter charcoal
## colgate 5 3 5 2 3 1
## walmart 4 4 5 3 NA 3
## amazon 4 5 5 3 NA 2
## cvs 5 2 3 NA 5 2
## rite_aid 3 5 3 5 3 NA
## duane_read 2 4 NA NA 4 3
We have our dataset, we will further select our samples and create our test and train sets.
# setting up test data
samples <- rbind(c(2,3), c(2,1), c(1,3), c(1,6), c(6,6), c(3,2))
test <- as.numeric(ratings[samples])
test
## [1] 5 4 5 1 3 5
#setting up train data
train <- ratings
train[samples] <- NA
train
## zero pepermint optic_white advanced stain_fighter charcoal
## colgate 5 3 NA 2 3 NA
## walmart NA 4 NA 3 NA 3
## amazon 4 NA 5 3 NA 2
## cvs 5 2 3 NA 5 2
## rite_aid 3 5 3 5 3 NA
## duane_read 2 4 NA NA 4 NA
Since we broke the ratings into separate training and test datasets, we can further calculate raw average(mean) rating for every combination.
#average training set
train_average <- mean(as.matrix(train), na.rm = TRUE) # we need to make sure we dont count the NA values
train_average
## [1] 3.458333
# average of test values
test_average <- mean(test)
test_average
## [1] 3.833333
Next, we will calculate the RMSE for raw average for both training and test data. Raw average rating is the square root of the average of the squared differences between the training set and the raw average. Basically we are looking at the error , the difference between the actual and predicted values, square rooting them to make sure no negative values, adding all the values, dividing them with the not missing values and take the square root of the result.
# RMSE for train
train_RMSE <- sum(((train-train_average)^2),na.rm=TRUE)
train_RMSE <- sqrt(train_RMSE / length(train_RMSE[!is.na(train-train_average)]))
train_RMSE
## [1] 1.079319
# RMSE for test
test_RMSE <- sqrt(sum(((test - train_average)^2), na.rm = TRUE) / length(test[!is.na(test)]))
test_RMSE
## [1] 1.509806
We are going to find the bias from consumers (that leaves reviews in particulare retailers) and products.
#consumer bias
product_bias <- colMeans(train, na.rm = TRUE) - train_average
#product bias
retailer_bias <- rowMeans(train, na.rm = TRUE) - train_average
product_bias
## zero pepermint optic_white advanced stain_fighter
## 0.3416667 0.1416667 0.2083333 -0.2083333 0.2916667
## charcoal
## -1.1250000
retailer_bias
## colgate walmart amazon cvs rite_aid duane_read
## -0.20833333 -0.12500000 0.04166667 -0.05833333 0.34166667 -0.12500000
For baseline predictors, we will be looking at every consumer, the product that they provide the rating and the item biases. Baseline Predictors = Raw Average + Consumer Bias + Product Bias
# set upo the empty baseline matrix
baseline <- as.data.frame(matrix(nrow=6, ncol=6))
# iteration for each retailer and product bias
for (i in 1:length(retailer_bias)){
row <- c(product_bias + retailer_bias[i] + train_average)
#print(row)
baseline[i, ] <- row
}
rownames(baseline) <- retail
colnames(baseline) <- colnames(ratings)
baseline
## zero pepermint optic_white advanced stain_fighter charcoal
## colgate 3.591667 3.391667 3.458333 3.041667 3.541667 2.125000
## walmart 3.675000 3.475000 3.541667 3.125000 3.625000 2.208333
## amazon 3.841667 3.641667 3.708333 3.291667 3.791667 2.375000
## cvs 3.741667 3.541667 3.608333 3.191667 3.691667 2.275000
## rite_aid 4.141667 3.941667 4.008333 3.591667 4.091667 2.675000
## duane_read 3.675000 3.475000 3.541667 3.125000 3.625000 2.208333
#train baseline RMSE
baseline_train_RMSE <- sqrt(sum((train - baseline)^2, na.rm=TRUE) / length(ratings[!is.na(ratings)]))
baseline_train_RMSE
## [1] 0.8623074
#test baseline RMSE
baseline_test <- baseline[samples]
baseline_test_RMSE <- sqrt(sum((test - baseline_test)^2) / length(test))
baseline_test_RMSE
## [1] 1.179444
# Summarize table results
# percent change
train_change = round((1-(baseline_train_RMSE/train_RMSE))*100, 2)
test_change = round((1-(baseline_test_RMSE/test_RMSE))*100, 2)
Raw_Average = c(train_average, test_average)
RMSE = c(train_RMSE, test_RMSE)
Baseline_RMSE = c(baseline_train_RMSE, baseline_test_RMSE)
Change_Percent = c(train_change, test_change)
results = data.frame(Raw_Average, RMSE, Baseline_RMSE, Change_Percent)
row.names(results) = c("Training Set", "Test Set")
results
## Raw_Average RMSE Baseline_RMSE Change_Percent
## Training Set 3.458333 1.079319 0.8623074 20.11
## Test Set 3.833333 1.509806 1.1794439 21.88
When we look at the retailer - product interaction on the duane read consumer’s give some harsh ratings or no ratings specially to “Colgate Zero” and “Charcoal” products. So we can see that consumer’s that shops from different retailers provide different ratings. This concept actualy is the basis of baseline predictor. We start with taking the raw Average, we add in bias for the user and we add in bias for the movie. We calculated bias for retailers and products in order to find the baseline predictors. Looking at the Basline RMSE, there is a small increase. Both for the training set and test set the RMSE is not significantly high. Comparing the Baseline RMSE with the RMSE we see %20 improvement of the training set and ~22% for the test set.