Global Baseline Predictors and RMSE

Data Collection

We will pick one brand and multiple products to create our dataset and assume these review ratings are collected via different websites by utilizing third party reviews application such as Power Reviews and provided to us.

# creating the dataset

retail <- c("colgate", "walmart", "amazon", "cvs", "rite_aid", "duane_read")

products <- c("zero", "pepermint", "optic_white", "advanced", "stain_fighter", "charcoal")

zero <- c(5,4,4,5,3,2)
pepermint <- c(3,4,5,2,5,4)
optic_white <- c(5,5,5,3,3,NA)
advanced <- c(2,3,3,NA,5,NA)
stain_fighter <- c(3,NA,NA,5,3,4)
charcoal <- c(1,3,2,2,NA,3)

ratings <-data.frame(zero,pepermint, optic_white,advanced,stain_fighter,charcoal)
rownames(ratings) <- retail
ratings

##            zero pepermint optic_white advanced stain_fighter charcoal
## colgate       5         3           5        2             3        1
## walmart       4         4           5        3            NA        3
## amazon        4         5           5        3            NA        2
## cvs           5         2           3       NA             5        2
## rite_aid      3         5           3        5             3       NA
## duane_read    2         4          NA       NA             4        3

Breaking up Test and Train Datasets

We have our dataset, we will further select our samples and create our test and train sets.

# setting up test data
samples <- rbind(c(2,3), c(2,1), c(1,3), c(1,6), c(6,6), c(3,2))
test <- as.numeric(ratings[samples])
test

## [1] 5 4 5 1 3 5

#setting up train data
train <- ratings
train[samples] <- NA
train

##            zero pepermint optic_white advanced stain_fighter charcoal
## colgate       5         3          NA        2             3       NA
## walmart      NA         4          NA        3            NA        3
## amazon        4        NA           5        3            NA        2
## cvs           5         2           3       NA             5        2
## rite_aid      3         5           3        5             3       NA
## duane_read    2         4          NA       NA             4       NA

Calculating Raw Average(Mean) rating.

Since we broke the ratings into separate training and test datasets, we can further calculate raw average(mean) rating for every combination.

#average training set
train_average <- mean(as.matrix(train), na.rm = TRUE) # we need to make sure we dont count the NA values
train_average

## [1] 3.458333

# average of test values
test_average <- mean(test)
test_average

## [1] 3.833333

Calculating RMSE for Train and Test.

Next, we will calculate the RMSE for raw average for both training and test data. Raw average rating is the square root of the average of the squared differences between the training set and the raw average. Basically we are looking at the error , the difference between the actual and predicted values, square rooting them to make sure no negative values, adding all the values, dividing them with the not missing values and take the square root of the result.

# RMSE for train
train_RMSE <- sum(((train-train_average)^2),na.rm=TRUE)
train_RMSE <- sqrt(train_RMSE / length(train_RMSE[!is.na(train-train_average)]))
train_RMSE

## [1] 1.079319

# RMSE for test
test_RMSE <- sqrt(sum(((test - train_average)^2), na.rm = TRUE) / length(test[!is.na(test)]))
test_RMSE

## [1] 1.509806

Calculating Bias

We are going to find the bias from consumers (that leaves reviews in particulare retailers) and products.

#consumer bias
product_bias <- colMeans(train, na.rm = TRUE) - train_average

#product bias
retailer_bias <- rowMeans(train, na.rm = TRUE) - train_average

product_bias

##          zero     pepermint   optic_white      advanced stain_fighter 
##     0.3416667     0.1416667     0.2083333    -0.2083333     0.2916667 
##      charcoal 
##    -1.1250000

retailer_bias

##     colgate     walmart      amazon         cvs    rite_aid  duane_read 
## -0.20833333 -0.12500000  0.04166667 -0.05833333  0.34166667 -0.12500000

Calculating Baseline Predictors

For baseline predictors, we will be looking at every consumer, the product that they provide the rating and the item biases. Baseline Predictors = Raw Average + Consumer Bias + Product Bias

# set upo the empty baseline matrix
baseline <- as.data.frame(matrix(nrow=6, ncol=6))

# iteration for each retailer and product bias

for (i in 1:length(retailer_bias)){
  row <- c(product_bias + retailer_bias[i] + train_average)
  #print(row)
  baseline[i, ] <- row
}

rownames(baseline) <- retail
colnames(baseline) <- colnames(ratings)
baseline

##                zero pepermint optic_white advanced stain_fighter charcoal
## colgate    3.591667  3.391667    3.458333 3.041667      3.541667 2.125000
## walmart    3.675000  3.475000    3.541667 3.125000      3.625000 2.208333
## amazon     3.841667  3.641667    3.708333 3.291667      3.791667 2.375000
## cvs        3.741667  3.541667    3.608333 3.191667      3.691667 2.275000
## rite_aid   4.141667  3.941667    4.008333 3.591667      4.091667 2.675000
## duane_read 3.675000  3.475000    3.541667 3.125000      3.625000 2.208333

Calculating RMSE for the baseline predctor

#train baseline RMSE

baseline_train_RMSE <- sqrt(sum((train - baseline)^2, na.rm=TRUE) / length(ratings[!is.na(ratings)]))
baseline_train_RMSE

## [1] 0.8623074

#test baseline RMSE
baseline_test <- baseline[samples]
baseline_test_RMSE <- sqrt(sum((test - baseline_test)^2) / length(test))
baseline_test_RMSE

## [1] 1.179444

Global Baseline Predictors and RMSE

Anil Akyildirim

6/8/2020

Introduction