This is an R Markdown document for providing documentation for performing analysis of customer ratings and favourites to recommend the new / untried flavors.
To facilitate the testing our baseline recommendation we will be randomly dividing the icecream data into a training set that contains around 80% of the data and a test set that contains 20% of the data.
knitr::opts_chunk$set(message = FALSE, echo = TRUE)
# To load survey data from googlesheets
library(googlesheets)
# Library for loading CSV data
library(RCurl)
# Library for data tidying
library(tidyr)
# Library for data structure operations
library(dplyr)
library(knitr)
# Library for plotting
library(ggplot2)
# Library for data display in tabular format
library(DT)
library(pander)The YumYum IceCream Shop has created a survey for the regular kids to rate their flavors.
Here is the survey link:
https://docs.google.com/forms/d/e/1FAIpQLSeWvKMD8LiQOkBzJPB-4Sf_50rFYXLw5mVib7iMQCvin2GqxA/viewform
The responses from survey can be found here :
https://docs.google.com/spreadsheets/d/1WnoRC8ZPpie22xf_MJ0egfMUFG2zvO7g7FEftvNrrnQ/edit?usp=sharing
# Loading data from googlesheets, first finding the relevant sheet , reading the
# sheet and relevant worksheet
gs_ls()## # A tibble: 4 x 10
## sheet_title author perm version updated
## <chr> <chr> <chr> <chr> <dttm>
## 1 YumYum IceCream java.messagi<U+0085> rw new 2017-06-16 18:06:05
## 2 IS606 Fall 2016 Present<U+0085> jason.bryer rw new 2017-03-15 23:05:15
## 3 Untitled spreadsheet kumudini.bha<U+0085> rw new 2016-12-21 20:47:52
## 4 IS606 Spring 2016 Prese<U+0085> jason.bryer rw new 2016-05-22 05:29:14
## # ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
## # alternate <chr>, self <chr>, alt_key <chr>
icedata.url <- gs_title("YumYum IceCream")
icedata.csv <- gs_read_csv(ss = icedata.url, ws = "Form Responses 1")
# convert to data.frame
icedata <- as.data.frame(icedata.csv)
# Verifying records and variables
nrow(icedata)## [1] 7
ncol(icedata)## [1] 9
# datatable(icedata)icedataorig <- icedata
icedata <- icedata %>% select(-Timestamp, -Name)
colnames(icedata) <- c("NuttyBtrSc", "BlkForest", "Rainbow", "Casata", "Choco", "StrawBerry",
"Coconut")
rownames(icedata) <- icedataorig$Name
# library(recommenderlab)
library(dplyr)
# creating test and traing dataset by randomly excluding some of the rating items
# from icedata
temptrainice <- icedata
temptestice <- data.frame(matrix(NA, nrow = nrow(icedata), ncol = ncol(icedata)))
colnames(temptestice) <- colnames(icedata)
rownames(temptestice) <- rownames(icedata)
set.seed(101)
for (i in seq(1, 7)) {
xr <- sample(seq_len(nrow(icedata)), 1, replace = FALSE)
xc <- sample(seq_len(ncol(icedata)), 1, replace = FALSE)
# cat('\n xr ' , xr,' xc ' , xc)
temptestice[xr, xc] <- icedata[xr, xc]
temptrainice[xr, xc] <- NA
}
trainnew <- temptrainice
testnew <- temptestice
# Not sure, if this way of partitioning data into training and test should be ok
# when we need bias for each user and each item. IS there a sampling method,
# that would take samples randomly but ensuring each user/item are included.??
# randomobs <- sample(seq_len(nrow(icedata)), size = floor(0.8 * nrow(icedata)))
# trainnew <- icedata[randomobs,] testnew <- icedata[-randomobs,]pander(icedata)| NuttyBtrSc | BlkForest | Rainbow | Casata | Choco | StrawBerry | Coconut | |
|---|---|---|---|---|---|---|---|
| Molly | 3 | 5 | 1 | 4 | NA | 1 | NA |
| Adi | 4 | 1 | 4 | 1 | NA | 4 | NA |
| Tom | 3 | 5 | NA | NA | 5 | 2 | 4 |
| Pinky | 4 | 4 | NA | 4 | 4 | 5 | 5 |
| Dino | 1 | 1 | 1 | 1 | 2 | NA | 2 |
| Peter | NA | 2 | NA | 2 | 4 | 4 | 4 |
| Dolly | 4 | NA | 4 | 4 | 4 | 4 | 4 |
pander(trainnew)| NuttyBtrSc | BlkForest | Rainbow | Casata | Choco | StrawBerry | Coconut | |
|---|---|---|---|---|---|---|---|
| Molly | 3 | 5 | 1 | 4 | NA | 1 | NA |
| Adi | 4 | 1 | NA | 1 | NA | 4 | NA |
| Tom | NA | 5 | NA | NA | 5 | 2 | 4 |
| Pinky | 4 | 4 | NA | 4 | 4 | 5 | 5 |
| Dino | 1 | 1 | NA | NA | NA | NA | 2 |
| Peter | NA | 2 | NA | 2 | 4 | 4 | NA |
| Dolly | 4 | NA | 4 | 4 | NA | 4 | 4 |
pander(testnew)| NuttyBtrSc | BlkForest | Rainbow | Casata | Choco | StrawBerry | Coconut | |
|---|---|---|---|---|---|---|---|
| Molly | NA | NA | NA | NA | NA | NA | NA |
| Adi | NA | NA | 4 | NA | NA | NA | NA |
| Tom | 3 | NA | NA | NA | NA | NA | NA |
| Pinky | NA | NA | NA | NA | NA | NA | NA |
| Dino | NA | NA | 1 | 1 | 2 | NA | NA |
| Peter | NA | NA | NA | NA | NA | NA | 4 |
| Dolly | NA | NA | NA | NA | 4 | NA | NA |
# Function to calculate Raw Average
rawavg <- function(ds){
nr <- nrow(ds)
nc <- ncol(ds)
tot <- 0
cnt <- 0
for(i in seq(1,nr)) # for each row
{
for(j in seq(1,nc))
{
if (!(is.na(ds[i,j])))
{
# cat("\nelem : ", ds[i,j],"\n")
tot <- tot + ds[i,j]
cnt <- cnt + 1
}
}
}
# cat("\nTable Tot : ", tot,"\n")
# cat("Table Cnt : ","\n",cnt,"\n")
return (tot /cnt)
}
rawav <- rawavg(trainnew)
rawav## [1] 3.290323
The raw average is 3.2903226
# Function to calculate Root Mean Square Error (RMSE)
calcrmse <- function(ds, ravg){
nr <- nrow(ds)
nc <- ncol(ds)
tot <- 0
cnt <- 0
for(i in seq(1,nr)) # for each row
{
for(j in seq(1,nc))
{
if (!(is.na(ds[i,j])))
{
# cat("\nelem : ", ds[i,j],"\n")
diff <- ds[i,j] - ravg
tot <- tot + diff^2
cnt <- cnt + 1
}
}
}
# cat("\nTable Tot : ", tot,"\n")
# cat("Table Cnt : ","\n",cnt,"\n")
mse <- tot / cnt
rmse <- sqrt(mse)
return (rmse)
}
# RMSE for train data
rmse_train <- calcrmse(trainnew, rawav)
rmse_train## [1] 1.395697
# RMSE for test data
rmse_test <- calcrmse(testnew, rawav)
rmse_test## [1] 1.401596
The RMSE for training set is 1.3956973 and RMSE for test set is 1.4015961 We find very little difference is the RMSE for training and test sets.
# Function to calculate bias for child and flavor
calcbias <- function(ds, ravg) {
ds <- trainnew
nr <- nrow(ds)
nc <- ncol(ds)
tot <- 0
cnt <- 0
# Finding bias for the child
childavgrate <- rowMeans(ds[, c(1:7)], na.rm = TRUE)
childbias <- childavgrate - ravg
# Finding bias for the icecrea flavor
iceavgrate <- colMeans(ds[, c(1:7)], na.rm = TRUE)
icebias <- iceavgrate - ravg
# Forming bias table for child and icecream flavor
childtab <- data.frame(cbind(row.names(ds), as.numeric(childavgrate), as.numeric(childbias)),
stringsAsFactors = FALSE)
colnames(childtab) <- c("childname", "avgrate", "bias")
icetab <- data.frame(cbind(colnames(ds), as.numeric(iceavgrate), as.numeric(icebias)),
stringsAsFactors = FALSE)
colnames(icetab) <- c("flavor", "avgrate", "bias")
# return list of dataframe and bias tables
return(list(ds = ds, childtab = childtab, icetab = icetab))
}
listbias <- calcbias(trainnew, rawav)pander(listbias$childtab, caption = "Child Bias Table")| childname | avgrate | bias |
|---|---|---|
| Molly | 2.8 | -0.490322580645162 |
| Adi | 2.5 | -0.790322580645161 |
| Tom | 4 | 0.709677419354839 |
| Pinky | 4.33333333333333 | 1.04301075268817 |
| Dino | 1.33333333333333 | -1.95698924731183 |
| Peter | 3 | -0.290322580645161 |
| Dolly | 4 | 0.709677419354839 |
pander(listbias$icetab, caption = "IceCream Bias Table")| flavor | avgrate | bias |
|---|---|---|
| NuttyBtrSc | 3.2 | -0.0903225806451613 |
| BlkForest | 3 | -0.290322580645161 |
| Rainbow | 2.5 | -0.790322580645161 |
| Casata | 3 | -0.290322580645161 |
| Choco | 4.33333333333333 | 1.04301075268817 |
| StrawBerry | 3.33333333333333 | 0.043010752688172 |
| Coconut | 3.75 | 0.459677419354839 |
# Function to calculate Baseline Predictor for each child-flavor combination
calcbaselinepred <- function(ds) {
# initialize the new baseline dataframe
citab <- data.frame(matrix(NA, nrow = nrow(icedata), ncol = ncol(icedata)))
# Fetch raw average for the passed dataset
rawav <- rawavg(trainnew)
rawav
# Fetch bias for user and item for the dataset
lbias <- calcbias(trainnew, rawav)
chtab <- lbias$childtab
ictab <- lbias$icetab
chtab
ictab
# Iterate through user and item bias dataframes to find baseline for each
# user-item combination
for (c in seq(1, nrow(chtab))) {
for (i in seq(1, nrow(ictab))) {
# cat('\n chtab[c,3] :', c , ' ',as.numeric(chtab[c,]$bias) ,' ictab[i,3] ', i ,
# ' ', as.numeric(ictab[i,]$bias))
bline <- rawav + as.numeric(chtab[c, ]$bias) + as.numeric(ictab[i, ]$bias)
if (bline < 1) {
bline <- 1
} else if (bline > 5) {
bline <- 5
}
citab[c, i] <- bline
}
}
colnames(citab) <- ictab$flavor
rownames(citab) <- chtab$childname
return(citab = citab)
}childicecreambaseline <- calcbaselinepred(trainnew)The baseline predictors are presented in the comparison with training and test sets below in RMSE section.
# Function to calculate Root Mean Square Error (RMSE)
calcbaselinermse <- function(ds, dsbaseline){
# the dimensions of both dataframes are similar
nr <- nrow(ds)
nc <- ncol(ds)
tot <- 0
cnt <- 0
for(i in seq(1,nr)) # for each row
{
for(j in seq(1,nc))
{
if (!(is.na(ds[i,j])))
{
# cat("\nelem : ", ds[i,j],"\n")
diff <- ds[i,j] - dsbaseline[i,j]
tot <- tot + diff^2
cnt <- cnt + 1
}
}
}
# cat("\nTable Tot : ", tot,"\n")
# cat("Table Cnt : ","\n",cnt,"\n")
msebaseline <- tot / cnt
rmsebaseline <- sqrt(msebaseline)
return (rmsebaseline)
}par(mfrow = c(1, 2))
pander(trainnew)| NuttyBtrSc | BlkForest | Rainbow | Casata | Choco | StrawBerry | Coconut | |
|---|---|---|---|---|---|---|---|
| Molly | 3 | 5 | 1 | 4 | NA | 1 | NA |
| Adi | 4 | 1 | NA | 1 | NA | 4 | NA |
| Tom | NA | 5 | NA | NA | 5 | 2 | 4 |
| Pinky | 4 | 4 | NA | 4 | 4 | 5 | 5 |
| Dino | 1 | 1 | NA | NA | NA | NA | 2 |
| Peter | NA | 2 | NA | 2 | 4 | 4 | NA |
| Dolly | 4 | NA | 4 | 4 | NA | 4 | 4 |
pander(childicecreambaseline)| NuttyBtrSc | BlkForest | Rainbow | Casata | Choco | StrawBerry | Coconut | |
|---|---|---|---|---|---|---|---|
| Molly | 2.71 | 2.51 | 2.01 | 2.51 | 3.843 | 2.843 | 3.26 |
| Adi | 2.41 | 2.21 | 1.71 | 2.21 | 3.543 | 2.543 | 2.96 |
| Tom | 3.91 | 3.71 | 3.21 | 3.71 | 5 | 4.043 | 4.46 |
| Pinky | 4.243 | 4.043 | 3.543 | 4.043 | 5 | 4.376 | 4.793 |
| Dino | 1.243 | 1.043 | 1 | 1.043 | 2.376 | 1.376 | 1.793 |
| Peter | 2.91 | 2.71 | 2.21 | 2.71 | 4.043 | 3.043 | 3.46 |
| Dolly | 3.91 | 3.71 | 3.21 | 3.71 | 5 | 4.043 | 4.46 |
# RMSE for train data
rmsebaseline_train <- calcbaselinermse(trainnew, childicecreambaseline)
rmsebaseline_train## [1] 1.000087
par(mfrow = c(1, 2))
pander(testnew)| NuttyBtrSc | BlkForest | Rainbow | Casata | Choco | StrawBerry | Coconut | |
|---|---|---|---|---|---|---|---|
| Molly | NA | NA | NA | NA | NA | NA | NA |
| Adi | NA | NA | 4 | NA | NA | NA | NA |
| Tom | 3 | NA | NA | NA | NA | NA | NA |
| Pinky | NA | NA | NA | NA | NA | NA | NA |
| Dino | NA | NA | 1 | 1 | 2 | NA | NA |
| Peter | NA | NA | NA | NA | NA | NA | 4 |
| Dolly | NA | NA | NA | NA | 4 | NA | NA |
pander(childicecreambaseline)| NuttyBtrSc | BlkForest | Rainbow | Casata | Choco | StrawBerry | Coconut | |
|---|---|---|---|---|---|---|---|
| Molly | 2.71 | 2.51 | 2.01 | 2.51 | 3.843 | 2.843 | 3.26 |
| Adi | 2.41 | 2.21 | 1.71 | 2.21 | 3.543 | 2.543 | 2.96 |
| Tom | 3.91 | 3.71 | 3.21 | 3.71 | 5 | 4.043 | 4.46 |
| Pinky | 4.243 | 4.043 | 3.543 | 4.043 | 5 | 4.376 | 4.793 |
| Dino | 1.243 | 1.043 | 1 | 1.043 | 2.376 | 1.376 | 1.793 |
| Peter | 2.91 | 2.71 | 2.21 | 2.71 | 4.043 | 3.043 | 3.46 |
| Dolly | 3.91 | 3.71 | 3.21 | 3.71 | 5 | 4.043 | 4.46 |
# RMSE for test data
rmsebaseline_test <- calcbaselinermse(testnew, childicecreambaseline)
rmsebaseline_test## [1] 1.035686
We find an improvement in the RMSE in training as well as the test sets with the baseline predictions method. Hence it offers a better recommendation option.
We see the improvements as follows.
train_impr <- 1 - rmsebaseline_train/rmse_train
train_impr## [1] 0.2834502
test_impr <- 1 - rmsebaseline_test/rmse_test
test_impr## [1] 0.2610664
The % improvement of RMSE by baseline prediction method as comapred to previous calculated RMSE for training set 28.3450212 %
The % improvement of RMSE by baseline prediction method as comapred to previous calculated RMSE for test set 26.106644 %
We find that there is more or less equivalent improvement in the RMSE for both test and training set with the baseline prediction
Choco flavors to Molly and Adi, They are surely going to like it ! We are only hoping they will like the Coconut flavor though .
As for Tom, he may be ok for TuttiFruiti , but as we know he loves cake icecream he should surely go for Casata !
Pinky loves most of icecreams and we feel she will like the TuttiFruiti Rainbow at least above average
Dino is picky and he doesn’t seem to like cake icecreams, we feel he will like the Strawberry Flavour as he likes plain ones more. We hope so.!
Peter is more into single flavours as comapred to cake icecreams , but he hasnt tried the nutty ones, we hope he likes it about average at most the ButterScotch and TuttiFruiti, as he is not as much picky.
Dolly is all for icecreams and we are eager for her to grab the BlackForest cake icecream ’coz we know she will enjoy it!