Recommending New Cool Flavors For The Hot Season !!

Summary

This is an R Markdown document for providing documentation for performing analysis of customer ratings and favourites to recommend the new / untried flavors.

To facilitate the testing our baseline recommendation we will be randomly dividing the icecream data into a training set that contains around 80% of the data and a test set that contains 20% of the data.

knitr::opts_chunk$set(message = FALSE, echo = TRUE)

# To load survey data from googlesheets
library(googlesheets)
# Library for loading CSV data
library(RCurl)
# Library for data tidying
library(tidyr)
# Library for data structure operations
library(dplyr)
library(knitr)
# Library for plotting
library(ggplot2)
# Library for data display in tabular format
library(DT)
library(pander)

Loading The IceCream Survey Data

The YumYum IceCream Shop has created a survey for the regular kids to rate their flavors.

Here is the survey link:

https://docs.google.com/forms/d/e/1FAIpQLSeWvKMD8LiQOkBzJPB-4Sf_50rFYXLw5mVib7iMQCvin2GqxA/viewform

The responses from survey can be found here :

https://docs.google.com/spreadsheets/d/1WnoRC8ZPpie22xf_MJ0egfMUFG2zvO7g7FEftvNrrnQ/edit?usp=sharing

# Loading data from googlesheets, first finding the relevant sheet , reading the
# sheet and relevant worksheet

gs_ls()

## # A tibble: 4 x 10
##                sheet_title        author  perm version             updated
##                      <chr>         <chr> <chr>   <chr>              <dttm>
## 1          YumYum IceCream java.messagi<U+0085>    rw     new 2017-06-16 18:06:05
## 2 IS606 Fall 2016 Present<U+0085>   jason.bryer    rw     new 2017-03-15 23:05:15
## 3     Untitled spreadsheet kumudini.bha<U+0085>    rw     new 2016-12-21 20:47:52
## 4 IS606 Spring 2016 Prese<U+0085>   jason.bryer    rw     new 2016-05-22 05:29:14
## # ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
## #   alternate <chr>, self <chr>, alt_key <chr>

icedata.url <- gs_title("YumYum IceCream")
icedata.csv <- gs_read_csv(ss = icedata.url, ws = "Form Responses 1")

# convert to data.frame
icedata <- as.data.frame(icedata.csv)


# Verifying records and variables

nrow(icedata)

## [1] 7

ncol(icedata)

## [1] 9

# datatable(icedata)

Data Cleansing

icedataorig <- icedata
icedata <- icedata %>% select(-Timestamp, -Name)
colnames(icedata) <- c("NuttyBtrSc", "BlkForest", "Rainbow", "Casata", "Choco", "StrawBerry", 
    "Coconut")
rownames(icedata) <- icedataorig$Name

# library(recommenderlab)
library(dplyr)


# creating test and traing dataset by randomly excluding some of the rating items
# from icedata

temptrainice <- icedata

temptestice <- data.frame(matrix(NA, nrow = nrow(icedata), ncol = ncol(icedata)))
colnames(temptestice) <- colnames(icedata)
rownames(temptestice) <- rownames(icedata)




set.seed(101)

for (i in seq(1, 7)) {
    xr <- sample(seq_len(nrow(icedata)), 1, replace = FALSE)
    xc <- sample(seq_len(ncol(icedata)), 1, replace = FALSE)
    # cat('\n xr ' , xr,' xc ' , xc)
    temptestice[xr, xc] <- icedata[xr, xc]
    temptrainice[xr, xc] <- NA
}


trainnew <- temptrainice
testnew <- temptestice

# Not sure, if this way of partitioning data into training and test should be ok
# when we need bias for each user and each item.  IS there a sampling method,
# that would take samples randomly but ensuring each user/item are included.??

# randomobs <- sample(seq_len(nrow(icedata)), size = floor(0.8 * nrow(icedata)))
# trainnew <- icedata[randomobs,] testnew <- icedata[-randomobs,]

IceCream Survey Data

Original IceCream Survey Data

pander(icedata)

	NuttyBtrSc	BlkForest	Rainbow	Casata	Choco	StrawBerry	Coconut
Molly	3	5	1	4	NA	1	NA
Adi	4	1	4	1	NA	4	NA
Tom	3	5	NA	NA	5	2	4
Pinky	4	4	NA	4	4	5	5
Dino	1	1	1	1	2	NA	2
Peter	NA	2	NA	2	4	4	4
Dolly	4	NA	4	4	4	4	4

Training Set

pander(trainnew)

	NuttyBtrSc	BlkForest	Rainbow	Casata	Choco	StrawBerry	Coconut
Molly	3	5	1	4	NA	1	NA
Adi	4	1	NA	1	NA	4	NA
Tom	NA	5	NA	NA	5	2	4
Pinky	4	4	NA	4	4	5	5
Dino	1	1	NA	NA	NA	NA	2
Peter	NA	2	NA	2	4	4	NA
Dolly	4	NA	4	4	NA	4	4

Test Set

pander(testnew)

	NuttyBtrSc	BlkForest	Rainbow	Casata	Choco	StrawBerry	Coconut
Molly	NA	NA	NA	NA	NA	NA	NA
Adi	NA	NA	4	NA	NA	NA	NA
Tom	3	NA	NA	NA	NA	NA	NA
Pinky	NA	NA	NA	NA	NA	NA	NA
Dino	NA	NA	1	1	2	NA	NA
Peter	NA	NA	NA	NA	NA	NA	4
Dolly	NA	NA	NA	NA	4	NA	NA

Metrics

Calculating Raw Average

# Function to calculate Raw Average
     
rawavg <- function(ds){ 
     nr <- nrow(ds)
     nc <- ncol(ds)
     
     tot <- 0
     cnt <- 0
     for(i in seq(1,nr))  # for each row
     {
          for(j in seq(1,nc))
          {
               if (!(is.na(ds[i,j])))
               {
               #     cat("\nelem : ", ds[i,j],"\n")
                    tot <- tot + ds[i,j]
                    cnt <- cnt + 1    
               }
          }
     }

     # cat("\nTable Tot : ", tot,"\n")
     # cat("Table Cnt : ","\n",cnt,"\n") 

     return (tot /cnt)

}


rawav <- rawavg(trainnew)
rawav

## [1] 3.290323

The raw average is 3.2903226

Calculate Root Mean Squared Error (RMSE)

# Function to calculate Root Mean Square Error (RMSE)
     

calcrmse <- function(ds, ravg){ 
     
     nr <- nrow(ds)
     nc <- ncol(ds)
     
     tot <- 0
     cnt <- 0
     for(i in seq(1,nr))  # for each row
     {
          for(j in seq(1,nc))
          {
               if (!(is.na(ds[i,j])))
               {
               #     cat("\nelem : ", ds[i,j],"\n")
                    diff <- ds[i,j] - ravg
                    
                    tot <- tot + diff^2
                    cnt <- cnt + 1    
               }
          }
     }

     # cat("\nTable Tot : ", tot,"\n")
     # cat("Table Cnt : ","\n",cnt,"\n") 

     mse <- tot / cnt
     rmse <- sqrt(mse)
     
     return (rmse)

}

# RMSE for train data
rmse_train <- calcrmse(trainnew, rawav)
rmse_train

## [1] 1.395697

# RMSE for test data
rmse_test <-  calcrmse(testnew, rawav)
rmse_test

## [1] 1.401596

The RMSE for training set is 1.3956973 and RMSE for test set is 1.4015961 We find very little difference is the RMSE for training and test sets.

Calculate Bias

# Function to calculate bias for child and flavor


calcbias <- function(ds, ravg) {
    
    ds <- trainnew
    
    nr <- nrow(ds)
    nc <- ncol(ds)
    
    tot <- 0
    cnt <- 0
    
    
    # Finding bias for the child
    childavgrate <- rowMeans(ds[, c(1:7)], na.rm = TRUE)
    childbias <- childavgrate - ravg
    
    # Finding bias for the icecrea flavor
    iceavgrate <- colMeans(ds[, c(1:7)], na.rm = TRUE)
    icebias <- iceavgrate - ravg
    
    
    # Forming bias table for child and icecream flavor
    
    childtab <- data.frame(cbind(row.names(ds), as.numeric(childavgrate), as.numeric(childbias)), 
        stringsAsFactors = FALSE)
    colnames(childtab) <- c("childname", "avgrate", "bias")
    
    
    icetab <- data.frame(cbind(colnames(ds), as.numeric(iceavgrate), as.numeric(icebias)), 
        stringsAsFactors = FALSE)
    colnames(icetab) <- c("flavor", "avgrate", "bias")
    
    # return list of dataframe and bias tables
    
    return(list(ds = ds, childtab = childtab, icetab = icetab))
    
}


listbias <- calcbias(trainnew, rawav)

Bias Tables

Child Bias Table

pander(listbias$childtab, caption = "Child Bias Table")

Child Bias Table
childname	avgrate	bias
Molly	2.8	-0.490322580645162
Adi	2.5	-0.790322580645161
Tom	4	0.709677419354839
Pinky	4.33333333333333	1.04301075268817
Dino	1.33333333333333	-1.95698924731183
Peter	3	-0.290322580645161
Dolly	4	0.709677419354839

IceCream Bias Table

pander(listbias$icetab, caption = "IceCream Bias Table")

IceCream Bias Table
flavor	avgrate	bias
NuttyBtrSc	3.2	-0.0903225806451613
BlkForest	3	-0.290322580645161
Rainbow	2.5	-0.790322580645161
Casata	3	-0.290322580645161
Choco	4.33333333333333	1.04301075268817
StrawBerry	3.33333333333333	0.043010752688172
Coconut	3.75	0.459677419354839

Calculate Baseline Predictions

# Function to calculate Baseline Predictor for each child-flavor combination



calcbaselinepred <- function(ds) {
    
    # initialize the new baseline dataframe
    citab <- data.frame(matrix(NA, nrow = nrow(icedata), ncol = ncol(icedata)))
    
    # Fetch raw average for the passed dataset
    rawav <- rawavg(trainnew)
    rawav
    
    # Fetch bias for user and item for the dataset
    lbias <- calcbias(trainnew, rawav)
    chtab <- lbias$childtab
    ictab <- lbias$icetab
    chtab
    ictab
    
    
    
    # Iterate through user and item bias dataframes to find baseline for each
    # user-item combination
    
    for (c in seq(1, nrow(chtab))) {
        for (i in seq(1, nrow(ictab))) {
            
            # cat('\n chtab[c,3] :', c , ' ',as.numeric(chtab[c,]$bias) ,' ictab[i,3] ', i ,
            # ' ', as.numeric(ictab[i,]$bias))
            
            bline <- rawav + as.numeric(chtab[c, ]$bias) + as.numeric(ictab[i, ]$bias)
            
            if (bline < 1) {
                bline <- 1
            } else if (bline > 5) {
                bline <- 5
            }
            
            citab[c, i] <- bline
        }
        
    }
    
    colnames(citab) <- ictab$flavor
    rownames(citab) <- chtab$childname
    return(citab = citab)
}

childicecreambaseline <- calcbaselinepred(trainnew)

The baseline predictors are presented in the comparison with training and test sets below in RMSE section.

Calculate RMSE For Baseline Prediction

# Function to calculate Root Mean Square Error (RMSE)
     

calcbaselinermse <- function(ds, dsbaseline){ 
     
     # the dimensions of both dataframes are similar
     
     nr <- nrow(ds)
     nc <- ncol(ds)
     
     tot <- 0
     cnt <- 0
     
     for(i in seq(1,nr))  # for each row
     {
          for(j in seq(1,nc))
          {
               if (!(is.na(ds[i,j])))
               {
               #     cat("\nelem : ", ds[i,j],"\n")
                    diff <- ds[i,j] - dsbaseline[i,j]
                    
                    tot <- tot + diff^2
                    cnt <- cnt + 1    
               }
          }
     }

     

     # cat("\nTable Tot : ", tot,"\n")
     # cat("Table Cnt : ","\n",cnt,"\n") 

     msebaseline <- tot / cnt
     rmsebaseline <- sqrt(msebaseline)
     
     return (rmsebaseline)
}

Training Set Vs Baseline Prediction

par(mfrow = c(1, 2))
pander(trainnew)

	NuttyBtrSc	BlkForest	Rainbow	Casata	Choco	StrawBerry	Coconut
Molly	3	5	1	4	NA	1	NA
Adi	4	1	NA	1	NA	4	NA
Tom	NA	5	NA	NA	5	2	4
Pinky	4	4	NA	4	4	5	5
Dino	1	1	NA	NA	NA	NA	2
Peter	NA	2	NA	2	4	4	NA
Dolly	4	NA	4	4	NA	4	4

pander(childicecreambaseline)

	NuttyBtrSc	BlkForest	Rainbow	Casata	Choco	StrawBerry	Coconut
Molly	2.71	2.51	2.01	2.51	3.843	2.843	3.26
Adi	2.41	2.21	1.71	2.21	3.543	2.543	2.96
Tom	3.91	3.71	3.21	3.71	5	4.043	4.46
Pinky	4.243	4.043	3.543	4.043	5	4.376	4.793
Dino	1.243	1.043	1	1.043	2.376	1.376	1.793
Peter	2.91	2.71	2.21	2.71	4.043	3.043	3.46
Dolly	3.91	3.71	3.21	3.71	5	4.043	4.46

# RMSE for train data
rmsebaseline_train <- calcbaselinermse(trainnew, childicecreambaseline)
rmsebaseline_train

## [1] 1.000087

Test Set Vs Baseline Prediction

par(mfrow = c(1, 2))
pander(testnew)

	NuttyBtrSc	BlkForest	Rainbow	Casata	Choco	StrawBerry	Coconut
Molly	NA	NA	NA	NA	NA	NA	NA
Adi	NA	NA	4	NA	NA	NA	NA
Tom	3	NA	NA	NA	NA	NA	NA
Pinky	NA	NA	NA	NA	NA	NA	NA
Dino	NA	NA	1	1	2	NA	NA
Peter	NA	NA	NA	NA	NA	NA	4
Dolly	NA	NA	NA	NA	4	NA	NA

pander(childicecreambaseline)

	NuttyBtrSc	BlkForest	Rainbow	Casata	Choco	StrawBerry	Coconut
Molly	2.71	2.51	2.01	2.51	3.843	2.843	3.26
Adi	2.41	2.21	1.71	2.21	3.543	2.543	2.96
Tom	3.91	3.71	3.21	3.71	5	4.043	4.46
Pinky	4.243	4.043	3.543	4.043	5	4.376	4.793
Dino	1.243	1.043	1	1.043	2.376	1.376	1.793
Peter	2.91	2.71	2.21	2.71	4.043	3.043	3.46
Dolly	3.91	3.71	3.21	3.71	5	4.043	4.46

# RMSE for test data
rmsebaseline_test <- calcbaselinermse(testnew, childicecreambaseline)
rmsebaseline_test

## [1] 1.035686

Summarization

We find an improvement in the RMSE in training as well as the test sets with the baseline predictions method. Hence it offers a better recommendation option.

We see the improvements as follows.

train_impr <- 1 - rmsebaseline_train/rmse_train
train_impr

## [1] 0.2834502

test_impr <- 1 - rmsebaseline_test/rmse_test
test_impr

## [1] 0.2610664

The % improvement of RMSE by baseline prediction method as comapred to previous calculated RMSE for training set 28.3450212 %

The % improvement of RMSE by baseline prediction method as comapred to previous calculated RMSE for test set 26.106644 %

We find that there is more or less equivalent improvement in the RMSE for both test and training set with the baseline prediction

YumYum Icecream Recommends !

Choco flavors to Molly and Adi, They are surely going to like it ! We are only hoping they will like the Coconut flavor though .
As for Tom, he may be ok for TuttiFruiti , but as we know he loves cake icecream he should surely go for Casata !
Pinky loves most of icecreams and we feel she will like the TuttiFruiti Rainbow at least above average
Dino is picky and he doesn’t seem to like cake icecreams, we feel he will like the Strawberry Flavour as he likes plain ones more. We hope so.!
Peter is more into single flavours as comapred to cake icecreams , but he hasnt tried the nutty ones, we hope he likes it about average at most the ButterScotch and TuttiFruiti, as he is not as much picky.
Dolly is all for icecreams and we are eager for her to grab the BlackForest cake icecream ’coz we know she will enjoy it!

IceCreamRS

Kumudini Bhave

June 7, 2017