Recommending New Cool Flavors For The Hot Season !!


Summary

This is an R Markdown document for providing documentation for performing analysis of customer ratings and favourites to recommend the new / untried flavors.

To facilitate the testing our baseline recommendation we will be randomly dividing the icecream data into a training set that contains around 80% of the data and a test set that contains 20% of the data.

knitr::opts_chunk$set(message = FALSE, echo = TRUE)

# To load survey data from googlesheets
library(googlesheets)
# Library for loading CSV data
library(RCurl)
# Library for data tidying
library(tidyr)
# Library for data structure operations
library(dplyr)
library(knitr)
# Library for plotting
library(ggplot2)
# Library for data display in tabular format
library(DT)
library(pander)

Loading The IceCream Survey Data

The YumYum IceCream Shop has created a survey for the regular kids to rate their flavors.

Here is the survey link:

https://docs.google.com/forms/d/e/1FAIpQLSeWvKMD8LiQOkBzJPB-4Sf_50rFYXLw5mVib7iMQCvin2GqxA/viewform

The responses from survey can be found here :

https://docs.google.com/spreadsheets/d/1WnoRC8ZPpie22xf_MJ0egfMUFG2zvO7g7FEftvNrrnQ/edit?usp=sharing

# Loading data from googlesheets, first finding the relevant sheet , reading the
# sheet and relevant worksheet

gs_ls()
## # A tibble: 4 x 10
##                sheet_title        author  perm version             updated
##                      <chr>         <chr> <chr>   <chr>              <dttm>
## 1          YumYum IceCream java.messagi<U+0085>    rw     new 2017-06-16 18:06:05
## 2 IS606 Fall 2016 Present<U+0085>   jason.bryer    rw     new 2017-03-15 23:05:15
## 3     Untitled spreadsheet kumudini.bha<U+0085>    rw     new 2016-12-21 20:47:52
## 4 IS606 Spring 2016 Prese<U+0085>   jason.bryer    rw     new 2016-05-22 05:29:14
## # ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
## #   alternate <chr>, self <chr>, alt_key <chr>
icedata.url <- gs_title("YumYum IceCream")
icedata.csv <- gs_read_csv(ss = icedata.url, ws = "Form Responses 1")

# convert to data.frame
icedata <- as.data.frame(icedata.csv)


# Verifying records and variables

nrow(icedata)
## [1] 7
ncol(icedata)
## [1] 9
# datatable(icedata)

Data Cleansing

icedataorig <- icedata
icedata <- icedata %>% select(-Timestamp, -Name)
colnames(icedata) <- c("NuttyBtrSc", "BlkForest", "Rainbow", "Casata", "Choco", "StrawBerry", 
    "Coconut")
rownames(icedata) <- icedataorig$Name

# library(recommenderlab)
library(dplyr)


# creating test and traing dataset by randomly excluding some of the rating items
# from icedata

temptrainice <- icedata

temptestice <- data.frame(matrix(NA, nrow = nrow(icedata), ncol = ncol(icedata)))
colnames(temptestice) <- colnames(icedata)
rownames(temptestice) <- rownames(icedata)




set.seed(101)

for (i in seq(1, 7)) {
    xr <- sample(seq_len(nrow(icedata)), 1, replace = FALSE)
    xc <- sample(seq_len(ncol(icedata)), 1, replace = FALSE)
    # cat('\n xr ' , xr,' xc ' , xc)
    temptestice[xr, xc] <- icedata[xr, xc]
    temptrainice[xr, xc] <- NA
}


trainnew <- temptrainice
testnew <- temptestice

# Not sure, if this way of partitioning data into training and test should be ok
# when we need bias for each user and each item.  IS there a sampling method,
# that would take samples randomly but ensuring each user/item are included.??

# randomobs <- sample(seq_len(nrow(icedata)), size = floor(0.8 * nrow(icedata)))
# trainnew <- icedata[randomobs,] testnew <- icedata[-randomobs,]

IceCream Survey Data

Original IceCream Survey Data

pander(icedata)
  NuttyBtrSc BlkForest Rainbow Casata Choco StrawBerry Coconut
Molly 3 5 1 4 NA 1 NA
Adi 4 1 4 1 NA 4 NA
Tom 3 5 NA NA 5 2 4
Pinky 4 4 NA 4 4 5 5
Dino 1 1 1 1 2 NA 2
Peter NA 2 NA 2 4 4 4
Dolly 4 NA 4 4 4 4 4

Training Set

pander(trainnew)
  NuttyBtrSc BlkForest Rainbow Casata Choco StrawBerry Coconut
Molly 3 5 1 4 NA 1 NA
Adi 4 1 NA 1 NA 4 NA
Tom NA 5 NA NA 5 2 4
Pinky 4 4 NA 4 4 5 5
Dino 1 1 NA NA NA NA 2
Peter NA 2 NA 2 4 4 NA
Dolly 4 NA 4 4 NA 4 4

Test Set

pander(testnew)
  NuttyBtrSc BlkForest Rainbow Casata Choco StrawBerry Coconut
Molly NA NA NA NA NA NA NA
Adi NA NA 4 NA NA NA NA
Tom 3 NA NA NA NA NA NA
Pinky NA NA NA NA NA NA NA
Dino NA NA 1 1 2 NA NA
Peter NA NA NA NA NA NA 4
Dolly NA NA NA NA 4 NA NA

Metrics

Calculating Raw Average

# Function to calculate Raw Average
     
rawavg <- function(ds){ 
     nr <- nrow(ds)
     nc <- ncol(ds)
     
     tot <- 0
     cnt <- 0
     for(i in seq(1,nr))  # for each row
     {
          for(j in seq(1,nc))
          {
               if (!(is.na(ds[i,j])))
               {
               #     cat("\nelem : ", ds[i,j],"\n")
                    tot <- tot + ds[i,j]
                    cnt <- cnt + 1    
               }
          }
     }

     # cat("\nTable Tot : ", tot,"\n")
     # cat("Table Cnt : ","\n",cnt,"\n") 

     return (tot /cnt)

}


rawav <- rawavg(trainnew)
rawav
## [1] 3.290323

The raw average is 3.2903226

Calculate Root Mean Squared Error (RMSE)

# Function to calculate Root Mean Square Error (RMSE)
     

calcrmse <- function(ds, ravg){ 
     
     nr <- nrow(ds)
     nc <- ncol(ds)
     
     tot <- 0
     cnt <- 0
     for(i in seq(1,nr))  # for each row
     {
          for(j in seq(1,nc))
          {
               if (!(is.na(ds[i,j])))
               {
               #     cat("\nelem : ", ds[i,j],"\n")
                    diff <- ds[i,j] - ravg
                    
                    tot <- tot + diff^2
                    cnt <- cnt + 1    
               }
          }
     }

     # cat("\nTable Tot : ", tot,"\n")
     # cat("Table Cnt : ","\n",cnt,"\n") 

     mse <- tot / cnt
     rmse <- sqrt(mse)
     
     return (rmse)

}

# RMSE for train data
rmse_train <- calcrmse(trainnew, rawav)
rmse_train
## [1] 1.395697
# RMSE for test data
rmse_test <-  calcrmse(testnew, rawav)
rmse_test
## [1] 1.401596

The RMSE for training set is 1.3956973 and RMSE for test set is 1.4015961 We find very little difference is the RMSE for training and test sets.


Calculate Bias

# Function to calculate bias for child and flavor


calcbias <- function(ds, ravg) {
    
    ds <- trainnew
    
    nr <- nrow(ds)
    nc <- ncol(ds)
    
    tot <- 0
    cnt <- 0
    
    
    # Finding bias for the child
    childavgrate <- rowMeans(ds[, c(1:7)], na.rm = TRUE)
    childbias <- childavgrate - ravg
    
    # Finding bias for the icecrea flavor
    iceavgrate <- colMeans(ds[, c(1:7)], na.rm = TRUE)
    icebias <- iceavgrate - ravg
    
    
    # Forming bias table for child and icecream flavor
    
    childtab <- data.frame(cbind(row.names(ds), as.numeric(childavgrate), as.numeric(childbias)), 
        stringsAsFactors = FALSE)
    colnames(childtab) <- c("childname", "avgrate", "bias")
    
    
    icetab <- data.frame(cbind(colnames(ds), as.numeric(iceavgrate), as.numeric(icebias)), 
        stringsAsFactors = FALSE)
    colnames(icetab) <- c("flavor", "avgrate", "bias")
    
    # return list of dataframe and bias tables
    
    return(list(ds = ds, childtab = childtab, icetab = icetab))
    
}


listbias <- calcbias(trainnew, rawav)

Bias Tables

Child Bias Table

pander(listbias$childtab, caption = "Child Bias Table")
Child Bias Table
childname avgrate bias
Molly 2.8 -0.490322580645162
Adi 2.5 -0.790322580645161
Tom 4 0.709677419354839
Pinky 4.33333333333333 1.04301075268817
Dino 1.33333333333333 -1.95698924731183
Peter 3 -0.290322580645161
Dolly 4 0.709677419354839

IceCream Bias Table

pander(listbias$icetab, caption = "IceCream Bias Table")
IceCream Bias Table
flavor avgrate bias
NuttyBtrSc 3.2 -0.0903225806451613
BlkForest 3 -0.290322580645161
Rainbow 2.5 -0.790322580645161
Casata 3 -0.290322580645161
Choco 4.33333333333333 1.04301075268817
StrawBerry 3.33333333333333 0.043010752688172
Coconut 3.75 0.459677419354839

Calculate Baseline Predictions

# Function to calculate Baseline Predictor for each child-flavor combination



calcbaselinepred <- function(ds) {
    
    # initialize the new baseline dataframe
    citab <- data.frame(matrix(NA, nrow = nrow(icedata), ncol = ncol(icedata)))
    
    # Fetch raw average for the passed dataset
    rawav <- rawavg(trainnew)
    rawav
    
    # Fetch bias for user and item for the dataset
    lbias <- calcbias(trainnew, rawav)
    chtab <- lbias$childtab
    ictab <- lbias$icetab
    chtab
    ictab
    
    
    
    # Iterate through user and item bias dataframes to find baseline for each
    # user-item combination
    
    for (c in seq(1, nrow(chtab))) {
        for (i in seq(1, nrow(ictab))) {
            
            # cat('\n chtab[c,3] :', c , ' ',as.numeric(chtab[c,]$bias) ,' ictab[i,3] ', i ,
            # ' ', as.numeric(ictab[i,]$bias))
            
            bline <- rawav + as.numeric(chtab[c, ]$bias) + as.numeric(ictab[i, ]$bias)
            
            if (bline < 1) {
                bline <- 1
            } else if (bline > 5) {
                bline <- 5
            }
            
            citab[c, i] <- bline
        }
        
    }
    
    colnames(citab) <- ictab$flavor
    rownames(citab) <- chtab$childname
    return(citab = citab)
}
childicecreambaseline <- calcbaselinepred(trainnew)

The baseline predictors are presented in the comparison with training and test sets below in RMSE section.


Calculate RMSE For Baseline Prediction

# Function to calculate Root Mean Square Error (RMSE)
     

calcbaselinermse <- function(ds, dsbaseline){ 
     
     # the dimensions of both dataframes are similar
     
     nr <- nrow(ds)
     nc <- ncol(ds)
     
     tot <- 0
     cnt <- 0
     
     for(i in seq(1,nr))  # for each row
     {
          for(j in seq(1,nc))
          {
               if (!(is.na(ds[i,j])))
               {
               #     cat("\nelem : ", ds[i,j],"\n")
                    diff <- ds[i,j] - dsbaseline[i,j]
                    
                    tot <- tot + diff^2
                    cnt <- cnt + 1    
               }
          }
     }

     

     # cat("\nTable Tot : ", tot,"\n")
     # cat("Table Cnt : ","\n",cnt,"\n") 

     msebaseline <- tot / cnt
     rmsebaseline <- sqrt(msebaseline)
     
     return (rmsebaseline)
}

Training Set Vs Baseline Prediction

par(mfrow = c(1, 2))
pander(trainnew)
  NuttyBtrSc BlkForest Rainbow Casata Choco StrawBerry Coconut
Molly 3 5 1 4 NA 1 NA
Adi 4 1 NA 1 NA 4 NA
Tom NA 5 NA NA 5 2 4
Pinky 4 4 NA 4 4 5 5
Dino 1 1 NA NA NA NA 2
Peter NA 2 NA 2 4 4 NA
Dolly 4 NA 4 4 NA 4 4
pander(childicecreambaseline)
  NuttyBtrSc BlkForest Rainbow Casata Choco StrawBerry Coconut
Molly 2.71 2.51 2.01 2.51 3.843 2.843 3.26
Adi 2.41 2.21 1.71 2.21 3.543 2.543 2.96
Tom 3.91 3.71 3.21 3.71 5 4.043 4.46
Pinky 4.243 4.043 3.543 4.043 5 4.376 4.793
Dino 1.243 1.043 1 1.043 2.376 1.376 1.793
Peter 2.91 2.71 2.21 2.71 4.043 3.043 3.46
Dolly 3.91 3.71 3.21 3.71 5 4.043 4.46
# RMSE for train data
rmsebaseline_train <- calcbaselinermse(trainnew, childicecreambaseline)
rmsebaseline_train
## [1] 1.000087

Test Set Vs Baseline Prediction

par(mfrow = c(1, 2))
pander(testnew)
  NuttyBtrSc BlkForest Rainbow Casata Choco StrawBerry Coconut
Molly NA NA NA NA NA NA NA
Adi NA NA 4 NA NA NA NA
Tom 3 NA NA NA NA NA NA
Pinky NA NA NA NA NA NA NA
Dino NA NA 1 1 2 NA NA
Peter NA NA NA NA NA NA 4
Dolly NA NA NA NA 4 NA NA
pander(childicecreambaseline)
  NuttyBtrSc BlkForest Rainbow Casata Choco StrawBerry Coconut
Molly 2.71 2.51 2.01 2.51 3.843 2.843 3.26
Adi 2.41 2.21 1.71 2.21 3.543 2.543 2.96
Tom 3.91 3.71 3.21 3.71 5 4.043 4.46
Pinky 4.243 4.043 3.543 4.043 5 4.376 4.793
Dino 1.243 1.043 1 1.043 2.376 1.376 1.793
Peter 2.91 2.71 2.21 2.71 4.043 3.043 3.46
Dolly 3.91 3.71 3.21 3.71 5 4.043 4.46
# RMSE for test data
rmsebaseline_test <- calcbaselinermse(testnew, childicecreambaseline)
rmsebaseline_test
## [1] 1.035686

Summarization

We find an improvement in the RMSE in training as well as the test sets with the baseline predictions method. Hence it offers a better recommendation option.

We see the improvements as follows.

train_impr <- 1 - rmsebaseline_train/rmse_train
train_impr
## [1] 0.2834502
test_impr <- 1 - rmsebaseline_test/rmse_test
test_impr
## [1] 0.2610664

The % improvement of RMSE by baseline prediction method as comapred to previous calculated RMSE for training set 28.3450212 %

The % improvement of RMSE by baseline prediction method as comapred to previous calculated RMSE for test set 26.106644 %

We find that there is more or less equivalent improvement in the RMSE for both test and training set with the baseline prediction


YumYum Icecream Recommends !

  • Choco flavors to Molly and Adi, They are surely going to like it ! We are only hoping they will like the Coconut flavor though .

  • As for Tom, he may be ok for TuttiFruiti , but as we know he loves cake icecream he should surely go for Casata !

  • Pinky loves most of icecreams and we feel she will like the TuttiFruiti Rainbow at least above average

  • Dino is picky and he doesn’t seem to like cake icecreams, we feel he will like the Strawberry Flavour as he likes plain ones more. We hope so.!

  • Peter is more into single flavours as comapred to cake icecreams , but he hasnt tried the nutty ones, we hope he likes it about average at most the ButterScotch and TuttiFruiti, as he is not as much picky.

  • Dolly is all for icecreams and we are eager for her to grab the BlackForest cake icecream ’coz we know she will enjoy it!