Special IceCream For A Special You !!

Summary

This is an R Markdown document for providing documentation for performing analysis of icecream ratings by little kids and to recommend the new / untried flavors to them this summer.

To facilitate the testing our baseline recommendation we will be using k-fold crossvalidation for Collaborative Filtering Techniques For Recommendation

knitr::opts_chunk$set(message = FALSE, echo = TRUE)

# To load survey data from googlesheets
suppressWarnings(suppressMessages(library(googlesheets)))
# Library for loading CSV data
library(RCurl)
# Library for data tidying
library(tidyr)
# Library for data structure operations
library(dplyr)
library(knitr)
# Library for plotting
library(ggplot2)
# Library for data display in tabular format
library(DT)
library(pander)

suppressWarnings(suppressMessages(library(recommenderlab)))

Loading The IceCream Survey Data

The YumYum IceCream Shop has created a survey for the regular kids to rate their flavors.

Here is the survey link:

https://docs.google.com/forms/d/e/1FAIpQLSdk2Xgop-XCcTXR2XEQW3pFV9l0e_VjBFMjTWvX1ttqK3fMZg/viewform

The responses from survey can be found here :

https://docs.google.com/spreadsheets/d/1IKwsU5KjG6Y00Cg5F2ZDCUUYuqjPw8tG67ql3sDqIvc/edit?usp=sharing

# Loading data from googlesheets, first finding the relevant sheet , reading the
# sheet and relevant worksheet

gs_ls()

## # A tibble: 5 x 10
##                sheet_title        author  perm version             updated
##                      <chr>         <chr> <chr>   <chr>              <dttm>
## 1             YumYumSummer java.messagi<U+0085>    rw     new 2017-06-18 21:43:37
## 2          YumYum IceCream java.messagi<U+0085>    rw     new 2017-06-16 18:06:05
## 3 IS606 Fall 2016 Present<U+0085>   jason.bryer    rw     new 2017-03-15 23:05:15
## 4     Untitled spreadsheet kumudini.bha<U+0085>    rw     new 2016-12-21 20:47:52
## 5 IS606 Spring 2016 Prese<U+0085>   jason.bryer    rw     new 2016-05-22 05:29:14
## # ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
## #   alternate <chr>, self <chr>, alt_key <chr>

icedata.url <- gs_title("YumYumSummer")
icedata.csv <- gs_read_csv(ss = icedata.url, ws = "Summer")

# convert to data.frame
icedata <- as.data.frame(icedata.csv)


# Verifying records and variables

nrow(icedata)

## [1] 14

ncol(icedata)

## [1] 27

# datatable(icedata)

Data Exploration

icedataorig <- icedata
icedata <- icedata %>% select(-Timestamp, -Name)


# creating test and traing dataset by randomly excluding some of the rating items
# from icedata


# class(icedata)

icemat <- as(as.matrix(icedata), "realRatingMatrix")
# class(icemat)

icer <- nrow(icemat)
icec <- ncol(icemat)

IceCream Survey Data

Original IceCream Survey Data

datatable(icedata)

Exploring similarity between kids choices/favourites

alike_kids <- similarity(icemat[1:icer, ], method = "cosine", which = "users")

datatable(as.matrix(alike_kids))

image(as.matrix(alike_kids), main = "HeatMap : Kids Comparable")

Exploring similarity between icecream flavors

alike_ice <- similarity(icemat[, 1:icec], method = "cosine", which = "items")

datatable(as.matrix(alike_ice))

image(as.matrix(alike_ice), main = "HeatMap : IceCreams Comparable")

Exploring The Most Sold/Bought IceCream Flavor

NuttyButterScoth and NuttyExpresso seem to be the most bought flavors.

buys_each_icecream <- colCounts(icemat)

min_no_flavor_rated <- min(rowCounts(icemat))


# Then, we can sort the icecream buy counts by number of views:
table_buys <- data.frame(icecreamflavor = names(buys_each_icecream), buys = buys_each_icecream)

table_buys <- table_buys[order(table_buys$buys, decreasing = TRUE), ]

ggplot(table_buys[1:25, ], aes(x = icecreamflavor, y = buys, col = icecreamflavor, 
    fill = icecreamflavor)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1)) + ggtitle("Number of buys
of the icecream flavors")

Exploring The Ratings

The most given rating by kids are 4, so we see on average the kids if happy are rating generously. Although we do see some exceptions about some kids like Dino and Toto rating very low on average.

vratings <- as.vector(icemat@data)
vratings <- vratings[vratings != 0]


tabratings <- table(vratings)
pander(tabratings, caption = "Unique Ratings Frequency Table ")

Unique Ratings Frequency Table
1	2	3	4	5
24	19	29	55	22

##### 
qplot(factor(vratings), col = factor(vratings), fill = factor(vratings)) + ggtitle("Histogram Ratings")

Average Ratings For IceCream Flavors

We find that Chocolate Almond has the highest average rating followed by DiveInChocolate, NuttyExpresso and OrangeVanilla

avgratings <- colMeans(icemat@data, na.rm = TRUE)

table_avgrate <- data.frame(icecreamflavor = names(avgratings), avgrate = avgratings)

ggplot(table_avgrate[1:25, ], aes(x = icecreamflavor, y = avgrate, col = icecreamflavor, 
    fill = icecreamflavor)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1)) + ggtitle("Avg Rating IceCream Flavors")

Using k-fold To Validate Models

# Since the minimum flavors rated by the child is 6, we keep 5 as the items to
# keep for training

# we will define how many chunks we want. The argument is k (defined by n_fold
# below), like the number of repetitions
set.seed(101)

n_fold <- 5  # k value for k fold cross validation
items_to_keep <- 5  # Items to consider in training set
rating_threshold <- 3.5  # Considering a rating of 3.5 as good rating across all flavors.

eval_sets <- evaluationScheme(data = icemat, method = "cross-validation", k = n_fold, 
    given = items_to_keep, goodRating = rating_threshold)



eval_sets

## Evaluation scheme with 5 items given
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=3.500000
## Data set: 14 x 25 rating matrix of class 'realRatingMatrix' with 149 ratings.

evaltrain <- getData(eval_sets, "train")  # training set
evalknown <- getData(eval_sets, "known")  # known test set
evalunknown <- getData(eval_sets, "unknown")  # unknown test set

Creating Models

The functions in the recommenderlab package automatically mean-center data and calculate similarities (using the specified method). Many of the functions in the package take inputs of the class realRatingMatrix – for this reason, the raw dataset of icecream data is utilized.

For each of the models built below we try to find the ratings for 3 flavors to be recommended and compare them to arrive at the RMSE for the model

Model 1 : IBCF-Cosine

# First, let's prepare the data for validation, as shown in the previous section.
# Since #the k-fold is the most accurate approach, we will use it here:

model_to_evaluate <- "IBCF"
model_parameters <- list(method = "Cosine")


model1_IBCF_cosine <- Recommender(data = evaltrain, method = model_to_evaluate, parameter = model_parameters)

items_to_recommend <- 3

model1_prediction <- predict(object = model1_IBCF_cosine, newdata = evalknown, n = items_to_recommend, 
    type = "ratings")


model1_accuracy <- calcPredictionAccuracy(x = model1_prediction, data = evalunknown, 
    byUser = FALSE)
model1_accuracy

##     RMSE      MSE      MAE 
## 1.427599 2.038040 1.085901

Model 2 : IBCF-Pearson

model_to_evaluate <- "IBCF"
model_parameters <- list(method = "pearson")


model2_IBCF_pearson <- Recommender(data = evaltrain, method = model_to_evaluate, 
    parameter = model_parameters)

items_to_recommend <- 3

model2_prediction <- predict(object = model2_IBCF_pearson, newdata = evalknown, n = items_to_recommend, 
    type = "ratings")

model2_accuracy <- calcPredictionAccuracy(x = model2_prediction, data = evalunknown, 
    byUser = FALSE)
model2_accuracy

##     RMSE      MSE      MAE 
## 1.518935 2.307164 1.227525

Model 3 : UBCF-Cosine

model_to_evaluate <- "UBCF"
model_parameters <- list(method = "cosine")


model3_UBCF_cosine <- Recommender(data = evaltrain, method = model_to_evaluate, parameter = model_parameters)

items_to_recommend <- 3

model3_prediction <- predict(object = model3_UBCF_cosine, newdata = evalknown, n = items_to_recommend, 
    type = "ratings")

model3_accuracy <- calcPredictionAccuracy(x = model3_prediction, data = evalunknown, 
    byUser = FALSE)  # byUser =FALSE for model level performance metrics
model3_accuracy

##      RMSE       MSE       MAE 
## 1.1950031 1.4280325 0.9237501

Model 4 : UBCF-Pearson

model_to_evaluate <- "UBCF"
model_parameters <- list(method = "pearson")


model4_UBCF_pearson <- Recommender(data = evaltrain, method = model_to_evaluate, 
    parameter = model_parameters)

items_to_recommend <- 2

model4_prediction <- predict(object = model4_UBCF_pearson, newdata = evalknown, n = items_to_recommend, 
    type = "ratings")

model4_accuracy <- calcPredictionAccuracy(x = model4_prediction, data = evalunknown, 
    byUser = FALSE)  # byUser =FALSE for model level performance metrics
model4_accuracy

## RMSE  MSE  MAE 
##  NaN  NaN  NaN

Model 5 : Popular Items

model_to_evaluate <- "POPULAR"
model_parameters <- list(method = "POPULAR")


model5_popular <- Recommender(data = evaltrain, method = model_to_evaluate, parameter = model_parameters)

## Available parameter (with default values):
## normalize     =  center
## aggregationRatings    =  function (x, na.rm = FALSE, dims = 1, ...)  standardGeneric("colMeans")
## aggregationPopularity     =  function (x, na.rm = FALSE, dims = 1, ...)  standardGeneric("colSums")
## verbose   =  FALSE

items_to_recommend <- 3

model5_prediction <- predict(object = model5_popular, newdata = evalknown, n = items_to_recommend, 
    type = "ratings")

model5_accuracy <- calcPredictionAccuracy(x = model5_prediction, data = evalunknown, 
    byUser = FALSE)  # byUser =FALSE for model level performance metrics
model5_accuracy

##     RMSE      MSE      MAE 
## 1.261816 1.592179 1.038930

Model Accuracy Comparison

modelaccuracycomp <- rbind(model1_accuracy, model2_accuracy, model3_accuracy, model4_accuracy, 
    model5_accuracy) %>% round(2)

comptable <- cbind(modelaccuracycomp)

rownames(comptable) <- c("IBCF_Cosine", "IBCF_Pearson", "UBCF_Cosine", "UBCF_Pearson", 
    "Popular")
colnames(comptable) <- c("RMSE", "MSE", "MAE")

pander(comptable, caption = "Model Comparison")

Model Comparison
	RMSE	MSE	MAE
IBCF_Cosine	1.43	2.04	1.09
IBCF_Pearson	1.52	2.31	1.23
UBCF_Cosine	1.2	1.43	0.92
UBCF_Pearson	NA	NA	NA
Popular	1.26	1.59	1.04

Comparing Models With Varying Values Of Recommendation

# Evaluating different models, we can define a list with them We add random and
# popular to the model methods of evaluation in this comparison

models_to_evaluate <- list(IBCF_cos = list(name = "IBCF", param = list(method = "cosine")), 
    IBCF_pearson = list(name = "IBCF", param = list(method = "pearson")), UBCF_cos = list(name = "UBCF", 
        param = list(method = "cosine")), UBCF_pearson = list(name = "UBCF", param = list(method = "pearson")), 
    random = list(name = "RANDOM", param = NULL), pop = list(name = "POPULAR", param = NULL))

# In order to evaluate the models properly, we need to test them, varying the
# number of flavors , as follows
n_recommendations <- c(1, 3, 5, 7, 10, 12, 15)

list_results <- evaluate(x = eval_sets, method = models_to_evaluate, n = n_recommendations, 
    type = "topNList")


plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve")

listerror <- evaluate(x = eval_sets, method = models_to_evaluate, type = "ratings")
modelcomp <- as.data.frame(sapply(avg(listerror), rbind))
modelcompnew <- as.data.frame(t(as.matrix(modelcomp)))
colnames(modelcompnew) <- c("RMSE", "MSE", "MAE")

pander(modelcompnew, caption = "Model Comparison Based On Varying Recommendation")

Model Comparison Based On Varying Recommendation
	RMSE	MSE	MAE
IBCF_cos	1.585	2.548	1.259
IBCF_pearson	1.635	2.708	1.226
UBCF_cos	0.9715	0.9934	0.7876
UBCF_pearson	NA	NA	NA
random	1.351	1.869	1.041
pop	1.13	1.294	0.9088

Model Selection

From the ROC Plot

We find that the Area Under Curve for UBCF Cosine method is the most , as apparent.

From RMSE Comparison Tables

Also from the comparison of the calculated accuracy i.e the RMSE values, we find that UBCF Cosine i.e. the User Based Collborative Filtering With Cosine Similarity gives the best RMSE value (i.e. lowest value) and hence we use it to make recommendations for the kids next flavors to try. We use the model 3 (model3_UBCF_cosine) built above to get the recommendations.

# Getting Top 3 Recommendations for All The Kids
recomAll <- predict(model3_UBCF_cosine, icemat[1:14], n = 3)
recomAll

## Recommendations as 'topNList' with n = 3 for 14 users.

recdf <- as.data.frame(as(recomAll, "list"))
recdf <- as.data.frame(t(as.matrix(recdf)))

rownames(recdf) <- icedataorig$Name
colnames(recdf) <- c("Flavor 1", "Flavor 2", "Flavor 3")

YumYum Icecream Recommends !

pander(recdf, caption = "Special Flavors For You This Summer!!")

Special Flavors For You This Summer!!
	Flavor 1	Flavor 2	Flavor 3
Molly	BlueberryCheesecake	Coconut	IcyBrownie
Adi	BlueberryCheesecake	Coconut	Kulfi
Tom	Coconut	NuttyButterScotch	BlueberryCheesecake
Pinky	Casata	DiveInChocolate	BlackForest
Dino	Casata	ChocolateAlmond	BlueberryCheesecake
Peter	BlueberryCheesecake	DiveInChocolate	Coconut
Dolly	Coconut	IcyBrownie	NuttyExpresso
Nina	Casata	BlackForest	ChocolateAlmond
Betty	Casata	BlackForest	DiveInChocolate
Polly	ChocolateAlmond	DiveInChocolate	IcyBrownie
Sunny	BlackForest	ChocolateAlmond	Coconut
Kitty	Kulfi	MochaShot	IrishCoffee
Toto	Kulfi	IcyBrownie	MochaShot
Riya	Coconut	Kulfi	MochaShot

IceCreamRS_CollabFilter

Kumudini Bhave

June 15, 2017