Cincinnati Zoo Clustering and Market Basket Analysis

Packages and Themes Used

library(data.table)
library(dplyr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(stringr)
library(rattle)
library(corrplot)
library(reshape2)
library(nnet)
library(xlsx)
library(MASS)
library(colorRamps)
library(fpc)
library(factoextra)
library(cluster)
library(NbClust)
library(mclust)
library(arules)
library(arulesViz)

my_theme <- theme_bw() +
theme(axis.title=element_text(size=14),
plot.title=element_text(size=20),
axis.text =element_text(size=8))

my_theme_dark <- theme_dark() +
theme(axis.title=element_text(size=24),
plot.title=element_text(size=36),
axis.text =element_text(size=16))

Data Source

The Cincinnati Zoo was founded in 1873 and officially opened in 1875. It is the second oldest in the nation after Pennsylvania Zoo and serves over a million visitors each day. The goal of this analysis is to identify useful and/or hidden information in the data collected by the zoo and to study the buying and/or visiting behavior of zoo members. The data comes from 2 files - one with aggregated sales data of food items across the months Oct’10, Nov’10, Dec’10, Jan’11, Feb’12, and Mar’12 and another one with individual transaction data of customer purchase behaviour across the same months and food items.

Data Reading and Preparation

food_by_month <- read.xlsx2("qry_Food_by_Month.xlsx", sheetName = "qry_Food_by_MonthEdited")
colnames(food_by_month) <- c("Product", "Oct", "Nov", "Dec", "Jan", "Feb", "Mar")
food_by_month[,c(2,3,4,5,6,7)] <- apply(food_by_month[,-c(1)],2,function(x) as.numeric(x))
food_by_month[is.na(food_by_month$Oct), "Oct"] <- 0 
food_by_month[is.na(food_by_month$Nov), "Nov"] <- 0 
food_by_month[is.na(food_by_month$Dec), "Dec"] <- 0 
food_by_month[is.na(food_by_month$Jan), "Jan"] <- 0 
food_by_month[is.na(food_by_month$Feb), "Feb"] <- 0 
food_by_month[is.na(food_by_month$Mar), "Mar"] <- 0 
food_by_month_long <- food_by_month %>% gather(Month, Quantity, Oct:Mar)
food_by_month_long$Month <- factor(food_by_month_long$Month, c("Oct","Nov","Dec","Jan","Feb","Mar"), ordered=TRUE)

food_by_month_scaled <- food_by_month
rownames(food_by_month_scaled) <- food_by_month[,1]
food_by_month_scaled[,-c(1)] <- scale(food_by_month[,-c(1)])


food_items <- read.csv("food_4_association_15s.csv")
food_items <- food_items[, -1]
colnames(food_items) <- gsub("Food", "", colnames(food_items))
food_items_t <- t(food_items)
food_items_t <- cbind(Food_Products = rownames(food_items_t), food_items_t)
food_items_t <- as.data.frame(food_items_t)
food_items_t[,-c(1)] <- apply(food_items_t[,-c(1)], 2, function(x) as.numeric(x))

Exploratory Data Analysis

parcoord(food_by_month[,-c(1)],var.label=TRUE)

The Parallel Coordinates plot shows us that a lot of food items moving together, pick up sales in November and February.

time_series_plot <- food_by_month_long %>% group_by(Month) %>% summarise(Total=sum(Quantity, na.rm=TRUE)) %>% ggplot(aes(x=Month, y=Total, group=1)) + geom_line() + ggtitle("Sales Trend across Time")+
  my_theme
time_series_plot

The trend plot shows us that October has the highest sales and the sales plummet the most in January, gradually picking up as Spring approaches.

top_10_items_plot <- food_by_month_long %>% group_by(Month) %>% arrange(desc(Quantity)) %>% slice(1:5) %>% ggplot( aes(x=factor(Month), y=Quantity, fill=Product))+geom_bar(stat="identity", position="dodge")+ggtitle("Top 10 Selling Items each Month")+ my_theme 
top_10_items_plot

The above plot shows that November sells Drinks and Ice-Creams the most, probably because of the heat. Similarly, December and January sell a whole lot of Hot Chocolate because people enjoy a hot beverage on a cold day. Ice Cream sales are picking up in March again when the cold starts to leave. Among food items, a slice of cheese/pepperoni pizza are a favourite irrespective of seasonality.

Cluster Analysis on Zoo Food Sales Data

Clustering is a data exploratory technique used for discovering groups or pattern in a dataset.There are two standard clustering strategies: partitioning methods (e.g., k-means and pam) and hierarchical clustering.

K-Means Clustering

K-means clustering is the simplest and the most commonly used partitioning method for splitting a dataset into a set of k groups (i.e. clusters). It requires the analyst to specify the number of optimal clusters to be generated from the data. The number of clusters - k - can be determined in two ways - 1. Using a simple descriptive criterion like a dissimilarity measure like Within Cluster Sum of Squares (WSS), Silhoutte coefficient, Dunn’s Index etc 2. Based on simulating a null reference scenario of no-clustering.

Within Cluster Sum of Squares is calculated in the following way - - First identify k clusters, it can be random - Identify the significant clusters and this process is iterative. If the distance between the observation and its closest cluster center is greater than the distance between the others closest cluster centers(Cluster 1, Cluster 2 …),then the observation will replace the cluster center depending on which one is closer to the observation.Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center. The sum of these distances is called the WSS. Using WSS, we can see that the optimal number of clusters is 2 using the elbow method.

wss <- NULL
for (i in 1:12) wss[i] <- sum(kmeans(food_by_month_scaled[,-c(1)],centers=i)$withinss)
plot(1:12, wss, type="b", pch=19, xlab="Number of Clusters",ylab="Within groups sum of squares")
abline(v = 3, lty =2)

fviz_nbclust(food_by_month_scaled[,-c(1)], kmeans, method = "wss") +geom_vline(xintercept = 2, linetype = 2)

prediction.strength(food_by_month_scaled[,-c(1)], Gmin=2, Gmax=15, M=10,cutoff=0.8)

Prediction strength 
Clustering method:  kmeans 
Maximum number of clusters:  15 
Resampled data sets:  10 
Mean pred.str. for numbers of clusters:  1 0.9064832 0.5329637 0.4214886 0.4818452 0.3291941 0.2464502 0.2742857 0.2029762 0.1183333 0.1416667 0.1161111 0.08833333 0.025 0.05833333 
Cutoff value:  0.8 
Largest number of clusters better than cutoff:  2

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. The algorithm is similar to the elbow method and can be computed as follow: Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters 1. For each k, calculate the average silhouette of observations (avg.sil) 2. Plot the curve of avg.sil according to the number of clusters k. 3. The location of the maximum is considered as the appropriate number of clusters.

The Silhoutte index also gives us 2 clusters

distance = dist(food_by_month_scaled[,-c(1)], method = "euclidean")
result = matrix(nrow = 15, ncol = 3)
result[1,1] <- 1
result[1,c(2,3)] <- 0
for (i in 2:15){
  cluster_result = kmeans(food_by_month_scaled[,-c(1)], i)
  clusterstat=cluster.stats(distance, cluster_result$cluster)
  result[i,1]=i
  result[i,2]=clusterstat$avg.silwidth
  result[i,3]=clusterstat$dunn   
}
plot(result[,c(1,2)], type="l", ylab = 'silhouette width', xlab = 'number of clusters', pch=19)
abline(v = result[which.max(result[,2]),1], lty = 2)

fviz_nbclust(food_by_month_scaled[,-c(1)], kmeans, method = "silhouette")

The Dunn’s index measures compactness (Maximum distance in between data points of clusters) and clusters separation (minimum distance between clusters). This measurement serves as a measure to find the right number of clusters in a data set, where the maximum value of the index represents the right partitioning given the index (partition with the highest separation between clusters and less spread data in between clusters). The below plot tells us that Dunn’s index gives us 15 as the optimal number of clusters.

plot(result[,c(1,3)], type="l", ylab = 'dunn index', xlab = 'number of clusters',pch=19)
abline(v = result[which.max(result[,3]),1], lty = 2)

Gap Statistic The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.Let the distance d be the squared Euclidean distance and Wk be the pooled within cluster sum of squares around the cluster means. The approach is to standardize the graph of log(Wk) by comparing it with its expectation under an appropriate null reference distribution of the data. The estimate of the optimal number of clusters is then the value of k for which log(Wk) falls the farthest below this reference curve. The reference dataset is generated using Monte Carlo simulations of the sampling process. That is, for each variable in the data set we compute its range and generate values for the n points uniformly from the interval min to max. The Gap statistic shows 9 as the optimal number of clusters.

gap_stat <- clusGap(food_by_month_scaled[,-c(1)], FUN = kmeans, nstart = 25,K.max = 10, B = 50)
plot(gap_stat, frame = FALSE, xlab = "Number of clusters k")
abline(v = 4, lty = 2)

fviz_gap_stat(gap_stat)

NbClust package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. An important advantage of NbClust is that the user can simultaneously computes multiple indices and determine the number of clusters in a single function call.The indices provided in NbClust package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of Charrad et al., 2014. This method also shows us that maximum measures(5) vote for ‘2’ as the optimal number of clusters whereas 4 measures vote for 4 as the optimal number of clusters.

nb <- NbClust(food_by_month_scaled[,-c(1)], distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete", index ="all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 5 proposed 2 as the best number of clusters 
* 3 proposed 3 as the best number of clusters 
* 4 proposed 4 as the best number of clusters 
* 3 proposed 5 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 3 proposed 7 as the best number of clusters 
* 1 proposed 9 as the best number of clusters 
* 3 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

fviz_nbclust(nb) + theme_minimal()

Among all indices: 
===================
* 2 proposed  0 as the best number of clusters
* 1 proposed  1 as the best number of clusters
* 5 proposed  2 as the best number of clusters
* 3 proposed  3 as the best number of clusters
* 4 proposed  4 as the best number of clusters
* 3 proposed  5 as the best number of clusters
* 1 proposed  6 as the best number of clusters
* 3 proposed  7 as the best number of clusters
* 1 proposed  9 as the best number of clusters
* 3 proposed  10 as the best number of clusters

Conclusion
=========================
* According to the majority rule, the best number of clusters is  2 .

Visualizing the Clusters

optimal_clusters_2 <- kmeans(food_by_month_scaled[,-c(1)],centers=2)
fviz_cluster(optimal_clusters_2, data = food_by_month_scaled[,-c(1)], geom = "point",stand = FALSE, frame.type = "norm")

fviz_cluster(optimal_clusters_2, data =  food_by_month_scaled[,-c(1)], ellipse.type = "convex")+theme_minimal()

optimal_clusters_4 <- kmeans(food_by_month_scaled[,-c(1)],centers=4)
fviz_cluster(optimal_clusters_4, data = food_by_month_scaled[,-c(1)], geom = "point",stand = FALSE, frame.type = "norm")

fviz_cluster(optimal_clusters_4, data =  food_by_month_scaled[,-c(1)], ellipse.type = "convex")+theme_minimal()

Hierarchical Clustering

Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated. The result is a tree-based representation of the observations which is called a dendrogram. It uses pairwise distance matrix between observations as clustering criteria. Hierarchical clustering can be divided into two main types: agglomerative and divisive.

Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root). The result is a tree which can be plotted as a dendrogram.

Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster.

A number of different cluster agglomeration methods (i.e, linkage methods) has been developed to compute dissimilarity methods- 1. Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters. 2. Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters. 3. Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters. 4. Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2. 5. Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

food_by_month_scaled <- food_by_month_scaled[,-c(1)]
distance_measure <- dist(food_by_month_scaled, method = "euclidean")

Elbow method for hierarchical clustering gives us 2 clusters

fviz_nbclust(food_by_month_scaled, hcut, method = "wss", hc_method="complete") +geom_vline(xintercept = 2, linetype = 2)

Silhoutte method for hierarchical clustering gives us 4 clusters

fviz_nbclust(food_by_month_scaled, hcut, method = "silhouette", hc_method = "complete")

Gap Statisic gives 9 clusters for hierarchical clustering

gap_stat_hclust <- clusGap(food_by_month_scaled, FUN = hcut, K.max = 10, B = 50)
fviz_gap_stat(gap_stat_hclust)

A few clustering methods have been tried below - 1. Maximum or complete linkage clustering

complete_hclust <- hclust(distance_measure, method = "complete")
plot(complete_hclust, cex = 0.6, hang = -1)
rect.hclust(complete_hclust, k = 2, border = 2:4)

2. Minimum or Single Linkage Clustering

single_hclust <- hclust(distance_measure, method = "single" )
plot(single_hclust, cex = 0.6, hang = -1)
rect.hclust(complete_hclust, k = 2, border = 2:4)

Ward’s minimum variance clustering

ward_hclust <- hclust(distance_measure, method = "ward.D2" )
plot(ward_hclust, cex = 0.6, hang = -1)
rect.hclust(complete_hclust, k = 2, border = 2:4)

4. Divisive Clustering

divisive_clust <- diana(food_by_month_scaled)
pltree(divisive_clust, cex = 0.6, hang = -1,main = "Dendrogram of diana")
rect.hclust(divisive_clust, k = 2, border = 2:4)

Model Based Clustering

The traditional clustering methods such as hierarchical clustering and partitioning algorithms (k-means and others) are heuristic and are not based on formal models.An alternative is to use model-based clustering, in which, the data are considered as coming from a distribution that is mixture of two or more components (i.e. clusters) Each component k (i.e. group or cluster) is modeled by the normal or Gaussian distribution which is characterized by the parameters: mean vector,covariance matrix, an associated probability in the mixture. Each point has a probability of belonging to each cluster. The model parameters can be estimated using the EM (Expectation-Maximization) algorithm initialized by hierarchical model-based clustering. Each cluster k is centered at the means , with increased density for points near the mean. Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix. The key advantage of model-based approach, compared to the standard clustering methods (k-means, hierarchical clustering, .), is the suggestion of the number of clusters and an appropriate model. Model Based Clustering suggests 3 clusters.

mclust_result = Mclust(food_by_month_scaled)
summary(mclust_result)

----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust VEE (ellipsoidal, equal shape and orientation) model with 3 components:

 log.likelihood  n df       BIC       ICL
      -10.04923 55 43 -192.4138 -193.2773

Clustering table:
 1  2  3 
17 13 25

plot(mclust_result)

fviz_cluster(mclust_result, frame.type = "norm", geom = "point")

fviz_cluster(mclust_result, data =  food_by_month_scaled, ellipse.type = "convex")+theme_minimal()

Fuzzy Clustering

In K-means or PAM clustering, the data is divided into distinct clusters, where each element is affected exactly to one cluster. This type of clustering is also known as hard clustering or non-fuzzy clustering. Unlike K-means, Fuzzy clustering is considered as a soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.Points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value in [0, 1]. Fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. It was developed by Dunn in 1973 and improved by Bezdek in 1981. It’s frequently used in pattern recognition.

fuzzy_cluster <- fanny(food_by_month_scaled, 2, memb.exp = 2, metric = "euclidean", 
      stand = FALSE, maxit = 500)
fviz_cluster(fuzzy_cluster, frame.type = "norm",frame.level = 0.68)

fuzzy_cluster$membership

                                     [,1]       [,2]
Cheese                          0.7558224 0.24417764
Alchohol                        0.8111565 0.18884352
Bottled Water                   0.1834673 0.81653271
Burger                          0.8305382 0.16946181
Capri Sun                       0.9348599 0.06514014
Cheese Fries Basket             0.8986300 0.10137002
Cheeseburger Basket             0.2751497 0.72485028
Chicken Nugget Basket           0.7799857 0.22001435
Chicken Tender Basket           0.2572766 0.74272340
Chili Cheese Sandwich           0.9369916 0.06300840
Chips                           0.5345088 0.46549124
Coffee/HotTea                   0.6525562 0.34744381
Coney                           0.8189743 0.18102566
Diet Dr Pepper                  0.9495080 0.05049202
Diet Pepsi                      0.9358669 0.06413315
Dr Pepper                       0.9487428 0.05125721
Fish Basket                     0.9442882 0.05571180
Float                           0.9433120 0.05668802
French Fries Basket             0.2399619 0.76003813
Funnel Cake Topping             0.8014079 0.19859212
Gatorade                        0.2710607 0.72893934
Gourmet Cup                     0.7935908 0.20640925
Grilled Chicken Sandwich Basket 0.8533913 0.14660866
Hot Chocolate                   0.3964081 0.60359186
Hot Chocolate Souvenir          0.4486946 0.55130539
Hot Dog Basket                  0.3069078 0.69309220
IceCreamCone                    0.3816168 0.61838315
Iced Tea                        0.9361226 0.06387739
Icee                            0.7974893 0.20251067
Krazy Kritter                   0.2285233 0.77147666
Land Shark                      0.9339273 0.06607272
Medium Drink                    0.1558466 0.84415342
Milk                            0.7341856 0.26581438
Mountain Dew                    0.9412488 0.05875121
Nacho                           0.9447963 0.05520365
Pepsi                           0.9137052 0.08629476
Pink Lemonade                   0.9384886 0.06151136
Red River Hog                   0.9430311 0.05696891
Salad                           0.9184391 0.08156087
Sandwich Basket                 0.4918617 0.50813826
Shinerboch                      0.9376312 0.06236879
Siberian Chill                  0.9402878 0.05971217
Sierra Mist                     0.9467903 0.05320974
Small Drink                     0.1639031 0.83609693
Small/Large Fry                 0.9266281 0.07337192
Snack                           0.2296198 0.77038019
Soft Pretzel                    0.3491428 0.65085721
Soup                            0.9311814 0.06881863
Souvenir Drink                  0.2536995 0.74630051
Special                         0.9336584 0.06634157
Stella                          0.9394286 0.06057143
Veggie Burger Basket            0.9078197 0.09218031
Walking 3-Way                   0.9437671 0.05623295
Whole/Slice Cheese              0.2518961 0.74810388
Whole/Slice Pepp                0.1886052 0.81139484

corrplot(fuzzy_cluster$membership, is.corr = FALSE)

Association Mining

Association rule mining is primarily focused on finding frequent co-occurring associations among a collection of items. It is sometimes referred to as “Market Basket Analysis”, since that was the original application area of association mining. The goal is to find associations of items that occur together more often than you would expect from a random sampling of all possibilities. Association rules analysis on the food data was done using the Arules package in R. A minimum support of 0.3% together with a confidence of 50% was set. There were 19,066 transactions. Transactions were defined by combining customer name with date of purchase.

TransFood <- as(as.matrix(food_items), "transactions")

A summary of the transactions is given below-

summary(TransFood)

transactions as itemMatrix in sparse format with
 19066 rows (elements/itemsets/transactions) and
 118 columns (items) and a density of 0.02230966 

most frequent items:
  Bottled.Water Slice.of.Cheese    Medium.Drink     Small.Drink 
           3165            3071            2871            2768 
  Slice.of.Pepp         (Other) 
           2353           35964 

element (itemset/transaction) length distribution:
sizes
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
 197 5671 5174 3253 2128 1292  655  351  178   95   42   14    8    7    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.633   4.000  15.000 

includes extended item information - examples:
         labels
1    Add.Cheese
2          Beer
3 Bottled.Water

A plot of the most frequent items sold -

itemFrequencyPlot(TransFood, support = 0.1, cex.names=0.8)

Generating association rules for this dataset using a support of 0.3% and a confidence of 50%. We can see that the rules with the highest confidence are a combination of Bottled Water, Topping and Ice Cream and Cheeseburger, Chicken Tenders and French Fries Basket. And the rules with the highest lift values are a combination of Hot Dog, Side of Cheese and Cheese Coney and Small Pink Lemonade and Chicken Nugget Basket. Lift can be interpreted as - The probability of finding cheese.coney in a transaction which has a Hot Dog and a Side of Cheese is greater than the normal probability of finding a Cheese Coney by 26.16%. Confidence can be interpreted as, 96.2% of the times, French Fries Basket was present in transactions involving CheeseBurger and Chicken Tenders.

basket_rules <- apriori(TransFood,parameter = list(sup = 0.003, conf = 0.5,target="rules"))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5   0.003      1
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 57 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[115 item(s), 19066 transaction(s)] done [0.01s].
sorting and recoding items ... [74 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [42 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

kable(inspect(head(sort (basket_rules, by="confidence", decreasing=TRUE))))

    lhs                  rhs                       support confidence      lift
[1] {Bottled.Water,                                                            
     Topping}         => {Ice.Cream.Cone}      0.004038603  1.0000000  8.959586
[2] {Topping}         => {Ice.Cream.Cone}      0.028584916  0.9981685  8.943177
[3] {Cheeseburger,                                                             
     Chicken.Tenders} => {French.Fries.Basket} 0.003933704  0.9615385  9.850990
[4] {Chicken.Tenders,                                                          
     Krazy.Kritter}   => {French.Fries.Basket} 0.005664534  0.9557522  9.791710
[5] {Cheese.Coney,                                                             
     Side.of.Cheese}  => {Hot.Dog}             0.004353299  0.9325843 21.816751
[6] {Side.of.Cheese}  => {Hot.Dog}             0.006293926  0.9230769 21.594337

kable(inspect(head(sort (basket_rules, by="lift", decreasing=TRUE))))

    lhs                                rhs                          support confidence     lift
[1] {Hot.Dog,                                                                                  
     Side.of.Cheese}                => {Cheese.Coney}           0.004353299  0.6916667 26.16531
[2] {Side.of.Cheese}                => {Cheese.Coney}           0.004667995  0.6846154 25.89857
[3] {Cheese.Coney,                                                                             
     Side.of.Cheese}                => {Hot.Dog}                0.004353299  0.9325843 21.81675
[4] {Side.of.Cheese}                => {Hot.Dog}                0.006293926  0.9230769 21.59434
[5] {Small.Pink.Lemonade}           => {Chicken.Nugget.Basket}  0.003304311  0.5887850 15.94570
[6] {Hot.Chocolate.Souvenir.Refill} => {Hot.Chocolate.Souvenir} 0.014948075  0.5588235 13.18630

Here’s a look at rules greater than 3 items on the LHS

kable(inspect(subset(basket_rules, size(basket_rules)>3)))

    lhs                  rhs                   support confidence     lift
[1] {Krazy.Kritter,                                                       
     Medium.Drink,                                                        
     Slice.of.Pepp}   => {Slice.of.Cheese} 0.003251862  0.5535714 3.436794
[2] {Medium.Drink,                                                        
     Slice.of.Pepp,                                                       
     Small.Drink}     => {Slice.of.Cheese} 0.003146963  0.6000000 3.725041
[3] {Medium.Drink,                                                        
     Slice.of.Cheese,                                                     
     Small.Drink}     => {Slice.of.Pepp}   0.003146963  0.5172414 4.191128

Here’s a look at the rules with Bottled Water in the LHS and having a lift value greater than 5-

Bottled.WaterFood.lhs <- subset(basket_rules, subset = lhs %in% "Bottled.Water" & lift>5 )
kable(inspect(Bottled.WaterFood.lhs))

    lhs                  rhs                       support confidence     lift
[1] {Bottled.Water,                                                           
     Topping}         => {Ice.Cream.Cone}      0.004038603  1.0000000 8.959586
[2] {Add.Cheese,                                                              
     Bottled.Water}   => {Soft.Pretzel..3_39}  0.003828805  0.8021978 8.749830
[3] {Bottled.Water,                                                           
     Chicken.Tenders} => {French.Fries.Basket} 0.003461659  0.7586207 7.772091
[4] {Bottled.Water,                                                           
     Cheeseburger}    => {French.Fries.Basket} 0.003094514  0.7662338 7.850088

A scatter plot of the rules by their support, confidence and lift values is given below. We can see that most of the rules have a low support value. Most of the rules with very low confidence value also have a low lift value.

plot(basket_rules)

A graph of the top 10 rules by lift values -

plot(head(sort(basket_rules, by="lift"), 10), method = "graph")

Another plot of the rules.

plot(basket_rules, method="grouped")

An attempt to cluster the food items according to their purchase frequencies and patterns using k-means clustering.

food_items_t[,-c(1)] <- scale(food_items_t[,-c(1)])
food_items_t <- food_items_t[, colSums(is.na(food_items_t)) != nrow(food_items_t)]
food_product_clusters <- kmeans(food_items_t[,-c(1)],centers=3)
fviz_cluster(food_product_clusters, data = food_items_t[,-c(1)], geom = "point",stand = FALSE, frame.type = "norm")

fviz_cluster(food_product_clusters, data =  food_items_t[,-c(1)], ellipse.type = "convex")+   theme_minimal()

cluster_1 <- as.vector(food_items_t[food_product_clusters$cluster==1,1])
cluster_2 <- as.vector(food_items_t[food_product_clusters$cluster==2,1])
cluster_3 <- as.vector(food_items_t[food_product_clusters$cluster==3,1])

The clusters are as follows -

kable(cluster_1)

Bottled.Water

Medium.Drink

Slice.of.Cheese

Slice.of.Pepp

Small.Drink

kable(cluster_2)

Add.Cheese

Beer

Bowl.of.Chili.w.Cheese

Bowl.of.Chili

Brat.Sausage

Bud.Light

Burger.Basket

Burger

Buy.One.Get.One.Ice.Cream.Cone

Capri.Sun

Caramel.Apple.w.Topping

Caramel.Apple

Cheese.Coney

Cheese.Fries.Basket

Cheeseburger.Basket

Cheeseburger

Chicken.Nugget.Basket

Chicken.Nuggets

Chicken.Tender.Basket

Chicken.Tenders

Chili.Cheese.Sandwich

Chili

Chips

Coffee

Coney

Cotton.Candy

Double.Cheeseburger

Fish.Basket

Fish.Sandwich

Float

French.Fries.Basket

Funnel.Cake.Topping

Funnel.Cake

Gatorade

Giraffe.Cone

Gourmet.Cup

Grilled.Chicken.Sandwich.Basket

Grilled.Chicken.Sandwich

Hot.Chocolate.Souvenir.Refill

Hot.Chocolate.Souvenir

Hot.Chocolate.w.Liquor.Shot

Hot.Chocolate

Hot.Dog.Basket

Hot.Dog

Hot.Tea

Icee.Buy.One.Get.One

Icee

Kettle.Chips

Krazy.Kritter

Land.Shark

Large.Fry

Margarita

Medium.Diet.Dr.Pepper

Medium.Diet.Pepsi

Medium.Dr.Pepper

Medium.Iced.Tea

Medium.Mountain.Dew

Medium.Pepsi

Medium.Pink.Lemonade

Medium.Sierra.Mist

Milk

Nacho

October.Fest

Popcorn

Red.River.Hog

Rice.Krispie.Treat

Salad

Sandwich.Basket

Sandwich

Shinerboch

Siberian.Chill

Side.of.Cheese

Small.Diet.Dr.Pepper

Small.Diet.Pepsi

Small.Dr.Pepper

Small.Fry

Small.Iced.Tea

Small.Mountain.Dew

Small.Pepsi

Small.Pink.Lemonade

Small.Sierra.Mist

Soft.Pretzel..3_39

Soft.Pretzel..3_89

Soup

Souvenir.Diet.Dr.Pepper.Refill

Souvenir.Diet.Dr.Pepper

Souvenir.Diet.Pepsi.Refill

Souvenir.Diet.Pepsi

Souvenir.Dr.Pepper.Refill

Souvenir.Dr.Pepper

Souvenir.Drink

Souvenir.Iced.Tea.Refill

Souvenir.Iced.Tea

Souvenir.Mountain.Dew.Refill

Souvenir.Mountain.Dew

Souvenir.Pepsi.Refill

Souvenir.Pepsi

Souvenir.Pink.Lemonade.Refill

Souvenir.Pink.Lemonade

Souvenir.Refill

Souvenir.Sierra.Mist.Refill

Souvenir.Sierra.Mist

Special

Stella

Topping

Vanilla.Latte

Veggie.Burger.Basket

Veggie.Burger

Walking.3.Way

Whole.Cheese

Whole.Pepp

kable(cluster_3)

Ice.Cream.Cone