In this assignment we work with the Grocery dataset and apply market basket analysis and cluster analysis to the data. The dataset includes:

  • 9835 transactions (observations)
  • Up to 32 items in any transaction
  • 169 items
path <- "C:/Users/Kevin/Google Drive/Documents/CUNY MSDS/DATA624/Week13/"
df <- read_csv(paste0(path, "GroceryDataSet.csv"), 
               col_names = FALSE,
               col_types = cols(.default = col_character()))
#str(df)
#glimpse(df)

Market basket analysis

We use the arules and arulesViz libraries to perform and visualize a market basket analysis of the data. From the item frequency plot below, we see that the 10 top items represented in the transaction data include:

  • Whole milk
  • Other vegetables
  • Rolls / buns
  • Soda
  • Yogurt
  • Bottled water
  • Root vegetables
  • Tropical fruit
  • Shopping bags
  • Sausage
# load & review data
trans <- read.transactions(paste0(path, "GroceryDataSet.csv"), 
                        format = "basket", 
                        header = FALSE, 
                        sep = ",")
summary(trans)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
#inspect(trans)

# top items
#itemFrequency(trans)
itemFrequencyPlot(trans, type = "relative", topN = 25, horiz = TRUE, 
                  main = "Relative item frequency: Top 25 items")

Now we use the apriori function to find association rules in the transaction data. After some experimentation, we set the parameters as follows in order to generate a reasonably large set of rules:

  • Minimum support: 0.01
  • Minimum confidence: 0.2.

The result of running the Apriori mining algorithm with these parameters is a set of 232 association rules, virtually all based on a total of 2 or 3 items.

# find rules: vary conf 0.5 = 15 rules, 0.4 = 62 rules, 0.3 = 125 rules, 0.2 = 232
rules <- apriori(trans, parameter = list(support = 0.01, confidence = 0.2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [232 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules)
## set of 232 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3 
##   1 151  80 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.341   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift            count       
##  Min.   :0.01007   Min.   :0.2006   Min.   :0.8991   Min.   :  99.0  
##  1st Qu.:0.01200   1st Qu.:0.2470   1st Qu.:1.4432   1st Qu.: 118.0  
##  Median :0.01490   Median :0.3170   Median :1.7277   Median : 146.5  
##  Mean   :0.02005   Mean   :0.3321   Mean   :1.7890   Mean   : 197.2  
##  3rd Qu.:0.02227   3rd Qu.:0.4033   3rd Qu.:2.0762   3rd Qu.: 219.0  
##  Max.   :0.25552   Max.   :0.5862   Max.   :3.2950   Max.   :2513.0  
## 
## mining info:
##   data ntransactions support confidence
##  trans          9835    0.01        0.2
#inspect(rules)

The top 15 rules based on lift are shown in the table below. We see that rules associated with “root vegetables” or “other vegetables” as consequent (RHS) dominate the top 10 rules, with lift ratios ranging from 2.7x to 3.3x.

# top 15 by lift
inspect(head(rules, n = 15, by = "lift"))
##      lhs                   rhs                     support confidence     lift count
## [1]  {citrus fruit,                                                                 
##       other vegetables} => {root vegetables}    0.01037112  0.3591549 3.295045   102
## [2]  {other vegetables,                                                             
##       yogurt}           => {whipped/sour cream} 0.01016777  0.2341920 3.267062   100
## [3]  {other vegetables,                                                             
##       tropical fruit}   => {root vegetables}    0.01230300  0.3427762 3.144780   121
## [4]  {beef}             => {root vegetables}    0.01738688  0.3313953 3.040367   171
## [5]  {citrus fruit,                                                                 
##       root vegetables}  => {other vegetables}   0.01037112  0.5862069 3.029608   102
## [6]  {root vegetables,                                                              
##       tropical fruit}   => {other vegetables}   0.01230300  0.5845411 3.020999   121
## [7]  {other vegetables,                                                             
##       whole milk}       => {root vegetables}    0.02318251  0.3097826 2.842082   228
## [8]  {curd,                                                                         
##       whole milk}       => {yogurt}             0.01006609  0.3852140 2.761356    99
## [9]  {other vegetables,                                                             
##       yogurt}           => {root vegetables}    0.01291307  0.2974239 2.728698   127
## [10] {other vegetables,                                                             
##       yogurt}           => {tropical fruit}     0.01230300  0.2833724 2.700550   121
## [11] {other vegetables,                                                             
##       root vegetables}  => {citrus fruit}       0.01037112  0.2188841 2.644626   102
## [12] {other vegetables,                                                             
##       rolls/buns}       => {root vegetables}    0.01220132  0.2863962 2.627525   120
## [13] {tropical fruit,                                                               
##       whole milk}       => {root vegetables}    0.01199797  0.2836538 2.602365   118
## [14] {rolls/buns,                                                                   
##       root vegetables}  => {other vegetables}   0.01220132  0.5020921 2.594890   120
## [15] {root vegetables,                                                              
##       yogurt}           => {other vegetables}   0.01291307  0.5000000 2.584078   127

For comparison we also show the top 15 rules based on confidence in the table below, where the confidence levels range from 0.5 to almost 0.6. In contrast to the top rules by lift, we see that the top rules by confidence are dominated by rules associated with “whole milk” as consequent (RHS). Interestingly, the top 2 rules by confidence are identical to rules #5 and #6 by lift, whereas there is no other overlap between the two tables:

  • {citrus fruit + root vegetables} \(\implies\) {other vegetables}
  • {root vegetables + tropical fruit} \(\implies\) {other vegetables}
# top 15 by confidence
inspect(head(rules, n = 15, by = "confidence"))
##      lhs                     rhs                   support confidence     lift count
## [1]  {citrus fruit,                                                                 
##       root vegetables}    => {other vegetables} 0.01037112  0.5862069 3.029608   102
## [2]  {root vegetables,                                                              
##       tropical fruit}     => {other vegetables} 0.01230300  0.5845411 3.020999   121
## [3]  {curd,                                                                         
##       yogurt}             => {whole milk}       0.01006609  0.5823529 2.279125    99
## [4]  {butter,                                                                       
##       other vegetables}   => {whole milk}       0.01148958  0.5736041 2.244885   113
## [5]  {root vegetables,                                                              
##       tropical fruit}     => {whole milk}       0.01199797  0.5700483 2.230969   118
## [6]  {root vegetables,                                                              
##       yogurt}             => {whole milk}       0.01453991  0.5629921 2.203354   143
## [7]  {domestic eggs,                                                                
##       other vegetables}   => {whole milk}       0.01230300  0.5525114 2.162336   121
## [8]  {whipped/sour cream,                                                           
##       yogurt}             => {whole milk}       0.01087951  0.5245098 2.052747   107
## [9]  {rolls/buns,                                                                   
##       root vegetables}    => {whole milk}       0.01270971  0.5230126 2.046888   125
## [10] {other vegetables,                                                             
##       pip fruit}          => {whole milk}       0.01352313  0.5175097 2.025351   133
## [11] {tropical fruit,                                                               
##       yogurt}             => {whole milk}       0.01514997  0.5173611 2.024770   149
## [12] {other vegetables,                                                             
##       yogurt}             => {whole milk}       0.02226741  0.5128806 2.007235   219
## [13] {other vegetables,                                                             
##       whipped/sour cream} => {whole milk}       0.01464159  0.5070423 1.984385   144
## [14] {rolls/buns,                                                                   
##       root vegetables}    => {other vegetables} 0.01220132  0.5020921 2.594890   120
## [15] {root vegetables,                                                              
##       yogurt}             => {other vegetables} 0.01291307  0.5000000 2.584078   127

We show several visualizations of the association rules including:

  • Scatter plot of all rules
  • Interactive graph chart of the top 10 rules by lift
  • Parallel coordinate plot of the top 10 rules by lift.
# scatter plot of all rules
plot(rules, jitter = 3)

# graph of top 10 rules
head(rules, n = 10, by = "lift") %>% 
    plot(method = "graph", engine = "htmlwidget")
# parallel coord plot of top 10 rules
head(rules, n = 10, by = "lift") %>%  
    plot(method = "paracoord")

Cluster analysis

Finally we run a simple cluster analysis, focusing on the items in the transaction data that have greater than 4% support. We follow the logic and code example given in the help file for the dissimilarity function in the arules library. First we compute the distance between items using the dissimilarity function; the distance measure in this case defaults to the “jaccard” measure. Then we apply hierarchical clustering to the dissimilarity matrix with the hclust function, using the “ward.D2” method.

The results of the cluster analysis are displayed in the dendrogram below. We see that “root vegetables”, “other vegetables”, and “whole milk” form a prominent cluster, consistent with the association rules observed earlier.

# follow logic of "dissimilarity" help file from "arules" package
# cluster analysis on items with support > 5%
trans2 <- trans[ , itemFrequency(trans) > 0.04]
d_jaccard <- dissimilarity(trans2, which = "items")
# plot dendrogram
plot(hclust(d_jaccard, method = "ward.D2"), 
     main = "Dendrogram for items", sub = "", xlab = "")