Association Analysis

Preliminary

We will use the arules package in the lesson that follows.

install.packages("arules")

Next, we load the arules package for use.

library(arules)

In the lesson that follows, we use the Groceries data set from the arules package, which contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The grocery store would like to find actionable, explainable and non-trivial rules based on Association Analysis of the data.

data("Groceries")

Data Exploration

We can explore the data using our typical data overview functions, including str() and summary(). The str() function output gives us information about the data, including under @itemInfo, where we see that there are 169 item labels, or 169 unique items in our transactions.

str(Groceries)

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  3 variables:
##   .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
##   .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
##   .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables

We can use the length() function to obtain T, the total number of transactions in the dataset.

length(Groceries)

## [1] 9835

The summary() function gives us information including the support count for the most frequent items, the total number of transactions, the number of items and the frequency based on the number of items per transaction.

summary(Groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

By saving the summary() object, we can create a barplot of the lengths component, which gives us an idea of the frequency of transactions with n items.

test <- summary(Groceries)

We can use the test@lengths component to create the barplot. As shown, must of our transactions contain only a few items, with the highest frequency of transactions containing a single item.

barplot(height = test@lengths, 
        xlab = "Frequency of Number of Items per Transaction",
        cex.names = .5)

We can use the inspect() function to view the data as itemsets. We use the head() function to view the first 6 itemsets.

inspect(head(Groceries))

##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups}             
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee}                  
## [3] {whole milk}              
## [4] {pip fruit,               
##      yogurt,                  
##      cream cheese ,           
##      meat spreads}            
## [5] {other vegetables,        
##      whole milk,              
##      condensed milk,          
##      long life bakery product}
## [6] {whole milk,              
##      butter,                  
##      yogurt,                  
##      rice,                    
##      abrasive cleaner}

The itemFrequency() function from the arules package returns the support values (default) for the items.

We can set type = "absolute" to view the support count. Below, we view the support count, or the number of transactions containing for the first item.

itemFrequency(x = Groceries, 
              type = "absolute")[1] # 1st item

## frankfurter 
##         580

We can set type = "relative" (default) to view the support value (support count / total number of transactions)

itemFrequency(x = Groceries, 
              type = "relative")[1] # 1st item

## frankfurter 
##  0.05897306

We can view the support for the first 6 items using the head() function.

head(itemFrequency(x = Groceries))

##       frankfurter           sausage        liver loaf               ham 
##       0.058973055       0.093950178       0.005083884       0.026029487 
##              meat finished products 
##       0.025826131       0.006507372

The itemFrequencyPlot() function in the arules packages allows us to view the item frequency as a barplot. Below, we restrict the plot to include only the top 10 most frequent items using using the topN argument.

Support Count (type = "absolute")

itemFrequencyPlot(x = Groceries, 
                  type = "absolute", 
                  topN = 10)

Support (type = "relative", default)

itemFrequencyPlot(x = Groceries, 
                  type = "relative", 
                  topN = 10)

Association Analysis

We use the apriori() function in the arules package to perform Association Analysis. By default, the minsup is set to 0.1 (support = 0.1) and the minconf is set to 0.8 (confidence = 0.8. We will use support = 0.005 and confidence = 0.5. We set minlen = 2 to avoid getting rules containing empty itemsets.

rules <- apriori(data = Groceries, 
                 parameter = list(target = "rules", 
                                  support = 0.005,
                                  confidence = 0.5, 
                                  minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [120 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Running a code line with the name of the object will tell us how many rules were created.

rules

## set of 120 rules

The summary() function will give us the basket sizes, as well as descriptive statistic information for the support, confidence, lift and count of our rules.

summary(rules)

## set of 120 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4 
##  1 98 21 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.167   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.005084   Min.   :0.5000   Min.   :0.008134   Min.   :1.957  
##  1st Qu.:0.005669   1st Qu.:0.5181   1st Qu.:0.010142   1st Qu.:2.091  
##  Median :0.006202   Median :0.5445   Median :0.011490   Median :2.249  
##  Mean   :0.007344   Mean   :0.5537   Mean   :0.013404   Mean   :2.379  
##  3rd Qu.:0.007982   3rd Qu.:0.5762   3rd Qu.:0.014667   3rd Qu.:2.643  
##  Max.   :0.022267   Max.   :0.7000   Max.   :0.043416   Max.   :3.691  
##      count       
##  Min.   : 50.00  
##  1st Qu.: 55.75  
##  Median : 61.00  
##  Mean   : 72.22  
##  3rd Qu.: 78.50  
##  Max.   :219.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.005        0.5

We can use the inspect() and head() functions to view the first 6 rules.

inspect(head(rules))

##     lhs                                    rhs                support    
## [1] {baking powder}                     => {whole milk}       0.009252669
## [2] {other vegetables,oil}              => {whole milk}       0.005083884
## [3] {root vegetables,onions}            => {other vegetables} 0.005693950
## [4] {onions,whole milk}                 => {other vegetables} 0.006609049
## [5] {other vegetables,hygiene articles} => {whole milk}       0.005185562
## [6] {other vegetables,sugar}            => {whole milk}       0.006304016
##     confidence coverage    lift     count
## [1] 0.5229885  0.017691917 2.046793 91   
## [2] 0.5102041  0.009964413 1.996760 50   
## [3] 0.6021505  0.009456024 3.112008 56   
## [4] 0.5462185  0.012099644 2.822942 65   
## [5] 0.5425532  0.009557702 2.123363 51   
## [6] 0.5849057  0.010777834 2.289115 62

We can view the 10 rules with the highest support values sorted in decreasing order by using the sort() function and specifying by = "support".

inspect(head(sort(rules, by = "support", 
                  decreasing = TRUE), 
             n = 10))

##      lhs                                      rhs                support   
## [1]  {other vegetables,yogurt}             => {whole milk}       0.02226741
## [2]  {tropical fruit,yogurt}               => {whole milk}       0.01514997
## [3]  {other vegetables,whipped/sour cream} => {whole milk}       0.01464159
## [4]  {root vegetables,yogurt}              => {whole milk}       0.01453991
## [5]  {pip fruit,other vegetables}          => {whole milk}       0.01352313
## [6]  {root vegetables,yogurt}              => {other vegetables} 0.01291307
## [7]  {root vegetables,rolls/buns}          => {whole milk}       0.01270971
## [8]  {other vegetables,domestic eggs}      => {whole milk}       0.01230300
## [9]  {tropical fruit,root vegetables}      => {other vegetables} 0.01230300
## [10] {root vegetables,rolls/buns}          => {other vegetables} 0.01220132
##      confidence coverage   lift     count
## [1]  0.5128806  0.04341637 2.007235 219  
## [2]  0.5173611  0.02928317 2.024770 149  
## [3]  0.5070423  0.02887646 1.984385 144  
## [4]  0.5629921  0.02582613 2.203354 143  
## [5]  0.5175097  0.02613116 2.025351 133  
## [6]  0.5000000  0.02582613 2.584078 127  
## [7]  0.5230126  0.02430097 2.046888 125  
## [8]  0.5525114  0.02226741 2.162336 121  
## [9]  0.5845411  0.02104728 3.020999 121  
## [10] 0.5020921  0.02430097 2.594890 120

We can view the 10 rules with the highest confidence values sorted in decreasing order by using the sort() function and specifying by = "confidence".

inspect(head(sort(rules, 
                  by = "confidence", 
                  decreasing = TRUE), 
             n = 10))

##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {tropical fruit,                                                                            
##       root vegetables,                                                                           
##       yogurt}             => {whole milk}       0.005693950  0.7000000 0.008134215 2.739554    56
## [2]  {pip fruit,                                                                                 
##       root vegetables,                                                                           
##       other vegetables}   => {whole milk}       0.005490595  0.6750000 0.008134215 2.641713    54
## [3]  {butter,                                                                                    
##       whipped/sour cream} => {whole milk}       0.006710727  0.6600000 0.010167768 2.583008    66
## [4]  {pip fruit,                                                                                 
##       whipped/sour cream} => {whole milk}       0.005998983  0.6483516 0.009252669 2.537421    59
## [5]  {butter,                                                                                    
##       yogurt}             => {whole milk}       0.009354347  0.6388889 0.014641586 2.500387    92
## [6]  {root vegetables,                                                                           
##       butter}             => {whole milk}       0.008235892  0.6377953 0.012913066 2.496107    81
## [7]  {tropical fruit,                                                                            
##       curd}               => {whole milk}       0.006507372  0.6336634 0.010269446 2.479936    64
## [8]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [9]  {pip fruit,                                                                                 
##       other vegetables,                                                                          
##       yogurt}             => {whole milk}       0.005083884  0.6250000 0.008134215 2.446031    50
## [10] {pip fruit,                                                                                 
##       domestic eggs}      => {whole milk}       0.005388917  0.6235294 0.008642603 2.440275    53

Finally, we can view the 10 rules with the highest lift values sorted in decreasing order by using the sort() function and specifying by = "lift".

inspect(head(sort(rules, 
                  by = "lift", 
                  decreasing = TRUE), 
             n = 10))

##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {tropical fruit,                                                                            
##       curd}               => {yogurt}           0.005287239  0.5148515 0.010269446 3.690645    52
## [2]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [3]  {pip fruit,                                                                                 
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005490595  0.6136364 0.008947636 3.171368    54
## [4]  {pip fruit,                                                                                 
##       whipped/sour cream} => {other vegetables} 0.005592272  0.6043956 0.009252669 3.123610    55
## [5]  {root vegetables,                                                                           
##       onions}             => {other vegetables} 0.005693950  0.6021505 0.009456024 3.112008    56
## [6]  {citrus fruit,                                                                              
##       root vegetables}    => {other vegetables} 0.010371124  0.5862069 0.017691917 3.029608   102
## [7]  {tropical fruit,                                                                            
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.007015760  0.5847458 0.011997966 3.022057    69
## [8]  {tropical fruit,                                                                            
##       root vegetables}    => {other vegetables} 0.012302999  0.5845411 0.021047280 3.020999   121
## [9]  {butter,                                                                                    
##       whipped/sour cream} => {other vegetables} 0.005795628  0.5700000 0.010167768 2.945849    57
## [10] {tropical fruit,                                                                            
##       whipped/sour cream} => {other vegetables} 0.007829181  0.5661765 0.013828165 2.926088    77

We can also use the subset() function within the inspect() function to view rules meeting a particular criteria, such as those rules with lift values greater than 3.

inspect(subset(rules, lift > 3))

##     lhs                     rhs                    support confidence    coverage     lift count
## [1] {root vegetables,                                                                           
##      onions}             => {other vegetables} 0.005693950  0.6021505 0.009456024 3.112008    56
## [2] {tropical fruit,                                                                            
##      curd}               => {yogurt}           0.005287239  0.5148515 0.010269446 3.690645    52
## [3] {pip fruit,                                                                                 
##      whipped/sour cream} => {other vegetables} 0.005592272  0.6043956 0.009252669 3.123610    55
## [4] {citrus fruit,                                                                              
##      root vegetables}    => {other vegetables} 0.010371124  0.5862069 0.017691917 3.029608   102
## [5] {tropical fruit,                                                                            
##      root vegetables}    => {other vegetables} 0.012302999  0.5845411 0.021047280 3.020999   121
## [6] {pip fruit,                                                                                 
##      root vegetables,                                                                           
##      whole milk}         => {other vegetables} 0.005490595  0.6136364 0.008947636 3.171368    54
## [7] {citrus fruit,                                                                              
##      root vegetables,                                                                           
##      whole milk}         => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [8] {tropical fruit,                                                                            
##      root vegetables,                                                                           
##      whole milk}         => {other vegetables} 0.007015760  0.5847458 0.011997966 3.022057    69

We can also view subsets based on multiple criteria, such as items that they include.

For instance, the grocery store may want to come up with promotions and advertising strategies for selling whole milk. We can create a subset based on the right-hand side (Y) including whole milk and the lift value being greater than 2.5, meaning that dependency strongly exists among some itemset X and whole milk.

wholemilk.rhs <- subset(rules, 
                        subset = rhs %in% "whole milk" & 
                                lift > 2.5)

We can use the inspect() function on our subset to better understand the rules meeting our criteria. As shown, all of the rules have the same lift value and confidence values, but different (albeit, low) support values.

inspect(wholemilk.rhs)

##     lhs                     rhs              support confidence    coverage     lift count
## [1] {butter,                                                                              
##      whipped/sour cream} => {whole milk} 0.006710727  0.6600000 0.010167768 2.583008    66
## [2] {butter,                                                                              
##      yogurt}             => {whole milk} 0.009354347  0.6388889 0.014641586 2.500387    92
## [3] {pip fruit,                                                                           
##      whipped/sour cream} => {whole milk} 0.005998983  0.6483516 0.009252669 2.537421    59
## [4] {pip fruit,                                                                           
##      root vegetables,                                                                     
##      other vegetables}   => {whole milk} 0.005490595  0.6750000 0.008134215 2.641713    54
## [5] {tropical fruit,                                                                      
##      root vegetables,                                                                     
##      yogurt}             => {whole milk} 0.005693950  0.7000000 0.008134215 2.739554    56

We can take a closer look at the rules with the highest lift values by sorting the subset in decreasing order of lift and isolating the first two observation of the sorted data.

inspect(head(sort(wholemilk.rhs, 
                  by = "support", 
                  decreasing = TRUE))[1:2])

##     lhs                            rhs          support     confidence
## [1] {butter,yogurt}             => {whole milk} 0.009354347 0.6388889 
## [2] {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000 
##     coverage   lift     count
## [1] 0.01464159 2.500387 92   
## [2] 0.01016777 2.583008 66

Based on this, the grocery store may want to run promotions for fruits and vegetables with their whole milk, or adjust the store layout to accomodate the finding that people buy fruits and vegetables (and in the top rule, yogurt) and whole milk together.

We can take a closer look at another subset, based on the item being on the left-hand side (X). Here, we create a subset based on X containing yogurt and the lift value higher than 2.25.

yogurt.lhs <- subset(rules, 
                         subset = lhs %in% "yogurt" & 
                                 lift > 2.25)

We can use the inspect() function on our subset to better understand the rules meeting our criteria.

inspect(yogurt.lhs)

##      lhs                        rhs                    support confidence    coverage     lift count
## [1]  {curd,                                                                                         
##       yogurt}                => {whole milk}       0.010066090  0.5823529 0.017285206 2.279125    99
## [2]  {butter,                                                                                       
##       yogurt}                => {whole milk}       0.009354347  0.6388889 0.014641586 2.500387    92
## [3]  {root vegetables,                                                                              
##       yogurt}                => {other vegetables} 0.012913066  0.5000000 0.025826131 2.584078   127
## [4]  {other vegetables,                                                                             
##       yogurt,                                                                                       
##       fruit/vegetable juice} => {whole milk}       0.005083884  0.6172840 0.008235892 2.415833    50
## [5]  {whole milk,                                                                                   
##       yogurt,                                                                                       
##       fruit/vegetable juice} => {other vegetables} 0.005083884  0.5376344 0.009456024 2.778578    50
## [6]  {whole milk,                                                                                   
##       yogurt,                                                                                       
##       whipped/sour cream}    => {other vegetables} 0.005592272  0.5140187 0.010879512 2.656529    55
## [7]  {pip fruit,                                                                                    
##       other vegetables,                                                                             
##       yogurt}                => {whole milk}       0.005083884  0.6250000 0.008134215 2.446031    50
## [8]  {pip fruit,                                                                                    
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.005083884  0.5319149 0.009557702 2.749019    50
## [9]  {tropical fruit,                                                                               
##       root vegetables,                                                                              
##       yogurt}                => {whole milk}       0.005693950  0.7000000 0.008134215 2.739554    56
## [10] {tropical fruit,                                                                               
##       other vegetables,                                                                             
##       yogurt}                => {whole milk}       0.007625826  0.6198347 0.012302999 2.425816    75
## [11] {tropical fruit,                                                                               
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.007625826  0.5033557 0.015149975 2.601421    75
## [12] {root vegetables,                                                                              
##       other vegetables,                                                                             
##       yogurt}                => {whole milk}       0.007829181  0.6062992 0.012913066 2.372842    77
## [13] {root vegetables,                                                                              
##       whole milk,                                                                                   
##       yogurt}                => {other vegetables} 0.007829181  0.5384615 0.014539908 2.782853    77

Next, we can take a closer look at the two rules with the highest lift values.

inspect(head(sort(yogurt.lhs, 
                  by = "lift", 
                  decreasing = TRUE))[1:2])

##     lhs                        rhs                    support confidence    coverage     lift count
## [1] {root vegetables,                                                                              
##      whole milk,                                                                                   
##      yogurt}                => {other vegetables} 0.007829181  0.5384615 0.014539908 2.782853    77
## [2] {whole milk,                                                                                   
##      yogurt,                                                                                       
##      fruit/vegetable juice} => {other vegetables} 0.005083884  0.5376344 0.009456024 2.778578    50

Based on the top two rules, whole milk and yogurt are being bought with fruits and vegetables (and fruit/vegetable juice, in the 2nd rule), with the top rule being {root vegetables, whole milk, yogurt} -> {other vegetables}. This information can help the grocery store to market to customers meeting this profile. Based on the subsets, it is clear that these milk-based products (whole milk and yogurt) and fruit and vegetable products are purchased together and the grocery store should tailer their marketing strategies to accomodate this finding.

Association Analysis

Dr. Chelsey Hill

Preliminary

Data Exploration

Association Analysis