Exploratory Data Analysis

# Import data 

items_data <- read.transactions("/Users/dirkhartog/Desktop/CUNY_MSDS/DATA_624/GroceryDataSet (1).csv", sep = ",")

summary(items_data)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

There are 9,835 transactions with 169 unique products. The most frequent items included - Whole Milk - Other vegatables - Rolls/buns - Soda - Yogurt

2,159 transactions contain 1 item and 1 transaction contains 32 unique items. There a total of \(2^{169} - 1\) possible item sets. We can print off the first 5 in our data.

# List the first 5 transactions in the data 
inspect(items_data[1:5])

##     items                      
## [1] {citrus fruit,             
##      margarine,                
##      ready soups,              
##      semi-finished bread}      
## [2] {coffee,                   
##      tropical fruit,           
##      yogurt}                   
## [3] {whole milk}               
## [4] {cream cheese,             
##      meat spreads,             
##      pip fruit,                
##      yogurt}                   
## [5] {condensed milk,           
##      long life bakery product, 
##      other vegetables,         
##      whole milk}

The first five transactions are listed above. We can see that 3/5 top items are in these transaction along with several other items. We can calculate the support of each item to find the fraction of transactions within the data set that contain each item

# Calculate the support for the top five most frequent items 

itemFrequency(items_data[,c("whole milk", "other vegetables",
                            "rolls/buns", "soda", "yogurt")])

##       whole milk other vegetables       rolls/buns             soda 
##        0.2555160        0.1934926        0.1839349        0.1743772 
##           yogurt 
##        0.1395018

# Create a tibble with item frequency/support for all items 
item_frequency <- tibble(
  Items = names(itemFrequency(items_data)), Frequency = itemFrequency(items_data)
)

item_frequency <- item_frequency %>% arrange(desc(Frequency))

# Top 20 most frequent items
head(item_frequency, 20)

## # A tibble: 20 × 2
##    Items                 Frequency
##    <chr>                     <dbl>
##  1 whole milk               0.256 
##  2 other vegetables         0.193 
##  3 rolls/buns               0.184 
##  4 soda                     0.174 
##  5 yogurt                   0.140 
##  6 bottled water            0.111 
##  7 root vegetables          0.109 
##  8 tropical fruit           0.105 
##  9 shopping bags            0.0985
## 10 sausage                  0.0940
## 11 pastry                   0.0890
## 12 citrus fruit             0.0828
## 13 bottled beer             0.0805
## 14 newspapers               0.0798
## 15 canned beer              0.0777
## 16 pip fruit                0.0756
## 17 fruit/vegetable juice    0.0723
## 18 whipped/sour cream       0.0717
## 19 brown bread              0.0649
## 20 domestic eggs            0.0634

Finding Association Rules

The data contain 169 unique items which gives us \(3^{169} - 2{169+1} + 1\) = r3^169 - 2^(169+1) + 1 number of rules. Evaluating all of the rules would computationally challenging and may not be important. We will use the apriori function in the arules package to find the association rules based on certain criteria that determine (the support, confidence and minimum length of a rule. This alogorithm will look through all of the possible item sets and determine if they meet the minimum support critera that we set. This will only retrieve rules that show up in the data at the support threshold that we choose. For example if we were interested in the item sets that show up at least 5% of the time we would then need to set our support value at 0.05

One way of identifying the strength of a rule within a dataset is to consider the degree of certainty of each rule using confidence. Confidence is the ratio of the number of transactions that include both the antecedent and consequent to the number of transactions that include only the antecedent.

In our case we will se the support to 0.015 and confidence to 0.25 with a minimum rule length of 2

# Finding association rules 

item_rules <- apriori(items_data,
parameter = list(support = 0.015, 
                  confidence = 0.25, 
                  minlen = 2
))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.015      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 147 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [73 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [78 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(item_rules)

## set of 78 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 62 16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.205   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01505   Min.   :0.2537   Min.   :0.02928   Min.   :0.9932  
##  1st Qu.:0.01790   1st Qu.:0.3084   1st Qu.:0.05247   1st Qu.:1.5047  
##  Median :0.02191   Median :0.3546   Median :0.06121   Median :1.7400  
##  Mean   :0.02558   Mean   :0.3608   Mean   :0.07319   Mean   :1.7632  
##  3rd Qu.:0.02888   3rd Qu.:0.4056   3rd Qu.:0.08277   3rd Qu.:1.9427  
##  Max.   :0.07483   Max.   :0.5174   Max.   :0.25552   Max.   :3.0404  
##      count      
##  Min.   :148.0  
##  1st Qu.:176.0  
##  Median :215.5  
##  Mean   :251.6  
##  3rd Qu.:284.0  
##  Max.   :736.0  
## 
## mining info:
##        data ntransactions support confidence
##  items_data          9835   0.015       0.25
##                                                                                          call
##  apriori(data = items_data, parameter = list(support = 0.015, confidence = 0.25, minlen = 2))

# View the first 10 rules
inspect(item_rules[1:10])

##      lhs                    rhs                support    confidence coverage  
## [1]  {sugar}             => {whole milk}       0.01504830 0.4444444  0.03385867
## [2]  {cream cheese}      => {whole milk}       0.01647178 0.4153846  0.03965430
## [3]  {chicken}           => {other vegetables} 0.01789527 0.4170616  0.04290798
## [4]  {chicken}           => {whole milk}       0.01759024 0.4099526  0.04290798
## [5]  {white bread}       => {whole milk}       0.01708185 0.4057971  0.04209456
## [6]  {chocolate}         => {whole milk}       0.01667514 0.3360656  0.04961871
## [7]  {coffee}            => {whole milk}       0.01870869 0.3222417  0.05805796
## [8]  {frozen vegetables} => {other vegetables} 0.01779359 0.3699789  0.04809354
## [9]  {frozen vegetables} => {whole milk}       0.02043721 0.4249471  0.04809354
## [10] {beef}              => {root vegetables}  0.01738688 0.3313953  0.05246568
##      lift     count
## [1]  1.739400 148  
## [2]  1.625670 162  
## [3]  2.155439 176  
## [4]  1.604411 173  
## [5]  1.588147 168  
## [6]  1.315243 164  
## [7]  1.261141 184  
## [8]  1.912108 175  
## [9]  1.663094 201  
## [10] 3.040367 171

# Sort by lift 
item_rules %>% sort(by = "lift") %>%  head(10) %>% inspect()

##      lhs                               rhs                support    confidence
## [1]  {beef}                         => {root vegetables}  0.01738688 0.3313953 
## [2]  {other vegetables, whole milk} => {root vegetables}  0.02318251 0.3097826 
## [3]  {whole milk, yogurt}           => {tropical fruit}   0.01514997 0.2704174 
## [4]  {pip fruit}                    => {tropical fruit}   0.02043721 0.2701613 
## [5]  {tropical fruit, whole milk}   => {yogurt}           0.01514997 0.3581731 
## [6]  {root vegetables, whole milk}  => {other vegetables} 0.02318251 0.4740125 
## [7]  {curd}                         => {yogurt}           0.01728521 0.3244275 
## [8]  {root vegetables}              => {other vegetables} 0.04738180 0.4347015 
## [9]  {chicken}                      => {other vegetables} 0.01789527 0.4170616 
## [10] {other vegetables, whole milk} => {yogurt}           0.02226741 0.2975543 
##      coverage   lift     count
## [1]  0.05246568 3.040367 171  
## [2]  0.07483477 2.842082 228  
## [3]  0.05602440 2.577089 149  
## [4]  0.07564820 2.574648 201  
## [5]  0.04229792 2.567516 149  
## [6]  0.04890696 2.449770 228  
## [7]  0.05327911 2.325615 170  
## [8]  0.10899847 2.246605 466  
## [9]  0.04290798 2.155439 176  
## [10] 0.07483477 2.132979 219

When we sort the rules by lift we can see interesting findings. The first rule with the highest lift tells us that those who buy beef are 3x more likely to also purchase root vegetables. We can also see that many of the top 10 rules by lift include the whole milk yogurt, and other vegetables which were in the top five most frequently purchased items. Lets evaluate the top rules excluding these items to identify other rules that might be insightful.

item_rules %>% subset(!items %in% c("whole milk", "yogurt", "other vegetables")) %>% sort(by = "lift") %>%
inspect()

##     lhs                        rhs               support    confidence
## [1] {beef}                  => {root vegetables} 0.01738688 0.3313953 
## [2] {pip fruit}             => {tropical fruit}  0.02043721 0.2701613 
## [3] {frankfurter}           => {rolls/buns}      0.01921708 0.3258621 
## [4] {sausage}               => {rolls/buns}      0.03060498 0.3257576 
## [5] {bottled water}         => {soda}            0.02897814 0.2621895 
## [6] {sausage}               => {soda}            0.02430097 0.2586580 
## [7] {fruit/vegetable juice} => {soda}            0.01840366 0.2545710 
##     coverage   lift     count
## [1] 0.05246568 3.040367 171  
## [2] 0.07564820 2.574648 201  
## [3] 0.05897306 1.771616 189  
## [4] 0.09395018 1.771048 301  
## [5] 0.11052364 1.503577 285  
## [6] 0.09395018 1.483324 239  
## [7] 0.07229283 1.459887 181

Excluding some of the top items reveals possible pairings to take into account when setting up the items in the store, more specifcally the last four rules. Placing soda and rolls/buns near the sausage/frankfurters may lead to more purchases as these items are likely to be purhased together.

We can inspect the results sorted by confidence or suppor as well for further insights.

item_rules %>% 
  sort(by = "confidence") %>% head(10) %>% 
  inspect()

##      lhs                                    rhs                support   
## [1]  {tropical fruit, yogurt}            => {whole milk}       0.01514997
## [2]  {other vegetables, yogurt}          => {whole milk}       0.02226741
## [3]  {butter}                            => {whole milk}       0.02755465
## [4]  {curd}                              => {whole milk}       0.02613116
## [5]  {other vegetables, root vegetables} => {whole milk}       0.02318251
## [6]  {other vegetables, tropical fruit}  => {whole milk}       0.01708185
## [7]  {root vegetables, whole milk}       => {other vegetables} 0.02318251
## [8]  {domestic eggs}                     => {whole milk}       0.02999492
## [9]  {rolls/buns, yogurt}                => {whole milk}       0.01555669
## [10] {whipped/sour cream}                => {whole milk}       0.03223183
##      confidence coverage   lift     count
## [1]  0.5173611  0.02928317 2.024770 149  
## [2]  0.5128806  0.04341637 2.007235 219  
## [3]  0.4972477  0.05541434 1.946053 271  
## [4]  0.4904580  0.05327911 1.919481 257  
## [5]  0.4892704  0.04738180 1.914833 228  
## [6]  0.4759207  0.03589222 1.862587 168  
## [7]  0.4740125  0.04890696 2.449770 228  
## [8]  0.4727564  0.06344687 1.850203 295  
## [9]  0.4526627  0.03436706 1.771563 153  
## [10] 0.4496454  0.07168277 1.759754 317

Homework 10: Market Basket Analysis

Dirk Hartog

2025-05-02

Instructions

Exploratory Data Analysis

Finding Association Rules