Market Basket and Clusters

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore Market Basket Analysis.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.

Use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.

Acquire the data

Reading in data for market basket analysis requires the read.transactions function from the arules package. It creates a set of formal class transactions that can then be mined to extract association rules.

Examining the summary stats from the class transactions, there were 2159 transactions that included only 1 item and the largest single transaction included 32 items. Overall it appears that most transactions included from 1 to 12 items.

grocery <- read.transactions("C:/data/GroceryDataSet.csv", sep = ",")
summary(grocery)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Explore the data

Using the tidyverse package we can convert the class transactions to a tibble for closer examination.

grocery_frequency <- tibble(Items = names(itemFrequency(grocery)),
                            Frequency = itemFrequency(grocery))
head(grocery_frequency)
## # A tibble: 6 x 2
##   Items            Frequency
##   <chr>                <dbl>
## 1 abrasive cleaner  0.00356 
## 2 artif. sweetener  0.00325 
## 3 baby cosmetics    0.000610
## 4 baby food         0.000102
## 5 bags              0.000407
## 6 baking powder     0.0177

The tibble is easier to interact with, but it is not currently sorted numerically, only alphabetically by item. So let’s look at the top 20 items by purchasing frequency. Whole milk is the most frequently purchased item, followed by other vegetables, rolls/buns, soda and yogurt. It is also interesting that only one meat (sausage) makes the list of top 20 items purchased, while fruits/vegetables, breads and soda seem to be more popular.

The table below represents the support or actual frequency with which each item is purchased. For example, whole milk is purchased with actual frequency of 0.255 or 26% of the time. Stated differently, about 1 in 4 transactions (0.255 of the time) include whole milk.

grocery_frequency %>%
  arrange(desc(Frequency)) %>%
  slice(1:20)
## # A tibble: 20 x 2
##    Items                 Frequency
##    <chr>                     <dbl>
##  1 whole milk               0.256 
##  2 other vegetables         0.193 
##  3 rolls/buns               0.184 
##  4 soda                     0.174 
##  5 yogurt                   0.140 
##  6 bottled water            0.111 
##  7 root vegetables          0.109 
##  8 tropical fruit           0.105 
##  9 shopping bags            0.0985
## 10 sausage                  0.0940
## 11 pastry                   0.0890
## 12 citrus fruit             0.0828
## 13 bottled beer             0.0805
## 14 newspapers               0.0798
## 15 canned beer              0.0777
## 16 pip fruit                0.0756
## 17 fruit/vegetable juice    0.0723
## 18 whipped/sour cream       0.0717
## 19 brown bread              0.0649
## 20 domestic eggs            0.0634

Looking at the summary stats of the Frequencies, the median support appears to be 0.0104728 and the mean is 0.0260915. It would be good to use a number between the median and mean later when we need to designate a support threshold to generate the association rules.

grocery_frequency %>%
  select(Frequency) %>%
  summary()
##    Frequency        
##  Min.   :0.0001017  
##  1st Qu.:0.0038637  
##  Median :0.0104728  
##  Mean   :0.0260915  
##  3rd Qu.:0.0310117  
##  Max.   :0.2555160

Extract the rules

Association rules are generated with the apriori algorithm from the arules package.

But before we can calculate the rules, we need to calculate the parameters. In terms of parameter values, if we look for patterns in the data that happen at least five times a day in any given transaction set over the course of 30 days per period (an average # of days in a standard month).

The calculated support value is equal to 0.0152516522623284, which is between the median and mean as noted in the above summary of the Frequencies.

The confidence parameter is based on the conditional probability that an item is purchased given the prior items in the transaction. The higher the confidence for a rule the stronger the relationship given the conditional probabilities. Since the most frequently purchased item occurs with a support of 0.255, it is rational to set the confidence no higher than the most frequently purchased item. We set the confidence to 0.25.

Lastly, since transactions that contain only 1 item have no antecedent/consequent relationship to devise a rule for the purchase of the consequent item, we set the minimum length of a transaction to 2 items.

Now we are ready to extract the association rules from the data.

frequency_per_day <- 5
days_per_period <- 30
total_transactions <- length(grocery)

support_value <- (frequency_per_day * days_per_period)/total_transactions

grocery_rules <- apriori(grocery,
                         parameter = list(
                           support = support_value,
                           confidence = 0.25,
                           minlen = 2
                         ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime    support minlen
##        0.25    0.1    1 none FALSE            TRUE       5 0.01525165      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 150 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [71 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The apriori algorithm created 74 grocery rules using the parameters that we fed to it, and a summary of the rule stats includes information about the quality measures for support, confidence, coverage, lift and counts.

As a reminder support reflects the number of times an itemset appears in a basket of transactions, confidence reflects the strength of the relationship between the antecedent and consequent items in a transaction, coverage reflects the conditional probability that transactions include y given x, lift evaluates the strength of relationship between items within a given itemset, and counts are just counts.

summary(grocery_rules)
## set of 74 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 61 13 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.176   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01556   Min.   :0.2537   Min.   :0.03437   Min.   :0.9932  
##  1st Qu.:0.01871   1st Qu.:0.3084   1st Qu.:0.05328   1st Qu.:1.5038  
##  Median :0.02227   Median :0.3499   Median :0.06416   Median :1.6650  
##  Mean   :0.02614   Mean   :0.3588   Mean   :0.07496   Mean   :1.7381  
##  3rd Qu.:0.02895   3rd Qu.:0.4047   3rd Qu.:0.08277   3rd Qu.:1.9183  
##  Max.   :0.07483   Max.   :0.5129   Max.   :0.25552   Max.   :3.0404  
##      count      
##  Min.   :153.0  
##  1st Qu.:184.0  
##  Median :219.0  
##  Mean   :257.1  
##  3rd Qu.:284.8  
##  Max.   :736.0  
## 
## mining info:
##     data ntransactions    support confidence
##  grocery          9835 0.01525165       0.25
##                                                                                               call
##  apriori(data = grocery, parameter = list(support = support_value, confidence = 0.25, minlen = 2))

Assess the rules

Once the rules are created it is possible to view them using the inspect function. Sorting by the confidence parameter we can see the rules in descending order of strength. It is not surprising that 9 of the top 10 rules are predicting that whole milk will be purchased (rhs) given the items purchased with it (lhs), since whole milk is the most purchased item (1 in 4 transactions include it).

Explaining the first rule in terms of confidence, it is about twice as likely (0.5128806/100) that someone who buys other vegetables and yogurt will also buy whole milk.

grocery_rules %>%
  sort(by = "confidence") %>%
  head(n = 10) %>%
  inspect()
##      lhs                                    rhs                support   
## [1]  {other vegetables, yogurt}          => {whole milk}       0.02226741
## [2]  {butter}                            => {whole milk}       0.02755465
## [3]  {curd}                              => {whole milk}       0.02613116
## [4]  {other vegetables, root vegetables} => {whole milk}       0.02318251
## [5]  {other vegetables, tropical fruit}  => {whole milk}       0.01708185
## [6]  {root vegetables, whole milk}       => {other vegetables} 0.02318251
## [7]  {domestic eggs}                     => {whole milk}       0.02999492
## [8]  {rolls/buns, yogurt}                => {whole milk}       0.01555669
## [9]  {whipped/sour cream}                => {whole milk}       0.03223183
## [10] {root vegetables}                   => {whole milk}       0.04890696
##      confidence coverage   lift     count
## [1]  0.5128806  0.04341637 2.007235 219  
## [2]  0.4972477  0.05541434 1.946053 271  
## [3]  0.4904580  0.05327911 1.919481 257  
## [4]  0.4892704  0.04738180 1.914833 228  
## [5]  0.4759207  0.03589222 1.862587 168  
## [6]  0.4740125  0.04890696 2.449770 228  
## [7]  0.4727564  0.06344687 1.850203 295  
## [8]  0.4526627  0.03436706 1.771563 153  
## [9]  0.4496454  0.07168277 1.759754 317  
## [10] 0.4486940  0.10899847 1.756031 481

Looking at it another way, since confidence can be high at times just by chance, we can look at the rules in terms of lift to remove the influence of chance and examine more closely how each item in an itemset relates to the other items.

So when we look at the rules in terms of confidence, customers who purchase both other vegetables and yogurt are twice as likely to also purchase whole milk, but looking at the rules in terms of lift, rule #8 indicates that customers who purchase both other vegetables and whole milk are about twice as likely (2.132979) to also purchase yogurt.

grocery_rules %>%
  sort (by = "lift") %>%
  head(n = 10) %>%
  inspect()
##      lhs                               rhs                support    confidence
## [1]  {beef}                         => {root vegetables}  0.01738688 0.3313953 
## [2]  {other vegetables, whole milk} => {root vegetables}  0.02318251 0.3097826 
## [3]  {pip fruit}                    => {tropical fruit}   0.02043721 0.2701613 
## [4]  {root vegetables, whole milk}  => {other vegetables} 0.02318251 0.4740125 
## [5]  {curd}                         => {yogurt}           0.01728521 0.3244275 
## [6]  {root vegetables}              => {other vegetables} 0.04738180 0.4347015 
## [7]  {chicken}                      => {other vegetables} 0.01789527 0.4170616 
## [8]  {other vegetables, whole milk} => {yogurt}           0.02226741 0.2975543 
## [9]  {tropical fruit, whole milk}   => {other vegetables} 0.01708185 0.4038462 
## [10] {whipped/sour cream}           => {other vegetables} 0.02887646 0.4028369 
##      coverage   lift     count
## [1]  0.05246568 3.040367 171  
## [2]  0.07483477 2.842082 228  
## [3]  0.07564820 2.574648 201  
## [4]  0.04890696 2.449770 228  
## [5]  0.05327911 2.325615 170  
## [6]  0.10899847 2.246605 466  
## [7]  0.04290798 2.155439 176  
## [8]  0.07483477 2.132979 219  
## [9]  0.04229792 2.087140 168  
## [10] 0.07168277 2.081924 284

Since the rules seem to be confirming each other in terms of looking at them from different metric perspectives, perhaps clustering them visually would add additional insights to the analysis of the market basket.

Clustering

K-means clustering is one of the most commonly used clustering approaches, so let us see how it fares in this case.

First we need to extract all of the output from the grocery rules, which is an S4 object, into a dataframe using the as method to force the output into a dataframe class. This method combines the lhs and rhs into a single column called rules.

grocery_rules_df <- as(grocery_rules, "data.frame")
head(grocery_rules_df)
##                             rules    support confidence   coverage     lift
## 1  {cream cheese} => {whole milk} 0.01647178  0.4153846 0.03965430 1.625670
## 2 {chicken} => {other vegetables} 0.01789527  0.4170616 0.04290798 2.155439
## 3       {chicken} => {whole milk} 0.01759024  0.4099526 0.04290798 1.604411
## 4   {white bread} => {whole milk} 0.01708185  0.4057971 0.04209456 1.588147
## 5     {chocolate} => {whole milk} 0.01667514  0.3360656 0.04961871 1.315243
## 6        {coffee} => {whole milk} 0.01870869  0.3222417 0.05805796 1.261141
##   count
## 1   162
## 2   176
## 3   173
## 4   168
## 5   164
## 6   184

Next we need to decide which variables we will cluster around. We can group by lift and count to see what happens, so let’s take a look at the distributions of each of those metrics.

grocery_rules_df %>%
  select(lift, count) %>%
  summary()
##       lift            count      
##  Min.   :0.9932   Min.   :153.0  
##  1st Qu.:1.5038   1st Qu.:184.0  
##  Median :1.6650   Median :219.0  
##  Mean   :1.7381   Mean   :257.1  
##  3rd Qu.:1.9183   3rd Qu.:284.8  
##  Max.   :3.0404   Max.   :736.0

The ranges of each variable are quite different, so we need to scale them for the K-means process.

grocery_rules_scaled <- grocery_rules_df %>%
  select(lift, count) %>%
  scale()

head(grocery_rules_scaled)
##         lift      count
## 1 -0.3112089 -0.7984402
## 2  1.1555230 -0.6809257
## 3 -0.3700669 -0.7061074
## 4 -0.4150936 -0.7480768
## 5 -1.1706635 -0.7816524
## 6 -1.3204511 -0.6137746

Before clustering the data, it would be good to see how many clusters are recommended based on the data itself. There are three methods we can use to recommend the number of clusters, so we will try them all.

The first method is the Elbow method, which uses a measure of within-cluster sum of squares (WCSS) which is called by setting the method = "wss". This method measures the distance between each item in a cluster and the cluster’s centroid. The Elbow method suggests that 4 clusters are optimal.

fviz_nbclust(grocery_rules_scaled, kmeans, method = "wss")

The second method is the Average Silhouette Method which is called by setting the method = "silhouette". This method compares each item in a cluster with other items in the cluster as well as items in neighboring clusters. The Silhouette Method suggests that 2 clusters are optimal.

fviz_nbclust(grocery_rules_scaled, kmeans, method = "silhouette")

The third method is the Gap Statistic, which generates a random reference dataset and measures the difference in WCSS between the original and random datasets, which is called by setting the method = "gap_stat". The Gap Statistic method appears to show that 1 cluster is optimal, but since 1 cluster does not separate the data in any way, the second best choice appears to be 4 clusters.

fviz_nbclust(grocery_rules_scaled, kmeans, method = "gap_stat")

Since two methods suggested 4 clusters and one suggested 2 clusters, we can create two cluster groupings to visualize both ways.

First we will try two centers with 25 initial configurations. The K-means algorithm selected the best result with two clusters of sizes 66 and 8.

set.seed(101)

k_2 <- kmeans(grocery_rules_scaled, centers = 2, nstart = 25)
k_2$size
## [1] 66  8

It may be more useful to analyze the clusters visually using the factoextra package’s fviz_cluster function. Here we see that with only two clusters, cluster 1 has the highest overall lift but with low counts, while cluster 2 has the highest counts but with lower lift.

fviz_cluster(k_2, data = grocery_rules_scaled, repel = TRUE)

Last we will try four centers with 25 initial configurations. The K-means algorithm selected the best result with four clusters of sizes 29, 5, 32 and 8.

set.seed(102)

k_4 <- kmeans(grocery_rules_scaled, centers = 4, nstart = 25)
k_4$size
## [1]  5 32  8 29

We see that with four clusters, the original cluster 2 is unchanged (although now it is cluster 3), but the original cluster 1 with lower counts has been segmented more granularly into three separate clusters. The new cluster 1 still has the highest overall lift, while cluster 4 has average lift and cluster 2 has the lowest lift.

fviz_cluster(k_4, data = grocery_rules_scaled, repel = TRUE)