Imagine 10000 receipts sitting on your table. Each receipt represents
a transaction with items that were purchased. The receipt is a
representation of stuff that went into a customer’s basket - and
therefore Market Basket Analysis.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.
Use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
Reading in data for market basket analysis requires the
read.transactions function from the arules
package. It creates a set of formal class transactions that can then be
mined to extract association rules.
Examining the summary stats from the class transactions, there were 2159 transactions that included only 1 item and the largest single transaction included 32 items. Overall it appears that most transactions included from 1 to 12 items.
grocery <- read.transactions("C:/data/GroceryDataSet.csv", sep = ",")
summary(grocery)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Using the tidyverse package we can convert the class
transactions to a tibble for closer examination.
grocery_frequency <- tibble(Items = names(itemFrequency(grocery)),
Frequency = itemFrequency(grocery))
head(grocery_frequency)
## # A tibble: 6 x 2
## Items Frequency
## <chr> <dbl>
## 1 abrasive cleaner 0.00356
## 2 artif. sweetener 0.00325
## 3 baby cosmetics 0.000610
## 4 baby food 0.000102
## 5 bags 0.000407
## 6 baking powder 0.0177
The tibble is easier to interact with, but it is not currently sorted numerically, only alphabetically by item. So let’s look at the top 20 items by purchasing frequency. Whole milk is the most frequently purchased item, followed by other vegetables, rolls/buns, soda and yogurt. It is also interesting that only one meat (sausage) makes the list of top 20 items purchased, while fruits/vegetables, breads and soda seem to be more popular.
The table below represents the support or actual
frequency with which each item is purchased. For example, whole milk is
purchased with actual frequency of 0.255 or 26% of the time. Stated
differently, about 1 in 4 transactions (0.255 of the time) include whole
milk.
grocery_frequency %>%
arrange(desc(Frequency)) %>%
slice(1:20)
## # A tibble: 20 x 2
## Items Frequency
## <chr> <dbl>
## 1 whole milk 0.256
## 2 other vegetables 0.193
## 3 rolls/buns 0.184
## 4 soda 0.174
## 5 yogurt 0.140
## 6 bottled water 0.111
## 7 root vegetables 0.109
## 8 tropical fruit 0.105
## 9 shopping bags 0.0985
## 10 sausage 0.0940
## 11 pastry 0.0890
## 12 citrus fruit 0.0828
## 13 bottled beer 0.0805
## 14 newspapers 0.0798
## 15 canned beer 0.0777
## 16 pip fruit 0.0756
## 17 fruit/vegetable juice 0.0723
## 18 whipped/sour cream 0.0717
## 19 brown bread 0.0649
## 20 domestic eggs 0.0634
Looking at the summary stats of the Frequencies, the median support appears to be 0.0104728 and the mean is 0.0260915. It would be good to use a number between the median and mean later when we need to designate a support threshold to generate the association rules.
grocery_frequency %>%
select(Frequency) %>%
summary()
## Frequency
## Min. :0.0001017
## 1st Qu.:0.0038637
## Median :0.0104728
## Mean :0.0260915
## 3rd Qu.:0.0310117
## Max. :0.2555160
Association rules are generated with the apriori
algorithm from the arules package.
But before we can calculate the rules, we need to calculate the parameters. In terms of parameter values, if we look for patterns in the data that happen at least five times a day in any given transaction set over the course of 30 days per period (an average # of days in a standard month).
The calculated support value is equal to 0.0152516522623284, which is between the median and mean as noted in the above summary of the Frequencies.
The confidence parameter is based on the conditional probability that an item is purchased given the prior items in the transaction. The higher the confidence for a rule the stronger the relationship given the conditional probabilities. Since the most frequently purchased item occurs with a support of 0.255, it is rational to set the confidence no higher than the most frequently purchased item. We set the confidence to 0.25.
Lastly, since transactions that contain only 1 item have no antecedent/consequent relationship to devise a rule for the purchase of the consequent item, we set the minimum length of a transaction to 2 items.
Now we are ready to extract the association rules from the data.
frequency_per_day <- 5
days_per_period <- 30
total_transactions <- length(grocery)
support_value <- (frequency_per_day * days_per_period)/total_transactions
grocery_rules <- apriori(grocery,
parameter = list(
support = support_value,
confidence = 0.25,
minlen = 2
))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.01525165 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 150
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [71 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [74 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The apriori algorithm created 74 grocery rules using the
parameters that we fed to it, and a summary of the rule stats includes
information about the quality measures for support, confidence,
coverage, lift and counts.
As a reminder support reflects the number of times an itemset appears in a basket of transactions, confidence reflects the strength of the relationship between the antecedent and consequent items in a transaction, coverage reflects the conditional probability that transactions include y given x, lift evaluates the strength of relationship between items within a given itemset, and counts are just counts.
summary(grocery_rules)
## set of 74 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 61 13
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.176 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01556 Min. :0.2537 Min. :0.03437 Min. :0.9932
## 1st Qu.:0.01871 1st Qu.:0.3084 1st Qu.:0.05328 1st Qu.:1.5038
## Median :0.02227 Median :0.3499 Median :0.06416 Median :1.6650
## Mean :0.02614 Mean :0.3588 Mean :0.07496 Mean :1.7381
## 3rd Qu.:0.02895 3rd Qu.:0.4047 3rd Qu.:0.08277 3rd Qu.:1.9183
## Max. :0.07483 Max. :0.5129 Max. :0.25552 Max. :3.0404
## count
## Min. :153.0
## 1st Qu.:184.0
## Median :219.0
## Mean :257.1
## 3rd Qu.:284.8
## Max. :736.0
##
## mining info:
## data ntransactions support confidence
## grocery 9835 0.01525165 0.25
## call
## apriori(data = grocery, parameter = list(support = support_value, confidence = 0.25, minlen = 2))
Once the rules are created it is possible to view them using the
inspect function. Sorting by the confidence
parameter we can see the rules in descending order of strength. It is
not surprising that 9 of the top 10 rules are predicting that whole milk
will be purchased (rhs) given the items purchased with it (lhs), since
whole milk is the most purchased item (1 in 4 transactions include
it).
Explaining the first rule in terms of confidence, it is about twice as likely (0.5128806/100) that someone who buys other vegetables and yogurt will also buy whole milk.
grocery_rules %>%
sort(by = "confidence") %>%
head(n = 10) %>%
inspect()
## lhs rhs support
## [1] {other vegetables, yogurt} => {whole milk} 0.02226741
## [2] {butter} => {whole milk} 0.02755465
## [3] {curd} => {whole milk} 0.02613116
## [4] {other vegetables, root vegetables} => {whole milk} 0.02318251
## [5] {other vegetables, tropical fruit} => {whole milk} 0.01708185
## [6] {root vegetables, whole milk} => {other vegetables} 0.02318251
## [7] {domestic eggs} => {whole milk} 0.02999492
## [8] {rolls/buns, yogurt} => {whole milk} 0.01555669
## [9] {whipped/sour cream} => {whole milk} 0.03223183
## [10] {root vegetables} => {whole milk} 0.04890696
## confidence coverage lift count
## [1] 0.5128806 0.04341637 2.007235 219
## [2] 0.4972477 0.05541434 1.946053 271
## [3] 0.4904580 0.05327911 1.919481 257
## [4] 0.4892704 0.04738180 1.914833 228
## [5] 0.4759207 0.03589222 1.862587 168
## [6] 0.4740125 0.04890696 2.449770 228
## [7] 0.4727564 0.06344687 1.850203 295
## [8] 0.4526627 0.03436706 1.771563 153
## [9] 0.4496454 0.07168277 1.759754 317
## [10] 0.4486940 0.10899847 1.756031 481
Looking at it another way, since confidence can be high at times just by chance, we can look at the rules in terms of lift to remove the influence of chance and examine more closely how each item in an itemset relates to the other items.
So when we look at the rules in terms of confidence, customers who purchase both other vegetables and yogurt are twice as likely to also purchase whole milk, but looking at the rules in terms of lift, rule #8 indicates that customers who purchase both other vegetables and whole milk are about twice as likely (2.132979) to also purchase yogurt.
grocery_rules %>%
sort (by = "lift") %>%
head(n = 10) %>%
inspect()
## lhs rhs support confidence
## [1] {beef} => {root vegetables} 0.01738688 0.3313953
## [2] {other vegetables, whole milk} => {root vegetables} 0.02318251 0.3097826
## [3] {pip fruit} => {tropical fruit} 0.02043721 0.2701613
## [4] {root vegetables, whole milk} => {other vegetables} 0.02318251 0.4740125
## [5] {curd} => {yogurt} 0.01728521 0.3244275
## [6] {root vegetables} => {other vegetables} 0.04738180 0.4347015
## [7] {chicken} => {other vegetables} 0.01789527 0.4170616
## [8] {other vegetables, whole milk} => {yogurt} 0.02226741 0.2975543
## [9] {tropical fruit, whole milk} => {other vegetables} 0.01708185 0.4038462
## [10] {whipped/sour cream} => {other vegetables} 0.02887646 0.4028369
## coverage lift count
## [1] 0.05246568 3.040367 171
## [2] 0.07483477 2.842082 228
## [3] 0.07564820 2.574648 201
## [4] 0.04890696 2.449770 228
## [5] 0.05327911 2.325615 170
## [6] 0.10899847 2.246605 466
## [7] 0.04290798 2.155439 176
## [8] 0.07483477 2.132979 219
## [9] 0.04229792 2.087140 168
## [10] 0.07168277 2.081924 284
Since the rules seem to be confirming each other in terms of looking at them from different metric perspectives, perhaps clustering them visually would add additional insights to the analysis of the market basket.
K-means clustering is one of the most commonly used clustering approaches, so let us see how it fares in this case.
First we need to extract all of the output from the grocery rules,
which is an S4 object, into a dataframe using the as method
to force the output into a dataframe class. This method combines the lhs
and rhs into a single column called rules.
grocery_rules_df <- as(grocery_rules, "data.frame")
head(grocery_rules_df)
## rules support confidence coverage lift
## 1 {cream cheese} => {whole milk} 0.01647178 0.4153846 0.03965430 1.625670
## 2 {chicken} => {other vegetables} 0.01789527 0.4170616 0.04290798 2.155439
## 3 {chicken} => {whole milk} 0.01759024 0.4099526 0.04290798 1.604411
## 4 {white bread} => {whole milk} 0.01708185 0.4057971 0.04209456 1.588147
## 5 {chocolate} => {whole milk} 0.01667514 0.3360656 0.04961871 1.315243
## 6 {coffee} => {whole milk} 0.01870869 0.3222417 0.05805796 1.261141
## count
## 1 162
## 2 176
## 3 173
## 4 168
## 5 164
## 6 184
Next we need to decide which variables we will cluster around. We can
group by lift and count to see what happens,
so let’s take a look at the distributions of each of those metrics.
grocery_rules_df %>%
select(lift, count) %>%
summary()
## lift count
## Min. :0.9932 Min. :153.0
## 1st Qu.:1.5038 1st Qu.:184.0
## Median :1.6650 Median :219.0
## Mean :1.7381 Mean :257.1
## 3rd Qu.:1.9183 3rd Qu.:284.8
## Max. :3.0404 Max. :736.0
The ranges of each variable are quite different, so we need to scale them for the K-means process.
grocery_rules_scaled <- grocery_rules_df %>%
select(lift, count) %>%
scale()
head(grocery_rules_scaled)
## lift count
## 1 -0.3112089 -0.7984402
## 2 1.1555230 -0.6809257
## 3 -0.3700669 -0.7061074
## 4 -0.4150936 -0.7480768
## 5 -1.1706635 -0.7816524
## 6 -1.3204511 -0.6137746
Before clustering the data, it would be good to see how many clusters are recommended based on the data itself. There are three methods we can use to recommend the number of clusters, so we will try them all.
The first method is the Elbow method, which uses a measure
of within-cluster sum of squares (WCSS) which is called by
setting the method = "wss". This method measures the
distance between each item in a cluster and the cluster’s centroid. The
Elbow method suggests that 4 clusters are optimal.
fviz_nbclust(grocery_rules_scaled, kmeans, method = "wss")
The second method is the Average Silhouette Method which is
called by setting the method = "silhouette". This method
compares each item in a cluster with other items in the cluster as well
as items in neighboring clusters. The Silhouette Method suggests that 2
clusters are optimal.
fviz_nbclust(grocery_rules_scaled, kmeans, method = "silhouette")
The third method is the Gap Statistic, which generates a
random reference dataset and measures the difference in WCSS between the
original and random datasets, which is called by setting the
method = "gap_stat". The Gap Statistic method appears to
show that 1 cluster is optimal, but since 1 cluster does not separate
the data in any way, the second best choice appears to be 4
clusters.
fviz_nbclust(grocery_rules_scaled, kmeans, method = "gap_stat")
Since two methods suggested 4 clusters and one suggested 2 clusters, we can create two cluster groupings to visualize both ways.
First we will try two centers with 25 initial configurations. The K-means algorithm selected the best result with two clusters of sizes 66 and 8.
set.seed(101)
k_2 <- kmeans(grocery_rules_scaled, centers = 2, nstart = 25)
k_2$size
## [1] 66 8
It may be more useful to analyze the clusters visually using the
factoextra package’s fviz_cluster function.
Here we see that with only two clusters, cluster 1 has the highest
overall lift but with low counts, while cluster 2 has the highest counts
but with lower lift.
fviz_cluster(k_2, data = grocery_rules_scaled, repel = TRUE)
Last we will try four centers with 25 initial configurations. The K-means algorithm selected the best result with four clusters of sizes 29, 5, 32 and 8.
set.seed(102)
k_4 <- kmeans(grocery_rules_scaled, centers = 4, nstart = 25)
k_4$size
## [1] 5 32 8 29
We see that with four clusters, the original cluster 2 is unchanged (although now it is cluster 3), but the original cluster 1 with lower counts has been segmented more granularly into three separate clusters. The new cluster 1 still has the highest overall lift, while cluster 4 has average lift and cluster 2 has the lowest lift.
fviz_cluster(k_4, data = grocery_rules_scaled, repel = TRUE)