Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Whole milk is the most frequently bought items followed by other vegetables
## Association Rules The apriori function of the arules package is used
to generate the association rules.
The parameter support is “defined as the proportion of transactions in the data set which contain the item set. For example the item set {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions).”
The parameter confidence has a value of 0.55. In the example of {milk, bread} => {butter}, this means that 55% of transactions containing milk and bread also has butter.
Use minlen = 2 so that the LHS (antecedent) is not empty. By default, apriori has a minlen value of 1 (empty LHS).
The association rules need to meet the minimum support and confidence values. Support of 0.02 and confidence of 0.3 generated 37 rules.
The model below with support of 0.02 and confidence of 0.3 generated 10 association rules.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.02 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Below is summary of base model with support value of 0.02 and confidence value of 0.3 applied to data set.
## set of 37 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 32 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.135 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02003 Min. :0.3079 Min. :0.04342 Min. :1.205
## 1st Qu.:0.02318 1st Qu.:0.3483 1st Qu.:0.05765 1st Qu.:1.514
## Median :0.02755 Median :0.3868 Median :0.07229 Median :1.756
## Mean :0.03134 Mean :0.3915 Mean :0.08227 Mean :1.730
## 3rd Qu.:0.03325 3rd Qu.:0.4249 3rd Qu.:0.09395 3rd Qu.:1.915
## Max. :0.07483 Max. :0.5129 Max. :0.19349 Max. :2.842
## count
## Min. :197.0
## 1st Qu.:228.0
## Median :271.0
## Mean :308.3
## 3rd Qu.:327.0
## Max. :736.0
##
## mining info:
## data ntransactions support confidence
## df 9835 0.02 0.3
## call
## apriori(data = df, parameter = list(support = 0.02, confidence = 0.3, minlen = 2))
When there are too many association rules that satisfy the support and confidence constraints, lift can be used to further filter or rank the rules. Lift with greater values indicate stronger association.
The strongest association with the highest lift value is {citrus fruit,root vegetables} ==> {other vegetables}. At the bottom is {domestic eggs,other vegetables} ==> {whole milk} .
## lhs rhs support
## [1] {other vegetables, whole milk} => {root vegetables} 0.02318251
## [2] {root vegetables, whole milk} => {other vegetables} 0.02318251
## [3] {root vegetables} => {other vegetables} 0.04738180
## [4] {whipped/sour cream} => {other vegetables} 0.02887646
## [5] {whole milk, yogurt} => {other vegetables} 0.02226741
## [6] {other vegetables, yogurt} => {whole milk} 0.02226741
## [7] {butter} => {whole milk} 0.02755465
## [8] {pork} => {other vegetables} 0.02165735
## [9] {curd} => {whole milk} 0.02613116
## [10] {other vegetables, root vegetables} => {whole milk} 0.02318251
## confidence coverage lift count
## [1] 0.3097826 0.07483477 2.842082 228
## [2] 0.4740125 0.04890696 2.449770 228
## [3] 0.4347015 0.10899847 2.246605 466
## [4] 0.4028369 0.07168277 2.081924 284
## [5] 0.3974592 0.05602440 2.054131 219
## [6] 0.5128806 0.04341637 2.007235 219
## [7] 0.4972477 0.05541434 1.946053 271
## [8] 0.3756614 0.05765125 1.941476 213
## [9] 0.4904580 0.05327911 1.919481 257
## [10] 0.4892704 0.04738180 1.914833 228
### Plot of the 10 rules using the paracoord method.
Using the hierarchical cluster analysis to cluster the transactions. The hierarchical clusters creates cluster dedograms and it can include different methods in the hclust function. We will look at the complete method and the ward method.
trans2 <- df[ , itemFrequency(df) > 0.05]
glist <- dissimilarity(trans2, which = "items")
# plot dendrogram
hc1 <- hclust(glist, method = "complete" )
plot(hc1, cex = 0.8, hang = 2)Serperated into 7 different clusters with borders around them.
hc2 <- hclust(glist, method = "ward.D2" )
plot(hc2, cex = 0.7)
rect.hclust(hc2, k = 7, border = 2:5)Tanglegram compares the two dendrograms.
## Warning: package 'dendextend' was built under R version 4.3.3
##
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
# Compute 2 hierarchical clusterings
hc3 <- hclust(glist, method = "complete")
hc4 <- hclust(glist, method = "ward.D2")
# Create two dendrograms
dend1 <- as.dendrogram (hc3)
dend2 <- as.dendrogram (hc4)
tanglegram(dend1, dend2)