This article aims to use Association Rules an Unspervised learning technique, to find associations/relationshpis/dependencies between items purchased by customers in a Groceries dataset provided on Kaggle which can be retrieved from https://www.kaggle.com/heeraldedhia/groceries-dataset.
groceries_data <-read.csv2("Groceries_dataset.csv", header=TRUE, sep=",")
sprintf("The original dataset consists %s observations with %s features.", nrow(groceries_data), ncol(groceries_data))
## [1] "The original dataset consists 38765 observations with 3 features."
library(arules)
library(arulesViz)
groceries_tran <- read.transactions("Groceries_dataset.csv", format="single", sep=",", cols=c("Member_number","itemDescription"), header=TRUE)
inspect(groceries_tran[1])
## items transactionID
## [1] {canned beer,
## hygiene articles,
## misc. beverages,
## pastry,
## pickled vegetables,
## salty snack,
## sausage,
## semi-finished bread,
## soda,
## whole milk,
## yogurt} 1000
groceries_tran_dt <- read.transactions("Groceries_dataset.csv", format="single", sep=",", cols=c("Date","itemDescription"), header=TRUE)
inspect(groceries_tran_dt[1])
## items transactionID
## [1] {berries,
## bottled beer,
## bottled water,
## brown bread,
## butter,
## candles,
## chocolate,
## citrus fruit,
## cleaner,
## coffee,
## curd,
## dishes,
## domestic eggs,
## flower (seeds),
## frozen potato products,
## frozen vegetables,
## hamburger meat,
## Instant food products,
## onions,
## other vegetables,
## sausage,
## shopping bags,
## sliced cheese,
## soda,
## specialty chocolate,
## tropical fruit,
## waffles,
## whipped/sour cream,
## whole milk,
## yogurt} 01-01-2014
sprintf("There are %s unique products purchased and a total of %s transactions recorded", dim(groceries_tran)[2], dim(groceries_tran)[1])
## [1] "There are 167 unique products purchased and a total of 3898 transactions recorded"
sprintf("There are a total of %s unique products purchased on %s transaction days recorded", dim(groceries_tran_dt)[2], dim(groceries_tran_dt)[1])
## [1] "There are a total of 167 unique products purchased on 728 transaction days recorded"
#Let’s proceed to analyse transactions by Customer Id
itemFrequency(groceries_tran[,163:167])
## white bread white wine whole milk yogurt zwieback
## 0.08876347 0.04412519 0.45818368 0.28296562 0.01539251
itemFrequencyPlot(groceries_tran, support = 0.10, main="Item Frequency")
itemFrequencyPlot(groceries_tran, topN = 15, main="Item Frequency")
itemFrequencyPlot(groceries_tran, type=c("absolute"), topN = 15, main="Item Frequency")
image(groceries_tran[1:50])
itemFrequency(groceries_tran[1:50,c(120:123,164:167)])
## rice roll products rolls/buns root vegetables white wine
## 0.02 0.02 0.52 0.30 0.04
## whole milk yogurt zwieback
## 0.46 0.38 0.00
We can identify some frequent occurrence of products in transactions based on the clustering of black points in a vertical straight line. 2 items (rolls/buns and whole milk) can be seen to be present in above 40% of first 50 transactions.
image(groceries_tran_dt[1:100])
Looking at transactions that occurred the first 3 days of each month in 2014 and 2015, we can see some items that are purchased regularly on almost each day while some are purchased occasionaly. A deeper dive into the items purchased occasionally might reveal some seasonality in purchasing pattern.
For Association Rules, frequent patterns are extracted in the form of X → Y rules with two measures of support and confidence. Support represents the percentage of transactions that contain both X and Y among all transactions in the dataset. Confidence expresses the fraction of transactions containing X that also contain Y.
Let’s find rules that refer to at least two products with a minimum support and confidence level of 1% and 40% respectively.That mean that this rule should appear in at least 1% of all 7501 transactions and in 25% of transactions where antecedent item (or items) occurs, respectively. Enough high confidence level assures us that occurrence of consequent item is really associated with occurrence of an antecedent one.
g_rules <- apriori(groceries_tran, parameter = list(support = 0.03, confidence = 0.3, minlen = 2))
g_dt_rules <- apriori(groceries_tran_dt, parameter = list(support = 0.5, confidence = 0.5, minlen = 2, maxlen=4))
summary(g_rules)
## set of 370 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 182 180 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 3.00 2.53 3.00 4.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.03002 Min. :0.3015 Min. :0.05080 Min. :0.9697
## 1st Qu.:0.03438 1st Qu.:0.3710 1st Qu.:0.07773 1st Qu.:1.1224
## Median :0.04233 Median :0.4321 Median :0.10056 Median :1.1753
## Mean :0.05206 Mean :0.4399 Mean :0.12342 Mean :1.1873
## 3rd Qu.:0.05817 3rd Qu.:0.5000 3rd Qu.:0.13982 3rd Qu.:1.2462
## Max. :0.19138 Max. :0.6569 Max. :0.45818 Max. :1.5547
## count
## Min. :117.0
## 1st Qu.:134.0
## Median :165.0
## Mean :202.9
## 3rd Qu.:226.8
## Max. :746.0
##
## mining info:
## data ntransactions support confidence
## groceries_tran 3898 0.03 0.3
## call
## apriori(data = groceries_tran, parameter = list(support = 0.03, confidence = 0.3, minlen = 2))
summary(g_dt_rules)
## set of 459 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 140 219 100
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.913 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.5000 Min. :0.5452 Min. :0.5206 Min. :0.9805
## 1st Qu.:0.5220 1st Qu.:0.7398 1st Qu.:0.6044 1st Qu.:0.9976
## Median :0.5522 Median :0.8518 Median :0.6854 Median :1.0032
## Mean :0.5757 Mean :0.8276 Mean :0.7078 Mean :1.0044
## 3rd Qu.:0.6044 3rd Qu.:0.9172 3rd Qu.:0.8159 3rd Qu.:1.0099
## Max. :0.8750 Max. :0.9772 Max. :0.9574 Max. :1.0474
## count
## Min. :364.0
## 1st Qu.:380.0
## Median :402.0
## Mean :419.1
## 3rd Qu.:440.0
## Max. :637.0
##
## mining info:
## data ntransactions support confidence
## groceries_tran_dt 728 0.5 0.5
## call
## apriori(data = groceries_tran_dt, parameter = list(support = 0.5, confidence = 0.5, minlen = 2, maxlen = 4))
options(width = 250)
inspect(g_rules[366:370])
## lhs rhs support confidence coverage lift count
## [1] {other vegetables, rolls/buns, whole milk} => {yogurt} 0.03437660 0.4187500 0.08209338 1.479862 134
## [2] {other vegetables, rolls/buns, soda} => {whole milk} 0.03181119 0.6048780 0.05259107 1.320165 124
## [3] {rolls/buns, soda, whole milk} => {other vegetables} 0.03181119 0.4881890 0.06516162 1.296295 124
## [4] {other vegetables, soda, whole milk} => {rolls/buns} 0.03181119 0.4592593 0.06926629 1.313421 124
## [5] {other vegetables, rolls/buns, whole milk} => {soda} 0.03181119 0.3875000 0.08209338 1.236068 124
options(width = 250)
inspect(g_dt_rules[411:415])
## lhs rhs support confidence coverage lift count
## [1] {soda, whole milk, yogurt} => {root vegetables} 0.5068681 0.7515275 0.6744505 0.9947491 369
## [2] {rolls/buns, root vegetables, yogurt} => {other vegetables} 0.5137363 0.9166667 0.5604396 1.0020020 374
## [3] {other vegetables, root vegetables, yogurt} => {rolls/buns} 0.5137363 0.9055690 0.5673077 1.0158001 374
## [4] {other vegetables, rolls/buns, root vegetables} => {yogurt} 0.5137363 0.8237885 0.6236264 0.9995301 374
## [5] {other vegetables, rolls/buns, yogurt} => {root vegetables} 0.5137363 0.7586207 0.6771978 1.0041379 374
options(width = 250)
inspect(sort(g_rules, by = "support")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {other vegetables} => {whole milk} 0.1913802 0.5081744 0.3766034 1.109106 746
## [2] {whole milk} => {other vegetables} 0.1913802 0.4176932 0.4581837 1.109106 746
## [3] {rolls/buns} => {whole milk} 0.1785531 0.5106383 0.3496665 1.114484 696
## [4] {whole milk} => {rolls/buns} 0.1785531 0.3896976 0.4581837 1.114484 696
## [5] {soda} => {whole milk} 0.1511031 0.4819967 0.3134941 1.051973 589
options(width = 250)
inspect(sort(g_rules, by = "confidence")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660 0.6568627 0.05233453 1.433623 134
## [2] {bottled water, yogurt} => {whole milk} 0.04027707 0.6061776 0.06644433 1.323001 157
## [3] {bottled beer, rolls/buns} => {whole milk} 0.03822473 0.6056911 0.06310929 1.321939 149
## [4] {other vegetables, rolls/buns, soda} => {whole milk} 0.03181119 0.6048780 0.05259107 1.320165 124
## [5] {shopping bags, yogurt} => {whole milk} 0.03309389 0.6028037 0.05489995 1.315638 129
options(width = 250)
inspect(sort(g_rules, by = "lift")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {rolls/buns, yogurt} => {sausage} 0.03565931 0.3202765 0.11133915 1.554717 139
## [2] {rolls/buns, sausage} => {yogurt} 0.03565931 0.4330218 0.08234992 1.530298 139
## [3] {other vegetables, yogurt} => {sausage} 0.03719856 0.3091684 0.12031811 1.500795 145
## [4] {sausage, whole milk} => {yogurt} 0.04489482 0.4196643 0.10697794 1.483093 175
## [5] {other vegetables, rolls/buns, whole milk} => {yogurt} 0.03437660 0.4187500 0.08209338 1.479862 134
plot(g_rules)
plot(g_rules, shading="order", control=list(main="Two-key plot"))
plot(g_rules, method="paracoord", control=list(reorder=TRUE))
plot(g_rules, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{other vegetables,rolls/buns,yogurt}" "{rolls/buns,sausage}" "{rolls/buns,whole milk,yogurt}" "{other vegetables,whole milk,yogurt}" "{other vegetables,rolls/buns,whole milk}"
## [6] "{rolls/buns,yogurt}" "{bottled beer,rolls/buns}" "{sausage,yogurt}" "{other vegetables,rolls/buns,soda}" "{shopping bags,yogurt}"
## [11] "{other vegetables,soda,whole milk}" "{pastry,yogurt}" "{beef,other vegetables}" "{bottled water,yogurt}" "{sausage,whole milk}"
## [16] "{other vegetables,shopping bags}" "{frankfurter,whole milk}" "{other vegetables,yogurt}" "{rolls/buns,soda,whole milk}" "{pip fruit,yogurt}"
## [21] "{rolls/buns,shopping bags}" "{fruit/vegetable juice,whole milk}" "{other vegetables,pastry}" "{citrus fruit,yogurt}" "{domestic eggs,rolls/buns}"
## [26] "{butter,whole milk}" "{other vegetables,sausage}" "{shopping bags,whole milk}" "{brown bread,whole milk}" "{ham}"
## [31] "{canned beer,other vegetables}" "{beef,whole milk}" "{rolls/buns,whipped/sour cream}" "{bottled water,root vegetables}" "{newspapers,whole milk}"
## [36] "{shopping bags,soda}" "{fruit/vegetable juice,other vegetables}" "{pastry,rolls/buns}" "{brown bread,rolls/buns}" "{brown bread,other vegetables}"
## [41] "{bottled beer,whole milk}" "{soda,whole milk}" "{pip fruit,whole milk}" "{bottled water,rolls/buns}" "{other vegetables,whole milk}"
## [46] "{pastry,whole milk}" "{chocolate}" "{bottled beer,other vegetables}" "{root vegetables,yogurt}" "{rolls/buns,whole milk}"
## [51] "{sausage,soda}" "{domestic eggs,other vegetables}" "{whole milk,yogurt}" "{sugar}" "{bottled water,other vegetables}"
## [56] "{bottled water,whole milk}" "{other vegetables,soda}" "{frankfurter,rolls/buns}" "{whipped/sour cream,whole milk}" "{citrus fruit,whole milk}"
## [61] "{butter,other vegetables}" "{newspapers,rolls/buns}" "{soda,yogurt}" "{citrus fruit,rolls/buns}" "{canned beer,soda}"
## [66] "{other vegetables,rolls/buns}" "{citrus fruit,other vegetables}" "{domestic eggs,whole milk}" "{tropical fruit,yogurt}" "{rolls/buns,soda}"
## [71] "{other vegetables,pip fruit}" "{waffles}" "{canned beer,rolls/buns}" "{meat}" "{bottled water,soda}"
## [76] "{other vegetables,whipped/sour cream}" "{sausage}" "{UHT-milk}" "{frankfurter,other vegetables}" "{pip fruit,rolls/buns}"
## [81] "{pork,whole milk}" "{other vegetables,tropical fruit}" "{oil}" "{root vegetables,sausage}" "{other vegetables,root vegetables}"
## [86] "{hamburger meat}" "{root vegetables,whole milk}" "{rolls/buns,root vegetables}" "{curd}" "{onions}"
## [91] "{shopping bags}" "{other vegetables,pork}" "{fruit/vegetable juice}" "{rolls/buns,tropical fruit}" "{ice cream}"
## [96] "{napkins}" "{butter}" "{tropical fruit,whole milk}" "{salty snack}" "{frozen vegetables}"
## [101] "{pip fruit,soda}" "{root vegetables,soda}" "{berries}" "{white bread}" "{canned beer,whole milk}"
## [106] "{pip fruit}" "{frankfurter}" "{frozen meals}" "{bottled beer}" "{yogurt}"
## [111] "{cream cheese }" "{bottled water}" "{brown bread}" "{newspapers}" "{pastry}"
## [116] "{domestic eggs}" "{coffee}" "{rolls/buns}" "{margarine}" "{canned beer}"
## [121] "{beef}" "{whole milk}" "{citrus fruit}" "{dessert}" "{other vegetables}"
## [126] "{root vegetables}" "{pork}" "{chicken}" "{tropical fruit}" "{pastry,soda}"
## [131] "{whipped/sour cream}" "{citrus fruit,soda}" "{soda}" "{butter milk}" "{soda,tropical fruit}"
## Itemsets in Consequent (RHS)
## [1] "{soda}" "{other vegetables}" "{whole milk}" "{rolls/buns}" "{yogurt}" "{tropical fruit}" "{root vegetables}" "{sausage}"
plot(g_rules, method="group")
plot(g_rules, method="graph", max = 20)
yogurt_rules <- subset(g_rules, items %in% "yogurt")
yogurt_rules
## set of 85 rules
options(width = 250)
inspect(yogurt_rules[80:85])
## lhs rhs support confidence coverage lift count
## [1] {whole milk, yogurt} => {other vegetables} 0.07183171 0.4770017 0.15059005 1.266589 280
## [2] {other vegetables, whole milk} => {yogurt} 0.07183171 0.3753351 0.19138019 1.326434 280
## [3] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660 0.6568627 0.05233453 1.433623 134
## [4] {rolls/buns, whole milk, yogurt} => {other vegetables} 0.03437660 0.5214008 0.06593125 1.384482 134
## [5] {other vegetables, whole milk, yogurt} => {rolls/buns} 0.03437660 0.4785714 0.07183171 1.368651 134
## [6] {other vegetables, rolls/buns, whole milk} => {yogurt} 0.03437660 0.4187500 0.08209338 1.479862 134
Here we can observe that yoghurt is often purchased alongside wholemilk, other vegetables and rolls/buns.
options(width = 250)
rules.margarine<-apriori(data=groceries_tran, parameter=list(supp=0.006,conf = 0.2),
appearance=list(default="lhs", rhs="margarine"), control=list(verbose=F))
rules.margarine.byconf<-sort(rules.margarine, by="confidence", decreasing=TRUE)
inspect(head(rules.margarine.byconf,3))
## lhs rhs support confidence coverage lift count
## [1] {butter, frankfurter} => {margarine} 0.006413545 0.2941176 0.02180605 2.514190 25
## [2] {frankfurter, shopping bags} => {margarine} 0.006157004 0.2500000 0.02462801 2.137061 24
## [3] {other vegetables, shopping bags, yogurt} => {margarine} 0.006670087 0.2363636 0.02821960 2.020494 26
options(width = 150)
inspect(supportingTransactions(rules.margarine, groceries_tran)[1:3])
## items
## 1 {baking powder} => {margarine}
## 2 {curd,sausage} => {margarine}
## 3 {curd,root vegetables} => {margarine}
## transactionIDs
## 1 {1110,1765,1948,2601,2625,2676,3062,3180,3363,3773,3903,3931,3960,4106,4178,4199,4442,4563,4600,4635,4761,4812,4864,4872,4983}
## 2 {1309,1582,1741,1777,2171,2592,2625,2696,2794,3046,3180,3589,3827,3830,3899,3919,4312,4430,4433,4455,4573,4718,4761,4812,4966}
## 3 {1146,1234,1248,1747,2056,2601,3100,3138,3180,3289,3556,3818,3827,3830,3919,3925,3960,4113,4199,4312,4455,4485,4773,4835}
plot(rules.margarine)
options(width = 250)
rules.ice_cream<-apriori(data=groceries_tran_dt, parameter=list(supp=0.006,conf = 0.2),
appearance=list(default="lhs", rhs="ice cream"), control=list(verbose=F))
rules.ice_cream.byconf<-sort(rules.ice_cream, by="confidence", decreasing=TRUE)
inspect(head(rules.ice_cream.byconf,3))
## lhs rhs support confidence coverage lift count
## [1] {herbs, organic sausage} => {ice cream} 0.006868132 1 0.006868132 3.516908 5
## [2] {herbs, organic sausage, pip fruit} => {ice cream} 0.006868132 1 0.006868132 3.516908 5
## [3] {herbs, organic sausage, tropical fruit} => {ice cream} 0.006868132 1 0.006868132 3.516908 5
plot(rules.margarine)
g_items_freq <- eclat(groceries_tran, parameter=list(supp=0.03))
inspect(sort(g_items_freq, by = "support")[1:5])
## items support count
## [1] {whole milk} 0.4581837 1786
## [2] {other vegetables} 0.3766034 1468
## [3] {rolls/buns} 0.3496665 1363
## [4] {soda} 0.3134941 1222
## [5] {yogurt} 0.2829656 1103
Whole milk is the most frequent item and appeared 1786 times in the 3898 transactions.
options(width = 150)
summary(g_items_freq)
## set of 415 itemsets
##
## most frequent items:
## whole milk other vegetables rolls/buns soda yogurt (Other)
## 107 82 75 56 51 476
##
## element (itemset/transaction) length distribution:sizes
## 1 2 3 4
## 72 256 85 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.041 2.000 4.000
##
## summary of quality measures:
## support count
## Min. :0.03002 Min. : 117.0
## 1st Qu.:0.03527 1st Qu.: 137.5
## Median :0.04310 Median : 168.0
## Mean :0.05969 Mean : 232.7
## 3rd Qu.:0.06478 3rd Qu.: 252.5
## Max. :0.45818 Max. :1786.0
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support call
## groceries_tran 3898 0.03 eclat(data = groceries_tran, parameter = list(supp = 0.03))
g_freq_rules <- ruleInduction(g_items_freq, groceries_tran, confidence=0.3, minlen=2)
g_freq_rules
## set of 370 rules
options(width = 150)
inspect(sort(g_freq_rules, by = "support")[1:5])
## lhs rhs support confidence lift itemset
## [1] {whole milk} => {other vegetables} 0.1913802 0.4176932 1.109106 343
## [2] {other vegetables} => {whole milk} 0.1913802 0.5081744 1.109106 343
## [3] {whole milk} => {rolls/buns} 0.1785531 0.3896976 1.114484 341
## [4] {rolls/buns} => {whole milk} 0.1785531 0.5106383 1.114484 341
## [5] {whole milk} => {soda} 0.1511031 0.3297872 1.051973 337
options(width = 150)
inspect(sort(g_freq_rules, by = "confidence")[1:5])
## lhs rhs support confidence lift itemset
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660 0.6568627 1.433623 325
## [2] {bottled water, yogurt} => {whole milk} 0.04027707 0.6061776 1.323001 279
## [3] {bottled beer, rolls/buns} => {whole milk} 0.03822473 0.6056911 1.321939 167
## [4] {other vegetables, rolls/buns, soda} => {whole milk} 0.03181119 0.6048780 1.320165 333
## [5] {shopping bags, yogurt} => {whole milk} 0.03309389 0.6028037 1.315638 182
options(width = 150)
inspect(sort(g_freq_rules, by = "lift")[1:5])
## lhs rhs support confidence lift itemset
## [1] {rolls/buns, yogurt} => {sausage} 0.03565931 0.3202765 1.554717 263
## [2] {rolls/buns, sausage} => {yogurt} 0.03565931 0.4330218 1.530298 263
## [3] {other vegetables, yogurt} => {sausage} 0.03719856 0.3091684 1.500795 262
## [4] {sausage, whole milk} => {yogurt} 0.04489482 0.4196643 1.483093 261
## [5] {other vegetables, rolls/buns, whole milk} => {yogurt} 0.03437660 0.4187500 1.479862 325
plot(g_freq_rules, method="graph", max = 15)
We employed the Aprior and Eclat algorithm to find association between items purchased by customers in a Groceries dataset. The most frequently purchased item in the dataset was whole milk. 370 rules were generated using Aprior. 415 itemsets based on Support were generated using Eclat. It is also important to note that Eclat is faster than Aprior in processing capacity. Transaction dates also revealed some seasonality in the purchase pattern of items.