Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift. Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
library(arules)
library(arulesViz)
library(dplyr)
library(dendextend)
Read the transaction of the data set:
transactions <- read.transactions("GroceryDataSet.csv", format = 'basket', sep=',')
transactions
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
Summary of the data set
summary(transactions)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Top 10 items in the list which has most frequency
itemFrequencyPlot(transactions,topN=10,main="Grocery List")
Use apriori function from the arulez package to establish the rules.
arulez <- apriori(transactions, parameter = list(support = 0.02, confidence = 0.3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Summary of the rules
summary(arulez)
## set of 37 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 32 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.135 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02003 Min. :0.3079 Min. :0.04342 Min. :1.205
## 1st Qu.:0.02318 1st Qu.:0.3483 1st Qu.:0.05765 1st Qu.:1.514
## Median :0.02755 Median :0.3868 Median :0.07229 Median :1.756
## Mean :0.03134 Mean :0.3915 Mean :0.08227 Mean :1.730
## 3rd Qu.:0.03325 3rd Qu.:0.4249 3rd Qu.:0.09395 3rd Qu.:1.915
## Max. :0.07483 Max. :0.5129 Max. :0.19349 Max. :2.842
## count
## Min. :197.0
## 1st Qu.:228.0
## Median :271.0
## Mean :308.3
## 3rd Qu.:327.0
## Max. :736.0
##
## mining info:
## data ntransactions support confidence
## transactions 9835 0.02 0.3
## call
## apriori(data = transactions, parameter = list(support = 0.02, confidence = 0.3))
Inspect the head of the list sorted by lift
inspect(head(arulez, n = 10, by = "lift"))
## lhs rhs support
## [1] {other vegetables, whole milk} => {root vegetables} 0.02318251
## [2] {root vegetables, whole milk} => {other vegetables} 0.02318251
## [3] {root vegetables} => {other vegetables} 0.04738180
## [4] {whipped/sour cream} => {other vegetables} 0.02887646
## [5] {whole milk, yogurt} => {other vegetables} 0.02226741
## [6] {other vegetables, yogurt} => {whole milk} 0.02226741
## [7] {butter} => {whole milk} 0.02755465
## [8] {pork} => {other vegetables} 0.02165735
## [9] {curd} => {whole milk} 0.02613116
## [10] {other vegetables, root vegetables} => {whole milk} 0.02318251
## confidence coverage lift count
## [1] 0.3097826 0.07483477 2.842082 228
## [2] 0.4740125 0.04890696 2.449770 228
## [3] 0.4347015 0.10899847 2.246605 466
## [4] 0.4028369 0.07168277 2.081924 284
## [5] 0.3974592 0.05602440 2.054131 219
## [6] 0.5128806 0.04341637 2.007235 219
## [7] 0.4972477 0.05541434 1.946053 271
## [8] 0.3756614 0.05765125 1.941476 213
## [9] 0.4904580 0.05327911 1.919481 257
## [10] 0.4892704 0.04738180 1.914833 228
List sorted by the confidence:
inspect(head(arulez, n = 10, by = "confidence"))
## lhs rhs support
## [1] {other vegetables, yogurt} => {whole milk} 0.02226741
## [2] {butter} => {whole milk} 0.02755465
## [3] {curd} => {whole milk} 0.02613116
## [4] {other vegetables, root vegetables} => {whole milk} 0.02318251
## [5] {root vegetables, whole milk} => {other vegetables} 0.02318251
## [6] {domestic eggs} => {whole milk} 0.02999492
## [7] {whipped/sour cream} => {whole milk} 0.03223183
## [8] {root vegetables} => {whole milk} 0.04890696
## [9] {root vegetables} => {other vegetables} 0.04738180
## [10] {frozen vegetables} => {whole milk} 0.02043721
## confidence coverage lift count
## [1] 0.5128806 0.04341637 2.007235 219
## [2] 0.4972477 0.05541434 1.946053 271
## [3] 0.4904580 0.05327911 1.919481 257
## [4] 0.4892704 0.04738180 1.914833 228
## [5] 0.4740125 0.04890696 2.449770 228
## [6] 0.4727564 0.06344687 1.850203 295
## [7] 0.4496454 0.07168277 1.759754 317
## [8] 0.4486940 0.10899847 1.756031 481
## [9] 0.4347015 0.10899847 2.246605 466
## [10] 0.4249471 0.04809354 1.663094 201
Plot of the 37 rules.
plot(arulez, jitter = 5)
Plot of the 10 rules using the paracoord method:
head(arulez, n = 10, by = "confidence") %>%
plot(method = "paracoord")
head(arulez, n = 10, by = "confidence") %>%
plot(method = "graph", engine = "htmlwidget")
Hierarchical cluster analysis: it creates cluster dedograms.
Complete Method:
trans2 <- transactions[ , itemFrequency(transactions) > 0.05]
glist <- dissimilarity(trans2, which = "items")
# plot dendrogram
hc1 <- hclust(glist, method = "complete" )
plot(hc1, cex = 0.8, hang = 2)
Ward.D2 Method:
5 different clusters.
hc2 <- hclust(glist, method = "ward.D2" )
plot(hc2, cex = 0.7)
rect.hclust(hc2, k = 5, border = 2:5)
hc3 <- hclust(glist, method = "complete")
hc4 <- hclust(glist, method = "ward.D2")
dend1 <- as.dendrogram (hc3)
dend2 <- as.dendrogram (hc4)
tanglegram(dend1, dend2)