Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
#install.packages("arules")
library("arules")
#Load and read data
Data <- read.transactions("GroceryDataSet.csv")
## Warning in asMethod(object): removing duplicated items in transactions
Data
## transactions in sparse format with
## 9835 transactions (rows) and
## 8219 items (columns)
There are 9835 transactions (rows) and 8219 items (columns)
summary(Data)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 8219 columns (items) and a density of 0.0004422899
##
## most frequent items:
## vegetables,whole whole tropical other
## 940 717 482 460
## citrus (Other)
## 453 32700
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1380 2733 1774 1257 910 601 415 293 166 95 75 44 39 19 11 9
## 17 18 19 20 21 23
## 2 3 3 3 1 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.635 5.000 23.000
##
## includes extended item information - examples:
## labels
## 1 (appetizer),,,,,,,,,,,,,,,,,,,,,
## 2 (appetizer),,,,,,,,,,,,,,,,,,,,,,,
## 3 (appetizer),,,,,,,,,,,,,,,,,,,,,,,,
We can see from the summary above that (VEGETABLES) is the most frequent item with 940, followed by (WHOLE) with 717. Let’s see a visual using the item frequency plot.
itemFrequencyPlot(Data, topN=20, type="relative")
The frequency plot above gives a better visual by creating an item Frequency Bar Plot to view the distribution of objects based on itemMatrix.
Train the apriori to extract rules by defining the minimum support and confidence value.
min_suport <- 6 * 7/ nrow(Data)
min_suport
## [1] 0.004270463
# Training Apriori on the grocery dataset
rules <- apriori(Data, parameter = list(supp = 0.004, conf = 0.3, maxlen=3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.004 1
## maxlen target ext
## 3 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 39
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8219 item(s), 9835 transaction(s)] done [0.02s].
## sorting and recoding items ... [130 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## Warning in apriori(Data, parameter = list(supp = 0.004, conf = 0.3, maxlen
## = 3)): Mining stopped (maxlen reached). Only patterns up to a length of 3
## returned!
## done [0.00s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Inspect top 10 rules by lift. Lift indicates the significance of the rule.
summary(rules)
## set of 35 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 28 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 2.0 2.0 2.2 2.0 3.0
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.004270 Min. :0.3000 Min. :0.004270 Min. : 3.139
## 1st Qu.:0.005084 1st Qu.:0.4126 1st Qu.:0.005948 1st Qu.: 7.115
## Median :0.007321 Median :0.6842 Median :0.014743 Median : 12.866
## Mean :0.011193 Mean :0.7004 Mean :0.019609 Mean : 26.182
## 3rd Qu.:0.014082 3rd Qu.:1.0000 3rd Qu.:0.035587 3rd Qu.: 26.726
## Max. :0.037417 Max. :1.0000 Max. :0.046772 Max. :196.700
## count
## Min. : 42.0
## 1st Qu.: 50.0
## Median : 72.0
## Mean :110.1
## 3rd Qu.:138.5
## Max. :368.0
##
## mining info:
## data ntransactions support confidence
## Data 9835 0.004 0.3
## call
## apriori(data = Data, parameter = list(supp = 0.004, conf = 0.3, maxlen = 3))
inspect(sort(rules, by = 'lift')[1:10])
## lhs
## [1] {bags,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} =>
## [2] {shopping} =>
## [3] {water,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} =>
## [4] {juice,long} =>
## [5] {juice,long} =>
## [6] {bakery} =>
## [7] {life} =>
## [8] {bakery, juice,long} =>
## [9] {juice,long, life} =>
## [10] {bakery, vegetables,whole} =>
## rhs support confidence coverage
## [1] {shopping} 0.004880529 0.96 0.005083884
## [2] {bags,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} 0.004880529 1.00 0.004880529
## [3] {bottled} 0.006812405 1.00 0.006812405
## [4] {bakery} 0.004270463 1.00 0.004270463
## [5] {life} 0.004270463 1.00 0.004270463
## [6] {life} 0.037417387 1.00 0.037417387
## [7] {bakery} 0.037417387 1.00 0.037417387
## [8] {life} 0.004270463 1.00 0.004270463
## [9] {bakery} 0.004270463 1.00 0.004270463
## [10] {life} 0.006609049 1.00 0.006609049
## lift count
## [1] 196.70000 48
## [2] 196.70000 48
## [3] 28.26149 67
## [4] 26.72554 42
## [5] 26.72554 42
## [6] 26.72554 368
## [7] 26.72554 368
## [8] 26.72554 42
## [9] 26.72554 42
## [10] 26.72554 65
# top 10 by confidence
inspect(head(rules, n = 10, by = "confidence"))
## lhs
## [1] {shopping} =>
## [2] {water,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} =>
## [3] {juice,long} =>
## [4] {juice,long} =>
## [5] {cream,cream} =>
## [6] {milk,cream} =>
## [7] {,frozen} =>
## [8] {cheese,cream} =>
## [9] {bakery} =>
## [10] {life} =>
## rhs support confidence coverage
## [1] {bags,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} 0.004880529 1 0.004880529
## [2] {bottled} 0.006812405 1 0.006812405
## [3] {bakery} 0.004270463 1 0.004270463
## [4] {life} 0.004270463 1 0.004270463
## [5] {cheese} 0.004473818 1 0.004473818
## [6] {cheese} 0.006914082 1 0.006914082
## [7] {cheese} 0.004677173 1 0.004677173
## [8] {cheese} 0.005287239 1 0.005287239
## [9] {life} 0.037417387 1 0.037417387
## [10] {bakery} 0.037417387 1 0.037417387
## lift count
## [1] 196.70000 48
## [2] 28.26149 67
## [3] 26.72554 42
## [4] 26.72554 42
## [5] 25.21795 44
## [6] 25.21795 68
## [7] 25.21795 46
## [8] 25.21795 52
## [9] 26.72554 368
## [10] 26.72554 368
Finally we run a simple cluster analysis, focusing on the items in the transaction data that have greater than 4% support. We follow the logic and code example given in the help file for the dissimilarity function in the arules library.
# follow logic of "dissimilarity" help file from "arules" package
# cluster analysis on items with support > 5%
Data1 <- Data[ , itemFrequency(Data) > 0.04]
d_jaccard <- dissimilarity(Data1, which = "items")
# plot dendrogram
plot(hclust(d_jaccard, method = "ward.D2"),
main = "Dendrogram for items", sub = "", xlab = "")