Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like.
Market Basket Analysis
First we read in the transaction of the GroceryDataset.
transactions <- read.transactions("https://raw.githubusercontent.com/Luz917/data624hw10/main/GroceryDataSet.csv", format = 'basket', sep=',')
transactions## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
Summary of transactions.
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Top 10 Transactions
Here we can see the Top 10 items of the Grocery List.
arules package
We use the apriori function to get the rules.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Lets take a look at the rules.
- There are a set of 37 rules
## set of 37 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 32 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.135 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02003 Min. :0.3079 Min. :0.04342 Min. :1.205
## 1st Qu.:0.02318 1st Qu.:0.3483 1st Qu.:0.05765 1st Qu.:1.514
## Median :0.02755 Median :0.3868 Median :0.07229 Median :1.756
## Mean :0.03134 Mean :0.3915 Mean :0.08227 Mean :1.730
## 3rd Qu.:0.03325 3rd Qu.:0.4249 3rd Qu.:0.09395 3rd Qu.:1.915
## Max. :0.07483 Max. :0.5129 Max. :0.19349 Max. :2.842
## count
## Min. :197.0
## 1st Qu.:228.0
## Median :271.0
## Mean :308.3
## 3rd Qu.:327.0
## Max. :736.0
##
## mining info:
## data ntransactions support confidence
## transactions 9835 0.02 0.3
A look the list sorted by lift.
## lhs rhs support
## [1] {other vegetables,whole milk} => {root vegetables} 0.02318251
## [2] {root vegetables,whole milk} => {other vegetables} 0.02318251
## [3] {root vegetables} => {other vegetables} 0.04738180
## [4] {whipped/sour cream} => {other vegetables} 0.02887646
## [5] {whole milk,yogurt} => {other vegetables} 0.02226741
## [6] {other vegetables,yogurt} => {whole milk} 0.02226741
## [7] {butter} => {whole milk} 0.02755465
## [8] {pork} => {other vegetables} 0.02165735
## [9] {curd} => {whole milk} 0.02613116
## [10] {other vegetables,root vegetables} => {whole milk} 0.02318251
## confidence coverage lift count
## [1] 0.3097826 0.07483477 2.842082 228
## [2] 0.4740125 0.04890696 2.449770 228
## [3] 0.4347015 0.10899847 2.246605 466
## [4] 0.4028369 0.07168277 2.081924 284
## [5] 0.3974592 0.05602440 2.054131 219
## [6] 0.5128806 0.04341637 2.007235 219
## [7] 0.4972477 0.05541434 1.946053 271
## [8] 0.3756614 0.05765125 1.941476 213
## [9] 0.4904580 0.05327911 1.919481 257
## [10] 0.4892704 0.04738180 1.914833 228
A look at the list sorted by the confidence.
## lhs rhs support
## [1] {other vegetables,yogurt} => {whole milk} 0.02226741
## [2] {butter} => {whole milk} 0.02755465
## [3] {curd} => {whole milk} 0.02613116
## [4] {other vegetables,root vegetables} => {whole milk} 0.02318251
## [5] {root vegetables,whole milk} => {other vegetables} 0.02318251
## [6] {domestic eggs} => {whole milk} 0.02999492
## [7] {whipped/sour cream} => {whole milk} 0.03223183
## [8] {root vegetables} => {whole milk} 0.04890696
## [9] {root vegetables} => {other vegetables} 0.04738180
## [10] {frozen vegetables} => {whole milk} 0.02043721
## confidence coverage lift count
## [1] 0.5128806 0.04341637 2.007235 219
## [2] 0.4972477 0.05541434 1.946053 271
## [3] 0.4904580 0.05327911 1.919481 257
## [4] 0.4892704 0.04738180 1.914833 228
## [5] 0.4740125 0.04890696 2.449770 228
## [6] 0.4727564 0.06344687 1.850203 295
## [7] 0.4496454 0.07168277 1.759754 317
## [8] 0.4486940 0.10899847 1.756031 481
## [9] 0.4347015 0.10899847 2.246605 466
## [10] 0.4249471 0.04809354 1.663094 201
A plot of the 37 rules. You can see that the points are scattered.
Another plot of the 10 rules using the paracoord method.
Cluster analysis
Using the hierarchical cluster analysis to cluster the transactions. The hierarchical clusters creates cluster dedograms and it can include different methods in the hclust function. We will look at the complete method and the ward method.
Complete Method
trans2 <- transactions[ , itemFrequency(transactions) > 0.05]
glist <- dissimilarity(trans2, which = "items")
# plot dendrogram
hc1 <- hclust(glist, method = "complete" )
plot(hc1, cex = 0.8, hang = 2)Ward Method
- Serperated into 5 different clusters with borders around them.
hc2 <- hclust(glist, method = "ward.D2" )
plot(hc2, cex = 0.7)
rect.hclust(hc2, k = 5, border = 2:5)Tanglegram compares the two dendrograms.
library(dendextend)
# Compute 2 hierarchical clusterings
hc3 <- hclust(glist, method = "complete")
hc4 <- hclust(glist, method = "ward.D2")
# Create two dendrograms
dend1 <- as.dendrogram (hc3)
dend2 <- as.dendrogram (hc4)
tanglegram(dend1, dend2)You can see that it splits the words if its too long and it connects.
References
https://stackoverflow.com/questions/49628304/create-interactive-paracoord-plot-for-association-rules