Market Basket and Clusters
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift. Extra credit: do a simple cluster analysis on the data as well.
8.1. Recreate the simulated data from Exercise 7.2:
#Load Libraries
library(tidyverse)
library(arules)
library(arulesViz)
library(RColorBrewer)
library(kableExtra)
#Read transactions
grocery<-read.transactions('GroceryDataSet.csv', format = 'basket', header=FALSE, sep=',')
#inspect(grocery)
summary(grocery)## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The top five most frequent items are whole milk, other vegetables, roll/buns, soda and yogurt.
#Use RColorBrewer To Plot the Top 10 Items
library(RColorBrewer)
itemFrequencyPlot(grocery,topN=10,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")#Create rules
asoc_groc <- apriori(grocery, parameter = list(support=0.001,conf=0.8))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(asoc_groc)## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.8000 Min. :0.001017 Min. : 3.131
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:0.001220 1st Qu.: 3.312
## Median :0.001220 Median :0.8462 Median :0.001322 Median : 3.588
## Mean :0.001247 Mean :0.8663 Mean :0.001449 Mean : 3.951
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:0.001627 3rd Qu.: 4.341
## Max. :0.003152 Max. :1.0000 Max. :0.003559 Max. :11.235
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :12.00
## Mean :12.27
## 3rd Qu.:13.00
## Max. :31.00
##
## mining info:
## data ntransactions support confidence
## grocery 9835 0.001 0.8
## call
## apriori(data = grocery, parameter = list(support = 0.001, conf = 0.8))
Parameter Specification: min_sup=0.001 and min_confidence=0.8 values in a rule.
Total number of rules: The set of 410 rules
Distribution of rule length: A length of 4 items has the most rules: 229 and length of 6 items have the lowest number of rules:12
#Dig into the top 10 rules to report support, confidence and lift
inspect(head(asoc_groc, n=10, by='support'))## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## root vegetables,
## tropical fruit,
## whole milk} => {other vegetables} 0.003152008 0.8857143 0.003558719 4.577509 31
## [2] {curd,
## domestic eggs,
## other vegetables} => {whole milk} 0.002846975 0.8235294 0.003457041 3.223005 28
## [3] {curd,
## hamburger meat} => {whole milk} 0.002541942 0.8064516 0.003152008 3.156169 25
## [4] {herbs,
## rolls/buns} => {whole milk} 0.002440264 0.8000000 0.003050330 3.130919 24
## [5] {herbs,
## tropical fruit} => {whole milk} 0.002338587 0.8214286 0.002846975 3.214783 23
## [6] {citrus fruit,
## other vegetables,
## root vegetables,
## yogurt} => {whole milk} 0.002338587 0.8214286 0.002846975 3.214783 23
## [7] {butter,
## other vegetables,
## pork} => {whole milk} 0.002236909 0.8461538 0.002643620 3.311549 22
## [8] {rolls/buns,
## root vegetables,
## tropical fruit,
## yogurt} => {whole milk} 0.002236909 0.8148148 0.002745297 3.188899 22
## [9] {grapes,
## tropical fruit,
## whole milk} => {other vegetables} 0.002033554 0.8000000 0.002541942 4.134524 20
## [10] {fruit/vegetable juice,
## other vegetables,
## root vegetables,
## yogurt} => {whole milk} 0.002033554 0.8333333 0.002440264 3.261374 20
inspect(head(asoc_groc, n=10, by='confidence'))## lhs rhs support confidence coverage lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1 0.001220132 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
## [3] {butter,
## rice,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [4] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1 0.001728521 3.913649 17
## [5] {butter,
## domestic eggs,
## soft cheese} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [6] {citrus fruit,
## root vegetables,
## soft cheese} => {other vegetables} 0.001016777 1 0.001016777 5.168156 10
## [7] {butter,
## hygiene articles,
## pip fruit} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [8] {hygiene articles,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [9] {hygiene articles,
## pip fruit,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [10] {cream cheese,
## domestic eggs,
## sugar} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
inspect(head(asoc_groc, n=10, by='lift'))|> kable()|> kable_styling(bootstrap_options='striped',position='left')## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {citrus fruit,
## fruit/vegetable juice,
## other vegetables,
## soda} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [3] {oil,
## other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [4] {citrus fruit,
## fruit/vegetable juice,
## grapes} => {tropical fruit} 0.001118454 0.8461538 0.001321810 8.063879 11
## [5] {other vegetables,
## rice,
## whole milk,
## yogurt} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [6] {oil,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [7] {ham,
## other vegetables,
## pip fruit,
## yogurt} => {tropical fruit} 0.001016777 0.8333333 0.001220132 7.941699 10
## [8] {beef,
## citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.001016777 0.8333333 0.001220132 7.645367 10
## [9] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
## [10] {butter,
## sliced cheese,
## tropical fruit,
## whole milk} => {yogurt} 0.001016777 0.9090909 0.001118454 6.516698 10
#Also remove subset rules
subset.rules <- which(colSums(is.subset(asoc_groc, asoc_groc)) > 1)
length(subset.rules)## [1] 91
subset_asoc_groc <- asoc_groc[-subset.rules]#Plot All Rules
plot(asoc_groc, jitter=0, method='two-key plot')head(asoc_groc, n = 10, by = "confidence") %>%
plot(method = "graph", engine = "htmlwidget")#Perform Cluster analysis
library(cluster)
library(factoextra)
library(dendextend)
#Hierarchical cluster analysis to create cluster dendrograms.
t1 <- grocery[ , itemFrequency(grocery) > 0.08]
glist <- dissimilarity(t1, which = "items")
# plot dendrogram with complete
hc <- hclust(glist, method = "complete" )
plot(hc, cex = 0.8, hang = 1)
rect.hclust(hc, k = 4, border = 2:5)