HW10: Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
load and read data have have quick summary of the dataset. With library “arules” to ready transaction.
df <- read.csv("C:/Users/xmei/Desktop/GroceryDataSet.csv",header = FALSE)
summary(df)
## V1 V2 V3 V4
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V5 V6 V7 V8
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V9 V10 V11 V12
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V13 V14 V15 V16
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V17 V18 V19 V20
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V21 V22 V23 V24
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V25 V26 V27 V28
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V29 V30 V31 V32
## Length:9835 Length:9835 Length:9835 Length:9835
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
tr = read.transactions("C:/Users/xmei/Desktop/GroceryDataSet.csv", format = 'basket', sep=',')
summary(tr)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
To show top 10 items that has highest frequency and it’s occurrence in bar chart. Top item includes: while milk, other vegetables, rolls/bhuns…
itemFrequencyPlot(tr, topN = 20, type = "absolute",
col = brewer.pal(8,'Pastel2'), main = "Absolute Item Frequency Plot")
Explore rules with Support, Confidence and Lift.
association.rules = arules::apriori(tr, parameter=list(supp=0.002, conf=0.5, maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.02s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary
summary(association.rules)
## set of 1098 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 6 576 471 45
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.505 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.002034 Min. :0.5000 Min. :0.002440 Min. :1.957
## 1st Qu.:0.002237 1st Qu.:0.5263 1st Qu.:0.003864 1st Qu.:2.194
## Median :0.002644 Median :0.5676 Median :0.004677 Median :2.584
## Mean :0.003289 Mean :0.5845 Mean :0.005765 Mean :2.668
## 3rd Qu.:0.003559 3rd Qu.:0.6223 3rd Qu.:0.006304 3rd Qu.:2.899
## Max. :0.022267 Max. :0.8857 Max. :0.043416 Max. :7.154
## count
## Min. : 20.00
## 1st Qu.: 22.00
## Median : 26.00
## Mean : 32.35
## 3rd Qu.: 35.00
## Max. :219.00
##
## mining info:
## data ntransactions support confidence
## tr 9835 0.002 0.5
## call
## arules::apriori(data = tr, parameter = list(supp = 0.002, conf = 0.5, maxlen = 10))
The top 10 rules.
inspect(association.rules[1:10])
## lhs rhs support
## [1] {cereals} => {whole milk} 0.003660397
## [2] {jam} => {whole milk} 0.002948653
## [3] {specialty cheese} => {other vegetables} 0.004270463
## [4] {rice} => {other vegetables} 0.003965430
## [5] {rice} => {whole milk} 0.004677173
## [6] {baking powder} => {whole milk} 0.009252669
## [7] {specialty cheese, yogurt} => {whole milk} 0.002033554
## [8] {specialty cheese, whole milk} => {yogurt} 0.002033554
## [9] {other vegetables, specialty cheese} => {whole milk} 0.002236909
## [10] {specialty cheese, whole milk} => {other vegetables} 0.002236909
## confidence coverage lift count
## [1] 0.6428571 0.005693950 2.515917 36
## [2] 0.5471698 0.005388917 2.141431 29
## [3] 0.5000000 0.008540925 2.584078 42
## [4] 0.5200000 0.007625826 2.687441 39
## [5] 0.6133333 0.007625826 2.400371 46
## [6] 0.5229885 0.017691917 2.046793 91
## [7] 0.7142857 0.002846975 2.795464 20
## [8] 0.5405405 0.003762074 3.874793 20
## [9] 0.5238095 0.004270463 2.050007 22
## [10] 0.5945946 0.003762074 3.072957 22
Rules with lift. Here I listed filtered rule with confidence greater than 50% show top 10.
subRules = association.rules[quality(association.rules)$confidence > 0.5]
top10RulesByLift = head(subRules, n = 10, by = "lift")
inspect(top10RulesByLift)
## lhs rhs support confidence coverage lift count
## [1] {butter,
## hard cheese} => {whipped/sour cream} 0.002033554 0.5128205 0.003965430 7.154028 20
## [2] {beef,
## citrus fruit,
## other vegetables} => {root vegetables} 0.002135231 0.6363636 0.003355363 5.838280 21
## [3] {citrus fruit,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.003152008 0.6326531 0.004982206 5.804238 31
## [4] {citrus fruit,
## frozen vegetables,
## other vegetables} => {root vegetables} 0.002033554 0.6250000 0.003253686 5.734025 20
## [5] {beef,
## other vegetables,
## tropical fruit} => {root vegetables} 0.002745297 0.6136364 0.004473818 5.629770 27
## [6] {bottled water,
## root vegetables,
## yogurt} => {tropical fruit} 0.002236909 0.5789474 0.003863752 5.517391 22
## [7] {herbs,
## other vegetables,
## whole milk} => {root vegetables} 0.002440264 0.6000000 0.004067107 5.504664 24
## [8] {grapes,
## pip fruit} => {tropical fruit} 0.002135231 0.5675676 0.003762074 5.408941 21
## [9] {herbs,
## yogurt} => {root vegetables} 0.002033554 0.5714286 0.003558719 5.242537 20
## [10] {beef,
## other vegetables,
## soda} => {root vegetables} 0.002033554 0.5714286 0.003558719 5.242537 20
Use parallel coordinates plot visualizes flow of association rules.
plot(top10RulesByLift, method="paracoord")
Cluster analysis on data using K-Means.
tr_data = as(tr, "matrix")
norm_data = as.data.frame(scale(tr_data))
dim(norm_data)
## [1] 9835 169
set.seed(1234)
kmfit = kmeans(norm_data, centers=5, nstart = 25)
str(kmfit)
## List of 9
## $ cluster : int [1:9835] 4 4 4 4 4 1 4 4 4 4 ...
## $ centers : num [1:5, 1:169] 0.1246 -0.0598 0.3498 -0.0418 0.6703 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
## $ totss : num 1661946
## $ withinss : num [1:5] 823255 16544 9517 741302 16571
## $ tot.withinss: num 1607189
## $ betweenss : num 54757
## $ size : int [1:5] 2277 17 41 7477 23
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
Use visualization to show the quality and separation of the identified clusters by projecting the normalized data (norm_data) onto the first two principal components.
factoextra::fviz_cluster(kmfit, data = norm_data,
palette = "Set2",
ellipse.type = "convex",
repel = TRUE,
star.plot = TRUE,
ggtheme = theme_minimal(),
main = "K-means Clustering Results",
xlab = "Principal Component 1",
ylab = "Principal Component 2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
legend.position = "right",
legend.title = element_text(face = "bold"))