In this assignment we work with the Grocery dataset and apply market basket analysis and cluster analysis to the data. The dataset includes:
path <- "C:/Users/Kevin/Google Drive/Documents/CUNY MSDS/DATA624/Week13/"
df <- read_csv(paste0(path, "GroceryDataSet.csv"),
col_names = FALSE,
col_types = cols(.default = col_character()))
#str(df)
#glimpse(df)
We use the arules and arulesViz libraries to perform and visualize a market basket analysis of the data. From the item frequency plot below, we see that the 10 top items represented in the transaction data include:
# load & review data
trans <- read.transactions(paste0(path, "GroceryDataSet.csv"),
format = "basket",
header = FALSE,
sep = ",")
summary(trans)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
#inspect(trans)
# top items
#itemFrequency(trans)
itemFrequencyPlot(trans, type = "relative", topN = 25, horiz = TRUE,
main = "Relative item frequency: Top 25 items")
Now we use the apriori function to find association rules in the transaction data. After some experimentation, we set the parameters as follows in order to generate a reasonably large set of rules:
The result of running the Apriori mining algorithm with these parameters is a set of 232 association rules, virtually all based on a total of 2 or 3 items.
# find rules: vary conf 0.5 = 15 rules, 0.4 = 62 rules, 0.3 = 125 rules, 0.2 = 232
rules <- apriori(trans, parameter = list(support = 0.01, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [232 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 232 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3
## 1 151 80
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.341 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01007 Min. :0.2006 Min. :0.8991 Min. : 99.0
## 1st Qu.:0.01200 1st Qu.:0.2470 1st Qu.:1.4432 1st Qu.: 118.0
## Median :0.01490 Median :0.3170 Median :1.7277 Median : 146.5
## Mean :0.02005 Mean :0.3321 Mean :1.7890 Mean : 197.2
## 3rd Qu.:0.02227 3rd Qu.:0.4033 3rd Qu.:2.0762 3rd Qu.: 219.0
## Max. :0.25552 Max. :0.5862 Max. :3.2950 Max. :2513.0
##
## mining info:
## data ntransactions support confidence
## trans 9835 0.01 0.2
#inspect(rules)
The top 15 rules based on lift are shown in the table below. We see that rules associated with “root vegetables” or “other vegetables” as consequent (RHS) dominate the top 10 rules, with lift ratios ranging from 2.7x to 3.3x.
# top 15 by lift
inspect(head(rules, n = 15, by = "lift"))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## other vegetables} => {root vegetables} 0.01037112 0.3591549 3.295045 102
## [2] {other vegetables,
## yogurt} => {whipped/sour cream} 0.01016777 0.2341920 3.267062 100
## [3] {other vegetables,
## tropical fruit} => {root vegetables} 0.01230300 0.3427762 3.144780 121
## [4] {beef} => {root vegetables} 0.01738688 0.3313953 3.040367 171
## [5] {citrus fruit,
## root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [6] {root vegetables,
## tropical fruit} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [7] {other vegetables,
## whole milk} => {root vegetables} 0.02318251 0.3097826 2.842082 228
## [8] {curd,
## whole milk} => {yogurt} 0.01006609 0.3852140 2.761356 99
## [9] {other vegetables,
## yogurt} => {root vegetables} 0.01291307 0.2974239 2.728698 127
## [10] {other vegetables,
## yogurt} => {tropical fruit} 0.01230300 0.2833724 2.700550 121
## [11] {other vegetables,
## root vegetables} => {citrus fruit} 0.01037112 0.2188841 2.644626 102
## [12] {other vegetables,
## rolls/buns} => {root vegetables} 0.01220132 0.2863962 2.627525 120
## [13] {tropical fruit,
## whole milk} => {root vegetables} 0.01199797 0.2836538 2.602365 118
## [14] {rolls/buns,
## root vegetables} => {other vegetables} 0.01220132 0.5020921 2.594890 120
## [15] {root vegetables,
## yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
For comparison we also show the top 15 rules based on confidence in the table below, where the confidence levels range from 0.5 to almost 0.6. In contrast to the top rules by lift, we see that the top rules by confidence are dominated by rules associated with “whole milk” as consequent (RHS). Interestingly, the top 2 rules by confidence are identical to rules #5 and #6 by lift, whereas there is no other overlap between the two tables:
# top 15 by confidence
inspect(head(rules, n = 15, by = "confidence"))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [2] {root vegetables,
## tropical fruit} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [3] {curd,
## yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
## [4] {butter,
## other vegetables} => {whole milk} 0.01148958 0.5736041 2.244885 113
## [5] {root vegetables,
## tropical fruit} => {whole milk} 0.01199797 0.5700483 2.230969 118
## [6] {root vegetables,
## yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 143
## [7] {domestic eggs,
## other vegetables} => {whole milk} 0.01230300 0.5525114 2.162336 121
## [8] {whipped/sour cream,
## yogurt} => {whole milk} 0.01087951 0.5245098 2.052747 107
## [9] {rolls/buns,
## root vegetables} => {whole milk} 0.01270971 0.5230126 2.046888 125
## [10] {other vegetables,
## pip fruit} => {whole milk} 0.01352313 0.5175097 2.025351 133
## [11] {tropical fruit,
## yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 149
## [12] {other vegetables,
## yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 219
## [13] {other vegetables,
## whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 144
## [14] {rolls/buns,
## root vegetables} => {other vegetables} 0.01220132 0.5020921 2.594890 120
## [15] {root vegetables,
## yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
We show several visualizations of the association rules including:
# scatter plot of all rules
plot(rules, jitter = 3)
# graph of top 10 rules
head(rules, n = 10, by = "lift") %>%
plot(method = "graph", engine = "htmlwidget")
# parallel coord plot of top 10 rules
head(rules, n = 10, by = "lift") %>%
plot(method = "paracoord")
Finally we run a simple cluster analysis, focusing on the items in the transaction data that have greater than 4% support. We follow the logic and code example given in the help file for the dissimilarity function in the arules library. First we compute the distance between items using the dissimilarity function; the distance measure in this case defaults to the “jaccard” measure. Then we apply hierarchical clustering to the dissimilarity matrix with the hclust function, using the “ward.D2” method.
The results of the cluster analysis are displayed in the dendrogram below. We see that “root vegetables”, “other vegetables”, and “whole milk” form a prominent cluster, consistent with the association rules observed earlier.
# follow logic of "dissimilarity" help file from "arules" package
# cluster analysis on items with support > 5%
trans2 <- trans[ , itemFrequency(trans) > 0.04]
d_jaccard <- dissimilarity(trans2, which = "items")
# plot dendrogram
plot(hclust(d_jaccard, method = "ward.D2"),
main = "Dendrogram for items", sub = "", xlab = "")