Association Rule Mining (ARM) is a data minin technique that focuses on mining for associations between itemsets for further applications.
We will use package arules for ARM and arulesViz for AR visualization.
The data set will be the built-in Groceries data.
setwd("D:/Class Materials & Work/Summer 2020 practice/ARM")
library(arules) #for ARM
library(arulesViz) #for ARM visualization
Loading the data set and inspect it.
data(Groceries)
class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
Groceries
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
The data is a transactional data set with N = 9835 and 169 variables.
We can look at the data in item set level by using arules::inspect. Be careful to specify the number of rows to avoid flooding your output, as well as indicate whether you want to inspect from the top (head) or below (tail).
inspect(head(Groceries, 2))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
The result above is the first 2 transactions.
The eclat() takes in a transactions object and gives the most frequent items in the data based the support you provide to the supp argument. The maxlen defines the maximum number of items in each itemset of frequent items.
frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.07 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 688
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing ... [19 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
inspect(frequentItems)
## items support transIdenticalToItemsets count
## [1] {other vegetables,whole milk} 0.07483477 736 736
## [2] {whole milk} 0.25551601 2513 2513
## [3] {other vegetables} 0.19349263 1903 1903
## [4] {rolls/buns} 0.18393493 1809 1809
## [5] {yogurt} 0.13950178 1372 1372
## [6] {soda} 0.17437722 1715 1715
## [7] {root vegetables} 0.10899847 1072 1072
## [8] {tropical fruit} 0.10493137 1032 1032
## [9] {bottled water} 0.11052364 1087 1087
## [10] {sausage} 0.09395018 924 924
## [11] {shopping bags} 0.09852567 969 969
## [12] {citrus fruit} 0.08276563 814 814
## [13] {pastry} 0.08896797 875 875
## [14] {pip fruit} 0.07564820 744 744
## [15] {whipped/sour cream} 0.07168277 705 705
## [16] {fruit/vegetable juice} 0.07229283 711 711
## [17] {newspapers} 0.07981698 785 785
## [18] {bottled beer} 0.08052872 792 792
## [19] {canned beer} 0.07768175 764 764
We can plot item frequency with itemFrequencyPlot.
itemFrequencyPlot(Groceries, topN=10, type="absolute", main="Item Frequency")
We will generate parameters support and confidence for rule mining and lift for interestingness evaluation.
Support is an indication of how frequently the itemset appears in the dataset. For example, the support of the item citrus fruit is 1/2 as it appears in only 1 out of the two transactions.
Confidence is the proportion of the true positive of the rule.
Lets find out the rules using the apriori algorithm.
grocery_rules <- apriori(Groceries, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(grocery_rules)
## set of 15 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 15
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01007 Min. :0.5000 Min. :0.01729 Min. :1.984
## 1st Qu.:0.01174 1st Qu.:0.5151 1st Qu.:0.02089 1st Qu.:2.036
## Median :0.01230 Median :0.5245 Median :0.02430 Median :2.203
## Mean :0.01316 Mean :0.5411 Mean :0.02454 Mean :2.299
## 3rd Qu.:0.01403 3rd Qu.:0.5718 3rd Qu.:0.02598 3rd Qu.:2.432
## Max. :0.02227 Max. :0.5862 Max. :0.04342 Max. :3.030
## count
## Min. : 99.0
## 1st Qu.:115.5
## Median :121.0
## Mean :129.4
## 3rd Qu.:138.0
## Max. :219.0
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.01 0.5
The Apriori algorithm generated 15 rules with the given constraints (parameters). Lets dive into the Parameter Specification section of the output.
We can inspect the top three rules sorted by confidence.
inspect(head(sort(grocery_rules, by = "confidence"), 3))
## lhs rhs support
## [1] {citrus fruit,root vegetables} => {other vegetables} 0.01037112
## [2] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
## [3] {curd,yogurt} => {whole milk} 0.01006609
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.5823529 0.01728521 2.279125 99
Package arulesViz supports visualization of association rules with scatter plot, balloon plot, graph, parallel coordinates plot, etc.
#scatter plot as sorted by parameters
plot(grocery_rules)
#Graph plot for items
plot(grocery_rules, method="graph", control=list(verbose = FALSE))
#Parallel coordinate plot
plot(grocery_rules, method="paracoord", control=list(reorder=TRUE))
We can limit the number of generated rules to filter in only the significant rules for further use.
wholemilk_rules <- apriori(data=Groceries, parameter=list (supp=0.001,conf = 0.08),
appearance = list (rhs="whole milk"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.08 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [3765 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(wholemilk_rules)
## set of 3765 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6
## 1 134 1503 1792 325 10
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.62 4.00 6.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.1071 Min. :0.001017 Min. :0.4193
## 1st Qu.:0.001118 1st Qu.:0.4783 1st Qu.:0.001932 1st Qu.:1.8717
## Median :0.001423 Median :0.5702 Median :0.002644 Median :2.2317
## Mean :0.002348 Mean :0.5749 Mean :0.004952 Mean :2.2500
## 3rd Qu.:0.002237 3rd Qu.:0.6667 3rd Qu.:0.004372 3rd Qu.:2.6091
## Max. :0.255516 Max. :1.0000 Max. :1.000000 Max. :3.9136
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 14.00
## Mean : 23.09
## 3rd Qu.: 22.00
## Max. :2513.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.001 0.08
The above code shows what products are bought before buying “whole milk” and will generate rules that lead to buying “whole milk”.
There is over 3000 rules, which is too much for a single use. You can limit the number of rules by tweaking a few parameters depending on the type of data. The most common ways include changing support, confidence and other parameters like minlen, maxlen etc.
grocery_rules_increased_support <- apriori(Groceries, parameter = list(support = 0.02, confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(grocery_rules_increased_support)
## set of 1 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02227 Min. :0.5129 Min. :0.04342 Min. :2.007
## 1st Qu.:0.02227 1st Qu.:0.5129 1st Qu.:0.04342 1st Qu.:2.007
## Median :0.02227 Median :0.5129 Median :0.04342 Median :2.007
## Mean :0.02227 Mean :0.5129 Mean :0.04342 Mean :2.007
## 3rd Qu.:0.02227 3rd Qu.:0.5129 3rd Qu.:0.04342 3rd Qu.:2.007
## Max. :0.02227 Max. :0.5129 Max. :0.04342 Max. :2.007
## count
## Min. :219
## 1st Qu.:219
## Median :219
## Mean :219
## 3rd Qu.:219
## Max. :219
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.02 0.5
We increased support by 0.01, and that yields a total of one rule.
If you want to get stronger rules, you have to increase the confidence. If you want lengthier rules, increase the maxlen parameter. If you want to eliminate shorter rules, decrease the minlen parameter.
Sometimes you might be interested in finding the rules involving maximum number of items and remove the shorter rules that are subsets of the longer rules, which are considered as redundant.
subsets <- which(colSums(is.subset(wholemilk_rules, grocery_rules)) > 1) #remove subset rules that are related to wholemilk.
length(subsets)
## [1] 11
grocery_rules_prunned <- grocery_rules[-subsets]
grocery_rules_prunned
## set of 4 rules
plot(grocery_rules_prunned, method="paracoord", control=list(reorder=TRUE))
We can see that all rules are gone. The prunning can be adjusted based on the nature of data. The lower the support is, the more rule will be yielded, 0.0001, for instance.