library("arules")Loading required package: Matrix
Attaching package: 'arules'
The following objects are masked from 'package:base':
abbreviate, write
library(readr)Association analysis identifies relations or correlations between observations and/or between variables in our datasets.These relationships are then expressed as a collection of “association rules”.
Is a core technique of data mining. Is very useful for mining very large transactional databases, like shopping baskets and on-line customer purchases.
Knowledge Representation: Association Rules
General Format is A -> C Can generalize A (the antecedent) to a specific variable or value combination so can apply to various datasets
Basis of an association analysis algorithm is the generation of frequent itemsets.
Is an “apriori algorithm”, a generate-and-test type of search algorithm. Only after exploring all of the possibilities of associations containing k items does it consider those containing K + 1 items. For each k, all candidates are tested to determine whether they have enough support.
A frequent itemset is a set of items that occur together frequently enough to be considered as a candidate for generating association rules.
Support
“Support” is a measure of how frequently the items must appear in the whole dataset before they can be considered as a candidate association rule.
“Support” for a collection of items is the proportion of all transactions in which the items appear together
support(A -> C) = P( A U C)
We use small values of “support” as are not looking for the obvious ones.
Confidence
The actual association rules that we retain are those that meet a criterion “confidence”.
“Confidence” calculates the proportion of transactions containing A that also contain C.
confidence (A -> C) = P(C|A) = P(A U C) / P(A)
confidence (A -> C) = support(A -> C) / support(A)
Typically looking for larger values of confidence.
Lift
“Lift” is the increased likelihood of C being in a transaction if A is included in the transaction:
lift(A -> C) = confidence(A -> C) / support(C)
Leverage
Leverage, which captures the fact that a higher frequency of A and C with a lower lift may be interesting:
leverage(A->C) = support(A->C)-support(A)*support(C)
Two types of association rules were identified corresponding to the type of data available.
The simplest case, known as “market basket analysis”, is when we have a transaction dataset that records just a transaction identifer. The identifer might identify a single shopping basket containing multiple items from shopping or a particular customer or patient and their associated purchases or medical treatments over time.
A simple example of a market basket dataset might record the purchases of DVDs by customers (three customers in this case):
| ID | Item |
|---|---|
| 1 | Sixth Sense |
| 1 | LOTR1 |
| 1 | Harry Potter1 |
| 1 | Green Mile |
| 1 | LOTR2 |
| 2 | Gladiator |
| 2 | Patriot |
| 2 | Braveheart |
| 3 | LOTR1 |
| 3 | LOTR2 |
When loading a dataset to process with apriori() it must be converted into a transaction data structure. Consider a basket with two columns one being the identifier of the “basket” and the other being an item contained in the basket as is the case for the dvdtrans.csv data.
library("arules")Loading required package: Matrix
Attaching package: 'arules'
The following objects are masked from 'package:base':
abbreviate, write
library(readr)dvdtrans <- read.csv("dvdtrans.csv")
str(dvdtrans)'data.frame': 30 obs. of 2 variables:
$ ID : int 1 1 1 1 1 2 2 2 3 3 ...
$ Item: chr "Sixth Sense" "LOTR1" "Harry Potter1" "Green Mile" ...
dvdtrans ID Item
1 1 Sixth Sense
2 1 LOTR1
3 1 Harry Potter1
4 1 Green Mile
5 1 LOTR2
6 2 Gladiator
7 2 Patriot
8 2 Braveheart
9 3 LOTR1
10 3 LOTR2
11 4 Gladiator
12 4 Patriot
13 4 Sixth Sense
14 5 Gladiator
15 5 Patriot
16 5 Sixth Sense
17 6 Gladiator
18 6 Patriot
19 6 Sixth Sense
20 7 Harry Potter1
21 7 Harry Potter2
22 8 Gladiator
23 8 Patriot
24 9 Gladiator
25 9 Patriot
26 9 Sixth Sense
27 10 Sixth Sense
28 10 LOTR
29 10 Gladiator
30 10 Green Mile
dvdDS <- new.env()
dvdDS$data <- as(split(dvdtrans$Item, dvdtrans$ID),
"transactions")
dvdDS$datatransactions in sparse format with
10 transactions (rows) and
10 items (columns)
We can then build the model using the tranformed dataset:
dvdAPRIORI <- new.env(parent=dvdDS)
evalq({
model <- apriori(data,
parameter=list(support=0.2,
confidence=0.1))
}, dvdAPRIORI)Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.1 0.1 1 none FALSE TRUE 5 0.2 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 2
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recoding items ... [7 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [20 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
dvdAPRIORI$modelset of 20 rules
The rules can be extracted and ordered by confidence using inspect()
inspect(sort(dvdAPRIORI$model,
# limit display to first 5 rules
by="confidence")[1:5]) lhs rhs support confidence coverage
[1] {LOTR1} => {LOTR2} 0.2 1 0.2
[2] {LOTR2} => {LOTR1} 0.2 1 0.2
[3] {Green Mile} => {Sixth Sense} 0.2 1 0.2
[4] {Patriot} => {Gladiator} 0.6 1 0.6
[5] {Patriot, Sixth Sense} => {Gladiator} 0.4 1 0.4
lift count
[1] 5.000000 2
[2] 5.000000 2
[3] 1.666667 2
[4] 1.428571 6
[5] 1.428571 4
library(arulesViz)
plot(dvdAPRIORI$model, method = "graph", measure = "lift", shading = "confidence")You can add options to executable code like this
data(Groceries)
summary(Groceries)transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda
2513 1903 1809 1715
yogurt (Other)
1372 34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
17 18 19 20 21 22 23 24 26 27 28 29 32
29 14 14 9 11 4 6 1 1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
labels level2 level1
1 frankfurter sausage meat and sausage
2 sausage sausage meat and sausage
3 liver loaf sausage meat and sausage
rules <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5))Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.5 0.1 1 none FALSE TRUE 5 0.01 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 98
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [88 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [15 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(rules) lhs rhs support
[1] {curd, yogurt} => {whole milk} 0.01006609
[2] {other vegetables, butter} => {whole milk} 0.01148958
[3] {other vegetables, domestic eggs} => {whole milk} 0.01230300
[4] {yogurt, whipped/sour cream} => {whole milk} 0.01087951
[5] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
[6] {pip fruit, other vegetables} => {whole milk} 0.01352313
[7] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
[8] {tropical fruit, root vegetables} => {other vegetables} 0.01230300
[9] {tropical fruit, root vegetables} => {whole milk} 0.01199797
[10] {tropical fruit, yogurt} => {whole milk} 0.01514997
[11] {root vegetables, yogurt} => {other vegetables} 0.01291307
[12] {root vegetables, yogurt} => {whole milk} 0.01453991
[13] {root vegetables, rolls/buns} => {other vegetables} 0.01220132
[14] {root vegetables, rolls/buns} => {whole milk} 0.01270971
[15] {other vegetables, yogurt} => {whole milk} 0.02226741
confidence coverage lift count
[1] 0.5823529 0.01728521 2.279125 99
[2] 0.5736041 0.02003050 2.244885 113
[3] 0.5525114 0.02226741 2.162336 121
[4] 0.5245098 0.02074225 2.052747 107
[5] 0.5070423 0.02887646 1.984385 144
[6] 0.5175097 0.02613116 2.025351 133
[7] 0.5862069 0.01769192 3.029608 102
[8] 0.5845411 0.02104728 3.020999 121
[9] 0.5700483 0.02104728 2.230969 118
[10] 0.5173611 0.02928317 2.024770 149
[11] 0.5000000 0.02582613 2.584078 127
[12] 0.5629921 0.02582613 2.203354 143
[13] 0.5020921 0.02430097 2.594890 120
[14] 0.5230126 0.02430097 2.046888 125
[15] 0.5128806 0.04341637 2.007235 219
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(rules_sorted[1:10]) lhs rhs support
[1] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
[2] {tropical fruit, root vegetables} => {other vegetables} 0.01230300
[3] {root vegetables, rolls/buns} => {other vegetables} 0.01220132
[4] {root vegetables, yogurt} => {other vegetables} 0.01291307
[5] {curd, yogurt} => {whole milk} 0.01006609
[6] {other vegetables, butter} => {whole milk} 0.01148958
[7] {tropical fruit, root vegetables} => {whole milk} 0.01199797
[8] {root vegetables, yogurt} => {whole milk} 0.01453991
[9] {other vegetables, domestic eggs} => {whole milk} 0.01230300
[10] {yogurt, whipped/sour cream} => {whole milk} 0.01087951
confidence coverage lift count
[1] 0.5862069 0.01769192 3.029608 102
[2] 0.5845411 0.02104728 3.020999 121
[3] 0.5020921 0.02430097 2.594890 120
[4] 0.5000000 0.02582613 2.584078 127
[5] 0.5823529 0.01728521 2.279125 99
[6] 0.5736041 0.02003050 2.244885 113
[7] 0.5700483 0.02104728 2.230969 118
[8] 0.5629921 0.02582613 2.203354 143
[9] 0.5525114 0.02226741 2.162336 121
[10] 0.5245098 0.02074225 2.052747 107
library(arulesViz)
plot(rules_sorted, method = "graph", measure = "lift", shading = "confidence", interactive = FALSE)Warning in plot.rules(rules_sorted, method = "graph", measure = "lift", : The
parameter interactive is deprecated. Use engine='interactive' instead.