Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.
We will use the SunBai database from “arules” library to explore the association rule method.
library(arules)
library(arulesViz)
data(SunBai)
str(SunBai)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:18] 0 1 2 3 4 2 5 6 0 1 ...
## .. .. ..@ p : int [1:7] 0 5 8 10 11 15 18
## .. .. ..@ Dim : int [1:2] 8 6
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 8 obs. of 1 variable:
## .. ..$ labels: chr [1:8] "A" "B" "C" "D" ...
## ..@ itemsetInfo:'data.frame': 6 obs. of 2 variables:
## .. ..$ transactionID: num [1:6] 100 200 300 400 500 600
## .. ..$ weight : num [1:6] 0.518 0.436 0.232 0.148 0.544 ...
summary(SunBai)
## transactions as itemMatrix in sparse format with
## 6 rows (elements/itemsets/transactions) and
## 8 columns (items) and a density of 0.375
##
## most frequent items:
## A C G B F (Other)
## 4 3 3 2 2 4
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5
## 1 1 2 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.25 3.00 3.00 3.75 5.00
##
## includes extended item information - examples:
## labels
## 1 A
## 2 B
## 3 C
##
## includes extended transaction information - examples:
## transactionID weight
## 1 100 0.5176528
## 2 200 0.4362571
## 3 300 0.2321374
Through the inspect() function, you can see the transaction records of the supermarket and the product name of each transaction.
inspect(SunBai[1:5])
## items transactionID weight
## [1] {A, B, C, D, E} 100 0.5176528
## [2] {C, F, G} 200 0.4362571
## [3] {A, B} 300 0.2321374
## [4] {A} 400 0.1476262
## [5] {C, F, G, H} 500 0.5440458
itemFrequency(SunBai[,1:8])
## A B C D E F G H
## 0.6666667 0.3333333 0.5000000 0.1666667 0.1666667 0.3333333 0.5000000 0.3333333
We can also plot it
itemFrequencyPlot(SunBai, support=0.1)
Rank 20’s popular items
itemFrequencyPlot(SunBai, topN=20)
Next step will be train our model. Set the support degree to 0.2 and the confidence degree to 0.5 to perform association rule on the data
rules <- apriori(SunBai, parameter=list(support=0.2,confidence=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.2 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 6 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [16 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Minlen was set up to 2 to avoid creating useless rules.
rules=apriori(SunBai,parameter = list(support=0.2,confidence=0.5,minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.2 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 6 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 13 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 10 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.231 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift count
## Min. :0.3333 Min. :0.5000 Min. :0.3333 Min. :1.333 Min. :2
## 1st Qu.:0.3333 1st Qu.:0.6667 1st Qu.:0.3333 1st Qu.:1.500 1st Qu.:2
## Median :0.3333 Median :1.0000 Median :0.3333 Median :2.000 Median :2
## Mean :0.3333 Mean :0.8333 Mean :0.4231 Mean :1.897 Mean :2
## 3rd Qu.:0.3333 3rd Qu.:1.0000 3rd Qu.:0.5000 3rd Qu.:2.000 3rd Qu.:2
## Max. :0.3333 Max. :1.0000 Max. :0.6667 Max. :3.000 Max. :2
##
## mining info:
## data ntransactions support confidence
## SunBai 6 0.2 0.5
## call
## apriori(data = SunBai, parameter = list(support = 0.2, confidence = 0.5, minlen = 2))
Then, we should also consider to improve our model’s performance.
Perhaps the most useful rules are those with high support, confidence, and lift. The arules package contains a sort() function that reorders the list of rules by specifying the by parameter as “support”, “confidence” or “lift”. By default, the sorting is in descending order.
lift (lift degree), which is used to measure the general purchase rate of a type of product relative to it, how likely it is to be purchased at this time (Lift) is to avoid the bias of some unbalanced data labels, the larger the Lift, the The data quality is better; the smaller the Lift, the more unbalanced the data. So we set the lift value to 3 here.
inspect(head(sort(rules, by = "lift"), 3))
## lhs rhs support confidence coverage lift count
## [1] {C, G} => {F} 0.3333333 1.0000000 0.3333333 3 2
## [2] {F} => {G} 0.3333333 1.0000000 0.3333333 2 2
## [3] {G} => {F} 0.3333333 0.6666667 0.5000000 2 2
Then we will visualize it.
plot(rules, method = "grouped")
Also the scatter plot to show the support and confidence level’s distribution.
plot(rules, method='scatterplot')
Association rules are represented by arrows and circles, vertices represent itemsets, and edges represent relationships in rules. The larger the circle, the greater the support, and the darker the color, the greater the lift. However, if there are many rules, it will be very confusing, and it is difficult to find the rules. Therefore, such a diagram is usually only used for fewer rules or we can pick up these that we are interested in through subset() function.
plot(rules, method='graph', shading = "lift", control = list(type='items'))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
sub_rules <- subset(rules, items %in% "C")
sub_rules
## set of 7 rules
inspect(sub_rules[1:5])
## lhs rhs support confidence coverage lift count
## [1] {F} => {C} 0.3333333 1.0000000 0.3333333 2.000000 2
## [2] {C} => {F} 0.3333333 0.6666667 0.5000000 2.000000 2
## [3] {G} => {C} 0.3333333 0.6666667 0.5000000 1.333333 2
## [4] {C} => {G} 0.3333333 0.6666667 0.5000000 1.333333 2
## [5] {F, G} => {C} 0.3333333 1.0000000 0.3333333 2.000000 2
If lift=1, it means that the two events are not related; if lift<1, it means that the occurrence of event A and event B are repel each other. Generally, when the promotion degree is greater than 3, we recognize that the association rules mined are valuable. So, for this dataset we did not find very strong association rules among the data.