Project on Association Rule

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

We will use the SunBai database from “arules” library to explore the association rule method.

library(arules)   

library(arulesViz)  

data(SunBai)

str(SunBai)

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:18] 0 1 2 3 4 2 5 6 0 1 ...
##   .. .. ..@ p       : int [1:7] 0 5 8 10 11 15 18
##   .. .. ..@ Dim     : int [1:2] 8 6
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  8 obs. of  1 variable:
##   .. ..$ labels: chr [1:8] "A" "B" "C" "D" ...
##   ..@ itemsetInfo:'data.frame':  6 obs. of  2 variables:
##   .. ..$ transactionID: num [1:6] 100 200 300 400 500 600
##   .. ..$ weight       : num [1:6] 0.518 0.436 0.232 0.148 0.544 ...

summary(SunBai)

## transactions as itemMatrix in sparse format with
##  6 rows (elements/itemsets/transactions) and
##  8 columns (items) and a density of 0.375 
## 
## most frequent items:
##       A       C       G       B       F (Other) 
##       4       3       3       2       2       4 
## 
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 
## 1 1 2 1 1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.25    3.00    3.00    3.75    5.00 
## 
## includes extended item information - examples:
##   labels
## 1      A
## 2      B
## 3      C
## 
## includes extended transaction information - examples:
##   transactionID    weight
## 1           100 0.5176528
## 2           200 0.4362571
## 3           300 0.2321374

Through the inspect() function, you can see the transaction records of the supermarket and the product name of each transaction.

inspect(SunBai[1:5])

##     items           transactionID weight   
## [1] {A, B, C, D, E} 100           0.5176528
## [2] {C, F, G}       200           0.4362571
## [3] {A, B}          300           0.2321374
## [4] {A}             400           0.1476262
## [5] {C, F, G, H}    500           0.5440458

itemFrequency(SunBai[,1:8])

##         A         B         C         D         E         F         G         H 
## 0.6666667 0.3333333 0.5000000 0.1666667 0.1666667 0.3333333 0.5000000 0.3333333

We can also plot it

itemFrequencyPlot(SunBai, support=0.1)

Rank 20’s popular items

itemFrequencyPlot(SunBai, topN=20)

Next step will be train our model. Set the support degree to 0.2 and the confidence degree to 0.5 to perform association rule on the data

rules <- apriori(SunBai, parameter=list(support=0.2,confidence=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 6 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [16 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Minlen was set up to 2 to avoid creating useless rules.

rules=apriori(SunBai,parameter = list(support=0.2,confidence=0.5,minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.2      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 6 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 13 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 10  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.231   2.000   3.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift           count  
##  Min.   :0.3333   Min.   :0.5000   Min.   :0.3333   Min.   :1.333   Min.   :2  
##  1st Qu.:0.3333   1st Qu.:0.6667   1st Qu.:0.3333   1st Qu.:1.500   1st Qu.:2  
##  Median :0.3333   Median :1.0000   Median :0.3333   Median :2.000   Median :2  
##  Mean   :0.3333   Mean   :0.8333   Mean   :0.4231   Mean   :1.897   Mean   :2  
##  3rd Qu.:0.3333   3rd Qu.:1.0000   3rd Qu.:0.5000   3rd Qu.:2.000   3rd Qu.:2  
##  Max.   :0.3333   Max.   :1.0000   Max.   :0.6667   Max.   :3.000   Max.   :2  
## 
## mining info:
##    data ntransactions support confidence
##  SunBai             6     0.2        0.5
##                                                                                   call
##  apriori(data = SunBai, parameter = list(support = 0.2, confidence = 0.5, minlen = 2))

Then, we should also consider to improve our model’s performance.

Perhaps the most useful rules are those with high support, confidence, and lift. The arules package contains a sort() function that reorders the list of rules by specifying the by parameter as “support”, “confidence” or “lift”. By default, the sorting is in descending order.

lift (lift degree), which is used to measure the general purchase rate of a type of product relative to it, how likely it is to be purchased at this time (Lift) is to avoid the bias of some unbalanced data labels, the larger the Lift, the The data quality is better; the smaller the Lift, the more unbalanced the data. So we set the lift value to 3 here.

inspect(head(sort(rules, by = "lift"), 3))

##     lhs       rhs support   confidence coverage  lift count
## [1] {C, G} => {F} 0.3333333 1.0000000  0.3333333 3    2    
## [2] {F}    => {G} 0.3333333 1.0000000  0.3333333 2    2    
## [3] {G}    => {F} 0.3333333 0.6666667  0.5000000 2    2

Then we will visualize it.

plot(rules, method = "grouped")

Also the scatter plot to show the support and confidence level’s distribution.

plot(rules, method='scatterplot')

Association rules are represented by arrows and circles, vertices represent itemsets, and edges represent relationships in rules. The larger the circle, the greater the support, and the darker the color, the greater the lift. However, if there are many rules, it will be very confusing, and it is difficult to find the rules. Therefore, such a diagram is usually only used for fewer rules or we can pick up these that we are interested in through subset() function.

plot(rules, method='graph', shading = "lift",  control = list(type='items'))

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

sub_rules <- subset(rules, items %in% "C")
sub_rules

## set of 7 rules

inspect(sub_rules[1:5])

##     lhs       rhs support   confidence coverage  lift     count
## [1] {F}    => {C} 0.3333333 1.0000000  0.3333333 2.000000 2    
## [2] {C}    => {F} 0.3333333 0.6666667  0.5000000 2.000000 2    
## [3] {G}    => {C} 0.3333333 0.6666667  0.5000000 1.333333 2    
## [4] {C}    => {G} 0.3333333 0.6666667  0.5000000 1.333333 2    
## [5] {F, G} => {C} 0.3333333 1.0000000  0.3333333 2.000000 2

If lift=1, it means that the two events are not related; if lift<1, it means that the occurrence of event A and event B are repel each other. Generally, when the promotion degree is greater than 3, we recognize that the association rules mined are valuable. So, for this dataset we did not find very strong association rules among the data.

Project on Association Rule

Kai