ASSOCIATION ANALYSIS

TUJUAN PEMBELAJARAN

Students can analyze, interpret data and information and make appropriate decisions based on the association analysis approach (CPMK1, CPMK2, KUE, KKB). - Affinity Analysis - Apriori Algorithm in R Studio - FP Growth in R Studio

##AFFINITY ANALYSIS Affinity analysis is the study of attributes or characteristics that “go together”. Methods for affinity analysis, also known as market basket analysis, seek to uncover associations among these attributes; that is, it seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form “If antecedent, then consequent”, along with a measure of the support and confidence associated with the rule.

##APRIORI DALAM R STUDIO The apriori() generates the most relevent set of rules from a given transaction data. It also shows the support, confidence and lift of those rules. These three measure can be used to decide the relative strength of the rules. So what do these terms mean?

###INSTALL PACKAGES Install packages arules in console -arules -arulesViz -grid

Lets consider the rule A => B in order to compute these metrics.

###EXAMPLE Transactions data Lets play with the Groceries data that comes with the arules pkg. Unlike dataframe, using head(Groceries) does not display the transaction items in the data. To view the transactions, use the inspect() function instead.

Since association mining deals with transactions, the data has to be converted to one of class transactions, made available in R through the arules pkg. This is a necessary step because the apriori() function accepts transactions data of class transactions only.

###LOAD DATA

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
library(grid)
data(Groceries)
class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
inspect(head(Groceries, 3))
##     items                
## [1] {citrus fruit,       
##      semi-finished bread,
##      margarine,          
##      ready soups}        
## [2] {tropical fruit,     
##      yogurt,             
##      coffee}             
## [3] {whole milk}

How to see the most frequent items?

The eclat() takes in a transactions object and gives the most frequent items in the data based the support you provide to the supp argument. The maxlen defines the maximum number of items in each itemset of frequent items.

frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.07      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 688 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing  ... [19 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
inspect(frequentItems)
##      items                         support    transIdenticalToItemsets count
## [1]  {other vegetables,whole milk} 0.07483477  736                      736 
## [2]  {whole milk}                  0.25551601 2513                     2513 
## [3]  {other vegetables}            0.19349263 1903                     1903 
## [4]  {rolls/buns}                  0.18393493 1809                     1809 
## [5]  {yogurt}                      0.13950178 1372                     1372 
## [6]  {soda}                        0.17437722 1715                     1715 
## [7]  {root vegetables}             0.10899847 1072                     1072 
## [8]  {tropical fruit}              0.10493137 1032                     1032 
## [9]  {bottled water}               0.11052364 1087                     1087 
## [10] {sausage}                     0.09395018  924                      924 
## [11] {shopping bags}               0.09852567  969                      969 
## [12] {citrus fruit}                0.08276563  814                      814 
## [13] {pastry}                      0.08896797  875                      875 
## [14] {pip fruit}                   0.07564820  744                      744 
## [15] {whipped/sour cream}          0.07168277  705                      705 
## [16] {fruit/vegetable juice}       0.07229283  711                      711 
## [17] {newspapers}                  0.07981698  785                      785 
## [18] {bottled beer}                0.08052872  792                      792 
## [19] {canned beer}                 0.07768175  764                      764
itemFrequencyPlot(Groceries, topN=10, type="absolute", main="Item Frequency") # plot frequent items

###How to get the product recommendation rules?

rules <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5)) # Min Support as 0.001, confidence as 0.8.
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
##     lhs                     rhs                    support confidence    coverage     lift count
## [1] {rice,                                                                                      
##      sugar}              => {whole milk}       0.001220132          1 0.001220132 3.913649    12
## [2] {canned fish,                                                                               
##      hygiene articles}   => {whole milk}       0.001118454          1 0.001118454 3.913649    11
## [3] {root vegetables,                                                                           
##      butter,                                                                                    
##      rice}               => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [4] {root vegetables,                                                                           
##      whipped/sour cream,                                                                        
##      flour}              => {whole milk}       0.001728521          1 0.001728521 3.913649    17
## [5] {butter,                                                                                    
##      soft cheese,                                                                               
##      domestic eggs}      => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [6] {citrus fruit,                                                                              
##      root vegetables,                                                                           
##      soft cheese}        => {other vegetables} 0.001016777          1 0.001016777 5.168156    10
rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules
##     lhs                        rhs                  support confidence    coverage     lift count
## [1] {Instant food products,                                                                      
##      soda}                  => {hamburger meat} 0.001220132  0.6315789 0.001931876 18.99565    12
## [2] {soda,                                                                                       
##      popcorn}               => {salty snack}    0.001220132  0.6315789 0.001931876 16.69779    12
## [3] {flour,                                                                                      
##      baking powder}         => {sugar}          0.001016777  0.5555556 0.001830198 16.40807    10
## [4] {ham,                                                                                        
##      processed cheese}      => {white bread}    0.001931876  0.6333333 0.003050330 15.04549    19
## [5] {whole milk,                                                                                 
##      Instant food products} => {hamburger meat} 0.001525165  0.5000000 0.003050330 15.03823    15
## [6] {other vegetables,                                                                           
##      curd,                                                                                       
##      yogurt,                                                                                     
##      whipped/sour cream}    => {cream cheese }  0.001016777  0.5882353 0.001728521 14.83409    10

The rules with confidence of 1 (see rules_conf above) imply that, whenever the LHS (Itemsets in Antecedent) item was purchased, the RHS (itemsets in consequence)item was also purchased 100% of the time.

A rule with a lift of 18 (see rules_lift above) imply that, the items in LHS and RHS are 18 times more likely to be purchased together compared to the purchases when they are assumed to be unrelated.

###How To Control The Number Of Rules in Output ? Adjust the maxlen, supp and conf arguments in the apriori function to control the number of rules generated. You will have to adjust this based on the sparesness of you data.

rules <- apriori(Groceries, parameter = list (supp = 0.001, conf = 0.2, maxlen=3)) # maxlen = 3 limits the elements in a rule to 3
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## Warning in apriori(Groceries, parameter = list(supp = 0.001, conf = 0.2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 3 returned!
##  done [0.00s].
## writing ... [9958 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
#summary of rules
summary(rules)
## set of 9958 rules
## 
## rule length distribution (lhs + rhs):sizes
##    1    2    3 
##    1  620 9337 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   2.938   3.000   3.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001017   Min.   :0.2000   Min.   :0.001118   Min.   : 0.8028  
##  1st Qu.:0.001220   1st Qu.:0.2439   1st Qu.:0.003762   1st Qu.: 1.8658  
##  Median :0.001627   Median :0.3077   Median :0.005287   Median : 2.3603  
##  Mean   :0.002554   Mean   :0.3452   Mean   :0.008272   Mean   : 2.6338  
##  3rd Qu.:0.002542   3rd Qu.:0.4194   3rd Qu.:0.008236   3rd Qu.: 3.0742  
##  Max.   :0.255516   Max.   :1.0000   Max.   :1.000000   Max.   :35.7158  
##      count        
##  Min.   :  10.00  
##  1st Qu.:  12.00  
##  Median :  16.00  
##  Mean   :  25.12  
##  3rd Qu.:  25.00  
##  Max.   :2513.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001        0.2
  1. To get ‘strong‘ rules, increase the value of ‘conf’ parameter.
  2. To get ’longer‘ rules, increase ’maxlen.
# Inspect rules
#inspect(rules)
#inspect top 5 rules by highest lift
inspect(head(sort(rules, by ="lift"),5))
##     lhs                              rhs                     support    
## [1] {bottled beer,red/blush wine} => {liquor}                0.001931876
## [2] {hamburger meat,soda}         => {Instant food products} 0.001220132
## [3] {ham,white bread}             => {processed cheese}      0.001931876
## [4] {bottled beer,liquor}         => {red/blush wine}        0.001931876
## [5] {Instant food products,soda}  => {hamburger meat}        0.001220132
##     confidence coverage    lift     count
## [1] 0.3958333  0.004880529 35.71579 19   
## [2] 0.2105263  0.005795628 26.20919 12   
## [3] 0.3800000  0.005083884 22.92822 19   
## [4] 0.4130435  0.004677173 21.49356 19   
## [5] 0.6315789  0.001931876 18.99565 12
# Visualization of rules
#Plotting rules
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

# Two key plot
plot(rules , shading="order", control=list(main="two-key plot"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

1.Purchase pattern related to beverages (Wine , Beer ) Find subset of rules that has Wine on the right hand side

RulesBev1 <- subset(rules, subset = rhs %ain% "soda")
summary(RulesBev1)
## set of 699 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
##  64 635 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.908   3.000   3.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.001017   Min.   :0.2000   Min.   :0.001322   Min.   :1.147  
##  1st Qu.:0.001220   1st Qu.:0.2390   1st Qu.:0.004067   1st Qu.:1.371  
##  Median :0.001627   Median :0.2807   Median :0.005592   Median :1.610  
##  Mean   :0.002408   Mean   :0.2994   Mean   :0.008989   Mean   :1.717  
##  3rd Qu.:0.002440   3rd Qu.:0.3415   3rd Qu.:0.009049   3rd Qu.:1.958  
##  Max.   :0.038332   Max.   :0.7692   Max.   :0.183935   Max.   :4.411  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 12.00  
##  Median : 16.00  
##  Mean   : 23.68  
##  3rd Qu.: 24.00  
##  Max.   :377.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001        0.2
inspect(head(sort(RulesBev1, by ="lift"),5))
##     lhs                             rhs    support     confidence coverage   
## [1] {coffee,misc. beverages}     => {soda} 0.001016777 0.7692308  0.001321810
## [2] {pastry,misc. beverages}     => {soda} 0.001220132 0.6315789  0.001931876
## [3] {chicken,waffles}            => {soda} 0.001220132 0.5714286  0.002135231
## [4] {tropical fruit,canned beer} => {soda} 0.001728521 0.5666667  0.003050330
## [5] {bottled water,cake bar}     => {soda} 0.001016777 0.5555556  0.001830198
##     lift     count
## [1] 4.411303 10   
## [2] 3.621912 12   
## [3] 3.276968 12   
## [4] 3.249660 17   
## [5] 3.185941 10