A9 association rules

Use the “AdultUCI” data available in the “arules” package and do as follows in R Script.

Attach the AdultUCI data in R

library(arules)

## Warning: package 'arules' was built under R version 4.1.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

data("AdultUCI")

Check the class, structure, dimension, head and tail of the attached data

class(AdultUCI)

## [1] "data.frame"

str(AdultUCI)

## 'data.frame':    48842 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : Ord.factor w/ 16 levels "Preschool"<"1st-4th"<..: 14 14 9 7 14 15 5 9 15 14 ...
##  $ education-num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital-status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital-gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital-loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours-per-week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native-country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ income        : Ord.factor w/ 2 levels "small"<"large": 1 1 1 1 1 1 1 2 2 2 ...

dim(AdultUCI)

## [1] 48842    15

head(AdultUCI)

##   age        workclass fnlwgt education education-num     marital-status
## 1  39        State-gov  77516 Bachelors            13      Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3  38          Private 215646   HS-grad             9           Divorced
## 4  53          Private 234721      11th             7 Married-civ-spouse
## 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital-gain capital-loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hours-per-week native-country income
## 1             40  United-States  small
## 2             13  United-States  small
## 3             40  United-States  small
## 4             40  United-States  small
## 5             40           Cuba  small
## 6             40  United-States  small

tail(AdultUCI)

##       age    workclass fnlwgt education education-num     marital-status
## 48837  33      Private 245211 Bachelors            13      Never-married
## 48838  39      Private 215419 Bachelors            13           Divorced
## 48839  64         <NA> 321403   HS-grad             9            Widowed
## 48840  38      Private 374983 Bachelors            13 Married-civ-spouse
## 48841  44      Private  83891 Bachelors            13           Divorced
## 48842  35 Self-emp-inc 182148 Bachelors            13 Married-civ-spouse
##            occupation   relationship               race    sex capital-gain
## 48837  Prof-specialty      Own-child              White   Male            0
## 48838  Prof-specialty  Not-in-family              White Female            0
## 48839            <NA> Other-relative              Black   Male            0
## 48840  Prof-specialty        Husband              White   Male            0
## 48841    Adm-clerical      Own-child Asian-Pac-Islander   Male         5455
## 48842 Exec-managerial        Husband              White   Male            0
##       capital-loss hours-per-week native-country income
## 48837            0             40  United-States   <NA>
## 48838            0             36  United-States   <NA>
## 48839            0             40  United-States   <NA>
## 48840            0             50  United-States   <NA>
## 48841            0             40  United-States   <NA>
## 48842            0             60  United-States   <NA>

Remove “fnlwgt” and “education-num” variables from the attached data

AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL

Convert “age” as ordered factor variables with cuts at 15, 25, 45, 65 and 100 and label it as “Young”, “Middle-aged”, “Senior” and “Old”

AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
                              labels = c("Young", "Middle-aged", "Senior", "Old"))

Convert the “hours-per-week” as ordered factor variable with cuts at 0, 25, 40, 60, 168 and label it as “Part-time”, “Full-time”, “Over-time” and “Workaholic”

AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
                                             c(0,25,40,60,168)),
                                         labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))

Convert the “capital-gain” as ordered factor variable with cuts at –Inf, 0, median and Inf and label it as “None”, “Low” and “High”

AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
                                           c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),
                                             Inf)), labels = c("None", "Low", "High"))

Convert the “capital-loss” as ordered factor variable with cuts at –Inf, 0, median and Inf and label it as “None”, “Low” and “High”

AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
                                           c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),
                                             Inf)), labels = c("None", "Low", "High"))

Create transactions of AdultUCI data as “Adult” and check it with “Adult” command

Adult <- transactions(AdultUCI)
Adult

## transactions in sparse format with
##  48842 transactions (rows) and
##  115 items (columns)

Get summary of the “Adult” and interpret it critically

summary(Adult)

## transactions as itemMatrix in sparse format with
##  48842 rows (elements/itemsets/transactions) and
##  115 columns (items) and a density of 0.1089939 
## 
## most frequent items:
##            capital-loss=None            capital-gain=None 
##                        46560                        44807 
## native-country=United-States                   race=White 
##                        43832                        41762 
##            workclass=Private                      (Other) 
##                        33906                       401333 
## 
## element (itemset/transaction) length distribution:
## sizes
##     9    10    11    12    13 
##    19   971  2067 15623 30162 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   12.00   13.00   12.53   13.00   13.00 
## 
## includes extended item information - examples:
##            labels variables      levels
## 1       age=Young       age       Young
## 2 age=Middle-aged       age Middle-aged
## 3      age=Senior       age      Senior
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Interpretation: Since, rule mining algorithms like Apriori works on factor variables, we convert our variables to ordered factor ones. We can coerce the data from list or data frame into transactional class using transaction function. The main reason of cooercion is because we want R to know we are dealing with transactional data so that we can mine association rules afterwards. The transactions was summarized using summary function. Within summary function, we can see 48842 number of rows which represents number of sets where items are assembled in which single row represents one transaction. Total number of items used in the transaction is 115. Transactions are created through possible permutations of each variables and their observation combined with other. most frequent items in the data set are: capital-loss=None, capital-gain=None,native-country=United-States, race=White etc. Therefore, they are mostly used while creating transactions too. The number of transactions that involves 9 item sets are 9 and transactions involving 10 item sets are 971 and so on. The transactions that involves 13 item sets are higher with a count of 30162. Descriptive summary of item sets can be also viualized. Density value of 0.1089 represents the total participation of item sets to the total transactions.

Inspect head and tail of the “Adult” and interpret them carefully

inspect(head(Adult))

##     items                                transactionID
## [1] {age=Middle-aged,                                 
##      workclass=State-gov,                             
##      education=Bachelors,                             
##      marital-status=Never-married,                    
##      occupation=Adm-clerical,                         
##      relationship=Not-in-family,                      
##      race=White,                                      
##      sex=Male,                                        
##      capital-gain=Low,                                
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States,                    
##      income=small}                                   1
## [2] {age=Senior,                                      
##      workclass=Self-emp-not-inc,                      
##      education=Bachelors,                             
##      marital-status=Married-civ-spouse,               
##      occupation=Exec-managerial,                      
##      relationship=Husband,                            
##      race=White,                                      
##      sex=Male,                                        
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Part-time,                        
##      native-country=United-States,                    
##      income=small}                                   2
## [3] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=HS-grad,                               
##      marital-status=Divorced,                         
##      occupation=Handlers-cleaners,                    
##      relationship=Not-in-family,                      
##      race=White,                                      
##      sex=Male,                                        
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States,                    
##      income=small}                                   3
## [4] {age=Senior,                                      
##      workclass=Private,                               
##      education=11th,                                  
##      marital-status=Married-civ-spouse,               
##      occupation=Handlers-cleaners,                    
##      relationship=Husband,                            
##      race=Black,                                      
##      sex=Male,                                        
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States,                    
##      income=small}                                   4
## [5] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=Bachelors,                             
##      marital-status=Married-civ-spouse,               
##      occupation=Prof-specialty,                       
##      relationship=Wife,                               
##      race=Black,                                      
##      sex=Female,                                      
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=Cuba,                             
##      income=small}                                   5
## [6] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=Masters,                               
##      marital-status=Married-civ-spouse,               
##      occupation=Exec-managerial,                      
##      relationship=Wife,                               
##      race=White,                                      
##      sex=Female,                                      
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States,                    
##      income=small}                                   6

inspect(tail(Adult))

##     items                                transactionID
## [1] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=Bachelors,                             
##      marital-status=Never-married,                    
##      occupation=Prof-specialty,                       
##      relationship=Own-child,                          
##      race=White,                                      
##      sex=Male,                                        
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States}               48837
## [2] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=Bachelors,                             
##      marital-status=Divorced,                         
##      occupation=Prof-specialty,                       
##      relationship=Not-in-family,                      
##      race=White,                                      
##      sex=Female,                                      
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States}               48838
## [3] {age=Senior,                                      
##      education=HS-grad,                               
##      marital-status=Widowed,                          
##      relationship=Other-relative,                     
##      race=Black,                                      
##      sex=Male,                                        
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States}               48839
## [4] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=Bachelors,                             
##      marital-status=Married-civ-spouse,               
##      occupation=Prof-specialty,                       
##      relationship=Husband,                            
##      race=White,                                      
##      sex=Male,                                        
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Over-time,                        
##      native-country=United-States}               48840
## [5] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=Bachelors,                             
##      marital-status=Divorced,                         
##      occupation=Adm-clerical,                         
##      relationship=Own-child,                          
##      race=Asian-Pac-Islander,                         
##      sex=Male,                                        
##      capital-gain=Low,                                
##      capital-loss=None,                               
##      hours-per-week=Full-time,                        
##      native-country=United-States}               48841
## [6] {age=Middle-aged,                                 
##      workclass=Self-emp-inc,                          
##      education=Bachelors,                             
##      marital-status=Married-civ-spouse,               
##      occupation=Exec-managerial,                      
##      relationship=Husband,                            
##      race=White,                                      
##      sex=Male,                                        
##      capital-gain=None,                               
##      capital-loss=None,                               
##      hours-per-week=Over-time,                        
##      native-country=United-States}               48842

Interpretation: After calling inspect function to the head and tail portion of the transactions, we can get a list of transactions and their ID. The transactions are set of item or item sets arranged together through ID. The dataframe after converting to transaction is in the flat file (basket) form. When the file is in the basket form, it means that each row represents the transaction where the items in the basket is represented by column. Once the original ordered factor data is cooerced to transactions, the data is ready for mining itemsets or rules.

Create absolute and relative item frequency plot and color it with RColorBrewer package

library(RColorBrewer)
# using itemFrequencyPlot() function
arules::itemFrequencyPlot(Adult, topN = 20,
                          col = brewer.pal(8, 'Pastel2'),
                          main = 'Absolute Item Frequency Plot',
                          type = "absolute",
                          ylab = "Item Frequency (Absolute)")

arules::itemFrequencyPlot(Adult, topN = 20,
                          col = brewer.pal(8, 'Pastel2'),
                          main = 'Relative Item Frequency Plot',
                          type = "relative",
                          ylab = "Item Frequency (Relative)")

Create an apriori rule as “association.rule” with support = 1%, confidence = 80% and maximum length of the rule as 10. Get summary of this rule and interpret it carefully.

aParam  = new("APparameter", "support" =0.01,  "confidence" = 0.8, "maxlen"= 10)
association.rule <-apriori(Adult,aParam)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 488 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(Adult, aParam): Mining stopped (maxlen reached). Only
## patterns up to a length of 10 returned!

##  done [0.90s].
## writing ... [197371 rule(s)] done [0.05s].
## creating S4 object  ... done [0.10s].

summary(association.rule)

## set of 197371 rules
## 
## rule length distribution (lhs + rhs):sizes
##     1     2     3     4     5     6     7     8     9    10 
##     4   266  3303 15219 37015 53616 48402 27754  9827  1965 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.318   7.000  10.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift        
##  Min.   :0.01001   Min.   :0.8000   Min.   :0.01001   Min.   : 0.8677  
##  1st Qu.:0.01251   1st Qu.:0.8953   1st Qu.:0.01353   1st Qu.: 1.0059  
##  Median :0.01708   Median :0.9372   Median :0.01847   Median : 1.0398  
##  Mean   :0.02726   Mean   :0.9283   Mean   :0.02949   Mean   : 1.2899  
##  3rd Qu.:0.02766   3rd Qu.:0.9669   3rd Qu.:0.02995   3rd Qu.: 1.2160  
##  Max.   :0.95328   Max.   :1.0000   Max.   :1.00000   Max.   :20.6826  
##      count      
##  Min.   :  489  
##  1st Qu.:  611  
##  Median :  834  
##  Mean   : 1331  
##  3rd Qu.: 1351  
##  Max.   :46560  
## 
## mining info:
##   data ntransactions support confidence
##  Adult         48842    0.01        0.8
##                                       call
##  apriori(data = Adult, parameter = aParam)

Interpretation: The above tasks of Apriori can be divided into following two sub-tasks:

frequent itemsets generation: determining the most frequent item sets from the Adult transactional data. A itemset is said to be most frequent when it satisfies the minimum support threshold provided of 1%.
Rule generation: from the above the frequent item sets, algorithm generates the association rules with the confidence of 80% defined as parameter to Apriori.

There might be many association rules involving any number of item sets with the support of 1% and confidence of 80%. Thus, to filter out, we specify the max-length of item sets in the rules should be limited to 10.

Summary function applied to rules give the information: 1. 197371 number of rules created with the parameter of support = 1%, confidence = 80% and max-len = 10. 2. Rule length distributions where number of item sets involved in the rules are defined. Maximum number of rules involved 6 item sets. (this is median too) 3. Summary measure for support, confidence and coverage can also be visualized which shows median of support is near to 1 which means that most of the rules are near to threshold (not much frequent itemset generation) 4. The conditional probability (A|B) or confidence is higher, which means chances of co-occurrence is higher. [ since threshold is higher i.e. 80%]

Inspect the first 10 rules and interpret it critically.

inspect(head(association.rule, 10))

##      lhs                                       rhs                               support confidence   coverage      lift count
## [1]  {}                                     => {race=White}                   0.85504279  0.8550428 1.00000000 1.0000000 41762
## [2]  {}                                     => {native-country=United-States} 0.89742435  0.8974243 1.00000000 1.0000000 43832
## [3]  {}                                     => {capital-gain=None}            0.91738668  0.9173867 1.00000000 1.0000000 44807
## [4]  {}                                     => {capital-loss=None}            0.95327792  0.9532779 1.00000000 1.0000000 46560
## [5]  {education=5th-6th}                    => {capital-loss=None}            0.01009377  0.9685658 0.01042136 1.0160372   493
## [6]  {education=Doctorate}                  => {race=White}                   0.01076942  0.8855219 0.01216166 1.0356463   526
## [7]  {education=Doctorate}                  => {capital-loss=None}            0.01076942  0.8855219 0.01216166 0.9289231   526
## [8]  {marital-status=Married-spouse-absent} => {capital-gain=None}            0.01218214  0.9474522 0.01285779 1.0327730   595
## [9]  {marital-status=Married-spouse-absent} => {capital-loss=None}            0.01240735  0.9649682 0.01285779 1.0122632   606
## [10] {education=12th}                       => {native-country=United-States} 0.01140412  0.8477930 0.01345154 0.9446958   557

Interpretation: The top 10 rules with minimum support of 1% and confidence of 80% with max item set of 10 is inspected above. The rules mostly contains either empty or single item set, since item set are selected in ascending order by default while creating rules. The lhs-rhs portion denotes the condition of if and then. For example: if the education if Doctorate, then race is White. This rule has a support of 0.01076 which is above the threshold specified which means it is frequent itemset in the transactional data. This rule has confidence of 0.885522 which is nearly 89% which confirms this rule. If the itemsets passed certain support threshold but could not pass confidence threshold, then rules won’t be created. To be a rule through Apriori algorithm, any itemset must be frequent as well as must pass confidence threshold. The lift value for the rule bolster the quality of rules. If the lift value for the rule is above 1, that means the rule can be trusted. If the value of lift is near to 1, that means there is nearly no association, it might be just random co-occurence. For example: {marital-status=Married-spouse-absent} => {capital-loss=None} this rule has support of more than 1% which means that it is frequent itemset in transaction, and has confidence of 96% means that it is highly that if maritial status is married-spouse-absent then capital-loss is None. But its lift is near to 1, thus this rule is very unlikely.

Since, confidence(X,Y) is different from confidence(Y, X), confidence does not imply co-occurence. But, lift has no direction. Here, most rules has lift value more than 1 which implies that there is positive relationship between lhs and rhs.

Remove the empty rules from the “association.rule” and inspect the first 10 rules with interpretations.

aParam@minlen <- 2L
association.rule <- apriori(Adult, aParam)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8      0    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 488 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(Adult, aParam): Mining stopped (maxlen reached). Only
## patterns up to a length of 10 returned!

##  done [0.85s].
## writing ... [197367 rule(s)] done [0.05s].
## creating S4 object  ... done [0.11s].

inspect(head(association.rule, 10))

##      lhs                                       rhs                               support confidence   coverage      lift count
## [1]  {education=5th-6th}                    => {capital-loss=None}            0.01009377  0.9685658 0.01042136 1.0160372   493
## [2]  {education=Doctorate}                  => {race=White}                   0.01076942  0.8855219 0.01216166 1.0356463   526
## [3]  {education=Doctorate}                  => {capital-loss=None}            0.01076942  0.8855219 0.01216166 0.9289231   526
## [4]  {marital-status=Married-spouse-absent} => {capital-gain=None}            0.01218214  0.9474522 0.01285779 1.0327730   595
## [5]  {marital-status=Married-spouse-absent} => {capital-loss=None}            0.01240735  0.9649682 0.01285779 1.0122632   606
## [6]  {education=12th}                       => {native-country=United-States} 0.01140412  0.8477930 0.01345154 0.9446958   557
## [7]  {education=12th}                       => {capital-gain=None}            0.01289873  0.9589041 0.01345154 1.0452562   630
## [8]  {education=12th}                       => {capital-loss=None}            0.01322632  0.9832572 0.01345154 1.0314487   646
## [9]  {education=9th}                        => {race=White}                   0.01250973  0.8082011 0.01547848 0.9452171   611
## [10] {education=9th}                        => {capital-gain=None}            0.01457762  0.9417989 0.01547848 1.0266107   712

Interpretation: Since, some of the rules has lhs empty although its support is greater than threshold, and confidence greater than 80% but lift come out to be 1 which tells that it is kind of useless rule. Thus, we can remove these empty and useless rule by tuning the parameter called as minlen. The minlen is defined as 2 which represents the total minimum length of itemsets to be associated with rules.

15: Create a new rule as “capital.gain.rhs.rule” with “capital-gain=None” in the RHS with support of 1%, confidence of 80%, maximum length of 10 and minimum length of 2.

capital.gain.rhs.rule <- apriori(Adult, parameter = list(support=0.01, confidence = 0.8, minlen = 2, maxlen = 10), 
                        appearance = list(rhs = c("capital-gain=None"), default="lhs"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 488 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(Adult, parameter = list(support = 0.01, confidence = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!

##  done [0.85s].
## writing ... [35433 rule(s)] done [0.01s].
## creating S4 object  ... done [0.03s].

Get summary of this rule and interpret it critically.

summary(capital.gain.rhs.rule)

## set of 35433 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6    7    8    9   10 
##   60  706 3062 7110 9790 8377 4537 1508  283 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   6.212   7.000  10.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01001   Min.   :0.8000   Min.   :0.01009   Min.   :0.8720  
##  1st Qu.:0.01265   1st Qu.:0.9015   1st Qu.:0.01359   1st Qu.:0.9827  
##  Median :0.01744   Median :0.9453   Median :0.01882   Median :1.0304  
##  Mean   :0.02819   Mean   :0.9287   Mean   :0.03051   Mean   :1.0124  
##  3rd Qu.:0.02864   3rd Qu.:0.9654   3rd Qu.:0.03092   3rd Qu.:1.0523  
##  Max.   :0.87066   Max.   :1.0000   Max.   :0.95328   Max.   :1.0901  
##      count      
##  Min.   :  489  
##  1st Qu.:  618  
##  Median :  852  
##  Mean   : 1377  
##  3rd Qu.: 1399  
##  Max.   :42525  
## 
## mining info:
##   data ntransactions support confidence
##  Adult         48842    0.01        0.8
##                                                                                                                                                                  call
##  apriori(data = Adult, parameter = list(support = 0.01, confidence = 0.8, minlen = 2, maxlen = 10), appearance = list(rhs = c("capital-gain=None"), default = "lhs"))

inspect(head(capital.gain.rhs.rule))

##     lhs                                       rhs                    support confidence   coverage     lift count
## [1] {marital-status=Married-spouse-absent} => {capital-gain=None} 0.01218214  0.9474522 0.01285779 1.032773   595
## [2] {education=12th}                       => {capital-gain=None} 0.01289873  0.9589041 0.01345154 1.045256   630
## [3] {education=9th}                        => {capital-gain=None} 0.01457762  0.9417989 0.01547848 1.026611   712
## [4] {education=7th-8th}                    => {capital-gain=None} 0.01822202  0.9319372 0.01955284 1.015861   890
## [5] {native-country=Mexico}                => {capital-gain=None} 0.01871340  0.9610936 0.01947095 1.047643   914
## [6] {occupation=Protective-serv}           => {capital-gain=None} 0.01863151  0.9257375 0.02012612 1.009103   910

Interpretation: To filter out the rules created with Apriori, we can use appearance parameter. Appearance can be changed by tuning the parameter of lhs and rhs. For this problem, we have created rhs to be capital-gain= None and default to be lhs. All the rules that has rhs of capital-gain= None was printed out. Some of these rules has lift greater than 1 which means that these rules are co-occurred. For example: if education is 12th then capital-gain is None. This rule implies there is positive relationship between lhs and rhs.

Create a new rule as “hours.per.week.ft.rule” with “hour-per-week=Full-time” in the RHS with support of 1%, confidence of 80%, maximum length of 10 and minimum length of 2.

hours.per.week.ft.rule <- apriori(Adult, parameter = list(support=0.01, confidence = 0.8, minlen = 2, maxlen = 10), 
                                 appearance = list(rhs = c("hours-per-week=Full-time"), default="lhs"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 488 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.06s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(Adult, parameter = list(support = 0.01, confidence = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!

##  done [0.85s].
## writing ... [159 rule(s)] done [0.01s].
## creating S4 object  ... done [0.02s].

Get summary of this rule and interpret it critically.

summary(hours.per.week.ft.rule)

## set of 159 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6  7  8 
##  3 16 48 58 29  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.686   6.000   8.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01001   Min.   :0.8000   Min.   :0.01216   Min.   :1.367  
##  1st Qu.:0.01066   1st Qu.:0.8047   1st Qu.:0.01318   1st Qu.:1.375  
##  Median :0.01179   Median :0.8086   Median :0.01456   Median :1.382  
##  Mean   :0.01237   Mean   :0.8089   Mean   :0.01530   Mean   :1.383  
##  3rd Qu.:0.01331   3rd Qu.:0.8129   3rd Qu.:0.01654   3rd Qu.:1.389  
##  Max.   :0.01992   Max.   :0.8266   Max.   :0.02471   Max.   :1.413  
##      count      
##  Min.   :489.0  
##  1st Qu.:520.5  
##  Median :576.0  
##  Mean   :604.4  
##  3rd Qu.:650.0  
##  Max.   :973.0  
## 
## mining info:
##   data ntransactions support confidence
##  Adult         48842    0.01        0.8
##                                                                                                                                                                         call
##  apriori(data = Adult, parameter = list(support = 0.01, confidence = 0.8, minlen = 2, maxlen = 10), appearance = list(rhs = c("hours-per-week=Full-time"), default = "lhs"))

inspect(head(hours.per.week.ft.rule))

##     lhs                                rhs                           support confidence   coverage     lift count
## [1] {occupation=Machine-op-inspct,                                                                               
##      sex=Female}                    => {hours-per-week=Full-time} 0.01324680  0.8047264 0.01646124 1.375387   647
## [2] {occupation=Adm-clerical,                                                                                    
##      race=Black}                    => {hours-per-week=Full-time} 0.01224356  0.8102981 0.01510995 1.384910   598
## [3] {occupation=Adm-clerical,                                                                                    
##      relationship=Unmarried}        => {hours-per-week=Full-time} 0.01756685  0.8003731 0.02194832 1.367947   858
## [4] {workclass=Private,                                                                                          
##      occupation=Machine-op-inspct,                                                                               
##      sex=Female}                    => {hours-per-week=Full-time} 0.01293968  0.8092190 0.01599034 1.383066   632
## [5] {occupation=Machine-op-inspct,                                                                               
##      sex=Female,                                                                                                 
##      capital-gain=None}             => {hours-per-week=Full-time} 0.01281684  0.8067010 0.01588797 1.378762   626
## [6] {occupation=Machine-op-inspct,                                                                               
##      sex=Female,                                                                                                 
##      capital-loss=None}             => {hours-per-week=Full-time} 0.01296016  0.8073980 0.01605176 1.379953   633

Interpretation: In this problem too, we have filter out the appearance of rhs to be hours-per-week=Full-time. The default value of lhs represents that there might be any combination of item sets in the lhs but rhs must be fixed. The summary of rules is summarized. The number of itemset involved in the rule is 3 and max item set is 8. Inspection of the rule is done which shows the rule along with the support, confidence and lift. Combination of parameter and apperance argument to the apriori function gives the flexibility to filter out the rules created.

Get new rule of “hours.per.week.ft.rule” as “conf.sort.rule” by sorting this rule in descending order by “confidence” and inspect the head and tail rules with critical interpretation.

conf.sort.rule <- sort(hours.per.week.ft.rule, by="confidence", desc = TRUE)

inspect(head(conf.sort.rule))

##     lhs                           rhs                           support confidence   coverage     lift count
## [1] {age=Middle-aged,                                                                                       
##      occupation=Adm-clerical,                                                                               
##      relationship=Unmarried,                                                                                
##      sex=Female,                                                                                            
##      capital-gain=None}        => {hours-per-week=Full-time} 0.01005282  0.8265993 0.01216166 1.412771   491
## [2] {age=Middle-aged,                                                                                       
##      occupation=Adm-clerical,                                                                               
##      relationship=Unmarried,                                                                                
##      capital-gain=None}        => {hours-per-week=Full-time} 0.01066705  0.8243671 0.01293968 1.408956   521
## [3] {age=Middle-aged,                                                                                       
##      occupation=Adm-clerical,                                                                               
##      relationship=Unmarried,                                                                                
##      capital-gain=None,                                                                                     
##      capital-loss=None}        => {hours-per-week=Full-time} 0.01042136  0.8236246 0.01265304 1.407687   509
## [4] {age=Middle-aged,                                                                                       
##      relationship=Unmarried,                                                                                
##      race=Black,                                                                                            
##      sex=Female,                                                                                            
##      capital-gain=None}        => {hours-per-week=Full-time} 0.01029851  0.8218954 0.01253020 1.404732   503
## [5] {age=Middle-aged,                                                                                       
##      workclass=Private,                                                                                     
##      education=HS-grad,                                                                                     
##      race=Black,                                                                                            
##      capital-gain=None,                                                                                     
##      capital-loss=None}        => {hours-per-week=Full-time} 0.01148602  0.8201754 0.01400434 1.401792   561
## [6] {age=Middle-aged,                                                                                       
##      education=HS-grad,                                                                                     
##      occupation=Adm-clerical,                                                                               
##      sex=Female,                                                                                            
##      capital-gain=None,                                                                                     
##      capital-loss=None}        => {hours-per-week=Full-time} 0.01031899  0.8195122 0.01259162 1.400658   504

inspect(tail(conf.sort.rule))

##     lhs                                rhs                           support confidence   coverage     lift count
## [1] {occupation=Adm-clerical,                                                                                    
##      relationship=Unmarried}        => {hours-per-week=Full-time} 0.01756685  0.8003731 0.02194832 1.367947   858
## [2] {workclass=Private,                                                                                          
##      occupation=Adm-clerical,                                                                                    
##      relationship=Unmarried,                                                                                     
##      sex=Female,                                                                                                 
##      capital-gain=None,                                                                                          
##      capital-loss=None,                                                                                          
##      native-country=United-States}  => {hours-per-week=Full-time} 0.01033946  0.8003170 0.01291921 1.367851   505
## [3] {occupation=Adm-clerical,                                                                                    
##      relationship=Unmarried,                                                                                     
##      sex=Female,                                                                                                 
##      capital-gain=None,                                                                                          
##      capital-loss=None,                                                                                          
##      income=small}                  => {hours-per-week=Full-time} 0.01042136  0.8003145 0.01302158 1.367847   509
## [4] {occupation=Machine-op-inspct,                                                                               
##      sex=Female,                                                                                                 
##      capital-gain=None,                                                                                          
##      native-country=United-States}  => {hours-per-week=Full-time} 0.01031899  0.8000000 0.01289873 1.367309   504
## [5] {occupation=Machine-op-inspct,                                                                               
##      sex=Female,                                                                                                 
##      capital-loss=None,                                                                                          
##      native-country=United-States}  => {hours-per-week=Full-time} 0.01040088  0.8000000 0.01300111 1.367309   508
## [6] {workclass=Private,                                                                                          
##      education=HS-grad,                                                                                          
##      race=Black,                                                                                                 
##      capital-gain=None,                                                                                          
##      native-country=United-States,                                                                               
##      income=small}                  => {hours-per-week=Full-time} 0.01171123  0.8000000 0.01463904 1.367309   572

Interpretation: The association rule named hours.per.week.ft.rule is sorted in descending order by the value of confidence. After the command is executed, the rule with highest confidence is printed at top. The higher confidence metric related to particular rule denotes the strength of the rule.

For example: {age=Middle-aged,
occupation=Adm-clerical,
relationship=Unmarried,
sex=Female,
capital-gain=None} => {hours-per-week=Full-time} let, lhs = {age=Middle-aged,
occupation=Adm-clerical,
relationship=Unmarried,
sex=Female,
capital-gain=None}

rhs = {hours-per-week=Full-time}

P(rhs|lhs) = P(rhs) ^ P(lhs) / P(lhs) = 0.8265993 (very high confidence) Similarly, lift value for this relationship is 1.412771 (greater than 1)

This implies that there is strong positive relationship between lhs and rhs. The rule can be rephrased as: If age is middle-aged, occupation is Adm-clerical, relationship is Unmarried, sex is Female and capital-gain is None, then hours-per-week is Full-time. (strong co-occurrence)

Plot the “hours.per.week.ft.rule” with arulesViz package with plot, plot with “two-key plot”, engine=”plotly”, method=graph & engine=htmlwidget and paraller coordinate plot and interpret each graph carefully.

library(arulesViz)

plot(hours.per.week.ft.rule)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(hours.per.week.ft.rule, method="two-key plot")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(hours.per.week.ft.rule, engine = "plotly")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(hours.per.week.ft.rule, method = "graph",  engine = "htmlwidget")

plot(hours.per.week.ft.rule, method="paracoord")

Interpretations:

Scatter plot: The default plot(rules) function plots the scatter-plot where support and confidence is represented in x and y axis respectively. The single dot in the scatter plot represents the single rule. The default color code is given to represent the value of lift. If the color of the particular rule is darker, its lift is greater. From the scatter-plot graph, it can be seen that higher the confidence, higher the lift but lower the support. If the support threshold is made more than 1.5%, then it is likely that we miss out on the rules which has higher confidence and lift. Thus, we should set the threshold for the support and confidence very wisely.
Two-key plot: By default the method argument is scatter plot. But, if we change the method to be two-key plot, then support vs confidence plot is drawn with the order of transaction as a color code. The order represents the number of items used to create a particular rule. In the plot, orange color code is given to those rules which involve 3 itemset.
Interactive plot using plotly: The above two plots are static, what if you want to interact with the plot. Features like zoom in, auto-scale and reset axes can be used inside the plot using arulesViz and plotly engine together. We can hover to the particular rule point to see the metrics for that rules. The support and confidence is plotted in x and y axis respectively and colouring of the points is done based on value of lift metric.
Graph plot: To visualize the rules using node and arrows, we use graph plot. The node represents the rule number and lhs/rhs are represented inside the rectangular box. The direction for the rule is represented by arrow line from lhs to rhs. If the number of rule is extremely high then, it is practically impossible to visualize this graph. Thus, filter options based on lhs and rules is provided within the graph plot. The metrics information of support, confidence and lift can be fetched by selecting the particular edge that connects the lhs and rhs.
Parallel coordinates plot: In order to visualize discovered association rules, we use parallel coordinates plot. Plot represents the items in y-axis and positions of item in the rule in x-axis. This plot represents the journey of all the items in the lhs which leads to rhs in making of rule. For example: {age=Middle-aged,
occupation=Adm-clerical,
relationship=Unmarried,
sex=Female,
capital-gain=None} => {hours-per-week=Full-time}

In making above rule, age=Middle-aged is at position 1, occupation=Adm-clerical is at position 2, relationship=Unmarried at position 3, sex=Female at position 4, capital-gain=None at position 5 in making the RHS i.e. hours-per-week=Full-time

These parallel-coordinates plot helps better see the items that has lead to formation of the rules.

A9 association rules

Sachin Kafle

1/4/2022