Association Rule Mining with data preprocessing

For this practice, we will use the AdultUCI built-in data set in arules package. The data set is not transactionalized yet, so we will practice data preprocessing as well.

setwd("D:/Class Materials & Work/Summer 2020 practice/ARM_2")

library(arules) #for ARM
library(arulesViz) #for ARM visualization

Converting data frame into transactional data

When we convert the dataframe into a transactional dataset, each row of this dataframe will become a transaction, and each column will become an item.

However, variables should be discretized to coerce them into fixed value first.

data("AdultUCI")

str(AdultUCI)

## 'data.frame':    48842 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : Ord.factor w/ 16 levels "Preschool"<"1st-4th"<..: 14 14 9 7 14 15 5 9 15 14 ...
##  $ education-num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital-status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital-gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital-loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours-per-week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native-country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ income        : Ord.factor w/ 2 levels "small"<"large": 1 1 1 1 1 1 1 2 2 2 ...

In AdultUCI dataframe, columns 1, 3, 5, 11, 12, 13 are integers. So we either convert them into factors or remove them.

#remove Attributes "fnlwgt" and "education-num"

AdultUCI[["fnlwgt"]] <- NULL

AdultUCI[["education-num"]] <- NULL

#Discretize the variables by cut-points

AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
                              labels = c("Young", "Middle-aged", "Senior", "Old")) #for age

AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],c(0,25,40,60,168)),
                                         labels = c("Part-time", "Full-time", "Over-time", "Workaholic")) #for hours-per-week

AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
                                           c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),Inf)), 
                                       labels = c("None", "Low", "High")) #for capital gain

AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
                                           c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),Inf)), 
                                       labels = c("None", "Low", "High")) #for capital loss

Now that we have processed the integer variables, let us check the data structure again.

str(AdultUCI)

## 'data.frame':    48842 obs. of  13 variables:
##  $ age           : Ord.factor w/ 4 levels "Young"<"Middle-aged"<..: 2 3 2 3 2 2 3 3 2 2 ...
##  $ workclass     : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ education     : Ord.factor w/ 16 levels "Preschool"<"1st-4th"<..: 14 14 9 7 14 15 5 9 15 14 ...
##  $ marital-status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital-gain  : Ord.factor w/ 3 levels "None"<"Low"<"High": 2 1 1 1 1 1 1 1 3 2 ...
##  $ capital-loss  : Ord.factor w/ 3 levels "None"<"Low"<"High": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hours-per-week: Ord.factor w/ 4 levels "Part-time"<"Full-time"<..: 2 1 2 2 2 2 1 3 3 2 ...
##  $ native-country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ income        : Ord.factor w/ 2 levels "small"<"large": 1 1 1 1 1 1 1 2 2 2 ...

All variables are either discretized or already a factor. Now to convert the dataframe into a transactional data:

transactional_AdultUCI <- as(AdultUCI, "transactions")

transactional_AdultUCI

## transactions in sparse format with
##  48842 transactions (rows) and
##  115 items (columns)

ARM practice with the transactionalized `AdultUCI` data set.

Let’s try plotting the frequencies of items in the data set with support greater than 20%.

itemFrequencyPlot(transactional_AdultUCI, support=.20)

Mining for rules with the minimum support of 1% and a confidence of 60%.

AdultUCI_rule <- apriori(transactional_AdultUCI, parameter=list(support=.01,confidence=.6,
                                                                minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 488 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.07s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [1.30s].
## writing ... [276437 rule(s)] done [0.11s].
## creating S4 object  ... done [0.38s].

summary(AdultUCI_rule)

## set of 276437 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9    10 
##   432  4981 22127 52669 75104 67198 38094 13244  2588 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   6.289   7.000  10.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift        
##  Min.   :0.01001   Min.   :0.6000   Min.   :0.01001   Min.   : 0.7171  
##  1st Qu.:0.01253   1st Qu.:0.7691   1st Qu.:0.01474   1st Qu.: 1.0100  
##  Median :0.01701   Median :0.9051   Median :0.02043   Median : 1.0554  
##  Mean   :0.02677   Mean   :0.8600   Mean   :0.03177   Mean   : 1.3110  
##  3rd Qu.:0.02741   3rd Qu.:0.9542   3rd Qu.:0.03288   3rd Qu.: 1.2980  
##  Max.   :0.87066   Max.   :1.0000   Max.   :0.95328   Max.   :20.6826  
##      count      
##  Min.   :  489  
##  1st Qu.:  612  
##  Median :  831  
##  Mean   : 1308  
##  3rd Qu.: 1339  
##  Max.   :42525  
## 
## mining info:
##                    data ntransactions support confidence
##  transactional_AdultUCI         48842    0.01        0.6

The method yields 276437 rules with minimum support 0.01 and confidence threshold 0.60.

Next, we can try finding rules that are related with each level of variable income (small and large).

#rules with high income as antecedent

AdultUCI_high_income <- apriori(transactional_AdultUCI, parameter = list(supp = .01, conf = .6, minlen = 2), 
                                appearance = list(default = "rhs",lhs = "income=large"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 488 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.07s].
## sorting and recoding items ... [67 item(s)] done [0.02s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

summary(AdultUCI_high_income)

## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 
## 8 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.1016   Min.   :0.6330   Min.   :0.1605   Min.   :0.8569  
##  1st Qu.:0.1249   1st Qu.:0.7783   1st Qu.:0.1605   1st Qu.:0.9371  
##  Median :0.1367   Median :0.8515   Median :0.1605   Median :1.0403  
##  Mean   :0.1325   Mean   :0.8251   Mean   :0.1605   Mean   :1.2248  
##  3rd Qu.:0.1450   3rd Qu.:0.9030   3rd Qu.:0.1605   3rd Qu.:1.4189  
##  Max.   :0.1468   Max.   :0.9146   Max.   :0.1605   Max.   :1.8697  
##      count     
##  Min.   :4963  
##  1st Qu.:6102  
##  Median :6677  
##  Mean   :6469  
##  3rd Qu.:7080  
##  Max.   :7171  
## 
## mining info:
##                    data ntransactions support confidence
##  transactional_AdultUCI         48842    0.01        0.6

#parallel plot
plot(AdultUCI_high_income, method="paracoord", control=list(reorder=TRUE))

#graph plot
plot(AdultUCI_high_income, method="graph", control=list(verbose = FALSE))

There are eight rules with high income as antecedent.

#rules with low income as antecedent

AdultUCI_low_income <- apriori(transactional_AdultUCI, parameter = list(supp = .01, conf = .6, minlen = 2), 
                               appearance = list(default = "rhs", lhs = "income=small"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 488 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.06s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 done [0.01s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

summary(AdultUCI_low_income)

## set of 7 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 
## 7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.3097   Min.   :0.6120   Min.   :0.5061   Min.   :0.9155  
##  1st Qu.:0.3383   1st Qu.:0.6684   1st Qu.:0.5061   1st Qu.:0.9855  
##  Median :0.4238   Median :0.8373   Median :0.5061   Median :1.0174  
##  Mean   :0.4052   Mean   :0.8006   Mean   :0.5061   Mean   :1.0057  
##  3rd Qu.:0.4677   3rd Qu.:0.9240   3rd Qu.:0.5061   3rd Qu.:1.0389  
##  Max.   :0.4908   Max.   :0.9698   Max.   :0.5061   Max.   :1.0586  
##      count      
##  Min.   :15128  
##  1st Qu.:16522  
##  Median :20699  
##  Mean   :19790  
##  3rd Qu.:22842  
##  Max.   :23974  
## 
## mining info:
##                    data ntransactions support confidence
##  transactional_AdultUCI         48842    0.01        0.6

#parallel plot
plot(AdultUCI_low_income, method="paracoord", control=list(reorder=TRUE))

#graph plot
plot(AdultUCI_low_income, method="graph", control=list(verbose = FALSE))

There are seven rules with low income as antecedent.

Rule Prunning

We can check whether there are redundant rules and remove them.

#check whether there is redundant rule for high income
summary(is.redundant(AdultUCI_high_income))

##    Mode   FALSE 
## logical       8

#remove them
AdultUCI_high_income[-is.redundant(AdultUCI_high_income)]

## set of 0 rules

There is no redundant rule for high income.

#check whether there is redundant rule for low income
summary(is.redundant(AdultUCI_low_income))

##    Mode   FALSE 
## logical       7

#remove them
AdultUCI_low_income[-is.redundant(AdultUCI_low_income)]

## set of 0 rules

There is no redundant rule for low income.

We can inspect the rules for item sets with highest parameter value (lift) for both high and low income.

inspect(head(sort(AdultUCI_low_income, by ="lift"),3))

##     lhs               rhs                        support   confidence coverage 
## [1] {income=small} => {hours-per-week=Full-time} 0.3134802 0.6193770  0.5061218
## [2] {income=small} => {capital-gain=None}        0.4849310 0.9581311  0.5061218
## [3] {income=small} => {workclass=Private}        0.3630687 0.7173544  0.5061218
##     lift     count
## [1] 1.058600 15311
## [2] 1.044414 23685
## [3] 1.033358 17733

For low income, {income=small} => {hours-per-week=Full-time} has the highest lift.

inspect(head(sort(AdultUCI_high_income, by ="lift"),3))

##     lhs               rhs                                 support   confidence
## [1] {income=large} => {relationship=Husband}              0.1211662 0.7547507 
## [2] {income=large} => {marital-status=Married-civ-spouse} 0.1370132 0.8534626 
## [3] {income=large} => {sex=Male}                          0.1363990 0.8496365 
##     coverage  lift     count
## [1] 0.1605381 1.869727 5918 
## [2] 0.1605381 1.862676 6692 
## [3] 0.1605381 1.270994 6662

For high income, {income=large} => {relationship=Husband} has the highest lift.

Credit: https://rstudio-pubs-static.s3.amazonaws.com/384365_ff41bbcd1af347dfbfac2e966da30795.html

Association Rule Mining with data preprocessing

Tarid Wongvorachan

November 10th, 2020

Converting data frame into transactional data

ARM practice with the transactionalized `AdultUCI` data set.

Rule Prunning

Association Rule Mining with data preprocessing

Tarid Wongvorachan

November 10th, 2020

Converting data frame into transactional data

ARM practice with the transactionalized AdultUCI data set.

Rule Prunning

ARM practice with the transactionalized `AdultUCI` data set.