For this practice, we will use the AdultUCI built-in data set in arules package. The data set is not transactionalized yet, so we will practice data preprocessing as well.
setwd("D:/Class Materials & Work/Summer 2020 practice/ARM_2")
library(arules) #for ARM
library(arulesViz) #for ARM visualization
When we convert the dataframe into a transactional dataset, each row of this dataframe will become a transaction, and each column will become an item.
However, variables should be discretized to coerce them into fixed value first.
data("AdultUCI")
str(AdultUCI)
## 'data.frame': 48842 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Ord.factor w/ 16 levels "Preschool"<"1st-4th"<..: 14 14 9 7 14 15 5 9 15 14 ...
## $ education-num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital-status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital-gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital-loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours-per-week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native-country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ income : Ord.factor w/ 2 levels "small"<"large": 1 1 1 1 1 1 1 2 2 2 ...
In AdultUCI dataframe, columns 1, 3, 5, 11, 12, 13 are integers. So we either convert them into factors or remove them.
#remove Attributes "fnlwgt" and "education-num"
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
#Discretize the variables by cut-points
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
labels = c("Young", "Middle-aged", "Senior", "Old")) #for age
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],c(0,25,40,60,168)),
labels = c("Part-time", "Full-time", "Over-time", "Workaholic")) #for hours-per-week
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),Inf)),
labels = c("None", "Low", "High")) #for capital gain
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),Inf)),
labels = c("None", "Low", "High")) #for capital loss
Now that we have processed the integer variables, let us check the data structure again.
str(AdultUCI)
## 'data.frame': 48842 obs. of 13 variables:
## $ age : Ord.factor w/ 4 levels "Young"<"Middle-aged"<..: 2 3 2 3 2 2 3 3 2 2 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ education : Ord.factor w/ 16 levels "Preschool"<"1st-4th"<..: 14 14 9 7 14 15 5 9 15 14 ...
## $ marital-status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital-gain : Ord.factor w/ 3 levels "None"<"Low"<"High": 2 1 1 1 1 1 1 1 3 2 ...
## $ capital-loss : Ord.factor w/ 3 levels "None"<"Low"<"High": 1 1 1 1 1 1 1 1 1 1 ...
## $ hours-per-week: Ord.factor w/ 4 levels "Part-time"<"Full-time"<..: 2 1 2 2 2 2 1 3 3 2 ...
## $ native-country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ income : Ord.factor w/ 2 levels "small"<"large": 1 1 1 1 1 1 1 2 2 2 ...
All variables are either discretized or already a factor. Now to convert the dataframe into a transactional data:
transactional_AdultUCI <- as(AdultUCI, "transactions")
transactional_AdultUCI
## transactions in sparse format with
## 48842 transactions (rows) and
## 115 items (columns)
AdultUCI data set.Let’s try plotting the frequencies of items in the data set with support greater than 20%.
itemFrequencyPlot(transactional_AdultUCI, support=.20)
Mining for rules with the minimum support of 1% and a confidence of 60%.
AdultUCI_rule <- apriori(transactional_AdultUCI, parameter=list(support=.01,confidence=.6,
minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 488
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.07s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [1.30s].
## writing ... [276437 rule(s)] done [0.11s].
## creating S4 object ... done [0.38s].
summary(AdultUCI_rule)
## set of 276437 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 432 4981 22127 52669 75104 67198 38094 13244 2588
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.289 7.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01001 Min. :0.6000 Min. :0.01001 Min. : 0.7171
## 1st Qu.:0.01253 1st Qu.:0.7691 1st Qu.:0.01474 1st Qu.: 1.0100
## Median :0.01701 Median :0.9051 Median :0.02043 Median : 1.0554
## Mean :0.02677 Mean :0.8600 Mean :0.03177 Mean : 1.3110
## 3rd Qu.:0.02741 3rd Qu.:0.9542 3rd Qu.:0.03288 3rd Qu.: 1.2980
## Max. :0.87066 Max. :1.0000 Max. :0.95328 Max. :20.6826
## count
## Min. : 489
## 1st Qu.: 612
## Median : 831
## Mean : 1308
## 3rd Qu.: 1339
## Max. :42525
##
## mining info:
## data ntransactions support confidence
## transactional_AdultUCI 48842 0.01 0.6
The method yields 276437 rules with minimum support 0.01 and confidence threshold 0.60.
Next, we can try finding rules that are related with each level of variable income (small and large).
#rules with high income as antecedent
AdultUCI_high_income <- apriori(transactional_AdultUCI, parameter = list(supp = .01, conf = .6, minlen = 2),
appearance = list(default = "rhs",lhs = "income=large"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 488
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.07s].
## sorting and recoding items ... [67 item(s)] done [0.02s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
summary(AdultUCI_high_income)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1016 Min. :0.6330 Min. :0.1605 Min. :0.8569
## 1st Qu.:0.1249 1st Qu.:0.7783 1st Qu.:0.1605 1st Qu.:0.9371
## Median :0.1367 Median :0.8515 Median :0.1605 Median :1.0403
## Mean :0.1325 Mean :0.8251 Mean :0.1605 Mean :1.2248
## 3rd Qu.:0.1450 3rd Qu.:0.9030 3rd Qu.:0.1605 3rd Qu.:1.4189
## Max. :0.1468 Max. :0.9146 Max. :0.1605 Max. :1.8697
## count
## Min. :4963
## 1st Qu.:6102
## Median :6677
## Mean :6469
## 3rd Qu.:7080
## Max. :7171
##
## mining info:
## data ntransactions support confidence
## transactional_AdultUCI 48842 0.01 0.6
#parallel plot
plot(AdultUCI_high_income, method="paracoord", control=list(reorder=TRUE))
#graph plot
plot(AdultUCI_high_income, method="graph", control=list(verbose = FALSE))
There are eight rules with high income as antecedent.
#rules with low income as antecedent
AdultUCI_low_income <- apriori(transactional_AdultUCI, parameter = list(supp = .01, conf = .6, minlen = 2),
appearance = list(default = "rhs", lhs = "income=small"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 488
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.06s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 done [0.01s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
summary(AdultUCI_low_income)
## set of 7 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.3097 Min. :0.6120 Min. :0.5061 Min. :0.9155
## 1st Qu.:0.3383 1st Qu.:0.6684 1st Qu.:0.5061 1st Qu.:0.9855
## Median :0.4238 Median :0.8373 Median :0.5061 Median :1.0174
## Mean :0.4052 Mean :0.8006 Mean :0.5061 Mean :1.0057
## 3rd Qu.:0.4677 3rd Qu.:0.9240 3rd Qu.:0.5061 3rd Qu.:1.0389
## Max. :0.4908 Max. :0.9698 Max. :0.5061 Max. :1.0586
## count
## Min. :15128
## 1st Qu.:16522
## Median :20699
## Mean :19790
## 3rd Qu.:22842
## Max. :23974
##
## mining info:
## data ntransactions support confidence
## transactional_AdultUCI 48842 0.01 0.6
#parallel plot
plot(AdultUCI_low_income, method="paracoord", control=list(reorder=TRUE))
#graph plot
plot(AdultUCI_low_income, method="graph", control=list(verbose = FALSE))
There are seven rules with low income as antecedent.
We can check whether there are redundant rules and remove them.
#check whether there is redundant rule for high income
summary(is.redundant(AdultUCI_high_income))
## Mode FALSE
## logical 8
#remove them
AdultUCI_high_income[-is.redundant(AdultUCI_high_income)]
## set of 0 rules
There is no redundant rule for high income.
#check whether there is redundant rule for low income
summary(is.redundant(AdultUCI_low_income))
## Mode FALSE
## logical 7
#remove them
AdultUCI_low_income[-is.redundant(AdultUCI_low_income)]
## set of 0 rules
There is no redundant rule for low income.
We can inspect the rules for item sets with highest parameter value (lift) for both high and low income.
inspect(head(sort(AdultUCI_low_income, by ="lift"),3))
## lhs rhs support confidence coverage
## [1] {income=small} => {hours-per-week=Full-time} 0.3134802 0.6193770 0.5061218
## [2] {income=small} => {capital-gain=None} 0.4849310 0.9581311 0.5061218
## [3] {income=small} => {workclass=Private} 0.3630687 0.7173544 0.5061218
## lift count
## [1] 1.058600 15311
## [2] 1.044414 23685
## [3] 1.033358 17733
For low income, {income=small} => {hours-per-week=Full-time} has the highest lift.
inspect(head(sort(AdultUCI_high_income, by ="lift"),3))
## lhs rhs support confidence
## [1] {income=large} => {relationship=Husband} 0.1211662 0.7547507
## [2] {income=large} => {marital-status=Married-civ-spouse} 0.1370132 0.8534626
## [3] {income=large} => {sex=Male} 0.1363990 0.8496365
## coverage lift count
## [1] 0.1605381 1.869727 5918
## [2] 0.1605381 1.862676 6692
## [3] 0.1605381 1.270994 6662
For high income, {income=large} => {relationship=Husband} has the highest lift.
Credit: https://rstudio-pubs-static.s3.amazonaws.com/384365_ff41bbcd1af347dfbfac2e966da30795.html