Use the “AdultUCI” data available in the “arules” package and do as follows in R Script.
library(arules)
## Warning: package 'arules' was built under R version 4.1.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
data("AdultUCI")
class(AdultUCI)
## [1] "data.frame"
str(AdultUCI)
## 'data.frame': 48842 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Ord.factor w/ 16 levels "Preschool"<"1st-4th"<..: 14 14 9 7 14 15 5 9 15 14 ...
## $ education-num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital-status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital-gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital-loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours-per-week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native-country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ income : Ord.factor w/ 2 levels "small"<"large": 1 1 1 1 1 1 1 2 2 2 ...
dim(AdultUCI)
## [1] 48842 15
head(AdultUCI)
## age workclass fnlwgt education education-num marital-status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital-gain capital-loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours-per-week native-country income
## 1 40 United-States small
## 2 13 United-States small
## 3 40 United-States small
## 4 40 United-States small
## 5 40 Cuba small
## 6 40 United-States small
tail(AdultUCI)
## age workclass fnlwgt education education-num marital-status
## 48837 33 Private 245211 Bachelors 13 Never-married
## 48838 39 Private 215419 Bachelors 13 Divorced
## 48839 64 <NA> 321403 HS-grad 9 Widowed
## 48840 38 Private 374983 Bachelors 13 Married-civ-spouse
## 48841 44 Private 83891 Bachelors 13 Divorced
## 48842 35 Self-emp-inc 182148 Bachelors 13 Married-civ-spouse
## occupation relationship race sex capital-gain
## 48837 Prof-specialty Own-child White Male 0
## 48838 Prof-specialty Not-in-family White Female 0
## 48839 <NA> Other-relative Black Male 0
## 48840 Prof-specialty Husband White Male 0
## 48841 Adm-clerical Own-child Asian-Pac-Islander Male 5455
## 48842 Exec-managerial Husband White Male 0
## capital-loss hours-per-week native-country income
## 48837 0 40 United-States <NA>
## 48838 0 36 United-States <NA>
## 48839 0 40 United-States <NA>
## 48840 0 50 United-States <NA>
## 48841 0 40 United-States <NA>
## 48842 0 60 United-States <NA>
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
c(0,25,40,60,168)),
labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),
Inf)), labels = c("None", "Low", "High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),
Inf)), labels = c("None", "Low", "High"))
Adult <- transactions(AdultUCI)
Adult
## transactions in sparse format with
## 48842 transactions (rows) and
## 115 items (columns)
summary(Adult)
## transactions as itemMatrix in sparse format with
## 48842 rows (elements/itemsets/transactions) and
## 115 columns (items) and a density of 0.1089939
##
## most frequent items:
## capital-loss=None capital-gain=None
## 46560 44807
## native-country=United-States race=White
## 43832 41762
## workclass=Private (Other)
## 33906 401333
##
## element (itemset/transaction) length distribution:
## sizes
## 9 10 11 12 13
## 19 971 2067 15623 30162
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 12.00 13.00 12.53 13.00 13.00
##
## includes extended item information - examples:
## labels variables levels
## 1 age=Young age Young
## 2 age=Middle-aged age Middle-aged
## 3 age=Senior age Senior
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Interpretation: Since, rule mining algorithms like Apriori works on factor variables, we convert our variables to ordered factor ones. We can coerce the data from list or data frame into transactional class using transaction function. The main reason of cooercion is because we want R to know we are dealing with transactional data so that we can mine association rules afterwards. The transactions was summarized using summary function. Within summary function, we can see 48842 number of rows which represents number of sets where items are assembled in which single row represents one transaction. Total number of items used in the transaction is 115. Transactions are created through possible permutations of each variables and their observation combined with other. most frequent items in the data set are: capital-loss=None, capital-gain=None,native-country=United-States, race=White etc. Therefore, they are mostly used while creating transactions too. The number of transactions that involves 9 item sets are 9 and transactions involving 10 item sets are 971 and so on. The transactions that involves 13 item sets are higher with a count of 30162. Descriptive summary of item sets can be also viualized. Density value of 0.1089 represents the total participation of item sets to the total transactions.
inspect(head(Adult))
## items transactionID
## [1] {age=Middle-aged,
## workclass=State-gov,
## education=Bachelors,
## marital-status=Never-married,
## occupation=Adm-clerical,
## relationship=Not-in-family,
## race=White,
## sex=Male,
## capital-gain=Low,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 1
## [2] {age=Senior,
## workclass=Self-emp-not-inc,
## education=Bachelors,
## marital-status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Part-time,
## native-country=United-States,
## income=small} 2
## [3] {age=Middle-aged,
## workclass=Private,
## education=HS-grad,
## marital-status=Divorced,
## occupation=Handlers-cleaners,
## relationship=Not-in-family,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 3
## [4] {age=Senior,
## workclass=Private,
## education=11th,
## marital-status=Married-civ-spouse,
## occupation=Handlers-cleaners,
## relationship=Husband,
## race=Black,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 4
## [5] {age=Middle-aged,
## workclass=Private,
## education=Bachelors,
## marital-status=Married-civ-spouse,
## occupation=Prof-specialty,
## relationship=Wife,
## race=Black,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=Cuba,
## income=small} 5
## [6] {age=Middle-aged,
## workclass=Private,
## education=Masters,
## marital-status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Wife,
## race=White,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States,
## income=small} 6
inspect(tail(Adult))
## items transactionID
## [1] {age=Middle-aged,
## workclass=Private,
## education=Bachelors,
## marital-status=Never-married,
## occupation=Prof-specialty,
## relationship=Own-child,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States} 48837
## [2] {age=Middle-aged,
## workclass=Private,
## education=Bachelors,
## marital-status=Divorced,
## occupation=Prof-specialty,
## relationship=Not-in-family,
## race=White,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States} 48838
## [3] {age=Senior,
## education=HS-grad,
## marital-status=Widowed,
## relationship=Other-relative,
## race=Black,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States} 48839
## [4] {age=Middle-aged,
## workclass=Private,
## education=Bachelors,
## marital-status=Married-civ-spouse,
## occupation=Prof-specialty,
## relationship=Husband,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Over-time,
## native-country=United-States} 48840
## [5] {age=Middle-aged,
## workclass=Private,
## education=Bachelors,
## marital-status=Divorced,
## occupation=Adm-clerical,
## relationship=Own-child,
## race=Asian-Pac-Islander,
## sex=Male,
## capital-gain=Low,
## capital-loss=None,
## hours-per-week=Full-time,
## native-country=United-States} 48841
## [6] {age=Middle-aged,
## workclass=Self-emp-inc,
## education=Bachelors,
## marital-status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital-gain=None,
## capital-loss=None,
## hours-per-week=Over-time,
## native-country=United-States} 48842
Interpretation: After calling inspect function to the head and tail portion of the transactions, we can get a list of transactions and their ID. The transactions are set of item or item sets arranged together through ID. The dataframe after converting to transaction is in the flat file (basket) form. When the file is in the basket form, it means that each row represents the transaction where the items in the basket is represented by column. Once the original ordered factor data is cooerced to transactions, the data is ready for mining itemsets or rules.
library(RColorBrewer)
# using itemFrequencyPlot() function
arules::itemFrequencyPlot(Adult, topN = 20,
col = brewer.pal(8, 'Pastel2'),
main = 'Absolute Item Frequency Plot',
type = "absolute",
ylab = "Item Frequency (Absolute)")
arules::itemFrequencyPlot(Adult, topN = 20,
col = brewer.pal(8, 'Pastel2'),
main = 'Relative Item Frequency Plot',
type = "relative",
ylab = "Item Frequency (Relative)")
aParam = new("APparameter", "support" =0.01, "confidence" = 0.8, "maxlen"= 10)
association.rule <-apriori(Adult,aParam)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 488
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(Adult, aParam): Mining stopped (maxlen reached). Only
## patterns up to a length of 10 returned!
## done [0.90s].
## writing ... [197371 rule(s)] done [0.05s].
## creating S4 object ... done [0.10s].
summary(association.rule)
## set of 197371 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8 9 10
## 4 266 3303 15219 37015 53616 48402 27754 9827 1965
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.318 7.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01001 Min. :0.8000 Min. :0.01001 Min. : 0.8677
## 1st Qu.:0.01251 1st Qu.:0.8953 1st Qu.:0.01353 1st Qu.: 1.0059
## Median :0.01708 Median :0.9372 Median :0.01847 Median : 1.0398
## Mean :0.02726 Mean :0.9283 Mean :0.02949 Mean : 1.2899
## 3rd Qu.:0.02766 3rd Qu.:0.9669 3rd Qu.:0.02995 3rd Qu.: 1.2160
## Max. :0.95328 Max. :1.0000 Max. :1.00000 Max. :20.6826
## count
## Min. : 489
## 1st Qu.: 611
## Median : 834
## Mean : 1331
## 3rd Qu.: 1351
## Max. :46560
##
## mining info:
## data ntransactions support confidence
## Adult 48842 0.01 0.8
## call
## apriori(data = Adult, parameter = aParam)
Interpretation: The above tasks of Apriori can be divided into following two sub-tasks:
There might be many association rules involving any number of item sets with the support of 1% and confidence of 80%. Thus, to filter out, we specify the max-length of item sets in the rules should be limited to 10.
Summary function applied to rules give the information: 1. 197371 number of rules created with the parameter of support = 1%, confidence = 80% and max-len = 10. 2. Rule length distributions where number of item sets involved in the rules are defined. Maximum number of rules involved 6 item sets. (this is median too) 3. Summary measure for support, confidence and coverage can also be visualized which shows median of support is near to 1 which means that most of the rules are near to threshold (not much frequent itemset generation) 4. The conditional probability (A|B) or confidence is higher, which means chances of co-occurrence is higher. [ since threshold is higher i.e. 80%]
inspect(head(association.rule, 10))
## lhs rhs support confidence coverage lift count
## [1] {} => {race=White} 0.85504279 0.8550428 1.00000000 1.0000000 41762
## [2] {} => {native-country=United-States} 0.89742435 0.8974243 1.00000000 1.0000000 43832
## [3] {} => {capital-gain=None} 0.91738668 0.9173867 1.00000000 1.0000000 44807
## [4] {} => {capital-loss=None} 0.95327792 0.9532779 1.00000000 1.0000000 46560
## [5] {education=5th-6th} => {capital-loss=None} 0.01009377 0.9685658 0.01042136 1.0160372 493
## [6] {education=Doctorate} => {race=White} 0.01076942 0.8855219 0.01216166 1.0356463 526
## [7] {education=Doctorate} => {capital-loss=None} 0.01076942 0.8855219 0.01216166 0.9289231 526
## [8] {marital-status=Married-spouse-absent} => {capital-gain=None} 0.01218214 0.9474522 0.01285779 1.0327730 595
## [9] {marital-status=Married-spouse-absent} => {capital-loss=None} 0.01240735 0.9649682 0.01285779 1.0122632 606
## [10] {education=12th} => {native-country=United-States} 0.01140412 0.8477930 0.01345154 0.9446958 557
Interpretation: The top 10 rules with minimum support of 1% and confidence of 80% with max item set of 10 is inspected above. The rules mostly contains either empty or single item set, since item set are selected in ascending order by default while creating rules. The lhs-rhs portion denotes the condition of if and then. For example: if the education if Doctorate, then race is White. This rule has a support of 0.01076 which is above the threshold specified which means it is frequent itemset in the transactional data. This rule has confidence of 0.885522 which is nearly 89% which confirms this rule. If the itemsets passed certain support threshold but could not pass confidence threshold, then rules won’t be created. To be a rule through Apriori algorithm, any itemset must be frequent as well as must pass confidence threshold. The lift value for the rule bolster the quality of rules. If the lift value for the rule is above 1, that means the rule can be trusted. If the value of lift is near to 1, that means there is nearly no association, it might be just random co-occurence. For example: {marital-status=Married-spouse-absent} => {capital-loss=None} this rule has support of more than 1% which means that it is frequent itemset in transaction, and has confidence of 96% means that it is highly that if maritial status is married-spouse-absent then capital-loss is None. But its lift is near to 1, thus this rule is very unlikely.
Since, confidence(X,Y) is different from confidence(Y, X), confidence does not imply co-occurence. But, lift has no direction. Here, most rules has lift value more than 1 which implies that there is positive relationship between lhs and rhs.
aParam@minlen <- 2L
association.rule <- apriori(Adult, aParam)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 488
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(Adult, aParam): Mining stopped (maxlen reached). Only
## patterns up to a length of 10 returned!
## done [0.85s].
## writing ... [197367 rule(s)] done [0.05s].
## creating S4 object ... done [0.11s].
inspect(head(association.rule, 10))
## lhs rhs support confidence coverage lift count
## [1] {education=5th-6th} => {capital-loss=None} 0.01009377 0.9685658 0.01042136 1.0160372 493
## [2] {education=Doctorate} => {race=White} 0.01076942 0.8855219 0.01216166 1.0356463 526
## [3] {education=Doctorate} => {capital-loss=None} 0.01076942 0.8855219 0.01216166 0.9289231 526
## [4] {marital-status=Married-spouse-absent} => {capital-gain=None} 0.01218214 0.9474522 0.01285779 1.0327730 595
## [5] {marital-status=Married-spouse-absent} => {capital-loss=None} 0.01240735 0.9649682 0.01285779 1.0122632 606
## [6] {education=12th} => {native-country=United-States} 0.01140412 0.8477930 0.01345154 0.9446958 557
## [7] {education=12th} => {capital-gain=None} 0.01289873 0.9589041 0.01345154 1.0452562 630
## [8] {education=12th} => {capital-loss=None} 0.01322632 0.9832572 0.01345154 1.0314487 646
## [9] {education=9th} => {race=White} 0.01250973 0.8082011 0.01547848 0.9452171 611
## [10] {education=9th} => {capital-gain=None} 0.01457762 0.9417989 0.01547848 1.0266107 712
Interpretation: Since, some of the rules has lhs empty although its support is greater than threshold, and confidence greater than 80% but lift come out to be 1 which tells that it is kind of useless rule. Thus, we can remove these empty and useless rule by tuning the parameter called as minlen. The minlen is defined as 2 which represents the total minimum length of itemsets to be associated with rules.
15: Create a new rule as “capital.gain.rhs.rule” with “capital-gain=None” in the RHS with support of 1%, confidence of 80%, maximum length of 10 and minimum length of 2.
capital.gain.rhs.rule <- apriori(Adult, parameter = list(support=0.01, confidence = 0.8, minlen = 2, maxlen = 10),
appearance = list(rhs = c("capital-gain=None"), default="lhs"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 488
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(Adult, parameter = list(support = 0.01, confidence = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.85s].
## writing ... [35433 rule(s)] done [0.01s].
## creating S4 object ... done [0.03s].
summary(capital.gain.rhs.rule)
## set of 35433 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 60 706 3062 7110 9790 8377 4537 1508 283
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.212 7.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01001 Min. :0.8000 Min. :0.01009 Min. :0.8720
## 1st Qu.:0.01265 1st Qu.:0.9015 1st Qu.:0.01359 1st Qu.:0.9827
## Median :0.01744 Median :0.9453 Median :0.01882 Median :1.0304
## Mean :0.02819 Mean :0.9287 Mean :0.03051 Mean :1.0124
## 3rd Qu.:0.02864 3rd Qu.:0.9654 3rd Qu.:0.03092 3rd Qu.:1.0523
## Max. :0.87066 Max. :1.0000 Max. :0.95328 Max. :1.0901
## count
## Min. : 489
## 1st Qu.: 618
## Median : 852
## Mean : 1377
## 3rd Qu.: 1399
## Max. :42525
##
## mining info:
## data ntransactions support confidence
## Adult 48842 0.01 0.8
## call
## apriori(data = Adult, parameter = list(support = 0.01, confidence = 0.8, minlen = 2, maxlen = 10), appearance = list(rhs = c("capital-gain=None"), default = "lhs"))
inspect(head(capital.gain.rhs.rule))
## lhs rhs support confidence coverage lift count
## [1] {marital-status=Married-spouse-absent} => {capital-gain=None} 0.01218214 0.9474522 0.01285779 1.032773 595
## [2] {education=12th} => {capital-gain=None} 0.01289873 0.9589041 0.01345154 1.045256 630
## [3] {education=9th} => {capital-gain=None} 0.01457762 0.9417989 0.01547848 1.026611 712
## [4] {education=7th-8th} => {capital-gain=None} 0.01822202 0.9319372 0.01955284 1.015861 890
## [5] {native-country=Mexico} => {capital-gain=None} 0.01871340 0.9610936 0.01947095 1.047643 914
## [6] {occupation=Protective-serv} => {capital-gain=None} 0.01863151 0.9257375 0.02012612 1.009103 910
Interpretation: To filter out the rules created with Apriori, we can use appearance parameter. Appearance can be changed by tuning the parameter of lhs and rhs. For this problem, we have created rhs to be capital-gain= None and default to be lhs. All the rules that has rhs of capital-gain= None was printed out. Some of these rules has lift greater than 1 which means that these rules are co-occurred. For example: if education is 12th then capital-gain is None. This rule implies there is positive relationship between lhs and rhs.
hours.per.week.ft.rule <- apriori(Adult, parameter = list(support=0.01, confidence = 0.8, minlen = 2, maxlen = 10),
appearance = list(rhs = c("hours-per-week=Full-time"), default="lhs"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 488
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[115 item(s), 48842 transaction(s)] done [0.06s].
## sorting and recoding items ... [67 item(s)] done [0.01s].
## creating transaction tree ... done [0.04s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(Adult, parameter = list(support = 0.01, confidence = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.85s].
## writing ... [159 rule(s)] done [0.01s].
## creating S4 object ... done [0.02s].
summary(hours.per.week.ft.rule)
## set of 159 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6 7 8
## 3 16 48 58 29 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.686 6.000 8.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01001 Min. :0.8000 Min. :0.01216 Min. :1.367
## 1st Qu.:0.01066 1st Qu.:0.8047 1st Qu.:0.01318 1st Qu.:1.375
## Median :0.01179 Median :0.8086 Median :0.01456 Median :1.382
## Mean :0.01237 Mean :0.8089 Mean :0.01530 Mean :1.383
## 3rd Qu.:0.01331 3rd Qu.:0.8129 3rd Qu.:0.01654 3rd Qu.:1.389
## Max. :0.01992 Max. :0.8266 Max. :0.02471 Max. :1.413
## count
## Min. :489.0
## 1st Qu.:520.5
## Median :576.0
## Mean :604.4
## 3rd Qu.:650.0
## Max. :973.0
##
## mining info:
## data ntransactions support confidence
## Adult 48842 0.01 0.8
## call
## apriori(data = Adult, parameter = list(support = 0.01, confidence = 0.8, minlen = 2, maxlen = 10), appearance = list(rhs = c("hours-per-week=Full-time"), default = "lhs"))
inspect(head(hours.per.week.ft.rule))
## lhs rhs support confidence coverage lift count
## [1] {occupation=Machine-op-inspct,
## sex=Female} => {hours-per-week=Full-time} 0.01324680 0.8047264 0.01646124 1.375387 647
## [2] {occupation=Adm-clerical,
## race=Black} => {hours-per-week=Full-time} 0.01224356 0.8102981 0.01510995 1.384910 598
## [3] {occupation=Adm-clerical,
## relationship=Unmarried} => {hours-per-week=Full-time} 0.01756685 0.8003731 0.02194832 1.367947 858
## [4] {workclass=Private,
## occupation=Machine-op-inspct,
## sex=Female} => {hours-per-week=Full-time} 0.01293968 0.8092190 0.01599034 1.383066 632
## [5] {occupation=Machine-op-inspct,
## sex=Female,
## capital-gain=None} => {hours-per-week=Full-time} 0.01281684 0.8067010 0.01588797 1.378762 626
## [6] {occupation=Machine-op-inspct,
## sex=Female,
## capital-loss=None} => {hours-per-week=Full-time} 0.01296016 0.8073980 0.01605176 1.379953 633
Interpretation: In this problem too, we have filter out the appearance of rhs to be hours-per-week=Full-time. The default value of lhs represents that there might be any combination of item sets in the lhs but rhs must be fixed. The summary of rules is summarized. The number of itemset involved in the rule is 3 and max item set is 8. Inspection of the rule is done which shows the rule along with the support, confidence and lift. Combination of parameter and apperance argument to the apriori function gives the flexibility to filter out the rules created.
conf.sort.rule <- sort(hours.per.week.ft.rule, by="confidence", desc = TRUE)
inspect(head(conf.sort.rule))
## lhs rhs support confidence coverage lift count
## [1] {age=Middle-aged,
## occupation=Adm-clerical,
## relationship=Unmarried,
## sex=Female,
## capital-gain=None} => {hours-per-week=Full-time} 0.01005282 0.8265993 0.01216166 1.412771 491
## [2] {age=Middle-aged,
## occupation=Adm-clerical,
## relationship=Unmarried,
## capital-gain=None} => {hours-per-week=Full-time} 0.01066705 0.8243671 0.01293968 1.408956 521
## [3] {age=Middle-aged,
## occupation=Adm-clerical,
## relationship=Unmarried,
## capital-gain=None,
## capital-loss=None} => {hours-per-week=Full-time} 0.01042136 0.8236246 0.01265304 1.407687 509
## [4] {age=Middle-aged,
## relationship=Unmarried,
## race=Black,
## sex=Female,
## capital-gain=None} => {hours-per-week=Full-time} 0.01029851 0.8218954 0.01253020 1.404732 503
## [5] {age=Middle-aged,
## workclass=Private,
## education=HS-grad,
## race=Black,
## capital-gain=None,
## capital-loss=None} => {hours-per-week=Full-time} 0.01148602 0.8201754 0.01400434 1.401792 561
## [6] {age=Middle-aged,
## education=HS-grad,
## occupation=Adm-clerical,
## sex=Female,
## capital-gain=None,
## capital-loss=None} => {hours-per-week=Full-time} 0.01031899 0.8195122 0.01259162 1.400658 504
inspect(tail(conf.sort.rule))
## lhs rhs support confidence coverage lift count
## [1] {occupation=Adm-clerical,
## relationship=Unmarried} => {hours-per-week=Full-time} 0.01756685 0.8003731 0.02194832 1.367947 858
## [2] {workclass=Private,
## occupation=Adm-clerical,
## relationship=Unmarried,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## native-country=United-States} => {hours-per-week=Full-time} 0.01033946 0.8003170 0.01291921 1.367851 505
## [3] {occupation=Adm-clerical,
## relationship=Unmarried,
## sex=Female,
## capital-gain=None,
## capital-loss=None,
## income=small} => {hours-per-week=Full-time} 0.01042136 0.8003145 0.01302158 1.367847 509
## [4] {occupation=Machine-op-inspct,
## sex=Female,
## capital-gain=None,
## native-country=United-States} => {hours-per-week=Full-time} 0.01031899 0.8000000 0.01289873 1.367309 504
## [5] {occupation=Machine-op-inspct,
## sex=Female,
## capital-loss=None,
## native-country=United-States} => {hours-per-week=Full-time} 0.01040088 0.8000000 0.01300111 1.367309 508
## [6] {workclass=Private,
## education=HS-grad,
## race=Black,
## capital-gain=None,
## native-country=United-States,
## income=small} => {hours-per-week=Full-time} 0.01171123 0.8000000 0.01463904 1.367309 572
Interpretation: The association rule named hours.per.week.ft.rule is sorted in descending order by the value of confidence. After the command is executed, the rule with highest confidence is printed at top. The higher confidence metric related to particular rule denotes the strength of the rule.
For example: {age=Middle-aged,
occupation=Adm-clerical,
relationship=Unmarried,
sex=Female,
capital-gain=None} => {hours-per-week=Full-time} let, lhs = {age=Middle-aged,
occupation=Adm-clerical,
relationship=Unmarried,
sex=Female,
capital-gain=None}
rhs = {hours-per-week=Full-time}
P(rhs|lhs) = P(rhs) ^ P(lhs) / P(lhs) = 0.8265993 (very high confidence) Similarly, lift value for this relationship is 1.412771 (greater than 1)
This implies that there is strong positive relationship between lhs and rhs. The rule can be rephrased as: If age is middle-aged, occupation is Adm-clerical, relationship is Unmarried, sex is Female and capital-gain is None, then hours-per-week is Full-time. (strong co-occurrence)
library(arulesViz)
plot(hours.per.week.ft.rule)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(hours.per.week.ft.rule, method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(hours.per.week.ft.rule, engine = "plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(hours.per.week.ft.rule, method = "graph", engine = "htmlwidget")
plot(hours.per.week.ft.rule, method="paracoord")
Interpretations:
Scatter plot: The default plot(rules) function plots the scatter-plot where support and confidence is represented in x and y axis respectively. The single dot in the scatter plot represents the single rule. The default color code is given to represent the value of lift. If the color of the particular rule is darker, its lift is greater. From the scatter-plot graph, it can be seen that higher the confidence, higher the lift but lower the support. If the support threshold is made more than 1.5%, then it is likely that we miss out on the rules which has higher confidence and lift. Thus, we should set the threshold for the support and confidence very wisely.
Two-key plot: By default the method argument is scatter plot. But, if we change the method to be two-key plot, then support vs confidence plot is drawn with the order of transaction as a color code. The order represents the number of items used to create a particular rule. In the plot, orange color code is given to those rules which involve 3 itemset.
Interactive plot using plotly: The above two plots are static, what if you want to interact with the plot. Features like zoom in, auto-scale and reset axes can be used inside the plot using arulesViz and plotly engine together. We can hover to the particular rule point to see the metrics for that rules. The support and confidence is plotted in x and y axis respectively and colouring of the points is done based on value of lift metric.
Graph plot: To visualize the rules using node and arrows, we use graph plot. The node represents the rule number and lhs/rhs are represented inside the rectangular box. The direction for the rule is represented by arrow line from lhs to rhs. If the number of rule is extremely high then, it is practically impossible to visualize this graph. Thus, filter options based on lhs and rules is provided within the graph plot. The metrics information of support, confidence and lift can be fetched by selecting the particular edge that connects the lhs and rhs.
Parallel coordinates plot: In order to visualize discovered association rules, we use parallel coordinates plot. Plot represents the items in y-axis and positions of item in the rule in x-axis. This plot represents the journey of all the items in the lhs which leads to rhs in making of rule. For example: {age=Middle-aged,
occupation=Adm-clerical,
relationship=Unmarried,
sex=Female,
capital-gain=None} => {hours-per-week=Full-time}
In making above rule, age=Middle-aged is at position 1, occupation=Adm-clerical is at position 2, relationship=Unmarried at position 3, sex=Female at position 4, capital-gain=None at position 5 in making the RHS i.e. hours-per-week=Full-time
These parallel-coordinates plot helps better see the items that has lead to formation of the rules.