Association rule learning is a data analysis technique in machine learning that reveals relationships between variables in a large dataset. The basic idea is to find common patterns of elements that occur together in transactions or records. These patterns are then used to generate rules, called association rules, that suggest a relationship between elements. The use of association rule learning is significant in data mining as it provides insights into the connections between variables in the data. This information can then be used to make informed choices and forecast future trends. By discovering hidden patterns in the data, organizations can also improve their efficiency and make more informed decisions.
The following analysis is an example of the use of association rules in practice.
The data used in the analysis was taken from Kaggle: https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
The original dataset named ‘adult’ came from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/adult
Before cleaning, the data contains 15 variables (6 continuous and 9 nominal/ordinal) and 48842 observations. The database contains information on the individuals income and personal information. These include age, gender, race, country of origin, marital status and others. All variables and their descriptions can be found in the tables below.
| Variable | Description |
|---|---|
| age | age of respondent |
| fnlwgt | represents final weight, which is the number of units in the target population that the responding unit represents |
| education.num | number of years of education in total |
| capital.gain | income from investment sources other than wage/salary |
| capital.loss | losses from investment sources |
| hours.per.week | number of hours worked per week |
| Variable | Description |
|---|---|
| workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked |
| education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool |
| marital.status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse |
| occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces |
| relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
| race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black |
| gender | Female, Male |
| native.country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
| income | >50K, <=50K |
data<-read.csv("adult.csv", header=TRUE, sep=",")
head(data)In order to apply association rules methods, variables need to be prepared. The data must be in transactional format. It is therefore necessary to convert continuous variables into nominal variables. We can get rid of some variables such as “fnlwgt”, “educational.num”, as they do not provide valuable information for analysis. The remaining continuous variables need to be grouped and converted into a factor variable type.
data <- select(data,-c("fnlwgt", "educational.num"))
data[!complete.cases(data),]data %>% group_by(education) %>% tally()data %>% group_by(workclass) %>% tally()data %>% group_by(marital.status) %>% tally()data %>% group_by(occupation) %>% tally()data %>% group_by(relationship) %>% tally()data %>% group_by(race) %>% tally()data %>% group_by(gender) %>% tally()data %>% group_by(native.country) %>% tally()data %>% group_by(income) %>% tally()data <- data %>% mutate(workclass = ifelse(workclass == "?", "Unknown_workclass", workclass))
data <- data %>% mutate(occupation = ifelse(occupation == "?", "Unknown_occupation", occupation))
data <- data %>% mutate(native.country = ifelse(native.country == "?", "Unknown_native.country", native.country))
str(data)'data.frame': 48842 obs. of 13 variables:
$ age : int 25 38 28 44 18 34 29 63 24 55 ...
$ workclass : chr "Private" "Private" "Local-gov" "Private" ...
$ education : chr "11th" "HS-grad" "Assoc-acdm" "Some-college" ...
$ marital.status: chr "Never-married" "Married-civ-spouse" "Married-civ-spouse" "Married-civ-spouse" ...
$ occupation : chr "Machine-op-inspct" "Farming-fishing" "Protective-serv" "Machine-op-inspct" ...
$ relationship : chr "Own-child" "Husband" "Husband" "Husband" ...
$ race : chr "Black" "White" "White" "Black" ...
$ gender : chr "Male" "Male" "Male" "Male" ...
$ capital.gain : int 0 0 0 7688 0 0 0 3103 0 0 ...
$ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hours.per.week: int 40 50 40 40 30 30 40 32 40 10 ...
$ native.country: chr "United-States" "United-States" "United-States" "United-States" ...
$ income : chr "<=50K" "<=50K" ">50K" ">50K" ...
ggplot(data, aes(age)) +
geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)max(data$age)[1] 90
min(data$age)[1] 17
data$age <- cut(data$age,
breaks = c(0,24,34,44,54,64,74,100),
labels = c("17-24","25-34","35-44","45-54","55-64","65-74","75+"),
ordered_result = T)
class(data$age)[1] "ordered" "factor"
data %>% group_by(age) %>% tally()ggplot(data, aes(hours.per.week)) +
geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)max(data$hours.per.week)[1] 99
min(data$hours.per.week)[1] 1
table(data$hours.per.week)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27 53 59 84 95 92 45 218 27 425 20 247 28 55 623 303 42 129 19 1862
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
46 62 40 354 958 40 43 140 15 1700 12 423 61 48 1937 336 242 714 63 22803
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
59 338 227 310 2717 129 82 770 39 4246 20 205 39 62 1051 141 19 38 7 2177
61 62 63 64 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81
4 23 15 22 355 23 6 16 1 437 107 4 3 105 4 9 13 1 210 3
82 84 85 86 87 88 89 90 91 92 94 95 96 97 98 99
1 72 17 4 1 4 3 42 3 3 1 2 9 2 14 137
data$hours.per.week <- cut(data$hours.per.week,
breaks = c(0,31,40,60,100),
labels = c("Working_part-time","Working_full-time","Working_long-hours","Possible_workaholic"),
ordered_result = T)
class(data$hours.per.week)[1] "ordered" "factor"
data %>% group_by(hours.per.week) %>% tally()ggplot(data, aes(capital.gain)) +
geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)max(data$capital.gain)[1] 99999
min(data$capital.gain)[1] 0
table(data$capital.gain)
0 114 401 594 914 991 1055 1086 1111 1151 1173 1264 1409 1424 1455 1471 1506 1639 1731 1797
44807 8 5 52 10 6 37 8 1 13 5 2 10 4 4 9 24 1 1 10
1831 1848 2009 2036 2050 2062 2105 2174 2176 2202 2228 2290 2329 2346 2354 2387 2407 2414 2463 2538
9 9 3 5 5 3 15 74 31 28 5 10 7 8 21 1 25 10 15 5
2580 2597 2635 2653 2829 2885 2907 2936 2961 2964 2977 2993 3103 3137 3273 3325 3411 3418 3432 3456
20 31 14 11 42 30 18 4 4 14 11 3 152 51 7 81 34 8 4 6
3464 3471 3674 3781 3818 3887 3908 3942 4064 4101 4386 4416 4508 4650 4687 4787 4865 4931 4934 5013
33 11 22 16 11 8 42 18 54 29 108 24 23 63 4 35 25 4 10 117
5060 5178 5455 5556 5721 6097 6360 6418 6497 6514 6612 6723 6767 6849 7262 7298 7430 7443 7688 7896
2 146 18 6 7 2 3 16 15 10 1 5 6 42 1 364 15 7 410 4
7978 8614 9386 9562 10520 10566 10605 11678 13550 14084 14344 15020 15024 15831 18481 20051 22040 25124 25236 27828
2 82 31 5 64 8 19 4 42 49 34 10 513 8 2 49 1 6 14 58
34095 41310 99999
6 3 244
x <- data %>% filter(capital.gain > 0)
ggplot(x, aes(capital.gain)) +
geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)max(x$capital.gain)[1] 99999
min(x$capital.gain)[1] 114
table(x$capital.gain)
114 401 594 914 991 1055 1086 1111 1151 1173 1264 1409 1424 1455 1471 1506 1639 1731 1797 1831
8 5 52 10 6 37 8 1 13 5 2 10 4 4 9 24 1 1 10 9
1848 2009 2036 2050 2062 2105 2174 2176 2202 2228 2290 2329 2346 2354 2387 2407 2414 2463 2538 2580
9 3 5 5 3 15 74 31 28 5 10 7 8 21 1 25 10 15 5 20
2597 2635 2653 2829 2885 2907 2936 2961 2964 2977 2993 3103 3137 3273 3325 3411 3418 3432 3456 3464
31 14 11 42 30 18 4 4 14 11 3 152 51 7 81 34 8 4 6 33
3471 3674 3781 3818 3887 3908 3942 4064 4101 4386 4416 4508 4650 4687 4787 4865 4931 4934 5013 5060
11 22 16 11 8 42 18 54 29 108 24 23 63 4 35 25 4 10 117 2
5178 5455 5556 5721 6097 6360 6418 6497 6514 6612 6723 6767 6849 7262 7298 7430 7443 7688 7896 7978
146 18 6 7 2 3 16 15 10 1 5 6 42 1 364 15 7 410 4 2
8614 9386 9562 10520 10566 10605 11678 13550 14084 14344 15020 15024 15831 18481 20051 22040 25124 25236 27828 34095
82 31 5 64 8 19 4 42 49 34 10 513 8 2 49 1 6 14 58 6
41310 99999
3 244
data$capital.gain <- cut(data$capital.gain,
breaks = c(-1,0,median(x$capital.gain),max(x$capital.gain)),
labels = c("0_capital_gain","below_median_capital_gain","above_median_capital_gain"),
ordered_result = T)
class(data$capital.gain)[1] "ordered" "factor"
data %>% group_by(capital.gain) %>% tally()y <- data %>% filter(capital.loss > 0)
ggplot(data, aes(capital.loss)) +
geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)max(data$capital.loss)[1] 4356
min(data$capital.loss)[1] 0
table(data$capital.loss)
0 155 213 323 419 625 653 810 880 974 1092 1138 1258 1340 1380 1408 1411 1421 1429 1485
46560 1 5 5 3 17 4 2 6 2 11 4 6 11 10 35 4 1 3 71
1504 1510 1539 1564 1573 1579 1590 1594 1602 1617 1628 1648 1651 1668 1669 1672 1719 1721 1726 1735
26 3 1 43 12 30 62 9 62 11 24 3 11 9 35 50 38 28 9 3
1740 1741 1755 1762 1816 1825 1844 1848 1870 1876 1887 1902 1911 1944 1974 1977 1980 2001 2002 2042
58 44 2 20 4 5 3 67 1 59 233 304 1 3 28 253 36 35 33 12
2051 2057 2080 2129 2149 2163 2174 2179 2201 2205 2206 2231 2238 2246 2258 2267 2282 2339 2352 2377
29 16 1 7 5 2 10 20 1 19 6 7 4 8 39 3 2 27 2 25
2392 2415 2444 2457 2465 2467 2472 2489 2547 2559 2603 2754 2824 3004 3175 3683 3770 3900 4356
11 72 20 4 1 2 4 1 5 17 7 2 14 5 2 2 4 2 3
data$capital.loss <- cut(data$capital.loss,
breaks = c(-1,0,median(y$capital.loss),max(y$capital.loss)),
labels = c("0_capital_loss","below_median_capital_loss","above_median_capital_loss"),
ordered_result = T)
class(data$capital.loss)[1] "ordered" "factor"
data %>% group_by(capital.loss) %>% tally()Due to the asymmetric distribution of the variable capital.gain and capital.loss, the median instead of the average of all non-zero income/losses respectively was used in determining the new levels.
summary(data) age workclass education marital.status occupation relationship
17-24: 8432 Length:48842 Length:48842 Length:48842 Length:48842 Length:48842
25-34:12577 Class :character Class :character Class :character Class :character Class :character
35-44:12193 Mode :character Mode :character Mode :character Mode :character Mode :character
45-54: 8771
55-64: 4782
65-74: 1642
75+ : 445
race gender capital.gain capital.loss
Length:48842 Length:48842 0_capital_gain :44807 0_capital_loss :46560
Class :character Class :character below_median_capital_gain: 2345 below_median_capital_loss: 1166
Mode :character Mode :character above_median_capital_gain: 1690 above_median_capital_loss: 1116
hours.per.week native.country income
Working_part-time : 7863 Length:48842 Length:48842
Working_full-time :26627 Class :character Class :character
Working_long-hours :12676 Mode :character Mode :character
Possible_workaholic: 1676
trans <- transactions(data)Warning: Column(s) 2, 3, 4, 5, 6, 7, 8, 12, 13 not logical or factor. Applying default discretization (see '? discretizeDF').
length(trans)[1] 48842
LIST(head(trans, 2))$`1`
[1] "age=25-34" "workclass=Private" "education=11th"
[4] "marital.status=Never-married" "occupation=Machine-op-inspct" "relationship=Own-child"
[7] "race=Black" "gender=Male" "capital.gain=0_capital_gain"
[10] "capital.loss=0_capital_loss" "hours.per.week=Working_full-time" "native.country=United-States"
[13] "income=<=50K"
$`2`
[1] "age=35-44" "workclass=Private" "education=HS-grad"
[4] "marital.status=Married-civ-spouse" "occupation=Farming-fishing" "relationship=Husband"
[7] "race=White" "gender=Male" "capital.gain=0_capital_gain"
[10] "capital.loss=0_capital_loss" "hours.per.week=Working_long-hours" "native.country=United-States"
[13] "income=<=50K"
The finished data prepared for further analysis was saved under the name ‘trans’. The number of transactions is 48842, with a fixed length of 13.
head(sort(itemFrequency(trans, type="absolute"), decreasing = T), 20) capital.loss=0_capital_loss capital.gain=0_capital_gain native.country=United-States
46560 44807 43832
race=White income=<=50K workclass=Private
41762 37155 33906
gender=Male hours.per.week=Working_full-time marital.status=Married-civ-spouse
32650 26627 22379
relationship=Husband gender=Female marital.status=Never-married
19716 16192 16117
education=HS-grad hours.per.week=Working_long-hours relationship=Not-in-family
15784 12676 12583
age=25-34 age=35-44 income=>50K
12577 12193 11687
education=Some-college age=45-54
10878 8771
head(sort(itemFrequency(trans, type="absolute"), decreasing = F), 20) native.country=Holand-Netherlands workclass=Never-worked
1 10
occupation=Armed-Forces native.country=Hungary
15 19
native.country=Honduras workclass=Without-pay
20 21
native.country=Scotland native.country=Laos
21 23
native.country=Outlying-US(Guam-USVI-etc) native.country=Yugoslavia
23 23
native.country=Trinadad&Tobago native.country=Cambodia
27 28
native.country=Hong native.country=Thailand
30 30
marital.status=Married-AF-spouse native.country=Ireland
37 37
native.country=France native.country=Ecuador
38 45
native.country=Peru native.country=Greece
46 49
itemFrequencyPlot(trans, topN=20, type="relative", main="Frequency plot")The chart above shows the most frequently occurring values in the dataset. As you can see, there are several values that appear in almost every ‘transaction’. For example, the frequency of capital.loss=0 is over 90%. This should not come as a surprise, as it was already visible in the data analysis. The same is true for the values native.country=United-States and race=White. Again, this is not surprising as the data relates to people from the United States, where white people made up the vast majority of the population at the time of data collection.
It is therefore to be expected that these most common values will appear with the vast majority of rules.
Support of an association rule is the percentage of groups containing all elements in the rule, calculated from the total number of groups examined. It represents the frequency of occurrence of the rule’s body and head. Setting a minimum support level when mining can produce more significant results and control the number of rules generated.
Confidence measures the reliability of an association rule by indicating the frequency at which the head of the rule appears in groups containing the body of the rule. A higher confidence value means the rule is more reliable. Mining settings can include a minimum confidence threshold to produce more significant results and limit the number of generated rules.
The lift of an association rule indicates its significance by comparing the actual confidence of the rule to its expected confidence, which is calculated as the support of the head divided by the product of the body and head supports. A lift value greater than 1 means the body positively affects the head, a value less than 1 means a negative effect, and a value near 1 means no effect.
Apriori is an algorithm for discovering frequent item sets and association rules in databases. It starts with individual items with high frequency and builds larger sets while maintaining sufficient frequency. The frequent item sets produced by Apriori reveal general trends in the database.
Given the usefulness of the apriori algorithm and its popularity, it is this algorithm that will be used to further analyse association rules.
rules<-apriori(trans, parameter=list(minlen=2))Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 4884
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[121 item(s), 48842 transaction(s)] done [0.03s].
sorting and recoding items ... [32 item(s)] done [0.01s].
creating transaction tree ... done [0.02s].
checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.09s].
writing ... [7067 rule(s)] done [0.01s].
creating S4 object ... done [0.02s].
summary(rules)set of 7067 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5 6 7 8 9
135 718 1724 2245 1559 577 102 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 4.000 5.000 4.927 6.000 9.000
summary of quality measures:
support confidence coverage lift count
Min. :0.1000 Min. :0.8001 Min. :0.1003 Min. :0.8722 Min. : 4885
1st Qu.:0.1150 1st Qu.:0.8891 1st Qu.:0.1245 1st Qu.:1.0001 1st Qu.: 5615
Median :0.1360 Median :0.9267 Median :0.1473 Median :1.0278 Median : 6641
Mean :0.1692 Mean :0.9244 Mean :0.1839 Mean :1.2065 Mean : 8263
3rd Qu.:0.1876 3rd Qu.:0.9631 3rd Qu.:0.2052 3rd Qu.:1.1480 3rd Qu.: 9163
Max. :0.8707 Max. :1.0000 Max. :0.9533 Max. :2.7453 Max. :42525
mining info:
Basic parameters of the apriori function: support = 0.1 and confidence = 0.8.
rules.supp<-sort(rules, by="support", decreasing=TRUE)
inspectDT(rules.supp[is.significant(rules.supp, trans, alpha=0.05)])By sorting the resulting rules by the highest support value, we obtain similar conclusions to those reached by frequency chart analysis. The expected combinations race=White and native.country=United-States, or in combination with capital.loss=0 have the highest support value. This means that these connections have the highest number of rules obtained. It is also worth mentioning that only rules that are statistically significant at the 5% significance level were selected using the is.significant function.
rules.conf<-sort(rules, by="confidence", decreasing=TRUE)
inspectDT(rules.conf[is.significant(rules.conf, trans, alpha=0.05)])Sorting by highest confidence values already yields slightly different conclusions. A value of one for relationship=Husband and gender=Male should not come as a surprise. This means that when a person has the value relationship=Husband he is also male. There is a similar relationship between relationship=Husband and marital.status=Married for obvious reasons. However, there are many other less obvious and interesting connections that are worth exploring.
rules.lift<-sort(rules, by="lift", decreasing=TRUE)
inspectDT(rules.lift[is.significant(rules.lift, trans, alpha=0.05)])As described earlier, lift values greater than 1 indicate a positive effect of itemset 𝑋 (body of the rule) on 𝑌 (head of the rule). The highest lift values for the analysed dataset are reached by rules describing the impact of, for example, relationship=Own-child on marital.status=Never-married. Or another obvious example like the influence of marital.status=Married-civ-spouse and gender=Male on {relationship=Husband}.
A properly prepared dataset, together with the association rules detection techniques used, provides a wealth of knowledge about the units under investigation. Thanks to a large number of variables with multiple levels, we are able to investigate the most diverse connections between specific values. It would be impossible to prepare and describe all the interesting connections, hence I limited the selection to the study of rules related to income=>50K and native.country=Poland.
I was first interested in the variables and their values that had the greatest impact on income=>50K.
rules.income<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.4), appearance=list(default="lhs", rhs="income=>50K"), control=list(verbose=F))
rules.income.byconf<-sort(rules.income, by="confidence", decreasing=TRUE)
inspectDT(rules.income.byconf)By placing income=>50K to the right of the rules and adjusting the parameters accordingly, it was possible to obtain 102 rules with a minimum confidence of 0.4 and support = 0.1. For the data analysed, it can be concluded that married white US citizens have the highest probability of high earnings. This rule of thumb reaches a confidence level of 0.464. A good indicator may also be the value of the variable relationship=Husband because it also reaches similar confidence scores. In other rules, variables such as hours.per.week=Working_long-hours and capital.gain=0_capital_gain also appear, which seems reasonable.
The second interesting idea from my perspective to explore was to check the rules determining the impact of the native.country=Poland variable on all the others.
rules.Poland<-apriori(data=trans, parameter=list(supp=0.00001, conf = 0.1, minlen = 2),
appearance=list(default="rhs", lhs="native.country=Poland"), control=list(verbose=F))
rules.Poland.byconf<-sort(rules.Poland, by="confidence", decreasing=TRUE)
inspectDT(rules.Poland.byconf)Due to the very small number of people whose native country was Poland, support had to be reduced to as low as 0.00001 in order to comply with the 26 rules. Again, the rules were sorted due to the highest confidence. Unsurprisingly, the highest relationship was achieved between {native.country=Poland} and {race=White}. (the vast majority of Poles are White), {capital.loss=0_capital_loss, capital.gain=0_capital_gain} (significant advantage of these categories over others for the entire dataset). A very high confidence level because 0.805 was also achieved with income=<=50K. This shows that the average Pole surveyed was very likely to have low earnings. Further high values can also be read for rules where there is workclass=Private, gender=Male, hours.per.week=Working_full-time. Thus, it can be concluded that among the surveyed Poles there was a significant preponderance of men who worked privately on a full-time basis. Married people also prevailed among the surveyed Poles. They are also relatively young people (a preponderance of ages 35-44 followed by 25-34) who have graduated from high school.
plot(rules[1:100,], method="matrix", measure="lift")Itemsets in Antecedent (LHS)
[1] "{marital.status=Married-civ-spouse}" "{relationship=Husband}" "{relationship=Own-child}"
[4] "{age=17-24}" "{income=>50K}" "{occupation=Craft-repair}"
[7] "{hours.per.week=Working_part-time}" "{marital.status=Never-married}" "{marital.status=Divorced}"
[10] "{hours.per.week=Working_long-hours}" "{relationship=Not-in-family}" "{gender=Female}"
[13] "{education=HS-grad}" "{education=Some-college}" "{relationship=Unmarried}"
[16] "{occupation=Adm-clerical}" "{occupation=Sales}" "{age=25-34}"
[19] "{occupation=Exec-managerial}" "{age=35-44}" "{age=45-54}"
[22] "{education=Bachelors}" "{occupation=Prof-specialty}"
Itemsets in Consequent (RHS)
[1] "{capital.loss=0_capital_loss}" "{capital.gain=0_capital_gain}" "{race=White}"
[4] "{native.country=United-States}" "{income=<=50K}" "{gender=Male}"
[7] "{marital.status=Married-civ-spouse}" "{relationship=Husband}" "{marital.status=Never-married}"
plot(rules, measure=c("support","confidence"), shading="lift")To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules[1:20], method="grouped")plot(rules.income, measure=c("support","confidence"), shading="lift")plot(rules.income[1:20,], method="graph")plot(rules.income, method="paracoord", control=list(reorder=TRUE))set.seed(123)
plot(rules.income, method="graph", measure="support", shading="lift", engine="html")Warning: Too many rules supplied. Only plotting the best 100 using ‘lift’ (change control parameter max if needed).
The above charts are a valuable addition to the analysis of association rules. Using the example of the rules with income=>50k, you can perfectly see in the graphs the connections to other variables and their values. Graphical analysis provides the same information as a thorough review of the tables of metrics, hence its conclusions are analogous to those described earlier. However, many times graphical analysis provides a better and easier way to see the complexity and strength of connections between variables and their levels.
As stated earlier association rule learning is a method in machine learning used to uncover relationships between variables in a dataset. The technique involves finding common patterns in elements that occur together and generating rules that suggest connections between elements. The method was used in a study and proved to be effective, providing valuable insights into the connections within the data.
By using the Apriori algorithm, valuable information was obtained regarding the relationship between high income and the other variables and their values, as well as between the native country Poland and the other variables. However, this is a small part of the available rules to explore. In further work on the analysis, it would be worthwhile to deepen the interpretations of the results with other interesting variables like race, age, marital status and education.
https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
https://www.ibm.com/docs/en/ias?topic=procedures-association-rules