Association Rules - Census Income Dataset

Dataset overview

The Adult Dataset from the UCI Machine Learning Repository. It was extracted from the 1994 U.S. Census database and is also known as the “Census Income Dataset.” The dataset contains both categorical and numerical variables:
1. Age (numerical) – Age of the individual.
2. Workclass (categorical) – Type of employer (e.g., private, self-employed, government).
3. Fnlwgt (numerical) – Final weight (adjusted for population sampling).
4. Education (categorical) – Highest level of education attained.
5. Education-num (numerical) – Education level encoded as a numeric variable.
6. Marital-status (categorical) – Marital status (e.g., married, divorced, never married).
7. Occupation (categorical) – Job category (e.g., tech support, sales, clerical).
8. Relationship (categorical) – Relationship to household head (e.g., husband, wife, unmarried).
9. Race (categorical) – Race of the individual.
10. Sex (categorical) – Gender (Male/Female).
11. Capital-gain (numerical) – Income from capital gains.
12. Capital-loss (numerical) – Income lost due to capital losses.
13. Hours-per-week (numerical) – Weekly working hours.
14. Native-country (categorical) – Country of origin.
15. Income (categorical) – Either “<=50K” or “>50K” (annual income in $).

Goal of this paper

The primary objective of this paper is to apply association rules in order to find hidden relationships between demographic, occupational, and financial attributes. Association rules help identify patterns such as: Which education levels are strongly associated with high income? What work classes frequently appear with specific occupations? Are certain marital statuses more likely to be linked with a higher number of weekly work hours? I will pay special attention to find patterns that strongly indicate high-income and low income individuals. This will be achieved through:

Apriori Algorithm, which is a foundational method for discovering frequent items and generating association rules. It is characterized with its significant ability to identify this relationships in large datasets.¹

Preprocessing the data

Firstly, I am preprocessing the data - changing column names, removing NA’s, converting categorical variables into factors.

col_names <- c('Age', 'WorkClass', 'fnlwgt', 'Education', 'EducationNum', 'MaritalStatus', 'Occupation', 
               'Relationship', 'Race', 'Gender', 'CapitalGain', 'CapitalLoss', 'HoursPerWeek', 'NativeCountry', 'Income')

data <- read.csv("adult.csv", na.strings = "?")
head(data, 10)

##    age        workclass fnlwgt    education educational.num     marital.status
## 1   25          Private 226802         11th               7      Never-married
## 2   38          Private  89814      HS-grad               9 Married-civ-spouse
## 3   28        Local-gov 336951   Assoc-acdm              12 Married-civ-spouse
## 4   44          Private 160323 Some-college              10 Married-civ-spouse
## 5   18             <NA> 103497 Some-college              10      Never-married
## 6   34          Private 198693         10th               6      Never-married
## 7   29             <NA> 227026      HS-grad               9      Never-married
## 8   63 Self-emp-not-inc 104626  Prof-school              15 Married-civ-spouse
## 9   24          Private 369667 Some-college              10      Never-married
## 10  55          Private 104996      7th-8th               4 Married-civ-spouse
##           occupation  relationship  race gender capital.gain capital.loss
## 1  Machine-op-inspct     Own-child Black   Male            0            0
## 2    Farming-fishing       Husband White   Male            0            0
## 3    Protective-serv       Husband White   Male            0            0
## 4  Machine-op-inspct       Husband Black   Male         7688            0
## 5               <NA>     Own-child White Female            0            0
## 6      Other-service Not-in-family White   Male            0            0
## 7               <NA>     Unmarried Black   Male            0            0
## 8     Prof-specialty       Husband White   Male         3103            0
## 9      Other-service     Unmarried White Female            0            0
## 10      Craft-repair       Husband White   Male            0            0
##    hours.per.week native.country income
## 1              40  United-States  <=50K
## 2              50  United-States  <=50K
## 3              40  United-States   >50K
## 4              40  United-States   >50K
## 5              30  United-States  <=50K
## 6              30  United-States  <=50K
## 7              40  United-States  <=50K
## 8              32  United-States   >50K
## 9              40  United-States  <=50K
## 10             10  United-States  <=50K

colnames(data) <- col_names
data <- na.omit(data)

#WorkClass, Education, MaritalStatus, Occupation, Relationship, Race, Gender, NativeCountry, Income are characters

characters <- c('WorkClass', 'MaritalStatus', 'Occupation', 'Relationship', 'Race', 
                         'Gender', 'NativeCountry', 'Income')

data[characters] <- lapply(data[characters], as.factor)

data$Education <- cut(data$EducationNum, breaks = c(0, 5, 10, 15, 16), labels = c("Low", "Medium", "High", "Very High"), right = TRUE)

data$Age <- cut(data$Age, breaks = c(17, 30, 40, 50, 65, 90), labels = c("17-30", "31-40", "41-50", "51-65", "66-90"), right = TRUE)

data$CapitalGain <- cut(data$CapitalGain, breaks = c(-Inf, 1000, 5000, 20000, 50000, Inf), labels = c("Very Low", "Low", "Medium", "High", "Very High"))

data$CapitalLoss <- cut(data$CapitalLoss, breaks = 5, labels = c("Very Low", "Low", "Medium", "High", "Very High"), include.lowest = TRUE)

data$HoursPerWeek <- cut(data$HoursPerWeek, breaks = c(0, 20, 40, 60, 80, 100), labels = c("Very Low", "Low", "Medium", "High", "Very High"), right = TRUE)

data <- subset(data, select = -c(EducationNum, fnlwgt))

str(data)

## 'data.frame':    45222 obs. of  13 variables:
##  $ Age          : Factor w/ 5 levels "17-30","31-40",..: 1 2 1 3 2 4 1 4 4 2 ...
##  $ WorkClass    : Factor w/ 7 levels "Federal-gov",..: 3 3 2 3 3 5 3 3 3 1 ...
##  $ Education    : Factor w/ 4 levels "Low","Medium",..: 2 2 3 2 2 3 2 1 2 3 ...
##  $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 3 3 5 3 5 3 3 3 ...
##  $ Occupation   : Factor w/ 14 levels "Adm-clerical",..: 7 5 11 7 8 10 8 3 7 1 ...
##  $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 4 1 1 1 2 1 5 1 1 1 ...
##  $ Race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 3 5 5 3 5 5 5 5 5 5 ...
##  $ Gender       : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 2 2 ...
##  $ CapitalGain  : Factor w/ 5 levels "Very Low","Low",..: 1 1 1 3 1 2 1 1 3 1 ...
##  $ CapitalLoss  : Factor w/ 5 levels "Very Low","Low",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HoursPerWeek : Factor w/ 5 levels "Very Low","Low",..: 2 3 2 2 2 2 2 1 2 2 ...
##  $ NativeCountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 39 39 39 39 39 39 ...
##  $ Income       : Factor w/ 2 levels "<=50K",">50K": 1 1 2 2 1 2 1 1 2 1 ...

Now, I am transforming my data into format which is suitable for Apriori algorithm.

library(arules)
transactions_income <- as(data, "transactions")

summary(transactions_income)

## transactions as itemMatrix in sparse format with
##  45222 rows (elements/itemsets/transactions) and
##  108 columns (items) and a density of 0.1202694 
## 
## most frequent items:
##        CapitalLoss=Very Low        CapitalGain=Very Low 
##                       43117                       41498 
## NativeCountry=United-States                  Race=White 
##                       41292                       38903 
##                Income=<=50K                     (Other) 
##                       34014                      388569 
## 
## element (itemset/transaction) length distribution:
## sizes
##    12    13 
##   493 44729 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   13.00   13.00   12.99   13.00   13.00 
## 
## includes extended item information - examples:
##      labels variables levels
## 1 Age=17-30       Age  17-30
## 2 Age=31-40       Age  31-40
## 3 Age=41-50       Age  41-50
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

inspect(transactions_income[1:10])

##      items                               transactionID
## [1]  {Age=17-30,                                      
##       WorkClass=Private,                              
##       Education=Medium,                               
##       MaritalStatus=Never-married,                    
##       Occupation=Machine-op-inspct,                   
##       Relationship=Own-child,                         
##       Race=Black,                                     
##       Gender=Male,                                    
##       CapitalGain=Very Low,                           
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=<=50K}                                 1 
## [2]  {Age=31-40,                                      
##       WorkClass=Private,                              
##       Education=Medium,                               
##       MaritalStatus=Married-civ-spouse,               
##       Occupation=Farming-fishing,                     
##       Relationship=Husband,                           
##       Race=White,                                     
##       Gender=Male,                                    
##       CapitalGain=Very Low,                           
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Medium,                            
##       NativeCountry=United-States,                    
##       Income=<=50K}                                 2 
## [3]  {Age=17-30,                                      
##       WorkClass=Local-gov,                            
##       Education=High,                                 
##       MaritalStatus=Married-civ-spouse,               
##       Occupation=Protective-serv,                     
##       Relationship=Husband,                           
##       Race=White,                                     
##       Gender=Male,                                    
##       CapitalGain=Very Low,                           
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=>50K}                                  3 
## [4]  {Age=41-50,                                      
##       WorkClass=Private,                              
##       Education=Medium,                               
##       MaritalStatus=Married-civ-spouse,               
##       Occupation=Machine-op-inspct,                   
##       Relationship=Husband,                           
##       Race=Black,                                     
##       Gender=Male,                                    
##       CapitalGain=Medium,                             
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=>50K}                                  4 
## [5]  {Age=31-40,                                      
##       WorkClass=Private,                              
##       Education=Medium,                               
##       MaritalStatus=Never-married,                    
##       Occupation=Other-service,                       
##       Relationship=Not-in-family,                     
##       Race=White,                                     
##       Gender=Male,                                    
##       CapitalGain=Very Low,                           
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=<=50K}                                 6 
## [6]  {Age=51-65,                                      
##       WorkClass=Self-emp-not-inc,                     
##       Education=High,                                 
##       MaritalStatus=Married-civ-spouse,               
##       Occupation=Prof-specialty,                      
##       Relationship=Husband,                           
##       Race=White,                                     
##       Gender=Male,                                    
##       CapitalGain=Low,                                
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=>50K}                                  8 
## [7]  {Age=17-30,                                      
##       WorkClass=Private,                              
##       Education=Medium,                               
##       MaritalStatus=Never-married,                    
##       Occupation=Other-service,                       
##       Relationship=Unmarried,                         
##       Race=White,                                     
##       Gender=Female,                                  
##       CapitalGain=Very Low,                           
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=<=50K}                                 9 
## [8]  {Age=51-65,                                      
##       WorkClass=Private,                              
##       Education=Low,                                  
##       MaritalStatus=Married-civ-spouse,               
##       Occupation=Craft-repair,                        
##       Relationship=Husband,                           
##       Race=White,                                     
##       Gender=Male,                                    
##       CapitalGain=Very Low,                           
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Very Low,                          
##       NativeCountry=United-States,                    
##       Income=<=50K}                                 10
## [9]  {Age=51-65,                                      
##       WorkClass=Private,                              
##       Education=Medium,                               
##       MaritalStatus=Married-civ-spouse,               
##       Occupation=Machine-op-inspct,                   
##       Relationship=Husband,                           
##       Race=White,                                     
##       Gender=Male,                                    
##       CapitalGain=Medium,                             
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=>50K}                                  11
## [10] {Age=31-40,                                      
##       WorkClass=Federal-gov,                          
##       Education=High,                                 
##       MaritalStatus=Married-civ-spouse,               
##       Occupation=Adm-clerical,                        
##       Relationship=Husband,                           
##       Race=White,                                     
##       Gender=Male,                                    
##       CapitalGain=Very Low,                           
##       CapitalLoss=Very Low,                           
##       HoursPerWeek=Low,                               
##       NativeCountry=United-States,                    
##       Income=<=50K}                                 12

itemFrequencyPlot(transactions_income, support = 0.1, col="coral4")

Apriori Algorithm

I am now applying Apriori Algorithm.I am setting support level 10% - this means that an item must appear in at least 10% of the dataset. It will help me filter out rare, less meaningful patterns. Confidence is set at level of 80% - whenever the antecedent (left-hand side) occur, there is at least 80% chances that the consequent (right-hand side) will occur. After applying the Apriori algorithm, I am sorting by lift, confidence and count. Sorting by lift find strong dependencies between attributes beyond random occurence. It can be used in relation to high income. Confidence sorting displays how reliable the rule is in predicting an outcome. Finally, support sorting is used to find frquent patterns that appear in many transactions.

rules <- apriori(transactions_income, parameter = list(support = 0.1, confidence = 0.8), maxlen = 3)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4522 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[108 item(s), 45222 transaction(s)] done [0.02s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3

##  done [0.01s].
## writing ... [946 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 946 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3 
##   4 133 809 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   2.851   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.1002   Min.   :0.8002   Min.   :0.1029   Min.   :0.8725  
##  1st Qu.:0.1308   1st Qu.:0.8847   1st Qu.:0.1440   1st Qu.:0.9934  
##  Median :0.2046   Median :0.9177   Median :0.2238   Median :1.0172  
##  Mean   :0.2587   Mean   :0.9144   Mean   :0.2837   Mean   :1.1066  
##  3rd Qu.:0.2991   3rd Qu.:0.9477   3rd Qu.:0.3228   3rd Qu.:1.0591  
##  Max.   :0.9535   Max.   :1.0000   Max.   :1.0000   Max.   :2.9114  
##      count      
##  Min.   : 4531  
##  1st Qu.: 5913  
##  Median : 9253  
##  Mean   :11700  
##  3rd Qu.:13527  
##  Max.   :43117  
## 
## mining info:
##                 data ntransactions support confidence
##  transactions_income         45222     0.1        0.8
##                                                                                                call
##  apriori(data = transactions_income, parameter = list(support = 0.1, confidence = 0.8), maxlen = 3)

inspect(sort(rules, by = "lift")[1:10])

##      lhs                                    rhs                             support confidence  coverage     lift count
## [1]  {Age=17-30,                                                                                                       
##       Relationship=Own-child}            => {MaritalStatus=Never-married} 0.1022290  0.9398252 0.1087745 2.911411  4623
## [2]  {WorkClass=Private,                                                                                               
##       Relationship=Own-child}            => {MaritalStatus=Never-married} 0.1107868  0.8927299 0.1240989 2.765518  5010
## [3]  {Relationship=Own-child,                                                                                          
##       Race=White}                        => {MaritalStatus=Never-married} 0.1100128  0.8909384 0.1234797 2.759968  4975
## [4]  {Relationship=Own-child,                                                                                          
##       Income=<=50K}                      => {MaritalStatus=Never-married} 0.1282783  0.8895875 0.1441997 2.755783  5801
## [5]  {Relationship=Own-child,                                                                                          
##       NativeCountry=United-States}       => {MaritalStatus=Never-married} 0.1221529  0.8876748 0.1376100 2.749858  5524
## [6]  {Relationship=Own-child,                                                                                          
##       CapitalGain=Very Low}              => {MaritalStatus=Never-married} 0.1266640  0.8875116 0.1427181 2.749353  5728
## [7]  {Relationship=Own-child,                                                                                          
##       CapitalLoss=Very Low}              => {MaritalStatus=Never-married} 0.1269294  0.8860759 0.1432489 2.744905  5740
## [8]  {Relationship=Own-child}            => {MaritalStatus=Never-married} 0.1296714  0.8849985 0.1465216 2.741567  5864
## [9]  {Education=Medium,                                                                                                
##       Relationship=Own-child}            => {MaritalStatus=Never-married} 0.1022069  0.8845933 0.1155411 2.740312  4622
## [10] {MaritalStatus=Married-civ-spouse,                                                                                
##       Gender=Male}                       => {Relationship=Husband}        0.4124983  0.9900223 0.4166556 2.398521 18654

inspect(sort(rules, by = "confidence")[1:10])

##      lhs                                    rhs             support confidence  coverage     lift count
## [1]  {Age=41-50,                                                                                       
##       Relationship=Husband}              => {Gender=Male} 0.1194109  1.0000000 0.1194109 1.481377  5400
## [2]  {Relationship=Husband,                                                                            
##       Income=>50K}                       => {Gender=Male} 0.1881164  1.0000000 0.1881164 1.481377  8507
## [3]  {Relationship=Husband,                                                                            
##       HoursPerWeek=Medium}               => {Gender=Male} 0.1543054  1.0000000 0.1543054 1.481377  6978
## [4]  {Education=High,                                                                                  
##       Relationship=Husband}              => {Gender=Male} 0.1471629  1.0000000 0.1471629 1.481377  6655
## [5]  {Relationship=Husband}              => {Gender=Male} 0.4127416  0.9999464 0.4127637 1.481298 18665
## [6]  {MaritalStatus=Married-civ-spouse,                                                                
##       Relationship=Husband}              => {Gender=Male} 0.4124983  0.9999464 0.4125205 1.481298 18654
## [7]  {Relationship=Husband,                                                                            
##       CapitalLoss=Very Low}              => {Gender=Male} 0.3861616  0.9999427 0.3861837 1.481292 17463
## [8]  {Relationship=Husband,                                                                            
##       NativeCountry=United-States}       => {Gender=Male} 0.3776702  0.9999415 0.3776923 1.481290 17079
## [9]  {Relationship=Husband,                                                                            
##       Race=White}                        => {Gender=Male} 0.3751935  0.9999411 0.3752156 1.481290 16967
## [10] {Relationship=Husband,                                                                            
##       CapitalGain=Very Low}              => {Gender=Male} 0.3631639  0.9999391 0.3631861 1.481287 16423

inspect(sort(rules, by = "count")[1:10])

##      lhs                              rhs                           support  
## [1]  {}                            => {CapitalLoss=Very Low}        0.9534519
## [2]  {}                            => {CapitalGain=Very Low}        0.9176507
## [3]  {}                            => {NativeCountry=United-States} 0.9130954
## [4]  {CapitalGain=Very Low}        => {CapitalLoss=Very Low}        0.8711026
## [5]  {CapitalLoss=Very Low}        => {CapitalGain=Very Low}        0.8711026
## [6]  {NativeCountry=United-States} => {CapitalLoss=Very Low}        0.8698421
## [7]  {CapitalLoss=Very Low}        => {NativeCountry=United-States} 0.8698421
## [8]  {}                            => {Race=White}                  0.8602671
## [9]  {NativeCountry=United-States} => {CapitalGain=Very Low}        0.8364955
## [10] {CapitalGain=Very Low}        => {NativeCountry=United-States} 0.8364955
##      confidence coverage  lift      count
## [1]  0.9534519  1.0000000 1.0000000 43117
## [2]  0.9176507  1.0000000 1.0000000 41498
## [3]  0.9130954  1.0000000 1.0000000 41292
## [4]  0.9492747  0.9176507 0.9956189 39393
## [5]  0.9136304  0.9534519 0.9956189 39393
## [6]  0.9526300  0.9130954 0.9991381 39336
## [7]  0.9123084  0.9534519 0.9991381 39336
## [8]  0.8602671  1.0000000 1.0000000 38903
## [9]  0.9161097  0.9130954 0.9983207 37828
## [10] 0.9115620  0.9176507 0.9983207 37828

What drives low and high income?

Now, my Apriori implementation is generating association rules where the right-hand side is fixed to “Income=<=50K”, meaning you’re finding patterns that strongly indicate low-income individuals. I am sorting by “lift”, because I want meanignful, strong assosiations. I am setting the support level at 0.01 to catch rare but more meaningful patterns, and confidence at 0.5 to reduce redundant rules.

#What drives the low income 
rules.less50K<-apriori(data=transactions_income, parameter=list(supp=0.01,conf = 0.5), 
appearance=list(default="lhs", rhs="Income=<=50K"), control=list(verbose=F)) 
rules.less50K.bylift<-sort(rules.less50K, by="lift", decreasing=TRUE)
inspect(head(rules.less50K.bylift))

##     lhs                                rhs               support confidence   coverage     lift count
## [1] {Occupation=Other-service,                                                                       
##      Relationship=Own-child,                                                                         
##      HoursPerWeek=Very Low}         => {Income=<=50K} 0.01090177          1 0.01090177 1.329511   493
## [2] {MaritalStatus=Never-married,                                                                    
##      Occupation=Other-service,                                                                       
##      HoursPerWeek=Very Low}         => {Income=<=50K} 0.01364380          1 0.01364380 1.329511   617
## [3] {Relationship=Own-child,                                                                         
##      Gender=Male,                                                                                    
##      HoursPerWeek=Very Low}         => {Income=<=50K} 0.01581089          1 0.01581089 1.329511   715
## [4] {Relationship=Own-child,                                                                         
##      CapitalGain=Very Low,                                                                           
##      HoursPerWeek=Very Low}         => {Income=<=50K} 0.03498297          1 0.03498297 1.329511  1582
## [5] {Age=17-30,                                                                                      
##      Occupation=Handlers-cleaners,                                                                   
##      Relationship=Own-child,                                                                         
##      CapitalGain=Very Low}          => {Income=<=50K} 0.01068064          1 0.01068064 1.329511   483
## [6] {MaritalStatus=Never-married,                                                                    
##      Occupation=Handlers-cleaners,                                                                   
##      Relationship=Own-child,                                                                         
##      Gender=Male}                   => {Income=<=50K} 0.01112290          1 0.01112290 1.329511   503

Analyzing the rules I can say that there are some variables that seem to be highly associated with earning less than or equal $50K in this dataset. Occupation (such as “Other-service” and “Handlers-cleaners”), hours worked (“Very Low”), relationship status (“Own-child”, “Never-married”), capital gains (“Very Low”) and age (17-30 years). In this case, all rules have 100% confidence, meaning that when the conditions on the LHS hold, the RHS (Income <= 50K) will always hold. This suggests that these combinations are strong predictors of income being less than or equal to 50K. The lift value for all rules is the same (1.329511), indicating that these rules are 32.95% more likely to occur than by chance.

Below, I am generating association rules where the right-hand side is fixed to “Income=>50K”, meaning finding patterns that strongly indicate high-income individuals.

#What drives high income
rules.more50K<-apriori(data=transactions_income, parameter=list(supp=0.01,conf = 0.5), 
appearance=list(default="lhs", rhs="Income=>50K"), control=list(verbose=F)) 
rules.more50K.bylift<-sort(rules.more50K, by="lift", decreasing=TRUE)
inspect(head(rules.more50K.bylift))

##     lhs                                    rhs              support confidence   coverage     lift count
## [1] {Occupation=Exec-managerial,                                                                        
##      Race=White,                                                                                        
##      CapitalGain=Medium}                => {Income=>50K} 0.01023838  0.9585921 0.01068064 3.867724   463
## [2] {Occupation=Exec-managerial,                                                                        
##      Race=White,                                                                                        
##      CapitalGain=Medium,                                                                                
##      CapitalLoss=Very Low}              => {Income=>50K} 0.01023838  0.9585921 0.01068064 3.867724   463
## [3] {Education=High,                                                                                    
##      MaritalStatus=Married-civ-spouse,                                                                  
##      CapitalGain=Medium,                                                                                
##      NativeCountry=United-States}       => {Income=>50K} 0.01742515  0.9563107 0.01822122 3.858519   788
## [4] {Education=High,                                                                                    
##      MaritalStatus=Married-civ-spouse,                                                                  
##      CapitalGain=Medium,                                                                                
##      CapitalLoss=Very Low,                                                                              
##      NativeCountry=United-States}       => {Income=>50K} 0.01742515  0.9563107 0.01822122 3.858519   788
## [5] {Education=High,                                                                                    
##      MaritalStatus=Married-civ-spouse,                                                                  
##      CapitalGain=Medium}                => {Income=>50K} 0.01864137  0.9557823 0.01950378 3.856387   843
## [6] {Education=High,                                                                                    
##      MaritalStatus=Married-civ-spouse,                                                                  
##      CapitalGain=Medium,                                                                                
##      CapitalLoss=Very Low}              => {Income=>50K} 0.01864137  0.9557823 0.01950378 3.856387   843

These rules are explaining associations between various factors and the likelihood of earning greater than $50K.

The strong inidicators of high income revealed from the rules above are:

Occupation: “Exec-managerial” jobs are highly associated with earning more than $50K.
Race: Being “White” is seen frequently in these high-income rules.
Capital Gain: Individuals with “Medium” capital gain are often associated with higher incomes.
Education: Having “High” education (such as a high school diploma, college, or advanced degrees) is an important factor for earning more than $50K.
Marital Status: Being “Married-civ-spouse” also correlates strongly with earning more than $50K.
Native Country: Being from the United States also increases the likelihood of earning more than $50K.

The lift values for these rules are consistently high (around 3.86), meaning that these rules are significantly more likely than random chance to occur in the dataset. This suggests that the combinations of these attributes (e.g., occupation, education, capital gain, marital status) are very strong predictors of income being above $50K.

library(arulesViz)


itemFrequencyPlot(transactions_income, topN=10, type="absolute", main="Item Frequency - absolute", col="coral")

itemFrequencyPlot(transactions_income, topN=10, type="relative", main="Item Frequency - relative", col="coral4")

plot(rules, method = "grouped")

First plot above shows the top 10 most frequent items in my dataset. The second plot displays the proportion of occureness of each item relative to all of them. the third one, visualizes the association rules generated using Apriori algorithm. Shows relationship between left-hand side and right-hand side of the rules. Now I will display an interactive network graph of rules.

# income <=50K
plot(rules.less50K, method="graph", control = list(max = 10))

# income >50K
plot(rules.more50K, method="graph", control = list(max = 10))

Conclusions

By mining association rules, I identified key factors that strongly influence income levels - distinguishing between individuals earning above or below $50K per year. This analysis indicates that low-income individuals are commonly associated with low education levels, service-sector occupations, young age, and low capital gains, whereas high-income individuals tend to have managerial or executive roles, high education levels, significant capital gains, and are often married. The strong lift values of these rules confirm that these relationships are far from random and can provide meaningful insights.

Unsupervised Learning Project: Association Rules

Aleksandra Dobosz

2025-01-26