Abstract

In this project, I analyze the “Adult Census Income” dataset. I use a technique called Association Rule Mining. My goal is to find patterns that predict if a person earns a high income or a low income. I use the Apriori algorithm to find these rules. My results show that marital status and job type are very strong predictors of wealth. In fact, they are often more important than just education level alone.

Dataset

I used data from the UCI Machine Learning Repository. It contains nearly 50,000 records from the 1994 Census. I loaded the data directly from the website.

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
col_names <- c("age", "workclass", "fnlwgt", "education", "education_num", 
               "marital_status", "occupation", "relationship", "race", "sex", 
               "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")

raw_data <- read.csv(url, header = FALSE, col.names = col_names, 
                     stringsAsFactors = TRUE, strip.white = TRUE, na.strings = "?")

dim(raw_data)
## [1] 32561    15

Data Cleaning and Classification

Cleaning the Data The Apriori algorithm needs categories, not numbers. Therefore, I had to process the data. - Removing columns: I removed fnlwgt and education_num because I do not need them for this analysis. - Removing missing data: I removed rows that had empty values. - Discretization: I converted numbers into groups. For example, I grouped age into “Young,” “Middle-aged,” and “Senior.”

adult_clean <- raw_data[, -c(3, 5)]
adult_clean <- na.omit(adult_clean) 

adult_clean[[ "age" ]] <- discretize(adult_clean[[ "age" ]], method = "frequency", breaks = 3, labels = c("Young", "Middle-aged", "Senior"))
adult_clean[[ "hours_per_week" ]] <- cut(adult_clean[[ "hours_per_week" ]], 
                                         breaks = c(0, 39, 40, 168), 
                                         labels = c("Part-time", "Full-time", "Over-time"),
                                         include.lowest = TRUE)
adult_clean[[ "capital_gain" ]] <- discretize(adult_clean[[ "capital_gain" ]], method = "interval", breaks = 3, labels = c("Low", "Medium", "High"))
adult_clean[[ "capital_loss" ]] <- discretize(adult_clean[[ "capital_loss" ]], method = "interval", breaks = 3, labels = c("Low", "Medium", "High"))

trans_adult <- as(adult_clean, "transactions")

inspect(head(trans_adult, 3))
##     items                                transactionID
## [1] {age=Middle-aged,                                 
##      workclass=State-gov,                             
##      education=Bachelors,                             
##      marital_status=Never-married,                    
##      occupation=Adm-clerical,                         
##      relationship=Not-in-family,                      
##      race=White,                                      
##      sex=Male,                                        
##      capital_gain=Low,                                
##      capital_loss=Low,                                
##      hours_per_week=Full-time,                        
##      native_country=United-States,                    
##      income=<=50K}                                   1
## [2] {age=Senior,                                      
##      workclass=Self-emp-not-inc,                      
##      education=Bachelors,                             
##      marital_status=Married-civ-spouse,               
##      occupation=Exec-managerial,                      
##      relationship=Husband,                            
##      race=White,                                      
##      sex=Male,                                        
##      capital_gain=Low,                                
##      capital_loss=Low,                                
##      hours_per_week=Part-time,                        
##      native_country=United-States,                    
##      income=<=50K}                                   2
## [3] {age=Middle-aged,                                 
##      workclass=Private,                               
##      education=HS-grad,                               
##      marital_status=Divorced,                         
##      occupation=Handlers-cleaners,                    
##      relationship=Not-in-family,                      
##      race=White,                                      
##      sex=Male,                                        
##      capital_gain=Low,                                
##      capital_loss=Low,                                
##      hours_per_week=Full-time,                        
##      native_country=United-States,                    
##      income=<=50K}                                   3

Exploring Data

Before I run the algorithm, I want to see which items are the most common. I used a plot to visualize the top 10 attributes.

itemFrequencyPlot(trans_adult, topN = 10, type = "absolute", 
                  main = "Top 10 Most Common Attributes", 
                  col = "steelblue", xlab = "Attribute")

Observation: Most people in this dataset are from the United States. Also, most people have “Low” capital gain and “Low” capital loss. This is normal, but it means I need to look carefully to find interesting patterns.

Finding the Association Rules

Apriori

I used the Apriori algorithm to generate rules. I set the minimum support to 0.5%. I set the minimum confidence to 80%. This ensures that I only find strong and reliable rules.

rules <- apriori(trans_adult, 
                 parameter = list(supp = 0.005, conf = 0.8, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 150 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[112 item(s), 30162 transaction(s)] done [0.01s].
## sorting and recoding items ... [66 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(trans_adult, parameter = list(supp = 0.005, conf = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
##  done [0.33s].
## writing ... [713318 rule(s)] done [0.04s].
## creating S4 object  ... done [0.18s].
summary(rules)
## set of 713318 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8      9     10 
##    303   4859  29323  90307 164287 189433 142721  70369  21716 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   6.000   7.000   6.853   8.000  10.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.005006   Min.   :0.8000   Min.   :0.005006   Min.   : 0.8385  
##  1st Qu.:0.006498   1st Qu.:0.9086   1st Qu.:0.006896   1st Qu.: 1.0051  
##  Median :0.009184   Median :0.9565   Median :0.009781   Median : 1.0412  
##  Mean   :0.016245   Mean   :0.9430   Mean   :0.017302   Mean   : 1.3044  
##  3rd Qu.:0.015980   3rd Qu.:0.9936   3rd Qu.:0.017008   3rd Qu.: 1.3022  
##  Max.   :0.950269   Max.   :1.0000   Max.   :0.994927   Max.   :32.0061  
##      count      
##  Min.   :  151  
##  1st Qu.:  196  
##  Median :  277  
##  Mean   :  490  
##  3rd Qu.:  482  
##  Max.   :28662  
## 
## mining info:
##         data ntransactions support confidence
##  trans_adult         30162   0.005        0.8
##                                                                                 call
##  apriori(data = trans_adult, parameter = list(supp = 0.005, conf = 0.8, minlen = 2))

Who earns more than 50k?

I filtered the rules to look for high income. I selected rules where the result (RHS) is income=>50K. I sorted them by “lift.” Lift is a measure of how interesting a rule is. High lift means the relationship is very strong.

rules_high_income <- subset(rules, rhs %in% "income=>50K")
rules_high_income_sorted <- sort(rules_high_income, by = "lift", decreasing = TRUE)

inspect(head(rules_high_income_sorted, 10))
##      lhs                                     rhs               support confidence    coverage     lift count
## [1]  {education=Masters,                                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005636231  0.9444444 0.005967774 3.794131   170
## [2]  {education=Masters,                                                                                    
##       marital_status=Married-civ-spouse,                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005636231  0.9444444 0.005967774 3.794131   170
## [3]  {education=Masters,                                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       sex=Male,                                                                                             
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005636231  0.9444444 0.005967774 3.794131   170
## [4]  {education=Masters,                                                                                    
##       marital_status=Married-civ-spouse,                                                                    
##       occupation=Exec-managerial,                                                                           
##       race=White,                                                                                           
##       sex=Male,                                                                                             
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005636231  0.9444444 0.005967774 3.794131   170
## [5]  {education=Masters,                                                                                    
##       marital_status=Married-civ-spouse,                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       sex=Male,                                                                                             
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005636231  0.9444444 0.005967774 3.794131   170
## [6]  {education=Masters,                                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       capital_gain=Low,                                                                                     
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005437305  0.9425287 0.005768848 3.786435   164
## [7]  {education=Masters,                                                                                    
##       marital_status=Married-civ-spouse,                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       capital_gain=Low,                                                                                     
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005437305  0.9425287 0.005768848 3.786435   164
## [8]  {education=Masters,                                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       sex=Male,                                                                                             
##       capital_gain=Low,                                                                                     
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005437305  0.9425287 0.005768848 3.786435   164
## [9]  {education=Masters,                                                                                    
##       marital_status=Married-civ-spouse,                                                                    
##       occupation=Exec-managerial,                                                                           
##       race=White,                                                                                           
##       sex=Male,                                                                                             
##       capital_gain=Low,                                                                                     
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005437305  0.9425287 0.005768848 3.786435   164
## [10] {education=Masters,                                                                                    
##       marital_status=Married-civ-spouse,                                                                    
##       occupation=Exec-managerial,                                                                           
##       relationship=Husband,                                                                                 
##       race=White,                                                                                           
##       sex=Male,                                                                                             
##       capital_gain=Low,                                                                                     
##       hours_per_week=Over-time,                                                                             
##       native_country=United-States}       => {income=>50K} 0.005437305  0.9425287 0.005768848 3.786435   164

My Findings: - Marriage matters: The attribute Married-civ-spouse appears in almost all the top rules. - Jobs matter: Managerial and professional jobs (Exec-managerial, Prof-specialty) are strong predictors of wealth. - Education helps: Having a Bachelor’s or Master’s degree increases the chance of high income, especially when combined with marriage.

Who earns less than 50k?

Now, I look at the opposite. What predicts an income of less than 50k? I sorted these rules by “confidence.”

rules_low_income <- subset(rules, rhs %in% "income=<=50K")
rules_low_income_sorted <- sort(rules_low_income, by = "confidence", decreasing = TRUE)

inspect(head(rules_low_income_sorted, 10))
##      lhs                                rhs                support confidence    coverage    lift count
## [1]  {occupation=Other-service,                                                                        
##       relationship=Other-relative}   => {income=<=50K} 0.006000928          1 0.006000928 1.33142   181
## [2]  {occupation=Transport-moving,                                                                     
##       relationship=Own-child}        => {income=<=50K} 0.005437305          1 0.005437305 1.33142   164
## [3]  {workclass=Private,                                                                               
##       education=12th,                                                                                  
##       marital_status=Never-married}  => {income=<=50K} 0.006100391          1 0.006100391 1.33142   184
## [4]  {age=Young,                                                                                       
##       marital_status=Never-married,                                                                    
##       native_country=Mexico}         => {income=<=50K} 0.006000928          1 0.006000928 1.33142   181
## [5]  {education=10th,                                                                                  
##       occupation=Other-service,                                                                        
##       native_country=United-States}  => {income=<=50K} 0.005669385          1 0.005669385 1.33142   171
## [6]  {age=Young,                                                                                       
##       education=10th,                                                                                  
##       relationship=Own-child}        => {income=<=50K} 0.005072608          1 0.005072608 1.33142   153
## [7]  {education=10th,                                                                                  
##       marital_status=Never-married,                                                                    
##       relationship=Own-child}        => {income=<=50K} 0.005138917          1 0.005138917 1.33142   155
## [8]  {education=10th,                                                                                  
##       marital_status=Never-married,                                                                    
##       hours_per_week=Part-time}      => {income=<=50K} 0.005404151          1 0.005404151 1.33142   163
## [9]  {education=10th,                                                                                  
##       relationship=Not-in-family,                                                                      
##       capital_loss=Low}              => {income=<=50K} 0.005934620          1 0.005934620 1.33142   179
## [10] {age=Young,                                                                                       
##       education=10th,                                                                                  
##       marital_status=Never-married}  => {income=<=50K} 0.007857569          1 0.007857569 1.33142   237

My Findings: - Age: Young people (age=Young) almost always earn less than 50k. This makes sense because they are just starting their careers. - Single status: People who have never married (Never-married) and have lower education (HS-grad) are very likely to have lower income.

Visualization

It is easier to understand these rules with a picture. I created a network graph. The circles represent the rules. The arrows show the connection between attributes and income.

plot(head(rules_high_income_sorted, 10), method = "graph", engine = "htmlwidget")
plot(rules_high_income, measure = c("support", "lift"), shading = "confidence", 
     main = "High Income Rules: Support vs Lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(head(rules_high_income_sorted, 10), 
     method = "paracoord", 
     control = list(reorder = TRUE),
     main = "Parallel Coordinates for High Income")

Conclusion

In this project, I successfully used Association Rule Mining on census data. I found clear profiles for different income groups. The results show that high income is not just about education. Marital status and job type are actually stronger indicators. This method is very useful. It allows me to find complex patterns in social data that are hard to see with simple charts.