In this project, I analyze the “Adult Census Income” dataset. I use a technique called Association Rule Mining. My goal is to find patterns that predict if a person earns a high income or a low income. I use the Apriori algorithm to find these rules. My results show that marital status and job type are very strong predictors of wealth. In fact, they are often more important than just education level alone.
I used data from the UCI Machine Learning Repository. It contains nearly 50,000 records from the 1994 Census. I loaded the data directly from the website.
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
col_names <- c("age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "sex",
"capital_gain", "capital_loss", "hours_per_week", "native_country", "income")
raw_data <- read.csv(url, header = FALSE, col.names = col_names,
stringsAsFactors = TRUE, strip.white = TRUE, na.strings = "?")
dim(raw_data)
## [1] 32561 15
Cleaning the Data The Apriori algorithm needs categories, not numbers. Therefore, I had to process the data. - Removing columns: I removed fnlwgt and education_num because I do not need them for this analysis. - Removing missing data: I removed rows that had empty values. - Discretization: I converted numbers into groups. For example, I grouped age into “Young,” “Middle-aged,” and “Senior.”
adult_clean <- raw_data[, -c(3, 5)]
adult_clean <- na.omit(adult_clean)
adult_clean[[ "age" ]] <- discretize(adult_clean[[ "age" ]], method = "frequency", breaks = 3, labels = c("Young", "Middle-aged", "Senior"))
adult_clean[[ "hours_per_week" ]] <- cut(adult_clean[[ "hours_per_week" ]],
breaks = c(0, 39, 40, 168),
labels = c("Part-time", "Full-time", "Over-time"),
include.lowest = TRUE)
adult_clean[[ "capital_gain" ]] <- discretize(adult_clean[[ "capital_gain" ]], method = "interval", breaks = 3, labels = c("Low", "Medium", "High"))
adult_clean[[ "capital_loss" ]] <- discretize(adult_clean[[ "capital_loss" ]], method = "interval", breaks = 3, labels = c("Low", "Medium", "High"))
trans_adult <- as(adult_clean, "transactions")
inspect(head(trans_adult, 3))
## items transactionID
## [1] {age=Middle-aged,
## workclass=State-gov,
## education=Bachelors,
## marital_status=Never-married,
## occupation=Adm-clerical,
## relationship=Not-in-family,
## race=White,
## sex=Male,
## capital_gain=Low,
## capital_loss=Low,
## hours_per_week=Full-time,
## native_country=United-States,
## income=<=50K} 1
## [2] {age=Senior,
## workclass=Self-emp-not-inc,
## education=Bachelors,
## marital_status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital_gain=Low,
## capital_loss=Low,
## hours_per_week=Part-time,
## native_country=United-States,
## income=<=50K} 2
## [3] {age=Middle-aged,
## workclass=Private,
## education=HS-grad,
## marital_status=Divorced,
## occupation=Handlers-cleaners,
## relationship=Not-in-family,
## race=White,
## sex=Male,
## capital_gain=Low,
## capital_loss=Low,
## hours_per_week=Full-time,
## native_country=United-States,
## income=<=50K} 3
Before I run the algorithm, I want to see which items are the most common. I used a plot to visualize the top 10 attributes.
itemFrequencyPlot(trans_adult, topN = 10, type = "absolute",
main = "Top 10 Most Common Attributes",
col = "steelblue", xlab = "Attribute")
Observation: Most people in this dataset are from the United States. Also, most people have “Low” capital gain and “Low” capital loss. This is normal, but it means I need to look carefully to find interesting patterns.
I used the Apriori algorithm to generate rules. I set the minimum support to 0.5%. I set the minimum confidence to 80%. This ensures that I only find strong and reliable rules.
rules <- apriori(trans_adult,
parameter = list(supp = 0.005, conf = 0.8, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 150
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[112 item(s), 30162 transaction(s)] done [0.01s].
## sorting and recoding items ... [66 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(trans_adult, parameter = list(supp = 0.005, conf = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.33s].
## writing ... [713318 rule(s)] done [0.04s].
## creating S4 object ... done [0.18s].
summary(rules)
## set of 713318 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 303 4859 29323 90307 164287 189433 142721 70369 21716
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 6.000 7.000 6.853 8.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005006 Min. :0.8000 Min. :0.005006 Min. : 0.8385
## 1st Qu.:0.006498 1st Qu.:0.9086 1st Qu.:0.006896 1st Qu.: 1.0051
## Median :0.009184 Median :0.9565 Median :0.009781 Median : 1.0412
## Mean :0.016245 Mean :0.9430 Mean :0.017302 Mean : 1.3044
## 3rd Qu.:0.015980 3rd Qu.:0.9936 3rd Qu.:0.017008 3rd Qu.: 1.3022
## Max. :0.950269 Max. :1.0000 Max. :0.994927 Max. :32.0061
## count
## Min. : 151
## 1st Qu.: 196
## Median : 277
## Mean : 490
## 3rd Qu.: 482
## Max. :28662
##
## mining info:
## data ntransactions support confidence
## trans_adult 30162 0.005 0.8
## call
## apriori(data = trans_adult, parameter = list(supp = 0.005, conf = 0.8, minlen = 2))
I filtered the rules to look for high income. I selected rules where the result (RHS) is income=>50K. I sorted them by “lift.” Lift is a measure of how interesting a rule is. High lift means the relationship is very strong.
rules_high_income <- subset(rules, rhs %in% "income=>50K")
rules_high_income_sorted <- sort(rules_high_income, by = "lift", decreasing = TRUE)
inspect(head(rules_high_income_sorted, 10))
## lhs rhs support confidence coverage lift count
## [1] {education=Masters,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005636231 0.9444444 0.005967774 3.794131 170
## [2] {education=Masters,
## marital_status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005636231 0.9444444 0.005967774 3.794131 170
## [3] {education=Masters,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005636231 0.9444444 0.005967774 3.794131 170
## [4] {education=Masters,
## marital_status=Married-civ-spouse,
## occupation=Exec-managerial,
## race=White,
## sex=Male,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005636231 0.9444444 0.005967774 3.794131 170
## [5] {education=Masters,
## marital_status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005636231 0.9444444 0.005967774 3.794131 170
## [6] {education=Masters,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## capital_gain=Low,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005437305 0.9425287 0.005768848 3.786435 164
## [7] {education=Masters,
## marital_status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## capital_gain=Low,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005437305 0.9425287 0.005768848 3.786435 164
## [8] {education=Masters,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital_gain=Low,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005437305 0.9425287 0.005768848 3.786435 164
## [9] {education=Masters,
## marital_status=Married-civ-spouse,
## occupation=Exec-managerial,
## race=White,
## sex=Male,
## capital_gain=Low,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005437305 0.9425287 0.005768848 3.786435 164
## [10] {education=Masters,
## marital_status=Married-civ-spouse,
## occupation=Exec-managerial,
## relationship=Husband,
## race=White,
## sex=Male,
## capital_gain=Low,
## hours_per_week=Over-time,
## native_country=United-States} => {income=>50K} 0.005437305 0.9425287 0.005768848 3.786435 164
My Findings: - Marriage matters: The attribute Married-civ-spouse appears in almost all the top rules. - Jobs matter: Managerial and professional jobs (Exec-managerial, Prof-specialty) are strong predictors of wealth. - Education helps: Having a Bachelor’s or Master’s degree increases the chance of high income, especially when combined with marriage.
Now, I look at the opposite. What predicts an income of less than 50k? I sorted these rules by “confidence.”
rules_low_income <- subset(rules, rhs %in% "income=<=50K")
rules_low_income_sorted <- sort(rules_low_income, by = "confidence", decreasing = TRUE)
inspect(head(rules_low_income_sorted, 10))
## lhs rhs support confidence coverage lift count
## [1] {occupation=Other-service,
## relationship=Other-relative} => {income=<=50K} 0.006000928 1 0.006000928 1.33142 181
## [2] {occupation=Transport-moving,
## relationship=Own-child} => {income=<=50K} 0.005437305 1 0.005437305 1.33142 164
## [3] {workclass=Private,
## education=12th,
## marital_status=Never-married} => {income=<=50K} 0.006100391 1 0.006100391 1.33142 184
## [4] {age=Young,
## marital_status=Never-married,
## native_country=Mexico} => {income=<=50K} 0.006000928 1 0.006000928 1.33142 181
## [5] {education=10th,
## occupation=Other-service,
## native_country=United-States} => {income=<=50K} 0.005669385 1 0.005669385 1.33142 171
## [6] {age=Young,
## education=10th,
## relationship=Own-child} => {income=<=50K} 0.005072608 1 0.005072608 1.33142 153
## [7] {education=10th,
## marital_status=Never-married,
## relationship=Own-child} => {income=<=50K} 0.005138917 1 0.005138917 1.33142 155
## [8] {education=10th,
## marital_status=Never-married,
## hours_per_week=Part-time} => {income=<=50K} 0.005404151 1 0.005404151 1.33142 163
## [9] {education=10th,
## relationship=Not-in-family,
## capital_loss=Low} => {income=<=50K} 0.005934620 1 0.005934620 1.33142 179
## [10] {age=Young,
## education=10th,
## marital_status=Never-married} => {income=<=50K} 0.007857569 1 0.007857569 1.33142 237
My Findings: - Age: Young people (age=Young) almost always earn less than 50k. This makes sense because they are just starting their careers. - Single status: People who have never married (Never-married) and have lower education (HS-grad) are very likely to have lower income.
It is easier to understand these rules with a picture. I created a network graph. The circles represent the rules. The arrows show the connection between attributes and income.
plot(head(rules_high_income_sorted, 10), method = "graph", engine = "htmlwidget")
plot(rules_high_income, measure = c("support", "lift"), shading = "confidence",
main = "High Income Rules: Support vs Lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(head(rules_high_income_sorted, 10),
method = "paracoord",
control = list(reorder = TRUE),
main = "Parallel Coordinates for High Income")
In this project, I successfully used Association Rule Mining on census data. I found clear profiles for different income groups. The results show that high income is not just about education. Marital status and job type are actually stronger indicators. This method is very useful. It allows me to find complex patterns in social data that are hard to see with simple charts.