Name: Uyen Nguyen
Adult Income Level Prediction using Machine Learning Classification
Techniques
I. Introduction
This project utilizes the Adult Income Dataset from UCI Machine Learning Repository. The response variable is income level, which is binary data that takes 2 values. >50k indicates an individual earns more than $50000 annually while <=50k specifies that they make less than or equal to $50000 a year. The explanatory variables include age, workclass, final weight, education, education number, marital status, occupation, relationship, race, sex, capital gain, capital loss, hours per week, native country.
First I will perform data cleaning, then exploratory data analysis and data visualization to have the initial understanding of the relations between different features and income level. Finally, I will utilize machine learning models to perform the classification, namely random forest, logistic regression, support vector machine and decision tree to find out what methods are most effective in predicting an individual’s income based on many factors.
To begin with, I will upload the necessary library and load the csv file of the dataset into RStudio and check the first 6 rows to ensure the dataset is correct. There are 32561 rows and 15 columns.
#import library
library("ggplot2")
Warning: package ‘ggplot2’ was built under R version 4.2.3
library("dplyr")
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
#load file
df <- read.csv('C:/Users/Gia Uyen/Downloads/Adult_Income.csv')
print(head(df))
dim(df)
[1] 32561 15
Cleaning data
It is noticed that there are some columns labeled with the wrong data types. To fix it, I create a list of columns with wrong data types and convert them from character to factor. Also, it turns out that the missing values are not labeled as NA but instead with a question mark “?”. Thus, I replace “?” values with NA and then examine the percentage of NA values in the dataset. Na values appear in workclass, occupation and native country columns, with 5.63%, 5.66% and 1.79% respectively. Because the NA values only account for a small percentage, they are removed from the dataset. I also removed extra whitespace present in many values to avoid creating confusion when filtering out data.
#Missing values
sum(is.na(df))
[1] 0
#replace ? with NA
df[df == " ?"] <- NA
#Calculate NA percentage in the dataset
print(sapply(df, function(df){ sum(is.na(df)==T) * 100 /length(df) }))
age workclass final.weight education education.number marital.status occupation relationship
0.000000 5.638647 0.000000 0.000000 0.000000 0.000000 5.660146 0.000000
race sex capital.gain capital.loss hours.per.week native.country income
0.000000 0.000000 0.000000 0.000000 0.000000 1.790486 0.000000
#Because the NA values only account for more than 5% at max, so we can remove it
df <- na.omit(df)
sum(is.na(df))
[1] 0
#remove whitespace from the beginning of the string of each column
df[]<- lapply(df,trimws)
#change data types to numeric and factor
num_var <- c("age","final.weight","education.number",
"capital.gain","capital.loss","hours.per.week")
df[num_var] <- sapply(df[num_var], as.numeric)
categ_var <- c("workclass","education","marital.status",
"occupation","relationship","race",
"sex","native.country")
df[,categ_var] <- lapply(df[,categ_var],factor)
str(df)
'data.frame': 30162 obs. of 15 variables:
$ age : num 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 7 levels "Federal-gov",..: 6 5 3 3 3 3 3 5 3 3 ...
$ final.weight : num 77516 83311 215646 234721 338409 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ education.number: num 13 13 9 7 13 14 5 9 14 13 ...
$ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capital.gain : num 2174 0 0 0 0 ...
$ capital.loss : num 0 0 0 0 0 0 0 0 0 0 ...
$ hours.per.week : num 40 13 40 40 40 40 16 45 50 40 ...
$ native.country : Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
- attr(*, "na.action")= 'omit' Named int [1:2399] 15 28 39 52 62 70 78 94 107 129 ...
..- attr(*, "names")= chr [1:2399] "15" "28" "39" "52" ...
The graph below shows the correlation between the income level and quantitative features, including age, final weight, education number, capital gain, capital loss, hours per week. Since the income column is categorical data, I turned the data into 0 and 1 to be able to compare with other quantitative features, in which 0 represents income less than or equal to $50k and 1 represents income more than $50k. The correlation heatmap shows that the final weight does not have the least correlation with income, with r = -0.01 while education number appears to be more correlated with income (r = 0.34).
corr_var <- c("age","final.weight","education.number",
"capital.gain","capital.loss","hours.per.week","income")
df$income<-ifelse(df$income =='<=50K',0,1)
df$income <- as.numeric(df$income)
correlation <- round(cor(df[corr_var]),2)
correlation
age final.weight education.number capital.gain capital.loss hours.per.week income
age 1.00 -0.08 0.04 0.08 0.06 0.10 0.24
final.weight -0.08 1.00 -0.04 0.00 -0.01 -0.02 -0.01
education.number 0.04 -0.04 1.00 0.12 0.08 0.15 0.34
capital.gain 0.08 0.00 0.12 1.00 -0.03 0.08 0.22
capital.loss 0.06 -0.01 0.08 -0.03 1.00 0.05 0.15
hours.per.week 0.10 -0.02 0.15 0.08 0.05 1.00 0.23
income 0.24 -0.01 0.34 0.22 0.15 0.23 1.00
library(reshape2)
Warning: package ‘reshape2’ was built under R version 4.2.3
melted_corr <- melt(correlation)
ggplot(data = melted_corr, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
Because the final weight column doesn’t have much impact while capital gain and capital loss columns contain many zero values, I removed the columns out of the dataset. Additionally, I removed the native country column because a majority of people come from the United States and only a small number come from other countries.
#Because final weight column doesn't have much impact and capital gain and capital loss columns contain many zero value, so we will remove the columns out of the dataset
df$final.weight = NULL
df$capital.gain = NULL
df$capital.loss = NULL
df$native.country = NULL
summary(df)
age workclass education education.number marital.status occupation
Min. :17.00 Federal-gov : 943 HS-grad :9840 Min. : 1.00 Divorced : 4214 Prof-specialty :4038
1st Qu.:28.00 Local-gov : 2067 Some-college:6678 1st Qu.: 9.00 Married-AF-spouse : 21 Craft-repair :4030
Median :37.00 Private :22286 Bachelors :5044 Median :10.00 Married-civ-spouse :14065 Exec-managerial:3992
Mean :38.44 Self-emp-inc : 1074 Masters :1627 Mean :10.12 Married-spouse-absent: 370 Adm-clerical :3721
3rd Qu.:47.00 Self-emp-not-inc: 2499 Assoc-voc :1307 3rd Qu.:13.00 Never-married : 9726 Sales :3584
Max. :90.00 State-gov : 1279 11th :1048 Max. :16.00 Separated : 939 Other-service :3212
Without-pay : 14 (Other) :4618 Widowed : 827 (Other) :7585
relationship race sex hours.per.week income
Husband :12463 Amer-Indian-Eskimo: 286 Female: 9782 Min. : 1.00 Min. :0.0000
Not-in-family : 7726 Asian-Pac-Islander: 895 Male :20380 1st Qu.:40.00 1st Qu.:0.0000
Other-relative: 889 Black : 2817 Median :40.00 Median :0.0000
Own-child : 4466 Other : 231 Mean :40.93 Mean :0.2489
Unmarried : 3212 White :25933 3rd Qu.:45.00 3rd Qu.:0.0000
Wife : 1406 Max. :99.00 Max. :1.0000
Age vs Income
The figure below shows that the age distribution skews right, with a majority of the entries having ages between 27 and 50.
df$age = as.numeric(df$age)
hist(df$age,col = 'lavender', main = "Age Distribution",
xlab = "Age", ylab = "Number of people",breaks = 100,prob = T)
abline(v=quantile(df$age, .25), col='red', lwd = 2, lty = 'dashed')
abline(v=quantile(df$age, .50), col='red', lwd = 2, lty = 'dashed')
abline(v=quantile(df$age, .75), col='red', lwd = 2, lty = 'dashed')
lines(density(df$age),col='purple',lwd = 2)
The stacked histogram and boxplot below illustrate the income distribution by age. Most people who make more than $50k are between their mid thirties to mid fifties.
df$income <- as.factor(df$income)
ggplot(df, aes(x=age, fill=income)) +
geom_histogram(alpha=0.5, bins=30, color = 'black') +
geom_density(aes(y=after_stat(density), fill=income), alpha=0.5)+
scale_fill_manual(values=c("lightgreen", "salmon")) +
labs(x="Age", y="Count") +
ggtitle('Income Classification by Age')
theme_classic()
List of 97
$ line :List of 6
..$ colour : chr "black"
..$ linewidth : num 0.5
..$ linetype : num 1
..$ lineend : chr "butt"
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ rect :List of 5
..$ fill : chr "white"
..$ colour : chr "black"
..$ linewidth : num 0.5
..$ linetype : num 1
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ text :List of 11
..$ family : chr ""
..$ face : chr "plain"
..$ colour : chr "black"
..$ size : num 11
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : num 0
..$ lineheight : num 0.9
..$ margin : 'margin' num [1:4] 0points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ title : NULL
$ aspect.ratio : NULL
$ axis.title : NULL
$ axis.title.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.75points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 2.75points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.bottom : NULL
$ axis.title.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : num 90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 2.75points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.y.left : NULL
$ axis.title.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : num -90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 0points 2.75points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey30"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.2points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 2.2points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.bottom : NULL
$ axis.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 1
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 2.2points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.y.left : NULL
$ axis.text.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 0points 2.2points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.ticks :List of 6
..$ colour : chr "grey20"
..$ linewidth : NULL
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ axis.ticks.x : NULL
$ axis.ticks.x.top : NULL
$ axis.ticks.x.bottom : NULL
$ axis.ticks.y : NULL
$ axis.ticks.y.left : NULL
$ axis.ticks.y.right : NULL
$ axis.ticks.length : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
$ axis.ticks.length.x : NULL
$ axis.ticks.length.x.top : NULL
$ axis.ticks.length.x.bottom: NULL
$ axis.ticks.length.y : NULL
$ axis.ticks.length.y.left : NULL
$ axis.ticks.length.y.right : NULL
$ axis.line :List of 6
..$ colour : chr "black"
..$ linewidth : 'rel' num 1
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ axis.line.x : NULL
$ axis.line.x.top : NULL
$ axis.line.x.bottom : NULL
$ axis.line.y : NULL
$ axis.line.y.left : NULL
$ axis.line.y.right : NULL
$ legend.background :List of 5
..$ fill : NULL
..$ colour : logi NA
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ legend.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
..- attr(*, "unit")= int 8
$ legend.spacing : 'simpleUnit' num 11points
..- attr(*, "unit")= int 8
$ legend.spacing.x : NULL
$ legend.spacing.y : NULL
$ legend.key : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.key.size : 'simpleUnit' num 1.2lines
..- attr(*, "unit")= int 3
$ legend.key.height : NULL
$ legend.key.width : NULL
$ legend.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.text.align : NULL
$ legend.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.title.align : NULL
$ legend.position : chr "right"
$ legend.direction : NULL
$ legend.justification : chr "center"
$ legend.box : NULL
$ legend.box.just : NULL
$ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
..- attr(*, "unit")= int 1
$ legend.box.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.box.spacing : 'simpleUnit' num 11points
..- attr(*, "unit")= int 8
$ panel.background :List of 5
..$ fill : chr "white"
..$ colour : logi NA
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ panel.border : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ panel.spacing : 'simpleUnit' num 5.5points
..- attr(*, "unit")= int 8
$ panel.spacing.x : NULL
$ panel.spacing.y : NULL
$ panel.grid :List of 6
..$ colour : chr "grey92"
..$ linewidth : NULL
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.major : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ panel.grid.minor : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ panel.grid.major.x : NULL
$ panel.grid.major.y : NULL
$ panel.grid.minor.x : NULL
$ panel.grid.minor.y : NULL
$ panel.ontop : logi FALSE
$ plot.background :List of 5
..$ fill : NULL
..$ colour : chr "white"
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ plot.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.title.position : chr "panel"
$ plot.subtitle :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : num 1
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 5.5points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption.position : chr "panel"
$ plot.tag :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.tag.position : chr "topleft"
$ plot.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
..- attr(*, "unit")= int 8
$ strip.background :List of 5
..$ fill : chr "white"
..$ colour : chr "black"
..$ linewidth : 'rel' num 2
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ strip.background.x : NULL
$ strip.background.y : NULL
$ strip.clip : chr "inherit"
$ strip.placement : chr "inside"
$ strip.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey10"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.x : NULL
$ strip.text.x.bottom : NULL
$ strip.text.x.top : NULL
$ strip.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num -90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.y.left :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num 90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.y.right : NULL
$ strip.switch.pad.grid : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
$ strip.switch.pad.wrap : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi TRUE
- attr(*, "validate")= logi TRUE
ggplot(df, aes(x= income, y=age)) +
geom_boxplot(fill="salmon") +
labs(x="Income", y="Age") +
ggtitle("Age Distribution by Income Levels") +
theme_classic()
NA
For the workclass column, I combine “State-gov”, “Local-gov”, “Federal-gov” as “Government”, and “Self-emp-not-inc”, “Self-emp-inc” as “Self-employment”, and change “Without-pay” to “Unemployment”. This would allow me to interpret the models late on easier.
#Combining like factors of workclass column
df$workclass <- as.character(df$workclass)
df$workclass[df$workclass == "State-gov" | df$workclass == "Local-gov" | df$workclass == "Federal-gov"] <- "Government"
df$workclass[df$workclass == "Self-emp-not-inc" | df$workclass == "Self-emp-inc"] <- "Self_Employment"
df$workclass[df$workclass == "Without-pay"] <- "Unemployment"
unique(df$workclass)
[1] "Government" "Self_Employment" "Private" "Unemployment"
From the graph, people who work for private corporations tend to earn income more than $50k the most. There seems to be no distinguishable difference between working for Governmen and Self-Employment.
ggplot(df, aes(x=workclass, fill=income)) +
geom_bar() +
labs(x="Workclass", y="Count") +
ggtitle('Income Classification by Workclass')+
theme_classic()
Next, I also combine factors in Education column to make the data easier to interpret. “Bachelors”, “Masters”, “Doctorate”, “Prof-school” are lumped as “Higher Education”. “Assoc-acdm”, “Assoc-voc” are lumped as “Associates Degree”. “HS-grad”, “12th”, “11th”, “10th” are lumped as “High School”. I keep the “Some college” factor and the remaining factors are lumped as “Others”.
#Combining like factors of workclass column
df$education <- as.character(df$education)
df$education[df$education == "Bachelors" | df$education == "Masters" | df$education == "Doctorate" | df$education == "Prof-school"] <- "Higher Education"
df$education[df$education == "Assoc-acdm" | df$education == "Assoc-voc"] <- "Associates Degree"
df$education[df$education == "HS-grad" | df$education == "12th" | df$education == "11th" | df$education == "10th"] <- "High School"
df$education[df$education == "Some-college"] <- "Some College"
df$education[df$education == "Preschool" | df$education == "1st-4th" | df$education == "5th-6th" | df$education == "7th-8th" | df$education == "9th"] <- "Others"
unique(df$education)
[1] "Higher Education" "High School" "Others" "Some College" "Associates Degree"
In this dataset, the majority of people finished high school. However, people who have higher education (who have Bachelors degree or above) are most likely to to earn more than $50k. This result is consistent when we look at the education number. People who have 13 years of education tend to have higher income than others. Also, this observation explains the high correlation between education number and income columns as stated previously.
# plot the education distribution and sort the values
ggplot(df, aes(y = education, fill = income)) +
geom_bar(position = "dodge") +
scale_y_discrete(limits = rev(levels(factor(df$education)))) +
scale_fill_manual(values = c("lightgreen", "salmon")) +
labs(x = "Count", y = "Education Level", fill = "Income") +
theme_classic()
ggplot(df, aes(y = education.number,fill=income)) +
geom_bar() +
labs(x = "Count", y = "Years of education",
title = "Income Distribution by Years of education") +
theme_classic()
In the martial status column, “Married-civ-spouse”, “Married-spouse-absent”, “Married-AF-spouse” are combined as “Married”, and I change “Never-married” to “Single” and keep the remaining factors.
df$marital.status <- as.character(df$marital.status)
df$marital.status[df$marital.status == "Never_married"] <- "Single"
df$marital.status[df$marital.status == "Married-civ-spouse" | df$marital.status == "Married-spouse-absent" | df$marital.status == "Married-AF-spouse"] <- "Married"
unique(df$marital.status)
[1] "Never-married" "Married" "Divorced" "Separated" "Widowed"
I did the same to the occupation where I combine the factors into “White-collar”, “Blue-collar”, “Professisonal”, “Service”, “Sales” and “Other”.
df$occupation <- as.character(df$occupation)
df$occupation[df$occupation == "Adm-clerical" | df$occupation == "Exec-managerial"] <- "White-collar"
df$occupation[df$occupation == "Handlers-cleaners" |df$occupation == "Transport-moving" | df$occupation == "Farming-fishing" |df$occupation == "Machine-op-inspct" |df$occupation == "Craft-repair" ] <- "Blue-collar"
df$occupation[df$occupation == "Tech-support" | df$occupation == "Protective-serv" | df$occupation == "Priv-house-serv" | df$occupation == "Other-service"] <- "Service"
df$occupation[df$occupation == "Prof-specialty"] <- "Professional"
df$occupation[df$occupation == "Armed-Forces"] <- "Other"
unique(df$occupation)
[1] "White-collar" "Blue-collar" "Professional" "Service" "Sales" "Other"
In the bar graph, people who have white collar and professional jobs are more likely to earn more than $50k.
ggplot(df, aes(y = reorder(occupation, -table(occupation)[occupation]), fill = income)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set2") +
labs(y = "Occupation", x = "Count", fill = "Income")
It is noticeable that those who work 40 hours per week have much higher chance to earn higher than $50k than those who work overtime or undertime. This makes sense because those who work overtime often do blue-collar job and get paid the minimum wage so they still have a low income despite their working hours per week.
ggplot(df, aes(x= hours.per.week, fill=income)) +
geom_bar() +
labs(x="Hours per week", y="Count") +
ggtitle('Income Classification by Hours per week')+
theme_classic()
As shown in the graph and table below, 24720 people earn less than or equal to $50000 while there are only 7841 earning more than $50000, which creates class imbalance, a common problem in classification that can affect the model accuracy.
table(df$income)
0 1
22654 7508
barplot(table(df$income),main = 'Income Classification',col='pink',ylab ='Number of people')
df$income <- as.factor(df$income)
#A majority of people in this dataset earn below 50k
To perform machine learning techniques, the original dataset is split into training set, which accounts for 70% of the data, and testing set for the remaining 30% of the data. As mentioned earlier, the data imbalance in the response variable income needs to be addressed to prevent model inaccuracy. In a dataset with highly unbalanced classes, the classifier will always pick the most common one without actually performing any classification. Therefore, resampling techniques, oversampling and undersampling are used alternately to deal with class imbalance.
#split data
df$income <- as.factor(df$income)
library(caTools)
Warning: package ‘caTools’ was built under R version 4.2.3
set.seed(200)
sample <- sample.split(df$income,SplitRatio = 0.7)
train <- subset(df, sample == TRUE)
test <- subset(df, sample == FALSE)
head(test)
#Address imbalanced data
library("ROSE")
Warning: package ‘ROSE’ was built under R version 4.2.3Loaded ROSE 0.0-4
balanced_data <- ovun.sample(income~.,data = train,method = "both")$data
print(table(train$income))
0 1
15858 5256
print(table(balanced_data$income))
0 1
10583 10531
Before I do resampling in the train data, only 5256 people who earn more than $50k and 15858 earn less than 50k, which will create class imbalance. However, after performing sampling technique, the number of people who earn more than $50k and those who earn less than $50k are roughly the same. This will enhance the performance of the models later on.
For every model created, we will use the function confusionMatrix to compare the model performance by using the accuracy, sensitivity, specificity of each model.
Logistic Regression The Logistic Regression model has the accuracy of 77.69%, sensitivity of 75.84%, and specificity of 83.26%
library('caret')
Warning: package ‘caret’ was built under R version 4.2.3Loading required package: lattice
library('lattice')
log_model <- glm(income ~ ., family = binomial(), balanced_data)
log_pred <- predict(log_model, test, type = "response")
log_pred <- ifelse(log_pred > 0.5, "1", "0")
confusionMatrix(as.factor(log_pred), as.factor(test$income))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5154 377
1 1642 1875
Accuracy : 0.7769
95% CI : (0.7681, 0.7854)
No Information Rate : 0.7511
P-Value [Acc > NIR] : 5.333e-09
Kappa : 0.4975
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7584
Specificity : 0.8326
Pos Pred Value : 0.9318
Neg Pred Value : 0.5331
Prevalence : 0.7511
Detection Rate : 0.5696
Detection Prevalence : 0.6113
Balanced Accuracy : 0.7955
'Positive' Class : 0
According to the logistic regression model, the most important features to determine whether an individual’s income is more than $50k is age, education number, hours per week, relationship and martial status.
summary(log_model)
Call:
glm(formula = income ~ ., family = binomial(), data = balanced_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3651 -0.6007 -0.0443 0.7098 3.0277
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.560940 0.451568 -16.744 < 2e-16 ***
age 0.034344 0.001805 19.025 < 2e-16 ***
workclassPrivate 0.002873 0.054182 0.053 0.95772
workclassSelf_Employment -0.283572 0.069979 -4.052 5.07e-05 ***
workclassUnemployment -12.854263 83.755868 -0.153 0.87803
educationHigh School 0.180923 0.101129 1.789 0.07361 .
educationHigher Education 0.285250 0.090577 3.149 0.00164 **
educationOthers 0.270863 0.244922 1.106 0.26876
educationSome College 0.176595 0.082572 2.139 0.03246 *
education.number 0.295505 0.028453 10.386 < 2e-16 ***
marital.statusMarried 0.618989 0.157202 3.938 8.23e-05 ***
marital.statusNever-married -0.571545 0.079204 -7.216 5.35e-13 ***
marital.statusSeparated -0.040462 0.148019 -0.273 0.78458
marital.statusWidowed 0.081058 0.137936 0.588 0.55677
occupationOther 0.063129 1.148453 0.055 0.95616
occupationProfessional 0.761634 0.070892 10.744 < 2e-16 ***
occupationSales 0.654951 0.064838 10.101 < 2e-16 ***
occupationService 0.202202 0.064654 3.127 0.00176 **
occupationWhite-collar 0.786312 0.052995 14.837 < 2e-16 ***
relationshipNot-in-family -0.880874 0.153632 -5.734 9.83e-09 ***
relationshipOther-relative -0.967560 0.185017 -5.230 1.70e-07 ***
relationshipOwn-child -2.148161 0.192497 -11.159 < 2e-16 ***
relationshipUnmarried -1.149195 0.170763 -6.730 1.70e-11 ***
relationshipWife 1.463172 0.102542 14.269 < 2e-16 ***
raceAsian-Pac-Islander 0.212937 0.248924 0.855 0.39231
raceBlack -0.285794 0.231320 -1.235 0.21665
raceOther -0.551936 0.336939 -1.638 0.10140
raceWhite -0.038290 0.221093 -0.173 0.86250
sexMale 1.156552 0.071179 16.248 < 2e-16 ***
hours.per.week 0.037388 0.001884 19.850 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 29270 on 21113 degrees of freedom
Residual deviance: 18189 on 21084 degrees of freedom
AIC: 18249
Number of Fisher Scoring iterations: 11
Random Forest
This time I build Random Forest model with the number of trees equals to 500. The Random Forest model has the accuracy of 78.15%, sensitivity of 82.86%, and specificity of 76.59%
library('randomForest')
Warning: package ‘randomForest’ was built under R version 4.2.3randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:ggplot2’:
margin
rf <- randomForest(income ~ ., data = balanced_data, ntree = 500)
rf.pred <- predict(rf, newdata = test)
confusionMatrix(as.factor(rf.pred),as.factor(test$income),positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5205 386
1 1591 1866
Accuracy : 0.7815
95% CI : (0.7728, 0.79)
No Information Rate : 0.7511
P-Value [Acc > NIR] : 6.48e-12
Kappa : 0.5043
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8286
Specificity : 0.7659
Pos Pred Value : 0.5398
Neg Pred Value : 0.9310
Prevalence : 0.2489
Detection Rate : 0.2062
Detection Prevalence : 0.3821
Balanced Accuracy : 0.7972
'Positive' Class : 1
Support Vector Machine The Support Vector Machine model has the accuracy of 76.77%, sensitivity of 87.30%, and specificity of 73.28%
library('e1071')
Warning: package ‘e1071’ was built under R version 4.2.3
svm_model <- svm(income ~ ., data = balanced_data)
svm.pred <- predict(svm_model, newdata = test)
confusionMatrix(svm.pred,test$income,positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4980 286
1 1816 1966
Accuracy : 0.7677
95% CI : (0.7588, 0.7764)
No Information Rate : 0.7511
P-Value [Acc > NIR] : 0.0001257
Kappa : 0.4937
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8730
Specificity : 0.7328
Pos Pred Value : 0.5198
Neg Pred Value : 0.9457
Prevalence : 0.2489
Detection Rate : 0.2173
Detection Prevalence : 0.4180
Balanced Accuracy : 0.8029
'Positive' Class : 1
Decision Tree
library("rpart")
library("rpart.plot")
Warning: package ‘rpart.plot’ was built under R version 4.2.3
dec_tree <- rpart(income~.,data=balanced_data,method='class')
rpart.plot(dec_tree, box.col=c("salmon", "lavender"))
The Decision Tree model has the accuracy of 79.05%, sensitivity of 73.49%, and specificity of 80.89%
dec_tree.pred <- predict(dec_tree, newdata = test,type="class")
confusionMatrix(dec_tree.pred,test$income,positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5497 597
1 1299 1655
Accuracy : 0.7905
95% CI : (0.7819, 0.7988)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4924
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7349
Specificity : 0.8089
Pos Pred Value : 0.5603
Neg Pred Value : 0.9020
Prevalence : 0.2489
Detection Rate : 0.1829
Detection Prevalence : 0.3265
Balanced Accuracy : 0.7719
'Positive' Class : 1
Performance Comparison As the graph shows, based on accuracy, the Decision model seems to have the best performance while the Support Vector Machine model has the lowest accuracy among four models. However, the accuracy difference between models is not large.
accuracy<-data.frame(Model=c('Logistic Regression','Random Forest','Support Vector Machine','Decision Tree'),accuracy_of_models = c(0.7769,0.7815,0.7677,0.7905))
ggplot(accuracy,aes(x=Model,y=accuracy_of_models,fill=Model))+geom_bar(stat = 'identity')+ggtitle('Accuracy of each model')
Application Building the machine learning models to predict whether an individual’s income will exceed $50k or not can have a huge application in real life. It can benefit researches about income inequality and inform the government on which groups of people might not have a good living standard and need financial assistance. One limitation of this study is that because this is a classification problem, we cannot build predictive model to predict the actual income of an individual.