Income Level Classification
Introduction
Introduction The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database.
Initial Settings
# Loading Packages data manipulating===================
library(tidyverse)
library(DT) # datatable()
## graphics=======================
library(ggthemr)
library(scales)
library(ggridges)
## Models===================
library(tidymodels)
library(tune)
library(workflows)
# Loading our data
data <- read_csv("adult.txt")
# Set the plot theme
ggthemr("grape", type = "outer", layout = "scientific", spacing = 2)Exploratory Analysis
The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:
- age: the age of an individual
- Integer greater than 0
- workclass: a general term to represent the employment status of an individual
- Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: this is the number of people the census believes the entry represents.
- Integer greater than 0
- education: the highest level of education achieved by an individual.
- Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: the highest level of education achieved in numerical form.
- Integer greater than 0
- marital-status: marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces.
- Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: the general type of occupation of an individual
- Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute and is somewhat redundant with marital status. We might not make use of this attribute at all
- Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: Descriptions of an individual’s race
- White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: the biological sex of the individual
- Male, female
- capital-gain: capital gains for an individual
- Integer greater than or equal to 0
- capital-loss: capital loss for an individual
- Integer greater than or equal to 0
- hours-per-week: the hours an individual has reported to work per week
- continuous
- native-country: country of origin for an individual
- United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- the label: whether or not an individual makes more than $50,000 annually.
- <= 50K, >50K
The original dataset contains a distribution of 23.93% entries labeled with >50k and 76.07% entries labeled with <=50k. We split the dataset into training and test sets while maintaining the above distribution. The following graphs and statistics pertain to the training set.
| Label | Number | Percentage |
|---|---|---|
| ‘<= 50K’ | 7841 | 24.1 |
| ‘> 50K’ | 24720 | 75.9 |
Data Pre-processing
We want to do some pre-processing on the raw data for the up-coming analysis and modeling steps.
names(data) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status",
"occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
"hours_per_week", "native_country", "class")
edu_level <- function(x) {
if (x == "HS-grad") {
"High_School_grad"
} else if (x == "Bachelors" | x == "Some-college") {
"Bachelors"
} else if (x %in% c("11th", "9th", "7th-8th", "5th-6th", "10th", "1st-4th", "12th",
"preschool")) {
"compulsory"
} else if (x == "Assoc-acdm" | x == "Assoc-voc") {
"Associate"
} else {
x
}
}
education_level <- apply(data["education"], 1, edu_level)
data <- cbind(data, education_level)
data$class <- factor(data$class)
data$workclass <- factor(data$workclass)
data$education <- factor(data$education)
data$marital_status <- factor(data$marital_status)
data$occupation <- factor(data$occupation)
data$relationship <- factor(data$relationship)
data$race <- factor(data$race)
data$sex <- factor(data$sex)
data$native_country <- factor(data$native_country)
data$education_level <- factor(data$education_level)Analysis on Continuous Variables
Age
ggplot(data, aes(x = age, fill = class)) + geom_density(alpha = 0.8) + ggtitle("The age distribution is right-skewed",
subtitle = "The median age of high-income groups is higher") + labs(y = NULL)Initially, we make analysis on the age distribution: By applying ggplot() function, we can see that the distribution of the variable age is right-skewed, especially in the low-income groups (annual income <= 50K); Besides, we can also see that the median age of high-income group is higher, which implies the older generation prossess more wealth than the young.
fnlwgt
ggplot(data, aes(x = fnlwgt, fill = class)) + geom_density(alpha = 0.8) + ggtitle("No Significant Difference in fnlwgt") +
labs(y = NULL)As we can see from the figure, there is no significant difference between two income communities in the variable ‘fnlwgt’. Hence we take deleting the variable for consideration.
Education
Next, Education. We want to know whether the year of education will affect individuals annual income. From the density plot we can see that, almost every person who earned 50K a year has recieved at least 8 years of education. As for the low income group, most of them have 9 to 10 years of education. The interesting part is that there is a lot of people with long-term education carrer but still gainning low annual income.
ggplot(data, aes(x = education_num, fill = class)) + geom_density(alpha = 0.8) +
ggtitle("Most low-income groups have 9-10 years of education") + labs(x = "education/year",
y = NULL)In order to have a clear look on the impact that the year of education have on the annual income, we generate a boxplot. The boxplot can easily illustrates that High-income groups have more years of education.
data %>% ggplot(aes(x = class, y = education_num, fill = class)) + geom_boxplot() +
ggtitle("High-income groups have more years of education") + labs(y = "education/year")Capital
capital_gain <- data %>% filter(capital_gain < 20000) %>% ggplot(aes(x = capital_gain,
fill = class)) + geom_density(alpha = 0.8) + ggtitle("The capital gain gap between the two income groups is huge") +
labs(y = NULL)
capital_loss <- data %>% ggplot(aes(x = capital_loss, fill = class)) + geom_density(alpha = 0.8) +
ggtitle("The high-income group peaked at 1,900") + labs(y = NULL)
Rmisc::multiplot(capital_gain, capital_loss)As for the capital loss and capital gain, from the density plot above we can see that the capital gain gap between the two income groups is huge. On the other hand, the high-income group peaked at 1,900.
Discontinuous variable analysis
Workclass
Now let’s begin our analysis on discontinuous variable with workclass. The table below illustrates the frequency distribution of the variable workclass. In order to balance the datasets we need to do some further data manipulation.
We decide to merge “?” and “unknown” into one working class. Then remove “Without-pay” and “Never-worked” class since there are only 21 of them.
data <- data %>% filter(!workclass %in% c("Without-pay", "Never-worked"))
data %>% group_by(class, workclass) %>% summarise(n = n()) %>% ggplot(aes(workclass,
n, fill = class)) + geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5,
"cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) +
coord_flip() + labs(x = NULL, y = NULL) + ggtitle("Imbalance inside Private workclass internal income")After we cleaned the data, we use a barplot to demonstrate the frequency distribution. In the dataset, the majority of the samples’ workclass is private. It clear to see the imbalance inside private workclass internal income. We can infer that only 1/4 of the total observation have earned more than 50K a year.
Marital Status & Relationship
The frequency distribution barplot perfectly demonstrates the impact that marriage have on people’s annual income. Almost 90% of the never-married group earns less than 50K a year (Same thing within the divorced community). On the opposite, it reveals that most of the high income people are married couples.data %>% filter(marital_status != "Married-AF-spouse", age > 20) %>% group_by(class,
marital_status) %>% summarise(n = n()) %>% ggplot(aes(marital_status, n, fill = class)) +
geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5,
"cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) +
coord_flip() + labs(x = NULL, y = NULL) + ggtitle("Most of the high income people are married")data %>% group_by(class, relationship) %>% summarise(n = n()) %>% ggplot(aes(relationship,
n, fill = class)) + geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5,
"cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) +
coord_flip() + labs(x = NULL, y = NULL) + ggtitle("Families with children are overwhelmingly low-income")Occupation
Here we take a close look into the occupation variable.Other-service has the highest proportion of middle and low income
Adm-clerical also has a high proportion of low income.
Prof-specialty & Exec-managerial paid really well! Almost 50% of the people working in the Exec-managerial business can earn more than 50K a year.
occ <- as.data.frame(table(data$occupation)) %>% arrange(desc(Freq))
data %>% filter(occupation %in% head(occ, 7)$Var1) %>% group_by(class, occupation) %>%
summarise(n = n()) %>% ggplot(aes(occupation, n, fill = class)) + geom_bar(stat = "identity",
position = "dodge") + theme(axis.ticks.length = unit(0.5, "cm"), axis.text.x = element_text(angle = 330)) +
guides(fill = guide_legend(title = NULL)) + coord_flip() + labs(x = NULL, y = NULL) +
ggtitle("Prof-specialty & Exec-managerial paid really well!")Race & Native Country
When we look inside the samples in the race variable, most of the people are caucasian. Hence, there is a racial imbalance in the dataset that might make this variable less vital in the modeling section. More, if we compare the income distribution across different race, we don’t see any significant differences between income and race.data %>% filter(!race %in% c("Amer-Indian-Eskimo", "Other")) %>% group_by(class,
race) %>% summarise(n = n()) %>% ggplot(aes(race, n, fill = class)) + geom_bar(stat = "identity",
position = "dodge") + theme(axis.ticks.length = unit(0.5, "cm"), axis.text.x = element_text(angle = 330)) +
guides(fill = guide_legend(title = NULL)) + coord_flip() + labs(x = NULL, y = NULL) +
ggtitle("No significant difference between income and race", subtitle = "Racial imbalance in dataset")Now we make analysis basing on different countries. Likewise, there is also an imbalance inside the native_country variable. The U.S. citizens account for above 90% of the total observation. So we decide to take this analysis into two part: USA and other countries.
data %>% filter(native_country == "United-States") %>% group_by(class) %>% summarise(n = n()) %>%
ggplot(aes(class, n, fill = class)) + geom_bar(stat = "identity", position = "dodge") +
theme(axis.ticks.length = unit(0.5, "cm")) + guides(fill = guide_legend(title = NULL)) +
labs(x = NULL, y = NULL) + ggtitle("Income from Americans", subtitle = "Low-income groups make up the majority") At U.S, low-income groups make up the majority of the population.
data %>% filter(!native_country == "United-States") %>% group_by(class, native_country) %>%
summarise(n = n()) %>% arrange(desc(n)) %>% ggplot(aes(native_country, n, fill = class)) +
geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5,
"cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) +
coord_flip() + labs(x = NULL, y = NULL)- Lower income countries:
- Vietnam
- Mexico
- Puerto-Rico
- Jamaica
- Guatemala
- El-Salvador
- Dominican-Republic
- Columbia
- Higher income countries:
- Japan
- Iran
- India
- Germany
- England
- Canada
- United States of America
Sex
At last, we make analysis on gender. First, the pie plot below illustrates an imbalance distribution in gender, only about 1/3 of the total individuals are female. Therefore, it would be pointless to compare them directly, hence, we make the analysis on both gender respectively.
data %>% group_by(sex) %>% summarise(n = n()) %>% ggplot(aes(x = "", y = n, fill = sex)) +
geom_bar(stat = "identity", width = 2) + coord_polar("y") + geom_text(aes(label = n),
position = position_stack(vjust = 0.5), check_overlap = T, size = 5) + labs(x = NULL,
y = NULL, fill = NULL, title = "Proportion of Gender") + theme(axis.line = element_blank(),
axis.text = element_blank(), axis.ticks = element_blank(), plot.title = element_text(size = 14))Within the male group, about 31% of the males can earn more than 50K a year. However, within the women group, only 11% of the females are able to make more than 50K a year. This implys that there exist a gap between the income of male and female.
male <- data %>% filter(sex == "Male") %>% group_by(class) %>% summarise(n = n()) %>%
ggplot(aes(x = "", y = n, fill = class)) + geom_bar(stat = "identity", width = 2) +
coord_polar("y") + geom_text(aes(label = paste(100 * round(n/21776, 2), "%")),
position = position_stack(vjust = 0.5), check_overlap = T, size = 5) + labs(x = NULL,
y = NULL, fill = NULL, title = "Male income distribution") + theme(axis.line = element_blank(),
axis.text = element_blank(), axis.ticks = element_blank(), plot.title = element_text(size = 14))
female <- data %>% filter(!sex == "Male") %>% group_by(class) %>% summarise(n = n()) %>%
ggplot(aes(x = "", y = n, fill = class)) + geom_bar(stat = "identity", width = 2) +
coord_polar("y") + geom_text(aes(label = paste(100 * round(n/10764, 2), "%")),
position = position_stack(vjust = 0.5), check_overlap = T, size = 5) + labs(x = NULL,
y = NULL, fill = NULL, title = "Female income distribution") + theme(axis.line = element_blank(),
axis.text = element_blank(), axis.ticks = element_blank(), plot.title = element_text(size = 14))
Rmisc::multiplot(male, female, cols = 2)Models
At last, a model is built to make classification on annual income.
- Model: Random Forest
- Step1: Create dummy variables for nominal predictors.
- Step2: Use Principle Componets Analysis to reduce the dimensionality of the dataset.
- Step3: Downsample the traindata to erase the imbalance problems within the labels.
- Steo4: Train the model
- Step5: Make predictions on testset
- Step6: Evaluation
We use auc(area under curve) to evaluate the models. The auc for our model is up to 0.9755!
# ========================================
set.seed(1024)
tr_te_split <- initial_split(data)
income_rec_best <- recipe(class ~ ., data) %>% step_dummy(all_nominal(), -all_outcomes()) %>%
step_pca(all_predictors(), num_comp = 10) %>% step_downsample(class) %>% prep()
income_training <- income_rec_best %>% juice()
income_testing <- income_rec_best %>% bake(testing(tr_te_split))
roc_vals <- metric_set(roc_auc)
ctrl <- control_grid(verbose = F)
# Random Forest
rf_mod <- rand_forest() %>% set_mode("classification") %>% set_engine("ranger") %>%
fit(class ~ ., data = income_training)
rf_modparsnip model object
Fit time: 31.8s
Ranger result
Call:
ranger::ranger(formula = formula, data = data, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Type: Probability estimation
Number of trees: 500
Sample size: 15682
Number of independent variables: 10
Mtry: 3
Target node size: 10
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.129078
# Predictions
predict_probs_rf <- predict(rf_mod, income_testing, type = "prob") %>% bind_cols(income_testing)
predict_probs_rf %>% gain_curve(class, `.pred_<=50K`) %>% autoplot()# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.976