Income Level Classification

Introduction
Initial Settings
Exploratory Analysis
Models

Introduction

Introduction The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database.

Initial Settings

# Loading Packages data manipulating===================
library(tidyverse)
library(DT)  # datatable()
## graphics=======================
library(ggthemr)
library(scales)
library(ggridges)
## Models===================
library(tidymodels)
library(tune)
library(workflows)
# Loading our data
data <- read_csv("adult.txt")
# Set the plot theme
ggthemr("grape", type = "outer", layout = "scientific", spacing = 2)

Exploratory Analysis

The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:

age: the age of an individual
- Integer greater than 0
workclass: a general term to represent the employment status of an individual
- Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: this is the number of people the census believes the entry represents.
- Integer greater than 0
education: the highest level of education achieved by an individual.
- Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: the highest level of education achieved in numerical form.
- Integer greater than 0
marital-status: marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces.
- Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: the general type of occupation of an individual
- Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute and is somewhat redundant with marital status. We might not make use of this attribute at all
- Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: Descriptions of an individual’s race
- White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: the biological sex of the individual
- Male, female
capital-gain: capital gains for an individual
- Integer greater than or equal to 0
capital-loss: capital loss for an individual
- Integer greater than or equal to 0
hours-per-week: the hours an individual has reported to work per week
- continuous
native-country: country of origin for an individual
- United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
the label: whether or not an individual makes more than $50,000 annually.
- <= 50K, >50K

The original dataset contains a distribution of 23.93% entries labeled with >50k and 76.07% entries labeled with <=50k. We split the dataset into training and test sets while maintaining the above distribution. The following graphs and statistics pertain to the training set.

Label	Number	Percentage
‘<= 50K’	7841	24.1
‘> 50K’	24720	75.9

Data Pre-processing

We want to do some pre-processing on the raw data for the up-coming analysis and modeling steps.

names(data) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", 
    "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", 
    "hours_per_week", "native_country", "class")


edu_level <- function(x) {
    if (x == "HS-grad") {
        "High_School_grad"
    } else if (x == "Bachelors" | x == "Some-college") {
        "Bachelors"
    } else if (x %in% c("11th", "9th", "7th-8th", "5th-6th", "10th", "1st-4th", "12th", 
        "preschool")) {
        "compulsory"
    } else if (x == "Assoc-acdm" | x == "Assoc-voc") {
        "Associate"
    } else {
        x
    }
}
education_level <- apply(data["education"], 1, edu_level)
data <- cbind(data, education_level)

data$class <- factor(data$class)
data$workclass <- factor(data$workclass)
data$education <- factor(data$education)
data$marital_status <- factor(data$marital_status)
data$occupation <- factor(data$occupation)
data$relationship <- factor(data$relationship)
data$race <- factor(data$race)
data$sex <- factor(data$sex)
data$native_country <- factor(data$native_country)
data$education_level <- factor(data$education_level)

Analysis on Continuous Variables

Age

ggplot(data, aes(x = age, fill = class)) + geom_density(alpha = 0.8) + ggtitle("The age distribution is right-skewed", 
    subtitle = "The median age of high-income groups is higher") + labs(y = NULL)

Initially, we make analysis on the age distribution: By applying ggplot() function, we can see that the distribution of the variable age is right-skewed, especially in the low-income groups (annual income <= 50K); Besides, we can also see that the median age of high-income group is higher, which implies the older generation prossess more wealth than the young.

fnlwgt

ggplot(data, aes(x = fnlwgt, fill = class)) + geom_density(alpha = 0.8) + ggtitle("No Significant Difference in fnlwgt") + 
    labs(y = NULL)

As we can see from the figure, there is no significant difference between two income communities in the variable ‘fnlwgt’. Hence we take deleting the variable for consideration.

Education

Next, Education. We want to know whether the year of education will affect individuals annual income. From the density plot we can see that, almost every person who earned 50K a year has recieved at least 8 years of education. As for the low income group, most of them have 9 to 10 years of education. The interesting part is that there is a lot of people with long-term education carrer but still gainning low annual income.

ggplot(data, aes(x = education_num, fill = class)) + geom_density(alpha = 0.8) + 
    ggtitle("Most low-income groups have 9-10 years of education") + labs(x = "education/year", 
    y = NULL)

In order to have a clear look on the impact that the year of education have on the annual income, we generate a boxplot. The boxplot can easily illustrates that High-income groups have more years of education.

data %>% ggplot(aes(x = class, y = education_num, fill = class)) + geom_boxplot() + 
    ggtitle("High-income groups have more years of education") + labs(y = "education/year")

Capital

capital_gain <- data %>% filter(capital_gain < 20000) %>% ggplot(aes(x = capital_gain, 
    fill = class)) + geom_density(alpha = 0.8) + ggtitle("The capital gain gap between the two income groups is huge") + 
    labs(y = NULL)
capital_loss <- data %>% ggplot(aes(x = capital_loss, fill = class)) + geom_density(alpha = 0.8) + 
    ggtitle("The high-income group peaked at 1,900") + labs(y = NULL)
Rmisc::multiplot(capital_gain, capital_loss)

As for the capital loss and capital gain, from the density plot above we can see that the capital gain gap between the two income groups is huge. On the other hand, the high-income group peaked at 1,900.

Discontinuous variable analysis

Workclass

Now let’s begin our analysis on discontinuous variable with workclass. The table below illustrates the frequency distribution of the variable workclass. In order to balance the datasets we need to do some further data manipulation.

as.data.frame(table(data$workclass)) %>% arrange(desc(Freq)) %>% datatable()

We decide to merge “?” and “unknown” into one working class. Then remove “Without-pay” and “Never-worked” class since there are only 21 of them.

data <- data %>% filter(!workclass %in% c("Without-pay", "Never-worked"))

data %>% group_by(class, workclass) %>% summarise(n = n()) %>% ggplot(aes(workclass, 
    n, fill = class)) + geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5, 
    "cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) + 
    coord_flip() + labs(x = NULL, y = NULL) + ggtitle("Imbalance inside Private workclass internal income")

After we cleaned the data, we use a barplot to demonstrate the frequency distribution. In the dataset, the majority of the samples’ workclass is private. It clear to see the imbalance inside private workclass internal income. We can infer that only 1/4 of the total observation have earned more than 50K a year.

Marital Status & Relationship

as.data.frame(table(data$marital_status)) %>% arrange(desc(Freq)) %>% datatable()

The frequency distribution barplot perfectly demonstrates the impact that marriage have on people’s annual income. Almost 90% of the never-married group earns less than 50K a year (Same thing within the divorced community). On the opposite, it reveals that most of the high income people are married couples.

data %>% filter(marital_status != "Married-AF-spouse", age > 20) %>% group_by(class, 
    marital_status) %>% summarise(n = n()) %>% ggplot(aes(marital_status, n, fill = class)) + 
    geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5, 
    "cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) + 
    coord_flip() + labs(x = NULL, y = NULL) + ggtitle("Most of the high income people are married")

as.data.frame(table(data$relationship)) %>% arrange(desc(Freq)) %>% datatable()

More, from the relationship variable, it seems like families with children are overwhelmingly low-income.

data %>% group_by(class, relationship) %>% summarise(n = n()) %>% ggplot(aes(relationship, 
    n, fill = class)) + geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5, 
    "cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) + 
    coord_flip() + labs(x = NULL, y = NULL) + ggtitle("Families with children are overwhelmingly low-income")

Occupation

as.data.frame(table(data$occupation)) %>% arrange(desc(Freq)) %>% datatable()

Here we take a close look into the occupation variable.

Other-service has the highest proportion of middle and low income
Adm-clerical also has a high proportion of low income.
Prof-specialty & Exec-managerial paid really well! Almost 50% of the people working in the Exec-managerial business can earn more than 50K a year.

occ <- as.data.frame(table(data$occupation)) %>% arrange(desc(Freq))
data %>% filter(occupation %in% head(occ, 7)$Var1) %>% group_by(class, occupation) %>% 
    summarise(n = n()) %>% ggplot(aes(occupation, n, fill = class)) + geom_bar(stat = "identity", 
    position = "dodge") + theme(axis.ticks.length = unit(0.5, "cm"), axis.text.x = element_text(angle = 330)) + 
    guides(fill = guide_legend(title = NULL)) + coord_flip() + labs(x = NULL, y = NULL) + 
    ggtitle("Prof-specialty & Exec-managerial paid really well!")

Race & Native Country

as.data.frame(table(data$race)) %>% arrange(desc(Freq)) %>% datatable()

When we look inside the samples in the race variable, most of the people are caucasian. Hence, there is a racial imbalance in the dataset that might make this variable less vital in the modeling section. More, if we compare the income distribution across different race, we don’t see any significant differences between income and race.

data %>% filter(!race %in% c("Amer-Indian-Eskimo", "Other")) %>% group_by(class, 
    race) %>% summarise(n = n()) %>% ggplot(aes(race, n, fill = class)) + geom_bar(stat = "identity", 
    position = "dodge") + theme(axis.ticks.length = unit(0.5, "cm"), axis.text.x = element_text(angle = 330)) + 
    guides(fill = guide_legend(title = NULL)) + coord_flip() + labs(x = NULL, y = NULL) + 
    ggtitle("No significant difference between income and race", subtitle = "Racial imbalance in dataset")

Now we make analysis basing on different countries. Likewise, there is also an imbalance inside the native_country variable. The U.S. citizens account for above 90% of the total observation. So we decide to take this analysis into two part: USA and other countries.

as.data.frame(table(data$native_country)) %>% arrange(desc(Freq)) %>% datatable()

data %>% filter(native_country == "United-States") %>% group_by(class) %>% summarise(n = n()) %>% 
    ggplot(aes(class, n, fill = class)) + geom_bar(stat = "identity", position = "dodge") + 
    theme(axis.ticks.length = unit(0.5, "cm")) + guides(fill = guide_legend(title = NULL)) + 
    labs(x = NULL, y = NULL) + ggtitle("Income from Americans", subtitle = "Low-income groups make up the majority")

At U.S, low-income groups make up the majority of the population.

data %>% filter(!native_country == "United-States") %>% group_by(class, native_country) %>% 
    summarise(n = n()) %>% arrange(desc(n)) %>% ggplot(aes(native_country, n, fill = class)) + 
    geom_bar(stat = "identity", position = "dodge") + theme(axis.ticks.length = unit(0.5, 
    "cm"), axis.text.x = element_text(angle = 330)) + guides(fill = guide_legend(title = NULL)) + 
    coord_flip() + labs(x = NULL, y = NULL)

Lower income countries:
- Vietnam
- Mexico
- Puerto-Rico
- Jamaica
- Guatemala
- El-Salvador
- Dominican-Republic
- Columbia
Higher income countries:
- Japan
- Iran
- India
- Germany
- England
- Canada
- United States of America

Sex

At last, we make analysis on gender. First, the pie plot below illustrates an imbalance distribution in gender, only about 1/3 of the total individuals are female. Therefore, it would be pointless to compare them directly, hence, we make the analysis on both gender respectively.

data %>% group_by(sex) %>% summarise(n = n()) %>% ggplot(aes(x = "", y = n, fill = sex)) + 
    geom_bar(stat = "identity", width = 2) + coord_polar("y") + geom_text(aes(label = n), 
    position = position_stack(vjust = 0.5), check_overlap = T, size = 5) + labs(x = NULL, 
    y = NULL, fill = NULL, title = "Proportion of Gender") + theme(axis.line = element_blank(), 
    axis.text = element_blank(), axis.ticks = element_blank(), plot.title = element_text(size = 14))

Within the male group, about 31% of the males can earn more than 50K a year. However, within the women group, only 11% of the females are able to make more than 50K a year. This implys that there exist a gap between the income of male and female.

male <- data %>% filter(sex == "Male") %>% group_by(class) %>% summarise(n = n()) %>% 
    ggplot(aes(x = "", y = n, fill = class)) + geom_bar(stat = "identity", width = 2) + 
    coord_polar("y") + geom_text(aes(label = paste(100 * round(n/21776, 2), "%")), 
    position = position_stack(vjust = 0.5), check_overlap = T, size = 5) + labs(x = NULL, 
    y = NULL, fill = NULL, title = "Male income distribution") + theme(axis.line = element_blank(), 
    axis.text = element_blank(), axis.ticks = element_blank(), plot.title = element_text(size = 14))
female <- data %>% filter(!sex == "Male") %>% group_by(class) %>% summarise(n = n()) %>% 
    ggplot(aes(x = "", y = n, fill = class)) + geom_bar(stat = "identity", width = 2) + 
    coord_polar("y") + geom_text(aes(label = paste(100 * round(n/10764, 2), "%")), 
    position = position_stack(vjust = 0.5), check_overlap = T, size = 5) + labs(x = NULL, 
    y = NULL, fill = NULL, title = "Female income distribution") + theme(axis.line = element_blank(), 
    axis.text = element_blank(), axis.ticks = element_blank(), plot.title = element_text(size = 14))
Rmisc::multiplot(male, female, cols = 2)

Models

At last, a model is built to make classification on annual income.

Model: Random Forest
- Step1: Create dummy variables for nominal predictors.
- Step2: Use Principle Componets Analysis to reduce the dimensionality of the dataset.
- Step3: Downsample the traindata to erase the imbalance problems within the labels.
- Steo4: Train the model
- Step5: Make predictions on testset
- Step6: Evaluation

We use auc(area under curve) to evaluate the models. The auc for our model is up to 0.9755!

# ========================================
set.seed(1024)
tr_te_split <- initial_split(data)

income_rec_best <- recipe(class ~ ., data) %>% step_dummy(all_nominal(), -all_outcomes()) %>% 
    step_pca(all_predictors(), num_comp = 10) %>% step_downsample(class) %>% prep()

income_training <- income_rec_best %>% juice()
income_testing <- income_rec_best %>% bake(testing(tr_te_split))

roc_vals <- metric_set(roc_auc)
ctrl <- control_grid(verbose = F)
# Random Forest
rf_mod <- rand_forest() %>% set_mode("classification") %>% set_engine("ranger") %>% 
    fit(class ~ ., data = income_training)
rf_mod

parsnip model object

Fit time:  31.8s 
Ranger result

Call:
 ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  500 
Sample size:                      15682 
Number of independent variables:  10 
Mtry:                             3 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.129078

# Predictions
predict_probs_rf <- predict(rf_mod, income_testing, type = "prob") %>% bind_cols(income_testing)
predict_probs_rf %>% gain_curve(class, `.pred_<=50K`) %>% autoplot()

predict_probs_rf %>% roc_auc(class, `.pred_<=50K`)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.976

predict_probs_rf %>% roc_curve(class, `.pred_<=50K`) %>% autoplot()