Income Level Classification

Introduction

Introduction The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database.

Exploratory Analysis

The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:

  • age: the age of an individual
    • Integer greater than 0
  • workclass: a general term to represent the employment status of an individual
    • Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: this is the number of people the census believes the entry represents.
    • Integer greater than 0
  • education: the highest level of education achieved by an individual.
    • Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: the highest level of education achieved in numerical form.
    • Integer greater than 0
  • marital-status: marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces.
    • Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: the general type of occupation of an individual
    • Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute and is somewhat redundant with marital status. We might not make use of this attribute at all
    • Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Descriptions of an individual’s race
    • White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: the biological sex of the individual
    • Male, female
  • capital-gain: capital gains for an individual
    • Integer greater than or equal to 0
  • capital-loss: capital loss for an individual
    • Integer greater than or equal to 0
  • hours-per-week: the hours an individual has reported to work per week
    • continuous
  • native-country: country of origin for an individual
    • United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • the label: whether or not an individual makes more than $50,000 annually.
    • <= 50K, >50K

The original dataset contains a distribution of 23.93% entries labeled with >50k and 76.07% entries labeled with <=50k. We split the dataset into training and test sets while maintaining the above distribution. The following graphs and statistics pertain to the training set.

Label Number Percentage
‘<= 50K’ 7841 24.1
‘> 50K’ 24720 75.9

Analysis on Continuous Variables

Age

Initially, we make analysis on the age distribution: By applying ggplot() function, we can see that the distribution of the variable age is right-skewed, especially in the low-income groups (annual income <= 50K); Besides, we can also see that the median age of high-income group is higher, which implies the older generation prossess more wealth than the young.

fnlwgt

As we can see from the figure, there is no significant difference between two income communities in the variable ‘fnlwgt’. Hence we take deleting the variable for consideration.

Education

Next, Education. We want to know whether the year of education will affect individuals annual income. From the density plot we can see that, almost every person who earned 50K a year has recieved at least 8 years of education. As for the low income group, most of them have 9 to 10 years of education. The interesting part is that there is a lot of people with long-term education carrer but still gainning low annual income.

In order to have a clear look on the impact that the year of education have on the annual income, we generate a boxplot. The boxplot can easily illustrates that High-income groups have more years of education.

Discontinuous variable analysis

Workclass

Now let’s begin our analysis on discontinuous variable with workclass. The table below illustrates the frequency distribution of the variable workclass. In order to balance the datasets we need to do some further data manipulation.

We decide to merge “?” and “unknown” into one working class. Then remove “Without-pay” and “Never-worked” class since there are only 21 of them.

After we cleaned the data, we use a barplot to demonstrate the frequency distribution. In the dataset, the majority of the samples’ workclass is private. It clear to see the imbalance inside private workclass internal income. We can infer that only 1/4 of the total observation have earned more than 50K a year.

Marital Status & Relationship

The frequency distribution barplot perfectly demonstrates the impact that marriage have on people’s annual income. Almost 90% of the never-married group earns less than 50K a year (Same thing within the divorced community). On the opposite, it reveals that most of the high income people are married couples.

More, from the relationship variable, it seems like families with children are overwhelmingly low-income.

Occupation

Here we take a close look into the occupation variable.

  • Other-service has the highest proportion of middle and low income

  • Adm-clerical also has a high proportion of low income.

  • Prof-specialty & Exec-managerial paid really well! Almost 50% of the people working in the Exec-managerial business can earn more than 50K a year.

Race & Native Country

When we look inside the samples in the race variable, most of the people are caucasian. Hence, there is a racial imbalance in the dataset that might make this variable less vital in the modeling section. More, if we compare the income distribution across different race, we don’t see any significant differences between income and race.

Now we make analysis basing on different countries. Likewise, there is also an imbalance inside the native_country variable. The U.S. citizens account for above 90% of the total observation. So we decide to take this analysis into two part: USA and other countries.

At U.S, low-income groups make up the majority of the population.

  • Lower income countries:
    • Vietnam
    • Mexico
    • Puerto-Rico
    • Jamaica
    • Guatemala
    • El-Salvador
    • Dominican-Republic
    • Columbia
  • Higher income countries:
    • Japan
    • Iran
    • India
    • Germany
    • England
    • Canada
    • United States of America

Sex

At last, we make analysis on gender. First, the pie plot below illustrates an imbalance distribution in gender, only about 1/3 of the total individuals are female. Therefore, it would be pointless to compare them directly, hence, we make the analysis on both gender respectively.

Within the male group, about 31% of the males can earn more than 50K a year. However, within the women group, only 11% of the females are able to make more than 50K a year. This implys that there exist a gap between the income of male and female.

Models

At last, a model is built to make classification on annual income.

  • Model: Random Forest
    • Step1: Create dummy variables for nominal predictors.
    • Step2: Use Principle Componets Analysis to reduce the dimensionality of the dataset.
    • Step3: Downsample the traindata to erase the imbalance problems within the labels.
    • Steo4: Train the model
    • Step5: Make predictions on testset
    • Step6: Evaluation

We use auc(area under curve) to evaluate the models. The auc for our model is up to 0.9755!

parsnip model object

Fit time:  31.8s 
Ranger result

Call:
 ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  500 
Sample size:                      15682 
Number of independent variables:  10 
Mtry:                             3 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.129078 

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.976

Net Zissou

2020-01-04