Dataiku US Census Data Science Challenge

The Challenge

Our goal here is to be able to predict the salary bucket of a person, whether high salary which means earning more than $50,000 per year or low salary which means less than $50,000 a year.

In order to do that, we have access to several features like the age, the race, the education of the person.

We will tackle this problem in several parts:

First, we will set up our working environment and import all the data we have access to
Then, we will understand the data we have available, the numbers of features we have and what type they are, the distribution of our data
The work above will enable us to deep dive into each feature and consider which ones we will use for our classifier
Finally, we will build a classification model, train it and test it and output the main results of our work

Set Up the Working Environment

Import the necessary packages

library(rmarkdown)
library(knitr)
library(dplyr)
library(gdata)
library(reshape2)
library(ggplot2)
library(scales)
library(lsr)
library(randomForest)

Choose the main directory

First choose your repository where all your 3 files below are stored:

census_income_metadata.txt
census_income_learn.csv
census_income_test

main_repository = "~/Desktop/US_census/us_census_full"
setwd("~/Desktop/US_census/us_census_full") # Go into that Directory

Import the train and test datasets

Let’s import and trim our two datasets to remove leading and trailing spaces.

raw_train_data = trim(read.table("census_income_learn.csv", sep=",", header=FALSE, stringsAsFactors=TRUE))
raw_test_data = trim(read.table("census_income_test.csv", sep=",", header=FALSE, stringsAsFactors=TRUE))

Finding the column names of our datasets

First, by looking at the training set, I saw that actually it doesn’t have the name of the columns in it so my first task was to associate each column of my big dataset with its corresponding name.
For that, I looked at the metadata and tried to understand the structure of it and do correspondances between the description and our training set.
I noticed that at the end of the metadata, each columnn has its name then a ‘:’ then a list of its distinct values.
After several spot check, it seemed that it was exactly mapped to the columns of the training in the same order.
So instead of processing this manually, I have pasted and copied the last part in a txt file that I called find_columns_name.txt.
There is a simple pattern which each that its column is in its own line and the name of it is before the :. I have also removed the line called ‘| instance weight: ignore.’. Below is my code to process that:

fileName <- "find_columns_name.txt"
columns_name = c()
conn <- file(fileName, open="r")
lines <-readLines(conn)
for (i in 1:length(lines)) {
  col_name = strsplit(lines[i], ":")[[1]][1]
  columns_name = cbind(columns_name, col_name)
}
close(conn)

The last column of our datasets is the output of our classifier which is a binary variable:

“50000+.”
“- 50000.”

And we will just call it “salary_classification”.

columns_name = c(columns_name,"salary_classification")

Several columns have their name separated with a space or have some “-”.
Let’s change that with underscores “_" and let’s also remove the apostrophe “’”:

columns_name = gsub(" ", "_", columns_name)
columns_name = gsub("-", "_", columns_name)
columns_name = gsub("'", "", columns_name)

All the other columns will be our features.

Associate the columns name to our train and test datasets

Our previous training and testing datasets didn’t have their columns named so let’s do label them thanks to our previous processing

colnames(raw_train_data) = columns_name
colnames(raw_test_data) = columns_name

Reformat output values

I don’t really like the fact that we have two levels in the output that are not formatted in the same way (" 50000+." and " - 50000.“).
I will make them consistent by transforming:

“50000+.” to “+50,000”
“- 50000.” to “-50,000”

raw_train_data$salary_classification = ifelse(raw_train_data$salary_classification == "50000+.", "+50,000", "-50,000")
raw_test_data$salary_classification = ifelse(raw_test_data$salary_classification == "50000+.", "+50,000", "-50,000")

Remove the feature instance weight

As it is asked in the metadata file, we can ignore this feature for our model and analysis.
This attribute should *not* be used in the classifiers, so it is set to "ignore" in this file.

raw_train_data$instance_weight = NULL
raw_test_data$instance_weight = NULL

Final shape of our datasets

number_columns = ncol(raw_train_data)
train_nrow = nrow(raw_train_data)
test_nrow = nrow(raw_test_data)

train_number_high_salary = sum(ifelse(raw_train_data$salary_classification == "+50,000", TRUE, FALSE))
train_number_low_salary = sum(ifelse(raw_train_data$salary_classification != "+50,000", TRUE, FALSE))
  
test_number_high_salary = sum(ifelse(raw_test_data$salary_classification == "+50,000", TRUE, FALSE))
test_number_low_salary = sum(ifelse(raw_test_data$salary_classification != "+50,000", TRUE, FALSE))

Below are the shapes of our two datasets:

raw_train_data:
- number of columns: 41
- number of rows: 199523
- proportion of high salary: 6.21%
- proportion of low salary: 93.79%
raw_test_data:
- number of columns: 41
- number of rows: 99762
- proportion of high salary: 6.2%
- proportion of low salary: 93.8%

We clearly see that our data is skewed towards the small salary.
This finding will be important later to balance our insights and we will need to consider that when we will train our models.

Now that we have set up all our working environment, we can start our work.

Understand our Variables

What is the class of our columns (integer or factor for example)

for (i in 1:number_columns) {
  cat(paste0("- Type of the column ", colnames(raw_train_data)[i],": ", class(raw_train_data[,i]),"\n"))
}

## - Type of the column age: integer
## - Type of the column class_of_worker: factor
## - Type of the column detailed_industry_recode: integer
## - Type of the column detailed_occupation_recode: integer
## - Type of the column education: factor
## - Type of the column wage_per_hour: integer
## - Type of the column enroll_in_edu_inst_last_wk: factor
## - Type of the column marital_stat: factor
## - Type of the column major_industry_code: factor
## - Type of the column major_occupation_code: factor
## - Type of the column race: factor
## - Type of the column hispanic_origin: factor
## - Type of the column sex: factor
## - Type of the column member_of_a_labor_union: factor
## - Type of the column reason_for_unemployment: factor
## - Type of the column full_or_part_time_employment_stat: factor
## - Type of the column capital_gains: integer
## - Type of the column capital_losses: integer
## - Type of the column dividends_from_stocks: integer
## - Type of the column tax_filer_stat: factor
## - Type of the column region_of_previous_residence: factor
## - Type of the column state_of_previous_residence: factor
## - Type of the column detailed_household_and_family_stat: factor
## - Type of the column detailed_household_summary_in_household: factor
## - Type of the column migration_code_change_in_msa: factor
## - Type of the column migration_code_change_in_reg: factor
## - Type of the column migration_code_move_within_reg: factor
## - Type of the column live_in_this_house_1_year_ago: factor
## - Type of the column migration_prev_res_in_sunbelt: factor
## - Type of the column num_persons_worked_for_employer: integer
## - Type of the column family_members_under_18: factor
## - Type of the column country_of_birth_father: factor
## - Type of the column country_of_birth_mother: factor
## - Type of the column country_of_birth_self: factor
## - Type of the column citizenship: factor
## - Type of the column own_business_or_self_employed: integer
## - Type of the column fill_inc_questionnaire_for_veterans_admin: factor
## - Type of the column veterans_benefits: integer
## - Type of the column weeks_worked_in_year: integer
## - Type of the column year: integer
## - Type of the column salary_classification: character

By looking at those different classes fo our columns, we can see that some of them were by default taking a class that we don’t want. For example, the output salary_classification was missclassed as “character” but we want it to a Factor.
Same thing for some columns that are interpreted as integer but we want them to be Factors as they are industry codes for example.
We will also consider the column year to be a factor as it has only two values (94 and 95) which I can assume it was the year when this data was gathered. Let’s cange the classes of some of the columns for the train and test datasets.

## Transforming the train set
raw_train_data$detailed_industry_recode = as.factor(raw_train_data$detailed_industry_recode)
raw_train_data$detailed_occupation_recode = as.factor(raw_train_data$detailed_occupation_recode)
raw_train_data$own_business_or_self_employed = as.factor(raw_train_data$own_business_or_self_employed)
raw_train_data$veterans_benefits = as.factor(raw_train_data$veterans_benefits)
raw_train_data$year = as.factor(raw_train_data$year)
raw_train_data$salary_classification = as.factor(raw_train_data$salary_classification)

## Transforming the test set
raw_test_data$detailed_industry_recode = as.factor(raw_test_data$detailed_industry_recode)
raw_test_data$detailed_occupation_recode = as.factor(raw_test_data$detailed_occupation_recode)
raw_test_data$own_business_or_self_employed = as.factor(raw_test_data$own_business_or_self_employed)
raw_test_data$veterans_benefits = as.factor(raw_test_data$veterans_benefits)
raw_test_data$year = as.factor(raw_test_data$year)
raw_test_data$salary_classification = as.factor(raw_test_data$salary_classification)

Now we have the new class of our columns:

## - Type of the column age: integer
## - Type of the column class_of_worker: factor
## - Type of the column detailed_industry_recode: factor
## - Type of the column detailed_occupation_recode: factor
## - Type of the column education: factor
## - Type of the column wage_per_hour: integer
## - Type of the column enroll_in_edu_inst_last_wk: factor
## - Type of the column marital_stat: factor
## - Type of the column major_industry_code: factor
## - Type of the column major_occupation_code: factor
## - Type of the column race: factor
## - Type of the column hispanic_origin: factor
## - Type of the column sex: factor
## - Type of the column member_of_a_labor_union: factor
## - Type of the column reason_for_unemployment: factor
## - Type of the column full_or_part_time_employment_stat: factor
## - Type of the column capital_gains: integer
## - Type of the column capital_losses: integer
## - Type of the column dividends_from_stocks: integer
## - Type of the column tax_filer_stat: factor
## - Type of the column region_of_previous_residence: factor
## - Type of the column state_of_previous_residence: factor
## - Type of the column detailed_household_and_family_stat: factor
## - Type of the column detailed_household_summary_in_household: factor
## - Type of the column migration_code_change_in_msa: factor
## - Type of the column migration_code_change_in_reg: factor
## - Type of the column migration_code_move_within_reg: factor
## - Type of the column live_in_this_house_1_year_ago: factor
## - Type of the column migration_prev_res_in_sunbelt: factor
## - Type of the column num_persons_worked_for_employer: integer
## - Type of the column family_members_under_18: factor
## - Type of the column country_of_birth_father: factor
## - Type of the column country_of_birth_mother: factor
## - Type of the column country_of_birth_self: factor
## - Type of the column citizenship: factor
## - Type of the column own_business_or_self_employed: factor
## - Type of the column fill_inc_questionnaire_for_veterans_admin: factor
## - Type of the column veterans_benefits: factor
## - Type of the column weeks_worked_in_year: integer
## - Type of the column year: factor
## - Type of the column salary_classification: factor

The code below will give us how many classes we have and how many columns belongs to each of thoses classes:

columns_stats = data.frame("name" = colnames(raw_train_data))
columns_stats$class = as.factor(sapply(raw_train_data, FUN = class))

summary(columns_stats$class)

##  factor integer 
##      34       7

As we can see, we have 34 columns that are categorical and 7 that are integer.

How many missing values per column

First let’s see if there is any NA value in our dataset and if yes, in what proportion.

columns_stats$prop_NA = sapply(raw_train_data, FUN = function(x) sum(is.na(x))/length(x))
filter(columns_stats, columns_stats$prop_NA > 0)

## [1] name    class   prop_NA
## <0 rows> (or 0-length row.names)

We can see that there are no NA values so all our integer columns don’t have missing values. However, our categorical columns might have some that can be defined in another way. Let’s have a look at the possible unique values per column for categorical columns.

integer_columns = as.character(filter(columns_stats, columns_stats$class == "integer")$name)
categorical_columns = as.character(filter(columns_stats, columns_stats$class == "factor")$name)
#sapply(subset(raw_train_data, select = categorical_columns), FUN = unique) # Needs to comment this out to see the unique values

I have omitted the result of it in the report as it is pretty long but thanks to the output of the code above, we can see that some columns have the value “Not in universe”. We won’t consider this value as missing value as it gives us some information on the observation and we will consider it as distinct level of the corresponding columns (it might mean that this column is not appliable to this person).
However, we see some “?” that will be considered as our missing values. We will keep them in there own distinct level that we can keep “?”.
Let’s find out for each categorical variable how proportion of “?” instances we have that we will refer to as “bad value”.

columns_stats$prop_bad_val_in_col = sapply(raw_train_data, FUN = function(x) sum(x == "?")/length(x))
temp = filter(columns_stats, class == "factor" & prop_bad_val_in_col > 0)
kable(arrange(temp, desc(prop_bad_val_in_col)))

name	class	prop_bad_val_in_col
migration_code_change_in_msa	factor	0.4996717
migration_code_change_in_reg	factor	0.4996717
migration_code_move_within_reg	factor	0.4996717
migration_prev_res_in_sunbelt	factor	0.4996717
country_of_birth_father	factor	0.0336452
country_of_birth_mother	factor	0.0306681
country_of_birth_self	factor	0.0170056
state_of_previous_residence	factor	0.0035485

We see that only 8 columns have those bad values.
4 of them have a high proportion of “?” and they are all related to the migration and have the same proportion of almost 50%. The 4 others have a really small proportion (less than 4% of the available observations) which we can consider negligeable.

Deep dive in our numerical features

Global distribution

Let’s have a look at the overall distribution of our data where we calculate the mean, median, the min, the max and the first and third quantile.

kable(sapply(subset(raw_train_data, select = integer_columns), FUN = summary))

	age	wage_per_hour	capital_gains	capital_losses	dividends_from_stocks	num_persons_worked_for_employer	weeks_worked_in_year
Min.	0.00	0.00	0.0	0.00	0.0	0.000	0.00
1st Qu.	15.00	0.00	0.0	0.00	0.0	0.000	0.00
Median	33.00	0.00	0.0	0.00	0.0	1.000	8.00
Mean	34.49	55.43	434.7	37.31	197.5	1.956	23.17
3rd Qu.	50.00	0.00	0.0	0.00	0.0	4.000	52.00
Max.	90.00	9999.00	100000.0	4608.00	100000.0	6.000	52.00

We see that all the features related to money like wage_per_hour, capital_gains, capital_losses, dividends_from_stocks are really concentrated into the 0 value as we can see with the 3rd quantile being equal to 0. However, the mean is greater than 0 for those 4 features which means that we have observations that are really high but in a small number and that is why it impacts the mean but the median stays equal to 0.
This makes sense as our dataset is really skewed

Boxplots per salary classification

## Using salary_classification as id variables

The graph above gives us a lot of insights.

The age, as expected, seems to be really correlated to salary_classification. The richer you are, the older you are likely to be. We can see almost 15 years of difference in the median between those two classifications
The number of weeks worked per day is also interesting. The median here is significatively different as you would naturally expect. The median for the low salary is 0 weeks worked in the year against 52 for the high one
The number of persons worked for employer is also a great feature showing this distinction
For all the other features, we see that all the medians are 0 with a lot of outliers whatever which bucket you are in (low vs high salary)
- We can see that low salary people tend to have more capital losses, on the contrary high salary people tend to have more dividends from stocks
- The wage per hour seems to have for both classifications a median equal to 0 which makes me confused on this column. I am not sure to really understand how it was calculated as I would have expected that high salary people have high wage per hour, unless if they are not eligible for it so then it should be a missing value and not a 0 value

What happends if we boxplot the combination of financial wins and losses together?

ggplot(data = raw_train_data, aes(x = salary_classification, y = capital_gains - capital_losses + dividends_from_stocks, fill = salary_classification)) + geom_boxplot()

Even if both medians are still equal to 0, we can see a better difference now accross those two classifications.

Conclusions about numerical features

From the insights above, we can take several conclusions:

We won’t consider the feature wage_per_hour when we will build our model
We will combine the three features capital_gains, capital_losses and dividends_from_stocks into one that we will call external_financial_balance:
- external_financial_balance = capital_gains - capital_losses + dividends_from_stocks
All the other features will be considered

We will create a new train and test dataset called “train_data” and “test_data” (versus the old ones with the prefix “raw”) where we will keep only certain features that we consider after our small analyses and transform other ones.

## For the train dataset
train_data = subset(raw_train_data, select = c("age", "num_persons_worked_for_employer", "weeks_worked_in_year", "salary_classification"))
train_data$external_financial_balance = raw_train_data$capital_gains - raw_train_data$capital_losses + raw_train_data$dividends_from_stocks

## For the test dataset
test_data = subset(raw_test_data, select = c("age", "num_persons_worked_for_employer", "weeks_worked_in_year", "salary_classification"))
test_data$external_financial_balance = raw_test_data$capital_gains - raw_test_data$capital_losses + raw_test_data$dividends_from_stocks

Deep dive in our categorical variables

Correlation with the salary classification

For each of our categorical features, we will first make a Chi Square test of independence with the output which is the salary classification.

categorical_columns_stats = data.frame("name" = categorical_columns, stringsAsFactors = FALSE)
# Calculate the p value of the Chi Square test between each column and the salary classification
temp_p_value = sapply(subset(raw_train_data, select = as.character(categorical_columns)), FUN = function(x) chisq.test(x, raw_train_data$salary_classification)$p.value)

# If the p value is smaller than the significance level of 5% then reject the test so we can say that the categorical variable and the output salary classification are dependent
categorical_columns_stats$dependency_w_output = ifelse(temp_p_value <= 0.05, "dependent", "independent")

kable(categorical_columns_stats)

name	dependency_w_output
class_of_worker	dependent
detailed_industry_recode	dependent
detailed_occupation_recode	dependent
education	dependent
enroll_in_edu_inst_last_wk	dependent
marital_stat	dependent
major_industry_code	dependent
major_occupation_code	dependent
race	dependent
hispanic_origin	dependent
sex	dependent
member_of_a_labor_union	dependent
reason_for_unemployment	dependent
full_or_part_time_employment_stat	dependent
tax_filer_stat	dependent
region_of_previous_residence	dependent
state_of_previous_residence	dependent
detailed_household_and_family_stat	dependent
detailed_household_summary_in_household	dependent
migration_code_change_in_msa	dependent
migration_code_change_in_reg	dependent
migration_code_move_within_reg	dependent
live_in_this_house_1_year_ago	dependent
migration_prev_res_in_sunbelt	dependent
family_members_under_18	dependent
country_of_birth_father	dependent
country_of_birth_mother	dependent
country_of_birth_self	dependent
citizenship	dependent
own_business_or_self_employed	dependent
fill_inc_questionnaire_for_veterans_admin	dependent
veterans_benefits	dependent
year	dependent
salary_classification	dependent

We see that all the categorical features and the salary classification are dependent.

So now we can do the following calculation of this Chi Square test and calculate the Cramer’s V measure of association between this feature and the salary classification.
This measure goes from 0 to 1.
The closer the value is to 1, the stronger the correlation.

categorical_columns_stats = data.frame("name" = categorical_columns, stringsAsFactors = FALSE)
categorical_columns_stats$correlation_w_output = sapply(subset(raw_train_data, select = as.character(categorical_columns)), FUN = function(x) cramersV(table(x, raw_train_data$salary_classification)))
kable(arrange(categorical_columns_stats, desc(correlation_w_output)))

name	correlation_w_output
salary_classification	0.9999569
detailed_occupation_recode	0.4382090
education	0.3898368
major_occupation_code	0.3662641
detailed_industry_recode	0.2929573
major_industry_code	0.2778596
class_of_worker	0.2605315
detailed_household_and_family_stat	0.2404576
tax_filer_stat	0.2292316
detailed_household_summary_in_household	0.2272205
marital_stat	0.1976550
full_or_part_time_employment_stat	0.1598469
family_members_under_18	0.1591674
sex	0.1575890
veterans_benefits	0.1443385
own_business_or_self_employed	0.0838999
member_of_a_labor_union	0.0749995
country_of_birth_father	0.0715032
country_of_birth_mother	0.0705367
hispanic_origin	0.0686084
enroll_in_edu_inst_last_wk	0.0641903
country_of_birth_self	0.0608721
race	0.0587371
citizenship	0.0409302
migration_code_change_in_msa	0.0386807
migration_code_change_in_reg	0.0380052
migration_code_move_within_reg	0.0380002
state_of_previous_residence	0.0332277
region_of_previous_residence	0.0291698
migration_prev_res_in_sunbelt	0.0290768
live_in_this_house_1_year_ago	0.0279176
reason_for_unemployment	0.0278955
fill_inc_questionnaire_for_veterans_admin	0.0273720
year	0.0147730

It is normal that the salary_classification has the highest correlation as it is with itself. Then we see that the second strongest one is the detailed_occupation_recode followed closely by major_occupation_code.
We see the same pattern with detailed_industry_recode and major_industry_code and detailed_household_and_family_stat with detailed_household_summary_in_household.
We will just keep the major ones.

We will consider for training our model only the best 33% of features which are the 11 stronger correlated with the salary_classification.
Those features are:

temp = filter(categorical_columns_stats, name != "salary_classification" 
                                        & name != "detailed_occupation_recode"
                                        & name != "detailed_industry_recode"
                                        & name != "detailed_household_and_family_stat")
temp = arrange(temp, desc(correlation_w_output))$name
categorical_columns_to_keep = head(temp, round(length(temp)/3))
categorical_columns_to_keep

##  [1] "education"                              
##  [2] "major_occupation_code"                  
##  [3] "major_industry_code"                    
##  [4] "class_of_worker"                        
##  [5] "tax_filer_stat"                         
##  [6] "detailed_household_summary_in_household"
##  [7] "marital_stat"                           
##  [8] "full_or_part_time_employment_stat"      
##  [9] "family_members_under_18"                
## [10] "sex"

## For the train dataset
temp = subset(raw_train_data, select = categorical_columns_to_keep)
train_data = cbind(train_data, temp)

## For the test dataset
temp = subset(raw_test_data, select = categorical_columns_to_keep)
test_data = cbind(test_data, temp)

Graphing the above strongly correlated features against the salary classification

Sex

temp_prop = data.frame(table(train_data[,c("sex", "salary_classification")]))
colnames(temp_prop)[3] = "proportion"
  
ggplot(data = temp_prop, aes(x = sex, y = proportion, fill = salary_classification)) + 
    geom_bar(position = "fill",stat = "identity") + 
    scale_y_continuous(labels = percent_format()) +
    ggtitle("Repartition of salary classification depending on sex")

We clearly see that being a female seems to have a negative effect on teh salary meaning that when you are a female, you are more likely to be in the lower salary bucket.

Education

We clearly see the impact of the education on the salary bucket:

Being a children gives you only one possibility as it is expected: being in the lower salary classification
Having a bachelor degree, master degree, prof school degree or PhD gives you the highest probability of being in the higher salary bucket (in the ascending order, PhD being the highest one)
The probability to be in the high bucket is really low when you didn’t reach highschool or whether you don’t have any diploma

Major occupation code

We can see that your major occupation has an impact on your salary as expected:

Being an executive, a technician or related support, or in the armed forces, in a professional specialty, in protective services, in sales seems to give you more chance to belong to the higher salary bucket
Farming forestry and fishing industry as well as transportation and material moving and the machines operating and inspector industry seem to give aproximately a 45%-60% to belong to the lower salary bucket
All the other occupation strongly put you in the lower salary, especially the Not in Universe one (we did well to keep it as a distinct level) and the private household services

Major industry code

Insights:

Armed forces, communications, mining, utilities and sanitary services and other professional services seem to give you the highest probability to belong to the high salary bucket

Class of worker

We can draw the following insights:

Someone without a pay or that never worked or that is Not in Universe will strongly be in the lower bucket
Someone in the bucket self-employed incorporated will strongly be in the highest salary bucket
All the other classes of worker will be more likely to be in the highest salary bucket

Tax filer stat

Here are some insights that can be drawn:

Not having filled any tax report makes you almost automatically a person from the low salary bucket
The status “joint both under 65” make yous more likely to be in the high salary bucket

Detailed household summary in household

Insights:

Being householder is the category that puts you in the higher salary bucket with the highest probability

Marital stat

Insights:

As expected, the bucket with the highest probability to be in the higher salary bucket is the “Married civilian spouse present”
The “Never married” bucket is the one that gives you more chance to be in the lower salary bucket

Full or part time employment stat

Insights:

It makes sense than if you are FT or PT, you are more likely to be in the higher salary bucket as we can see in the above graph
The buckets putting with high probability in the lower salary bucket are the “unemployed” ones and the “Not in labor force”

Family members under 18

It seems that this feature is already encompassed by previous ones and the only bucket that is giving both salary buckets is the “Not in universe” and the probalibilities are around 60-40.
We will remove this feature from our train and test dataset.

## For train dataset
train_data$family_members_under_18 = NULL

## For test dataset
test_data$family_members_under_18 = NULL

Bonus: race

Insights:

It seems that the “White” and “Asian or Pacific Islander” are the two races that have the highest probability to be in the higher salary bucket
All ther otehrs have almost three times less chance than the two races above to be in the higher salary classification

Let’s add this feature in our train and test dataset as it can be interesting to see teh impact of this feature in our model.

## For train dataset
train_data$race = raw_train_data$race

## For test dataset
test_data$race = raw_test_data$race

Conclusions about categorical features

We have finished our anaylses around the categorical features and we have selected the ones we want to use for our classification model. We have selected the ones that were ranked highly correlated with our salary classification output in our previous section.
From those, we have remove the family_members_under_18 and added instead the race.

Build the classifier

Our training and testing sets

Number of features of interests

From our two main anaylses about numerical and categorical variables, we have narrowed the number of features of interest from 41 to 14.
The 14 ones being:

name	class
class_of_worker	factor
detailed_household_summary_in_household	factor
education	factor
full_or_part_time_employment_stat	factor
major_industry_code	factor
major_occupation_code	factor
marital_stat	factor
race	factor
sex	factor
tax_filer_stat	factor
age	integer
external_financial_balance	integer
num_persons_worked_for_employer	integer
weeks_worked_in_year	integer

We have 10 categorical variables and 4 numerical variables.

Divide our training set into a train and a validation set

From our training set, we will make a split of 80% for the real train set and 20% for our validation set.

train = sample(c(TRUE, FALSE), size = nrow(train_data), replace = TRUE, prob = c(0.8, 0.2))

small_train_data = train_data[train,]
validation_data = train_data[!train,]

Unbalance training set

As we have seen at the beginning of our report, our original training set is highly unbalanced and has way more low salary classifications than the high one:

proportion of high salary: 6.21%
proportion of low salary: 93.79%

So that our model doesn’t get impacted by that and doesn’t try to automatically classify an observation as low salary (as by doing that, your error will be super small because of the unbalanced behavior, if you classify all observations as low salary, your precision will still be high around 94%).

The code below will change our small train dataset (80% of our original training set) to tackle this issue.
We will take the exact same amount of low salary than we have in the high bucket.

set.seed(1)
temp_high = filter(small_train_data, salary_classification == "+50,000") # Take only the high salary observations
temp_low = filter(small_train_data, salary_classification != "+50,000") # Take only the low salary observations
temp_low = sample_n(temp_low, nrow(temp_high), replace = FALSE) # Take a random sample of size the numbers of high salary observations without replacement from the low salary bucket
balanced_small_train_data = rbind(temp_high, temp_low) # Create the unbalanced dataset

kable(data.frame(table(balanced_small_train_data$salary_classification)), col.names = c("salary_classification", "number_of_observations"))

salary_classification	number_of_observations
-50,000	9893
+50,000	9893

Choose the right model using several cutoffs with the Random Forest Model

Train the model

First, let’s try to fit a random forest to our balanced small training set.

rf_fit = randomForest(salary_classification ~  ., data = balanced_small_train_data, importance = TRUE) # We set importance to be TRUE so that we can find the importance of each feature in our classification
print(rf_fit)

## 
## Call:
##  randomForest(formula = salary_classification ~ ., data = balanced_small_train_data,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 12.87%
## Confusion matrix:
##         -50,000 +50,000 class.error
## -50,000    8245    1648  0.16658243
## +50,000     898    8995  0.09077125

Tune the right cutoff using our validation set

Our random will output the probability of being in the higher salary classification.
Instead of assigning the observation to be in “+50,000” if this probability is bigger than 0.5, let’s try several values and take the one that leads to the best score.

For that, we need first to create the function that will give us a score for each model.
There are several ones but we will use here the MCC (Mathews Correlation Coefficient) which is a measure of the quality of a binary classification as it takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.
Our function will also output the Precision, Recall and F1-score.
In our problem, the True Positive is an observation that is classified as “+50,000” when the real classification is indeed “+50,000”.

calculate_scores = function(confusion_matrix) {
  TP = confusion_matrix[2,2]
  TN = confusion_matrix[1,1]
  FP = confusion_matrix[2,1]
  FN = confusion_matrix[1,2]
  precision = TP / (TP + FP)
  recall = TP / (TP + FN)
  f1_score = 2 * (precision * recall) / (precision + recall)
  MCC = (TP * TN - FP * FN) / (sqrt(TP+FP) * sqrt(TP + FN) * sqrt(TN + FP) * sqrt(TN + FN)) # We need to calculate first the SQRT of each product otherwise the number is too big and we have the following error: "NAs produced by integer overflow"
  
  return(list("precision" = precision
              , "recall" = recall
              , "f1_score" = f1_score
              , "MCC" = MCC))
}

Let’s try now several cutoff parameters and output the MCC for each of them.
We will take the cutoff that leads to the highest MCC.

# Let's first predict the probabilities of being in class "+50,000" in our validation set
validation_prob = predict(rf_fit, validation_data, type = "prob")[,2]

# Let's create our vector with different cutoff from 0 to 1 with 0.01 increment
validation_cutoff_result = data.frame("cutoff_value" = seq(0.1,1, by = 0.01))

# Let's test our model for each of those cutoff
cutoff_MCC = c(NULL)
for (i in 1:nrow(validation_cutoff_result)) {
  temp_pred_class = ifelse(validation_prob >= validation_cutoff_result$cutoff_value[i], "+50,000", "-50,000")
  temp_table = table(temp_pred_class, validation_data$salary_classification)
  cutoff_MCC = c(cutoff_MCC, calculate_scores(temp_table)[["MCC"]])
}
validation_cutoff_result$MCC_result = cutoff_MCC

# Let's find the cutoff value that gives the highest MCC score
best_cutoff = validation_cutoff_result$cutoff_value[which.max(validation_cutoff_result$MCC_result)]

Here is the plot giving the MCC score for all the different cutoff parameters with a vertical line showing the value of the optimal cutoff that gives the maximal MCC:

best_MCC = validation_cutoff_result$MCC_result[validation_cutoff_result$cutoff_value == best_cutoff]

ggplot(data = validation_cutoff_result, aes(y = MCC_result, x = cutoff_value)) + 
  geom_point(aes(colour = MCC_result)) +
  scale_colour_gradient(low = "blue") +
  geom_vline(xintercept = best_cutoff, colour="blue") +
  geom_text(aes(best_cutoff, 0, label = best_cutoff, vjust = -1), size = 5) +
  geom_text(aes(best_cutoff, best_MCC, label = round(best_MCC,4)), size = 5)

We see that the best cutoff is 0.83 which gives an MCC equal to 0.5335164.
We will then test our model in the test set using this cutoff.

Test our model in the test set

We will test our model in teh test set with the cutoff found in the earlier section which is 0.83.
If the probability of being in the higher classification is bigger than this cutoff, we will classify our observation in the bucket “+50,000”

# Calculate the predicted probabilities of being in the class "+50,000"
test_predict_prob = predict(rf_fit, test_data, type = "prob")[,2]

# Assign the observation to the class "+50,000" if the calculated predicted probability is higher than the best cutoff found in the earlier section
test_predict_class = ifelse(test_predict_prob >= best_cutoff, "+50,000", "-50,000")

# Calculate the confusion matrix for our test set
test_confusion_matrix = table(test_predict_class, test_data$salary_classification)

# Calculate the different scores of our predicted model
test_rf_results = calculate_scores(test_confusion_matrix)

# Calculate the accuracy which is the number of well classified observations out of all the observations
rf_accuracy = sum(test_predict_class == test_data$salary_classification) / length(test_predict_class)

We have the following confusion matrix for our model (the rows being the predicted classes and the columns being the real ones):

	-50,000	+50,000
-50,000	89223	2097
+50,000	4353	4089

Our model has those scores (a True Positive being a “+50,000” observation being correctly classified as “+50,000”):

Precision: 48.44%
Recall: 66.1%
F1-score: 0.5590648
MCC: 0.5324719
Accuracy: 93.53%

Model conclusions

Accuracy
- We see that our accuracy is 93.53% which is pretty good
- The test set has 93.8% of low salary so if we have a model that assigns every observation to the low salary classification, the accuracy will be 93.8% which is slightly better than our random forest model but we will in this case miss all the high salary people
MCC:
- Our MCC is 0.5324719 which is pretty good as this score is always between -1 and 1 (0 meaning picking at random, 1 meaning a perfect classification and -1 meaning a perfect misclassification)
Precision and Recall for being in the “+50,000” classification:
- Our precision is 48.44%
- Our recall is 66.1% which are not really high values
- This seems to be due to our strongly unbalanced testing set. As we have trained our model in a fakely balanced dataset, we tend to put more observations in the high salary classification so our precision gets smaller. If we change our cutoff, we can reach a higher recall but the precision will drastically go down. Our analyses in the previous section enabled us to find the optimal cutoff that gives the maximum MCC that takes into consideration the precision and recall of both classifications
Precision and Recall for being in the “-50,000” classification:
- Our precision is 97.7%
- Our recall is 95.35%
- We can clearly that our recall and precision in this classification are super high which can also be due to the unbalanced structure of our dataset
We can now graph the importance of each of our features used in our Random forest model:

varImpPlot(rf_fit, type = 1, main = "Features Importance of our Optimal Random Forest")

The mean decrease in accuracy a variable causes is determined during the out of bag error calculation phase. The more the accuracy of the random forest decreases due to the exclusion of a single variable, the more important that variable is.
It means that variables with a large mean decrease in accuracy are more important for classifying our observations.

Here we can see that sex, education, external financial balance and age are the most important ones in our classification process, followed closely by the major occupation and the weeks worked in the year.

Those features make completely sense as we would expect them to give us high information on the salary bucket of a given observation.

Conclusions

Performance of our classifier

As explained in the previous section, it seems that our classifier model using Random Forest with an optimal cutoff seems to give pretty good results, especially for the low salary classification (“-50,000”).

Main challenges

There were several challenges in this project:

First, the unbalanced structure of our datasets was really a main issue during this challenge as it can completely change the behavior of our model:
- If we have only looked at the accuracy of the model as our main score, we would have lost so many information because as I said earlier, a simple classifier putting every observation in the “-50,000” bucekt gives an accuracy of 93.8%
- This issue is a common behavior of the real world data. In general, data is unbalanced and there are several ways to tackle that
- I have chosen to balance my training set but I could have chosen other options liek stratifying my training set of trying to put different class weight in my Random Forests fit process
Second, the high number of features of the dataset was really challenging (41 features, numerical and categorical):
- It was hard to decide on how I wanted to grab the most important features. I decided to follow the path of dividing my analysis into two parts: numerical then categorical
- For the numerical ones, the boxplots helped me in that process
- For the categorical ones, the Cramer’s V coupled with the Chi Square test of independence gave me the most correlated features with the salary classification

Next steps

Here I have chosen to use the Random Forest model with different cutoff values where I decided to use the one that maximize a score that I have chose (here the MCC).
The natural next steps would be to test other classification models like the Logistic Regression by using different regularizations rules (the L1 and L2) or to use a simple Decision Tree to see how they perform compared to my Random Forest.

Dataiku US Census Data Science Challenge

Yasser Ezbakhe

March 20th, 2017

The Challenge

Set Up the Working Environment

Import the necessary packages

Choose the main directory

Import the train and test datasets

Finding the column names of our datasets

Associate the columns name to our train and test datasets

Reformat output values

Remove the feature instance weight

Final shape of our datasets

Now that we have set up all our working environment, we can start our work.

Understand our Variables

What is the class of our columns (integer or factor for example)

How many missing values per column

Deep dive in our numerical features

Global distribution

Boxplots per salary classification

Conclusions about numerical features

Deep dive in our categorical variables

Correlation with the salary classification

Graphing the above strongly correlated features against the salary classification

Sex

Education

Major occupation code

Major industry code

Class of worker

Tax filer stat

Detailed household summary in household

Marital stat

Full or part time employment stat

Family members under 18

Bonus: race

Conclusions about categorical features

Build the classifier

Our training and testing sets

Number of features of interests

Divide our training set into a train and a validation set

Unbalance training set

Choose the right model using several cutoffs with the Random Forest Model

Train the model

Tune the right cutoff using our validation set

Test our model in the test set

Model conclusions

Conclusions

Performance of our classifier

Main challenges

Next steps