Exploring Data

Read

First, we have to load them. For more efficiency, I am going to load and transform both learn and test files at the same time.

usc <- fread("~/Desktop/VINCENT/Dataiku/Data/census_income_learn.csv", integer64 = "character", showProgress = TRUE)
usc$part <- "learn"
usc2 <- fread("~/Desktop/VINCENT/Dataiku/Data/census_income_test.csv", integer64 = "character", showProgress = TRUE)
usc2$part <- "test"
usc <- rbind.data.frame(usc,usc2)
rm(usc2)

By exploring the metadata file and the U.S. census file, I have figured out that some columns were missing, and others were not in the metadata file. I had to create my own file with the names of variables :

Var <- read.csv("~/Desktop/VINCENT/Dataiku/Data/Var.txt", sep="")
Var<- Var$Lib
names(usc) <- as.vector(Var)

Now that we have our complete file, we can begin to explore it.

Let’s plot the first two variables age , class of worker and sex which I assume, have a great incidence on the income.

Exploratory Analysis

Age, class of worker and sex

a <- ggplot(usc, aes(Age,fill = class_of_worker)) +
  geom_histogram(binwidth=1, alpha=.5,position="identity") +
  facet_grid(~income) +
  ggtitle("Class of worker by age and income") 
ggplotly(a)

First thing we see, the Not in universe category in the class of worker variable.There is almost no one with an income of +50000$ in it. Most of them are under 18 or over 60. We can assume that it reprensents the inactive population. It is not really worth it to predict an income on a population which is not eligible for work. Let’s exclude them. Plus, I transform the income variable to make it clearer.

usc <- usc[which(class_of_worker != "Not in universe")]
usc$income[usc$income == "- 50000."] <- "-50000$"
usc$income[usc$income == "50000+."] <- "+50000$"

Let us give a “demographic” dimension to our age / sex variable by income I prefer to hide the code because of its length.

It is obvious that both age and sex have an influence on the level of income.

Education

c <- ggplot(data=usc,aes(x = education,fill = income)) +
  geom_bar() +
  coord_flip() +
  scale_fill_manual(values = c("#CCCCCC","#FFCC33")) +
  theme_classic()
ggplotly(c)

We see that the education variable has too many categories. Let’s regroup it by schoolgrades, and let’s transform it as a factor so that we have an ordinal variable for education (from undereducated to the highest degrees)

usc$education_REC <- revalue(usc$education,c("10th grade" = "01-Less than Hight School",
                                             "1st 2nd 3rd or 4th grade"= "01-Less than Hight School",
                                             "9th grade" = "01-Less than Hight School",
                                             "Less than 1st grade" = "01-Less than Hight School",
                                             "12th grade no diploma" = "01-Less than Hight School",
                                             "5th or 6th grade" = "01-Less than Hight School",
                                             "7th and 8th grade" = "01-Less than Hight School",
                                             "11th grade" = "01-Less than Hight School",
                                             "High school graduate" = "02-High school",
                                             "Some college but no degree" = "02-High school",
                                             "Associates degree-occup /vocational" = "03-Associates degree",
                                             "Associates degree-academic program" = "03-Associates degree",
                                             "Bachelors degree(BA AB BS)" = "04-Degree",
                                             "Masters degree(MA MS MEng MEd MSW MBA)" = "04-Degree",
                                             "Doctorate degree(PhD EdD)" = "05-Phd",
                                             "Prof school degree (MD DDS DVM LLB JD)" = "05-Phd"))
usc$education_REC <- as.factor(usc$education_REC)

Then we can plot it, using proportional histogramm. It will underline in which categories the +50000$ are overrepresented.

d <- ggplot(usc, aes(x = usc$education_REC, fill = income)) + 
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle=45,face="bold")
        ,axis.title.x = element_blank()
        ,axis.title.y = element_blank()) +
  scale_fill_manual(values = c("#CCCCCC","#FFCC33")) +
  coord_flip()
plotly_build(d)

Education seems to have a strong influence on the odds to get an income superior to 50000$.

We can also explore the birth_self variable, which indicates were the individual was born. In order to do that, let’s represent the percentage by country of the indivuduals with a 5000k income.

usc$Less50 = if_else(usc$income == "-50000$",1,0)
usc$More50 = if_else(usc$income == "+50000$",1,0)
usc$TOTAL = 1
MAP <- as.data.frame(usc[,list(More50 = sum(More50),
                               Total = sum(TOTAL)),
                               by = "birth_self"])

MAP$percent = MAP$More50 / MAP$Total
head(MAP,5)
##      birth_self More50  Total    percent
## 1 United-States  15476 129756 0.11927001
## 2      Columbia     21    407 0.05159705
## 3        Mexico    120   5035 0.02383317
## 4          Peru     12    246 0.04878049
## 5   Philippines    109    897 0.12151616

We can try to project it on a map. First, let’s import a shapefile so that we have a border layer.

#shape <- readShapePoly("C:/Users/GAUME/Desktop/Personnel/Da/TM_WORLD_BORDERS")
shape <- readShapePoly("~/Desktop/VINCENT/Dataiku/SHP/TM_WORLD_BORDER")

In a second time, we can merge our shapefile with our data. To make the job easier, I have renamed the country names directly in the shapefile.

MAP_merged<- geo_join(shape,MAP, "NAME", "birth_self")

Finally, we can project the results :

####### POP-UP ########
popup <- paste0(shape$NAME, "<br>",
                "% Income > 50K$. : ",round((MAP_merged$percent),2),"<br>",
                "Nb individuals > 50K$ : " , MAP_merged$More50)
####### Palette #######
pal <- colorNumeric(palette = "PuBu",
                     domain = MAP_merged$percent)
####### MAP #######
map <-leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(data = MAP_merged,
              fillColor = ~pal(percent),
              color = "#b2aeae", # border color
              fillOpacity = 0.7,
              weight = 1, 
              smoothFactor = 0.2,
              popup = popup)
map

It is nice but not very useful… We can see than Iran and India have a high percentage of individuals earning more than 5000$, but :
- Only 40 countries are represented,but what if other countries in other datasets ?
- Too many categories, hard to interpret.
So I choose to exclude this variable (and also the place of birth of the parents).

I do the same for all other variables, so I can decide on each of them. The exploreR package allows to see quickly the distribution of each variable, the missing values and the disparities.

Keeping :
Householder : dummy variable => True (Householder) / False (non householder) logicial data
Week_worker => 52 vs >52
cap / dividend : dummy variable => True (speculator) / False (non-speculator) Race : dummy variable => white vs others
tax : keep as such
Unemployement : dummy variable => True (unemployed) / False (employed) major-indus-code => keep as such
citizenship => keep as such

Removing :
Mig / Previous : not enough movement
veteran : not enough veteran
labor : not enough individuals in union
nb of employees : Distribution unexpected
self employed : Distribution unexpected
Nb children : Irrelevant
marital status : Irrelevant

Of course, I made assumptions which could be reinforced or put into question by research papers.

usc$household_REC <- as.logical(if_else(usc$household == "Householder", 1,0))
usc$weeks_worked_REC = if_else(usc$weeks_worked == 52 ,"Full-time","Not full-time")
usc$cap = as.logical(if_else(usc$cap_gain > 0,1,
                             if_else(usc$cap_losses > 0 ,1,
                             if_else(usc$dividends > 0,1,0))))
usc$race_REC = if_else(usc$race == "White","Withe","Others")
usc$unemployment_REC = as.logical(if_else(usc$unemployment_reason != "Not in universe",1,0))

### Dataset for modeling

usc_mod <- as.data.frame(usc %>%  select(income,Age,education_REC,cap,class_of_worker
                            ,race_REC, sex ,household_REC,tax,citizenship,Part
                            ,major_occupation_code,major_indus_code,unemployment_REC))

Now we can split the dataset in order to modelize it.

usc_mod_learn = usc_mod[usc_mod$Part == "learn",]
usc_mod_test = usc_mod[usc_mod$Part == "test",]

Modeling the data

In order to avoid negative impact on model fitting due to the disparity of the observed class income, I am going to downsample the training dataset.

usc_mod_learn$Class = if_else(usc_mod_learn$income == "+50000$",1,0)
usc_mod_test$Class = if_else(usc_mod_test$income == "+50000$",1,0)
table(usc_mod_learn$Class)
## 
##     0     1 
## 87800 11478
set.seed(9560)
usc_mod_learn = downSample(x = usc_mod_learn[,- ncol(usc_mod_learn)],
                         y = as.factor(usc_mod_learn$Class))

table(usc_mod_learn$Class)
## 
##     0     1 
## 11478 11478

Now we have a 50/50 sample that we can use.

LOGIT
mod_logit <- glm(Class ~ Age + class_of_worker + 
                  education_REC + race_REC +
                  sex + cap + household_REC +
                  tax + citizenship + major_occupation_code + 
                  major_indus_code + unemployment_REC + major_indus_code
                  ,data = usc_mod_learn
                  ,family = binomial(link = "logit"))

We can interpret the odds : summary(mod-logit)
For example : An additional year increases the probability of earning more than 50000$ by 4% exp(coef) (other things being equal).
An individual with a Phd has 20 times more likely to earn 50000$ than an individual without any degree (other things being equal).

Plus, we can check the importance of each variable :

vi <- varImp(mod_logit)
vi$names <- row.names(vi)
vi <- vi[order(-vi$Overall),]
vi <- vi[1:12,] #Only the firt 12

p <- ggplot(vi,aes(reorder(x = names,Overall),y = Overall)) +
  geom_bar( stat = "identity" , width = 0.6, fill = "grey") +
  coord_flip() +
  theme_minimal() +
  ggtitle("Variable Importance") +
  theme(axis.title.y = element_blank())

plotly_build(p)

Confusion matrix :

usc_mod_test$pred = predict(mod_logit,usc_mod_test,type="response")
usc_mod_test$pred_log <- rep("-50000$")
usc_mod_test$pred_log[usc_mod_test$pred > 0.5] = "+50000$"
table(usc_mod_test$pred_log,usc_mod_test$income)
##          
##           -50000$ +50000$
##   -50000$   35200    1052
##   +50000$    8742    4689
#Confusion matrix

Specificity : 81%
Global accuracy : 80%
Note that the 0.5 threshold is totally subjective. We could play with it, like in the Dataiku plateform :)
It’s not bad but there is a solid trend for false positives.
It depends on the problematics. For example, if we try to detect a disease, you’d better be safe than sorry. But if we are looking for fraud, it would be prejudiciable to accuse someone who is not a frauder.

We can check some stats about this population of false positives :

statFP <- subset(usc_mod_test,usc_mod_test$pred_log == "+50000$" & usc_mod_test$income == "-50000$")

We see in the dataset that our false posisitive population is 44 years old , 50% of them are speculators and 31% of them have high degrees.

Let’s try another model.
carret package has a lot of models.

I am going to stay on a logit, in order to compare with the previous model. I will also use the cross validation method to avoid overfitting.

Boosted Logistic Regression

set.seed (456)
trctr_logitB = trainControl(method = "cv", number = 10)
mod_logitB = train(income ~ Age + class_of_worker + 
                education_REC + race_REC +
                sex + cap  + household_REC +
                tax + citizenship + major_occupation_code + 
                  major_indus_code + unemployment_REC
                    ,data = usc_mod_learn
                    ,trControl = trctr_logitB
                    ,method = "gbm"
                    ,verbose = F)

confusionMatrix(predict(mod_logitB,usc_mod_test),usc_mod_test$income,positive = "+50000$")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction -50000$ +50000$
##    -50000$   35356    1026
##    +50000$    8586    4715
##                                        
##                Accuracy : 0.8065       
##                  95% CI : (0.803, 0.81)
##     No Information Rate : 0.8844       
##     P-Value [Acc > NIR] : 1            
##                                        
##                   Kappa : 0.398        
##  Mcnemar's Test P-Value : <2e-16       
##                                        
##             Sensitivity : 0.8213       
##             Specificity : 0.8046       
##          Pos Pred Value : 0.3545       
##          Neg Pred Value : 0.9718       
##              Prevalence : 0.1156       
##          Detection Rate : 0.0949       
##    Detection Prevalence : 0.2677       
##       Balanced Accuracy : 0.8129       
##                                        
##        'Positive' Class : +50000$      
## 

We get almost the same results. It seems that we have no overfitting but still a lot of false positives.

Let’s try with a random forests model.

Random forests

The random forests algorithm will select for us the best decision trees.
I could also use a resampling method, but the processing time is too long.

set.seed(789)
mod_RF <- ranger(as.factor(income) ~ Age + class_of_worker + 
                     education_REC + race_REC + 
                     sex + cap + household_REC +
                     tax + citizenship + major_occupation_code + 
                     major_indus_code + unemployment_REC
                   ,data = usc_mod_learn
                   ,num.trees = 100
                   ,write.forest = T
                   ,importance = "impurity")
pred = predict(mod_RF,usc_mod_test)

# confusion matrix
table(pred$predictions,usc_mod_test$income)
##          
##           -50000$ +50000$
##   -50000$   34286     935
##   +50000$    9656    4806

Accuracy (80%) and sensitivity (84%) have both improved a little bit, but we still have a lot of false positives.

Concerning the importance of variables in this model :

data.frame(as.list(mod_RF$variable.importance)) %>% gather() %>% 
  ggplot(aes(x = reorder(key, value), y = value)) +
  geom_bar(stat = "identity", width = 0.6, fill = "grey") +
  coord_flip() +
  theme_minimal() +
  ggtitle("Variable Importance") +
  theme(axis.title.y = element_blank())

We have quite the same result as in the logit :
The cap, education and age variables have a strong influence on the odd of earning more than 50000$.

The false positive population fits with the +50000$ one, we would need more variables to improve the performance of our models.