Introduction

In this project we are going to explore and predict data from the UCI Machine learning repository on Kaggle, this data is comprised of information about the correspondents’ demographics. Our objective is to do some exploratory data analysis and classification using the data.

This data is obtained from : https://www.kaggle.com/uciml/adult-census-income

Business Case

Let’s pretend that we are a government agency that is responsible for investigating tax payment compliance, this job requires us to identify higher income individuals and check if there is any possibility of tax evasion or fraud. So, in this project, our main goal is to perform classification on which category of income does the citizens fall into based on known demographics of the citizens.

Data Preparation

Loading Libraries and Data

Loading all the required libraries

library(readr)
library(dplyr)
library(e1071)
library(partykit)
library(caret)
library(DMwR)
library(UBL)
library(dplyr)
library(rpart)
library(rpart.plot)
library(wesanderson)
library(ggpubr)
library(ggalluvial)

Data Preprocessing

Loading the required data

income<-read_csv("adult.csv")

Let’s inspect the dataset

head(income)

From the initial inspection, the workclass and occupation column has an entry of a question mark.

Next, we examine if the data has any empty entries.

colSums(is.na(income))
##            age      workclass         fnlwgt      education  education.num 
##              0              0              0              0              0 
## marital.status     occupation   relationship           race            sex 
##              0              0              0              0              0 
##   capital.gain   capital.loss hours.per.week native.country         income 
##              0              0              0              0              0

In the prior inspection, we see that in the workclass and occupation variable there is an entry of a question mark. Now, let’s inspect if any other columns in the dataset has an entry of a question mark. Though it is pretty self explanatory, we are going to convert all of the question mark to undisclosed.

colSums(income=="?")
##            age      workclass         fnlwgt      education  education.num 
##              0           1836              0              0              0 
## marital.status     occupation   relationship           race            sex 
##              0           1843              0              0              0 
##   capital.gain   capital.loss hours.per.week native.country         income 
##              0              0              0            583              0

Other than the occupation and the workclass column, it seems like the native.country column also has an entry of a question mark. So, let’s replace those values with “undisclosed” and convert qualitative variables into factor.

income_clean<-income %>% 
  mutate(occupation = replace(occupation, occupation == "?", "Undisclosed")) %>% 
  mutate(workclass = replace(workclass, workclass == "?", "Undisclosed")) %>% 
  mutate(native.country = replace(native.country, native.country == "?", "Undisclosed")) %>%  
  mutate_if(is.character,as.factor)

Now, let’s inspect those columns again

colSums(income_clean=="?")
##            age      workclass         fnlwgt      education  education.num 
##              0              0              0              0              0 
## marital.status     occupation   relationship           race            sex 
##              0              0              0              0              0 
##   capital.gain   capital.loss hours.per.week native.country         income 
##              0              0              0              0              0

As we can see, the question mark entries no longer exist.

Dataset Description

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

Description of fnlwgt (final weight) The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:

  • A single cell estimate of the population 16+ for each state.

  • Controls for Hispanic Origin by age and sex.

  • Controls by Race, age and sex.

We use all three sets of controls in our weighting program and “rake” through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

Exploratory Data Analysis

a<-income_clean %>%group_by(marital.status,income) %>% count() %>% 
ggplot(aes(reorder(marital.status,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=7, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Marital Status")
b<-income_clean %>%group_by(relationship,income) %>% count() %>% 
ggplot(aes(reorder(relationship,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=6, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),legend.position = "none",plot.title = element_text(size=10))+ggtitle("Relationship Status")
figure<-ggarrange(a, b,
          common.legend = TRUE, legend = "bottom")
annotate_figure(figure,top = text_grob("Income Based on Relationship and Marital Status", color = "black", face = "bold", size = 15))

c<-income_clean %>% group_by(occupation,income)%>% count() %>% 
ggplot(aes(reorder(occupation,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+ggtitle("Occupation")
d<-income_clean %>%group_by(workclass,income) %>% count() %>% 
ggplot(aes(reorder(workclass,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+ggtitle("Work Class")
figure<-ggarrange(c, d,
          common.legend = TRUE, legend = "bottom")
annotate_figure(figure,top = text_grob("Income Based on Economic Factors", color = "black", face = "bold", size = 15))

income_race<-income_clean%>%
  group_by(sex,income) %>%
  count()

income_race %>% ggplot(aes(axis1 =sex,
           axis2 =income,
           y = n)) +
  geom_alluvium(aes(fill = sex)) +
  geom_stratum() +
  geom_text(stat = "stratum",aes(label = after_stat(stratum), min.y = 100)) +
  scale_x_discrete(limits = c("Race", "Income"),
                   expand = c(.1, .1)) +
  scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1"))) +
  labs(title = "Income",
       subtitle = "Stratified by Sex",
       y = "Frequency") +
  theme_minimal()
## Warning: Ignoring unknown aesthetics: min.y

income_clean %>% group_by(race,income) %>% count() %>% ggplot(aes(x=reorder(race,n),y=n))+geom_col(aes(fill=income),position = "dodge")+coord_flip()+theme_light()+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=12,face="bold"))+ggtitle("Income Class Based on Race")

ggplot(income_clean,aes(x=hours.per.week))+geom_histogram(aes(fill=income))+theme_light()+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=12,face="bold"))+ggtitle("Working Hours Distribution Based on Income")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

income_clean %>%group_by(education,income) %>% count() %>% 
ggplot(aes(reorder(education,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=12,face="bold"))+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+ggtitle("Income by Education Level")

Splitting Data

To eliminate the chance of overfiting and underfiting our model to our train data, we are going to split the data into a training and testing set.

Now, we want to split the data into training and testing set first, with 70% of data in the training data and 30% of data in the testing data.

set.seed(110)

index <- sample(x = nrow(income_clean), nrow(income_clean) * 0.7)

income_train <- income_clean[index,]
income_test <- income_clean[-index,]

Now, let’s inspect if the target variable is balanced in proportion

prop.table(table(income_train$income))
## 
##     <=50K      >50K 
## 0.7590821 0.2409179

Since it is not balanced in proportion, we are going to do down sampling or up sampling. To determine whether we are going to do down sampling or up sampling we examine the number of rows that each our response variable has.

table(income_train$income)
## 
## <=50K  >50K 
## 17301  5491

We see that the >50k category has 5491 rows of data. Then we are going to down sample because we already have quite a lot of data.

income_train_balanced<-downSample(x=income_train %>% select(-income),
                                y=income_train$income,
                                yname="income")

As we can see from the result below, our data is already balanced.

table(income_train_balanced$income)
## 
## <=50K  >50K 
##  5491  5491

Modeling

Naive Bayes

Firstly we are going to try using the naive bayes model. Naive Bayes classification model utilizes conditional probability to predict it’s outcome, to give a clearer example, the formula below is used in naive bayes, it simply predicts the outcome of the target variable based on the condition that other variables related to the target variable has occurred.

Conditional Probability Formula

Now, let’s start our modeling process, and we named our model naive_income, we set our laplace parameter to 1, in case any data scarcity exists.

naive_income<-naiveBayes(x=income_train_balanced %>% select(-income),y=income_train_balanced$income,laplace = 1)

Now, let’s use our model to predict the test data.

predicted_naive<-predict(naive_income,income_test%>% select(-income))

Let’s evaluate our model

cm_bayes<-confusionMatrix(predicted_naive,income_test$income,positive =">50K")
cm_bayes
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  6704  945
##      >50K    715 1405
##                                           
##                Accuracy : 0.8301          
##                  95% CI : (0.8225, 0.8375)
##     No Information Rate : 0.7594          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5188          
##                                           
##  Mcnemar's Test P-Value : 1.903e-08       
##                                           
##             Sensitivity : 0.5979          
##             Specificity : 0.9036          
##          Pos Pred Value : 0.6627          
##          Neg Pred Value : 0.8765          
##              Prevalence : 0.2406          
##          Detection Rate : 0.1438          
##    Detection Prevalence : 0.2170          
##       Balanced Accuracy : 0.7507          
##                                           
##        'Positive' Class : >50K            
## 

Decision Tree

Next we are going to use the decision tree algorithm to predict which income bracket does the individuals fall into. The decision tree algorithm utilizes a tree like decision process to create a decision based on the other variables. We are going to create the decision tree using rpart package. Decision tree made using the rpart package uses gini impurity to select splits when performing classification.

dtree_income <- rpart(income~., data = income_train_balanced, method = 'class')

rpart.plot(dtree_income, extra = 106)

printcp(dtree_income)
## 
## Classification tree:
## rpart(formula = income ~ ., data = income_train_balanced, method = "class")
## 
## Variables actually used in tree construction:
## [1] age          capital.gain education    occupation   relationship
## 
## Root node error: 5491/10982 = 0.5
## 
## n= 10982 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.528319      0   1.00000 1.02149 0.0095402
## 2 0.042069      1   0.47168 0.47168 0.0081020
## 3 0.010381      2   0.42961 0.43089 0.0078464
## 4 0.010000      6   0.37370 0.38463 0.0075217

The root node contains 5491 errors out of 10982 values (50%). The most important indicator of income is relationship.

Now, let’s predict using our model

predicted_tree<-predict(dtree_income,income_test,type="class")

Next, let’s evaluate the model

cm_tree<-confusionMatrix(predicted_tree,income_test$income,positive =">50K" )
cm_tree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5667  403
##      >50K   1752 1947
##                                          
##                Accuracy : 0.7794         
##                  95% CI : (0.771, 0.7876)
##     No Information Rate : 0.7594         
##     P-Value [Acc > NIR] : 1.682e-06      
##                                          
##                   Kappa : 0.4952         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8285         
##             Specificity : 0.7638         
##          Pos Pred Value : 0.5264         
##          Neg Pred Value : 0.9336         
##              Prevalence : 0.2406         
##          Detection Rate : 0.1993         
##    Detection Prevalence : 0.3786         
##       Balanced Accuracy : 0.7962         
##                                          
##        'Positive' Class : >50K           
## 

Random Forest

Next, we are going to use a random forest model to predict which income bracket does the individual fall into, the random forest model utilizes a bagging which is a bootstrap and aggregation method. Random forest created numerous decision trees and determines the outcome by the majority voting the outcomes.

In our modeling process, we are going tune our model using a five way cross validation with three repetitions.

set.seed(417)

ctrl <- trainControl(method = "repeatedcv",
                     number = 5,
                     repeats = 3) 


mtry <- sqrt(ncol(income_train_balanced))
tunegrid <- expand.grid(.mtry=mtry)
rf_income <- train(income~., 
                      data=income_train_balanced, 
                      method='rf', 
                      metric='Accuracy', 
                      tuneGrid=tunegrid, 
                      trControl=ctrl)

Now we save our model to an RDS file for future use and reproducibility.

saveRDS(rf_income, "rf_income.RDS")
income_forest<-readRDS("rf_income.RDS")

Now, we predict and evaluate the confusion matrix using our model.

predicted_forest <- predict(income_forest, income_test)
cm_rf <- confusionMatrix(data = predicted_forest,
                         reference = income_test$income,positive =">50K")
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5557  276
##      >50K   1862 2074
##                                           
##                Accuracy : 0.7811          
##                  95% CI : (0.7728, 0.7893)
##     No Information Rate : 0.7594          
##     P-Value [Acc > NIR] : 2.117e-07       
##                                           
##                   Kappa : 0.5132          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8826          
##             Specificity : 0.7490          
##          Pos Pred Value : 0.5269          
##          Neg Pred Value : 0.9527          
##              Prevalence : 0.2406          
##          Detection Rate : 0.2123          
##    Detection Prevalence : 0.4029          
##       Balanced Accuracy : 0.8158          
##                                           
##        'Positive' Class : >50K            
## 

Now, let’s inspect the variable importance in our random forest model.

varImp(income_forest)
## rf variable importance
## 
##   only 20 most important variables shown (out of 100)
## 
##                                  Overall
## marital.statusMarried-civ-spouse 100.000
## marital.statusNever-married       49.190
## age                               48.098
## education.num                     46.376
## capital.gain                      42.741
## relationshipOwn-child             27.904
## hours.per.week                    26.562
## relationshipNot-in-family         24.606
## sexMale                           20.198
## occupationExec-managerial         14.410
## occupationProf-specialty          12.347
## capital.loss                      12.068
## educationBachelors                11.681
## occupationOther-service           11.514
## relationshipUnmarried             11.371
## fnlwgt                             7.924
## educationMasters                   7.557
## relationshipWife                   7.345
## educationHS-grad                   6.817
## workclassSelf-emp-inc              4.123

See that marriage status the most determining factor of the random in our model, it even exceeded the importance of occupation and education. What a surprise!

SVM

The third model that we are going to attempt is the support vector machine, this method can be used for classification. It creates hyperplanes that separates the data into classes.

In our first attempt, we are going to use gamma that is equal to 16 and the cost function that is equal to 1.

svm_attempt<-svm(income~., data=income_train_balanced, 
          method="C-classification", kernal="radial", 
          gamma=16, cost=1)

Now, we are going to predict using our model and evaluate the model using the confusion matrix.

pred_svm<-predict(svm_attempt,income_test)
cm_svm<-confusionMatrix(pred_svm,income_test$income,positive =">50K")
cm_svm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  6902 1823
##      >50K    517  527
##                                           
##                Accuracy : 0.7605          
##                  95% CI : (0.7519, 0.7689)
##     No Information Rate : 0.7594          
##     P-Value [Acc > NIR] : 0.4118          
##                                           
##                   Kappa : 0.1908          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.22426         
##             Specificity : 0.93031         
##          Pos Pred Value : 0.50479         
##          Neg Pred Value : 0.79106         
##              Prevalence : 0.24056         
##          Detection Rate : 0.05395         
##    Detection Prevalence : 0.10687         
##       Balanced Accuracy : 0.57728         
##                                           
##        'Positive' Class : >50K            
## 

Model Tuning

Decision Tree

Now, let’s tune our model. By the graph in the previous decision tree, it does not seem overfited we are just going to tune the cp parameter which is already taken into account in the prune function.

First we determine which cp that gives the cross validation error

which.min(dtree_income$cptable[,"xerror"])
## 4 
## 4

We obtained cp that is equal to 4, so in the model we set the cp to 4 and plot the decision tree.

pruned_tree<-prune(dtree_income, cp = dtree_income$cptable[4,"CP"])
rpart.plot(pruned_tree)

printcp(pruned_tree)
## 
## Classification tree:
## rpart(formula = income ~ ., data = income_train_balanced, method = "class")
## 
## Variables actually used in tree construction:
## [1] age          capital.gain education    occupation   relationship
## 
## Root node error: 5491/10982 = 0.5
## 
## n= 10982 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.528319      0   1.00000 1.02149 0.0095402
## 2 0.042069      1   0.47168 0.47168 0.0081020
## 3 0.010381      2   0.42961 0.43089 0.0078464
## 4 0.010000      6   0.37370 0.38463 0.0075217

It seems like not a lot has changed from the pre-tuned model, the decision tree remains identical.

Now let’s try to evaluate our tuned model using the confusion matrix.

predicted_pruned<-predict(pruned_tree, income_test,type="class")
cm_tree_tuned<-confusionMatrix(predicted_pruned,income_test$income,positive =">50K")
cm_tree_tuned
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5667  403
##      >50K   1752 1947
##                                          
##                Accuracy : 0.7794         
##                  95% CI : (0.771, 0.7876)
##     No Information Rate : 0.7594         
##     P-Value [Acc > NIR] : 1.682e-06      
##                                          
##                   Kappa : 0.4952         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8285         
##             Specificity : 0.7638         
##          Pos Pred Value : 0.5264         
##          Neg Pred Value : 0.9336         
##              Prevalence : 0.2406         
##          Detection Rate : 0.1993         
##    Detection Prevalence : 0.3786         
##       Balanced Accuracy : 0.7962         
##                                          
##        'Positive' Class : >50K           
## 

SVM

Now, we are attempting to tune our SVM Model. So, we are going to utilize the tune.svm function to perform a grid search algorithm which then will determine the best alpha and cost parameter for our SVM model.

# "set.seed(123)
# doParallel::registerDoParallel()
# parameters<-tune.svm(income~., data=income_train_balanced, sampling = "fix",
# gamma = 10^(-5:-1), cost = 10^(-3:1))"
# "parameters"

Since this algorithm takes about 2-3 hours to run, I am not going to run the code again in the knitting process. However, I will include the end result in the screenshot below. Tuned Parameter

Now, let’s construct the model using the obtained gamma and cost parameter.

svm_attempt_tuned<-svm(income~., data=income_train_balanced, 
          method="C-classification", kernal="radial", 
          gamma=0.01    , cost=10)

Using our tuned model, we are going to predict and evaluate it’s performance.

pred_svm_tuned<-predict(svm_attempt_tuned,income_test)
cm_svm_tuned<-confusionMatrix(pred_svm_tuned,income_test$income,positive =">50K")
cm_svm_tuned
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  5746  317
##      >50K   1673 2033
##                                           
##                Accuracy : 0.7963          
##                  95% CI : (0.7882, 0.8042)
##     No Information Rate : 0.7594          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5343          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8651          
##             Specificity : 0.7745          
##          Pos Pred Value : 0.5486          
##          Neg Pred Value : 0.9477          
##              Prevalence : 0.2406          
##          Detection Rate : 0.2081          
##    Detection Prevalence : 0.3794          
##       Balanced Accuracy : 0.8198          
##                                           
##        'Positive' Class : >50K            
## 

General Recap and Comparison

Prioritizing false positive or a false negative is not really relevant if we are going to do just a simple classification modeling. However, since our goal is to identify the higher income individuals as much as possible. So in this case, we are going to prioritize the sensitivity of the model. So far, we have obtained six models, we tried using naive bayes, decision tree, random forest and support vector machine. We have also tried tuning the decision tree and the support vector machine.

df_recap<-data.frame(row.names=c("Naive Bayes","Decision Tree","Random Forest","SVM","Tuned Decision Tree","Tuned SVM "),
           Sensitivity=c(cm_bayes$byClass['Sensitivity'],
           cm_tree$byClass['Sensitivity'],
           cm_rf$byClass['Sensitivity'],
           cm_svm$byClass['Sensitivity'],
           cm_tree_tuned$byClass['Sensitivity'],
           cm_svm_tuned$byClass['Sensitivity']))
df_recap %>% arrange(desc(Sensitivity))

As we can see, the random forest performs the best. It seems like after all, a voting algorithm that utilizes multiple decision trees outperforms all of the other method that we attempted.

Conclusion

We have performed several attempts on the classification of income, however the random forest model performs the best. This shows that by using a random forest classification algorithm could possibly alleviate the heinous task of checking the citizens’ data one by one. Thus, making the work much more efficient.