In this project we are going to explore and predict data from the UCI Machine learning repository on Kaggle, this data is comprised of information about the correspondents’ demographics. Our objective is to do some exploratory data analysis and classification using the data.
This data is obtained from : https://www.kaggle.com/uciml/adult-census-income
Let’s pretend that we are a government agency that is responsible for investigating tax payment compliance, this job requires us to identify higher income individuals and check if there is any possibility of tax evasion or fraud. So, in this project, our main goal is to perform classification on which category of income does the citizens fall into based on known demographics of the citizens.
Loading all the required libraries
library(readr)
library(dplyr)
library(e1071)
library(partykit)
library(caret)
library(DMwR)
library(UBL)
library(dplyr)
library(rpart)
library(rpart.plot)
library(wesanderson)
library(ggpubr)
library(ggalluvial)Loading the required data
income<-read_csv("adult.csv")Let’s inspect the dataset
head(income)From the initial inspection, the workclass and occupation column has an entry of a question mark.
Next, we examine if the data has any empty entries.
colSums(is.na(income))## age workclass fnlwgt education education.num
## 0 0 0 0 0
## marital.status occupation relationship race sex
## 0 0 0 0 0
## capital.gain capital.loss hours.per.week native.country income
## 0 0 0 0 0
In the prior inspection, we see that in the workclass and occupation variable there is an entry of a question mark. Now, let’s inspect if any other columns in the dataset has an entry of a question mark. Though it is pretty self explanatory, we are going to convert all of the question mark to undisclosed.
colSums(income=="?")## age workclass fnlwgt education education.num
## 0 1836 0 0 0
## marital.status occupation relationship race sex
## 0 1843 0 0 0
## capital.gain capital.loss hours.per.week native.country income
## 0 0 0 583 0
Other than the occupation and the workclass column, it seems like the native.country column also has an entry of a question mark. So, let’s replace those values with “undisclosed” and convert qualitative variables into factor.
income_clean<-income %>%
mutate(occupation = replace(occupation, occupation == "?", "Undisclosed")) %>%
mutate(workclass = replace(workclass, workclass == "?", "Undisclosed")) %>%
mutate(native.country = replace(native.country, native.country == "?", "Undisclosed")) %>%
mutate_if(is.character,as.factor)Now, let’s inspect those columns again
colSums(income_clean=="?")## age workclass fnlwgt education education.num
## 0 0 0 0 0
## marital.status occupation relationship race sex
## 0 0 0 0 0
## capital.gain capital.loss hours.per.week native.country income
## 0 0 0 0 0
As we can see, the question mark entries no longer exist.
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.
Description of fnlwgt (final weight) The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:
A single cell estimate of the population 16+ for each state.
Controls for Hispanic Origin by age and sex.
Controls by Race, age and sex.
We use all three sets of controls in our weighting program and “rake” through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.
a<-income_clean %>%group_by(marital.status,income) %>% count() %>%
ggplot(aes(reorder(marital.status,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=7, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Marital Status")
b<-income_clean %>%group_by(relationship,income) %>% count() %>%
ggplot(aes(reorder(relationship,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=6, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),legend.position = "none",plot.title = element_text(size=10))+ggtitle("Relationship Status")
figure<-ggarrange(a, b,
common.legend = TRUE, legend = "bottom")
annotate_figure(figure,top = text_grob("Income Based on Relationship and Marital Status", color = "black", face = "bold", size = 15))c<-income_clean %>% group_by(occupation,income)%>% count() %>%
ggplot(aes(reorder(occupation,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+ggtitle("Occupation")
d<-income_clean %>%group_by(workclass,income) %>% count() %>%
ggplot(aes(reorder(workclass,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+ggtitle("Work Class")
figure<-ggarrange(c, d,
common.legend = TRUE, legend = "bottom")
annotate_figure(figure,top = text_grob("Income Based on Economic Factors", color = "black", face = "bold", size = 15))income_race<-income_clean%>%
group_by(sex,income) %>%
count()
income_race %>% ggplot(aes(axis1 =sex,
axis2 =income,
y = n)) +
geom_alluvium(aes(fill = sex)) +
geom_stratum() +
geom_text(stat = "stratum",aes(label = after_stat(stratum), min.y = 100)) +
scale_x_discrete(limits = c("Race", "Income"),
expand = c(.1, .1)) +
scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1"))) +
labs(title = "Income",
subtitle = "Stratified by Sex",
y = "Frequency") +
theme_minimal()## Warning: Ignoring unknown aesthetics: min.y
income_clean %>% group_by(race,income) %>% count() %>% ggplot(aes(x=reorder(race,n),y=n))+geom_col(aes(fill=income),position = "dodge")+coord_flip()+theme_light()+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=12,face="bold"))+ggtitle("Income Class Based on Race")ggplot(income_clean,aes(x=hours.per.week))+geom_histogram(aes(fill=income))+theme_light()+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=12,face="bold"))+ggtitle("Working Hours Distribution Based on Income")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
income_clean %>%group_by(education,income) %>% count() %>%
ggplot(aes(reorder(education,n),n))+geom_col(aes(fill=income))+theme_light()+coord_flip()+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=12,face="bold"))+scale_fill_manual(values=wes_palette(n=2, name=("BottleRocket1")))+ggtitle("Income by Education Level")To eliminate the chance of overfiting and underfiting our model to our train data, we are going to split the data into a training and testing set.
Now, we want to split the data into training and testing set first, with 70% of data in the training data and 30% of data in the testing data.
set.seed(110)
index <- sample(x = nrow(income_clean), nrow(income_clean) * 0.7)
income_train <- income_clean[index,]
income_test <- income_clean[-index,]Now, let’s inspect if the target variable is balanced in proportion
prop.table(table(income_train$income))##
## <=50K >50K
## 0.7590821 0.2409179
Since it is not balanced in proportion, we are going to do down sampling or up sampling. To determine whether we are going to do down sampling or up sampling we examine the number of rows that each our response variable has.
table(income_train$income)##
## <=50K >50K
## 17301 5491
We see that the >50k category has 5491 rows of data. Then we are going to down sample because we already have quite a lot of data.
income_train_balanced<-downSample(x=income_train %>% select(-income),
y=income_train$income,
yname="income")As we can see from the result below, our data is already balanced.
table(income_train_balanced$income)##
## <=50K >50K
## 5491 5491
Firstly we are going to try using the naive bayes model. Naive Bayes classification model utilizes conditional probability to predict it’s outcome, to give a clearer example, the formula below is used in naive bayes, it simply predicts the outcome of the target variable based on the condition that other variables related to the target variable has occurred.
Conditional Probability Formula
Now, let’s start our modeling process, and we named our model naive_income, we set our laplace parameter to 1, in case any data scarcity exists.
naive_income<-naiveBayes(x=income_train_balanced %>% select(-income),y=income_train_balanced$income,laplace = 1)Now, let’s use our model to predict the test data.
predicted_naive<-predict(naive_income,income_test%>% select(-income))Let’s evaluate our model
cm_bayes<-confusionMatrix(predicted_naive,income_test$income,positive =">50K")
cm_bayes## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 6704 945
## >50K 715 1405
##
## Accuracy : 0.8301
## 95% CI : (0.8225, 0.8375)
## No Information Rate : 0.7594
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5188
##
## Mcnemar's Test P-Value : 1.903e-08
##
## Sensitivity : 0.5979
## Specificity : 0.9036
## Pos Pred Value : 0.6627
## Neg Pred Value : 0.8765
## Prevalence : 0.2406
## Detection Rate : 0.1438
## Detection Prevalence : 0.2170
## Balanced Accuracy : 0.7507
##
## 'Positive' Class : >50K
##
Next we are going to use the decision tree algorithm to predict which income bracket does the individuals fall into. The decision tree algorithm utilizes a tree like decision process to create a decision based on the other variables. We are going to create the decision tree using rpart package. Decision tree made using the rpart package uses gini impurity to select splits when performing classification.
dtree_income <- rpart(income~., data = income_train_balanced, method = 'class')
rpart.plot(dtree_income, extra = 106)printcp(dtree_income)##
## Classification tree:
## rpart(formula = income ~ ., data = income_train_balanced, method = "class")
##
## Variables actually used in tree construction:
## [1] age capital.gain education occupation relationship
##
## Root node error: 5491/10982 = 0.5
##
## n= 10982
##
## CP nsplit rel error xerror xstd
## 1 0.528319 0 1.00000 1.02149 0.0095402
## 2 0.042069 1 0.47168 0.47168 0.0081020
## 3 0.010381 2 0.42961 0.43089 0.0078464
## 4 0.010000 6 0.37370 0.38463 0.0075217
The root node contains 5491 errors out of 10982 values (50%). The most important indicator of income is relationship.
Now, let’s predict using our model
predicted_tree<-predict(dtree_income,income_test,type="class")Next, let’s evaluate the model
cm_tree<-confusionMatrix(predicted_tree,income_test$income,positive =">50K" )
cm_tree## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5667 403
## >50K 1752 1947
##
## Accuracy : 0.7794
## 95% CI : (0.771, 0.7876)
## No Information Rate : 0.7594
## P-Value [Acc > NIR] : 1.682e-06
##
## Kappa : 0.4952
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8285
## Specificity : 0.7638
## Pos Pred Value : 0.5264
## Neg Pred Value : 0.9336
## Prevalence : 0.2406
## Detection Rate : 0.1993
## Detection Prevalence : 0.3786
## Balanced Accuracy : 0.7962
##
## 'Positive' Class : >50K
##
Next, we are going to use a random forest model to predict which income bracket does the individual fall into, the random forest model utilizes a bagging which is a bootstrap and aggregation method. Random forest created numerous decision trees and determines the outcome by the majority voting the outcomes.
In our modeling process, we are going tune our model using a five way cross validation with three repetitions.
set.seed(417)
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
mtry <- sqrt(ncol(income_train_balanced))
tunegrid <- expand.grid(.mtry=mtry)
rf_income <- train(income~.,
data=income_train_balanced,
method='rf',
metric='Accuracy',
tuneGrid=tunegrid,
trControl=ctrl)Now we save our model to an RDS file for future use and reproducibility.
saveRDS(rf_income, "rf_income.RDS")
income_forest<-readRDS("rf_income.RDS")Now, we predict and evaluate the confusion matrix using our model.
predicted_forest <- predict(income_forest, income_test)
cm_rf <- confusionMatrix(data = predicted_forest,
reference = income_test$income,positive =">50K")
cm_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5557 276
## >50K 1862 2074
##
## Accuracy : 0.7811
## 95% CI : (0.7728, 0.7893)
## No Information Rate : 0.7594
## P-Value [Acc > NIR] : 2.117e-07
##
## Kappa : 0.5132
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8826
## Specificity : 0.7490
## Pos Pred Value : 0.5269
## Neg Pred Value : 0.9527
## Prevalence : 0.2406
## Detection Rate : 0.2123
## Detection Prevalence : 0.4029
## Balanced Accuracy : 0.8158
##
## 'Positive' Class : >50K
##
Now, let’s inspect the variable importance in our random forest model.
varImp(income_forest)## rf variable importance
##
## only 20 most important variables shown (out of 100)
##
## Overall
## marital.statusMarried-civ-spouse 100.000
## marital.statusNever-married 49.190
## age 48.098
## education.num 46.376
## capital.gain 42.741
## relationshipOwn-child 27.904
## hours.per.week 26.562
## relationshipNot-in-family 24.606
## sexMale 20.198
## occupationExec-managerial 14.410
## occupationProf-specialty 12.347
## capital.loss 12.068
## educationBachelors 11.681
## occupationOther-service 11.514
## relationshipUnmarried 11.371
## fnlwgt 7.924
## educationMasters 7.557
## relationshipWife 7.345
## educationHS-grad 6.817
## workclassSelf-emp-inc 4.123
See that marriage status the most determining factor of the random in our model, it even exceeded the importance of occupation and education. What a surprise!
The third model that we are going to attempt is the support vector machine, this method can be used for classification. It creates hyperplanes that separates the data into classes.
In our first attempt, we are going to use gamma that is equal to 16 and the cost function that is equal to 1.
svm_attempt<-svm(income~., data=income_train_balanced,
method="C-classification", kernal="radial",
gamma=16, cost=1)Now, we are going to predict using our model and evaluate the model using the confusion matrix.
pred_svm<-predict(svm_attempt,income_test)
cm_svm<-confusionMatrix(pred_svm,income_test$income,positive =">50K")
cm_svm## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 6902 1823
## >50K 517 527
##
## Accuracy : 0.7605
## 95% CI : (0.7519, 0.7689)
## No Information Rate : 0.7594
## P-Value [Acc > NIR] : 0.4118
##
## Kappa : 0.1908
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.22426
## Specificity : 0.93031
## Pos Pred Value : 0.50479
## Neg Pred Value : 0.79106
## Prevalence : 0.24056
## Detection Rate : 0.05395
## Detection Prevalence : 0.10687
## Balanced Accuracy : 0.57728
##
## 'Positive' Class : >50K
##
Now, let’s tune our model. By the graph in the previous decision tree, it does not seem overfited we are just going to tune the cp parameter which is already taken into account in the prune function.
First we determine which cp that gives the cross validation error
which.min(dtree_income$cptable[,"xerror"])## 4
## 4
We obtained cp that is equal to 4, so in the model we set the cp to 4 and plot the decision tree.
pruned_tree<-prune(dtree_income, cp = dtree_income$cptable[4,"CP"])
rpart.plot(pruned_tree)printcp(pruned_tree)##
## Classification tree:
## rpart(formula = income ~ ., data = income_train_balanced, method = "class")
##
## Variables actually used in tree construction:
## [1] age capital.gain education occupation relationship
##
## Root node error: 5491/10982 = 0.5
##
## n= 10982
##
## CP nsplit rel error xerror xstd
## 1 0.528319 0 1.00000 1.02149 0.0095402
## 2 0.042069 1 0.47168 0.47168 0.0081020
## 3 0.010381 2 0.42961 0.43089 0.0078464
## 4 0.010000 6 0.37370 0.38463 0.0075217
It seems like not a lot has changed from the pre-tuned model, the decision tree remains identical.
Now let’s try to evaluate our tuned model using the confusion matrix.
predicted_pruned<-predict(pruned_tree, income_test,type="class")
cm_tree_tuned<-confusionMatrix(predicted_pruned,income_test$income,positive =">50K")
cm_tree_tuned## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5667 403
## >50K 1752 1947
##
## Accuracy : 0.7794
## 95% CI : (0.771, 0.7876)
## No Information Rate : 0.7594
## P-Value [Acc > NIR] : 1.682e-06
##
## Kappa : 0.4952
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8285
## Specificity : 0.7638
## Pos Pred Value : 0.5264
## Neg Pred Value : 0.9336
## Prevalence : 0.2406
## Detection Rate : 0.1993
## Detection Prevalence : 0.3786
## Balanced Accuracy : 0.7962
##
## 'Positive' Class : >50K
##
Now, we are attempting to tune our SVM Model. So, we are going to utilize the tune.svm function to perform a grid search algorithm which then will determine the best alpha and cost parameter for our SVM model.
# "set.seed(123)
# doParallel::registerDoParallel()
# parameters<-tune.svm(income~., data=income_train_balanced, sampling = "fix",
# gamma = 10^(-5:-1), cost = 10^(-3:1))"
# "parameters"Since this algorithm takes about 2-3 hours to run, I am not going to run the code again in the knitting process. However, I will include the end result in the screenshot below.
Now, let’s construct the model using the obtained gamma and cost parameter.
svm_attempt_tuned<-svm(income~., data=income_train_balanced,
method="C-classification", kernal="radial",
gamma=0.01 , cost=10)Using our tuned model, we are going to predict and evaluate it’s performance.
pred_svm_tuned<-predict(svm_attempt_tuned,income_test)
cm_svm_tuned<-confusionMatrix(pred_svm_tuned,income_test$income,positive =">50K")
cm_svm_tuned## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5746 317
## >50K 1673 2033
##
## Accuracy : 0.7963
## 95% CI : (0.7882, 0.8042)
## No Information Rate : 0.7594
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5343
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8651
## Specificity : 0.7745
## Pos Pred Value : 0.5486
## Neg Pred Value : 0.9477
## Prevalence : 0.2406
## Detection Rate : 0.2081
## Detection Prevalence : 0.3794
## Balanced Accuracy : 0.8198
##
## 'Positive' Class : >50K
##
Prioritizing false positive or a false negative is not really relevant if we are going to do just a simple classification modeling. However, since our goal is to identify the higher income individuals as much as possible. So in this case, we are going to prioritize the sensitivity of the model. So far, we have obtained six models, we tried using naive bayes, decision tree, random forest and support vector machine. We have also tried tuning the decision tree and the support vector machine.
df_recap<-data.frame(row.names=c("Naive Bayes","Decision Tree","Random Forest","SVM","Tuned Decision Tree","Tuned SVM "),
Sensitivity=c(cm_bayes$byClass['Sensitivity'],
cm_tree$byClass['Sensitivity'],
cm_rf$byClass['Sensitivity'],
cm_svm$byClass['Sensitivity'],
cm_tree_tuned$byClass['Sensitivity'],
cm_svm_tuned$byClass['Sensitivity']))
df_recap %>% arrange(desc(Sensitivity))As we can see, the random forest performs the best. It seems like after all, a voting algorithm that utilizes multiple decision trees outperforms all of the other method that we attempted.
We have performed several attempts on the classification of income, however the random forest model performs the best. This shows that by using a random forest classification algorithm could possibly alleviate the heinous task of checking the citizens’ data one by one. Thus, making the work much more efficient.