Census Income Dataset

The aim is to predict whether a person makes over $50k/year based on “Census Income” dataset. This dataset has variables like age, workclass, education, hours/week, salary, etc.

Loading Necessary Packages

library(ggplot2)
library(readr)
library(caret)
library(randomForest)
library(caTools)
library(rpart)
library(rpart.plot)
library(gbm)
library(lattice)
library(klaR)
library(mice)
library(VIM)

Reading Data

adult <- read.csv("C:/Users/Firas/Desktop/income.csv", header=T, sep=",")[,2:15]
head(adult)
##   age         workclass fnlwgt  education education.num
## 1  39         State-gov  77516  Bachelors            13
## 2  50  Self-emp-not-inc  83311  Bachelors            13
## 3  38           Private 215646    HS-grad             9
## 4  53           Private 234721       11th             7
## 5  28           Private 338409  Bachelors            13
## 6  37           Private 284582    Masters            14
##        marital.status         occupation   relationship   race     sex
## 1       Never-married       Adm-clerical  Not-in-family  White    Male
## 2  Married-civ-spouse    Exec-managerial        Husband  White    Male
## 3            Divorced  Handlers-cleaners  Not-in-family  White    Male
## 4  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
## 5  Married-civ-spouse     Prof-specialty           Wife  Black  Female
## 6  Married-civ-spouse    Exec-managerial           Wife  White  Female
##   capital.gain capital.loss hours.per.week salary
## 1         2174            0             40  <=50K
## 2            0            0             13  <=50K
## 3            0            0             40  <=50K
## 4            0            0             40  <=50K
## 5            0            0             40  <=50K
## 6            0            0             40  <=50K

Dealing With Missing Values

We use the aggr() function from the package VIM, to check if some values are missing from our dataset.

aggr(adult, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(adult), cex.axis=.6, gap=1, ylab=c("Histogram of missing data","Pattern"),prop=FALSE,cex.numbers=0.8)

## 
##  Variables sorted by number of missings: 
##        Variable Count
##      occupation  1843
##       workclass  1836
##             age     0
##          fnlwgt     0
##       education     0
##   education.num     0
##  marital.status     0
##    relationship     0
##            race     0
##             sex     0
##    capital.gain     0
##    capital.loss     0
##  hours.per.week     0
##          salary     0

The plots produced by the function are very useful to have an idea about the variables with missing values. We can see that the only variables with missing data are “workclass” and “occupation”. The missing values are about 5% of these variables.

The Aggregation plot can show us the “combination” of missing values : in our dataset, if values in the variable occupation are missing, they are mostly also missing in the variable “workclass”“, except for 7 times.

For the imputation of our data, we’ll work with the package “MICE” (Multiple Imputation by Chained Equations).

miceadult <- mice(adult, m=1, method='polyreg')
## 
##  iter imp variable
##   1   1  workclass  occupation
##   2   1  workclass  occupation
##   3   1  workclass  occupation
##   4   1  workclass  occupation
##   5   1  workclass  occupation

The argument “m” is the number of multiple imputations. Here, we’ll use m =1, for the sake of demonstration. The method argument specify the imputation method. Since we’re imputing the variables “workclass” and “occupation”, we’ll choose the method “polyreg” (Bayesian polytomous regression model, also known as Bayesian multinomial logistic regression) which is recommanded for the imputation of unordered categorical variables.

After completing the imputation procedure, we should take a look at the imputed data to make sure that it makes sense :

work_table = table (adult$workclass)
work_table_imp = table(miceadult$imp$workclass)

occup_table = table (adult$occupation)
occup_table_imp = table (miceadult$imp$occupation)
prop.table(work_table)
## 
##       Federal-gov         Local-gov      Never-worked           Private 
##      0.0312449146      0.0681204231      0.0002278275      0.7386818552 
##      Self-emp-inc  Self-emp-not-inc         State-gov       Without-pay 
##      0.0363222132      0.0827013832      0.0422457282      0.0004556550
prop.table(work_table_imp)
## 
##       Federal-gov         Local-gov      Never-worked           Private 
##        0.04956427        0.08605664        0.01361656        0.68518519 
##      Self-emp-inc  Self-emp-not-inc         State-gov       Without-pay 
##        0.02124183        0.08932462        0.04302832        0.01198257
prop.table(occup_table)
## 
##       Adm-clerical       Armed-Forces       Craft-repair 
##       0.1227293444       0.0002929878       0.1334396771 
##    Exec-managerial    Farming-fishing  Handlers-cleaners 
##       0.1323653884       0.0323588775       0.0445992578 
##  Machine-op-inspct      Other-service    Priv-house-serv 
##       0.0651735139       0.1072660981       0.0048505762 
##     Prof-specialty    Protective-serv              Sales 
##       0.1347743994       0.0211276776       0.1188228400 
##       Tech-support   Transport-moving 
##       0.0302103001       0.0519890618
prop.table(occup_table_imp)
## 
##       Adm-clerical       Armed-Forces       Craft-repair 
##         0.15301139         0.01953337         0.12371134 
##    Exec-managerial    Farming-fishing  Handlers-cleaners 
##         0.05805751         0.04991861         0.06565383 
##  Machine-op-inspct      Other-service    Priv-house-serv 
##         0.06456864         0.13673359         0.01844818 
##     Prof-specialty    Protective-serv              Sales 
##         0.08193163         0.02712968         0.11068909 
##       Tech-support   Transport-moving 
##         0.04937602         0.04123711

We can see that the proportions of our imputed values, are more or less close to the proportions of the variables levels in the original data.

We could have tried other methods of imputation or multiple rounds of imputed values using mice. But for the sake of demonstration, let’s say that we are satisfied with the imputed values. Finally, we’re going to merge them into our original data using the complete() function:

adult <- complete (miceadult)

Now we can start our Modeling work. Let’s create 70:30 split :

set.seed(101)

sample = sample.split(adult$salary,SplitRatio = 0.7)

x_train = subset(adult,sample==TRUE)
x_test = subset(adult,sample== FALSE)

y_train <- x_train$salary
y_test <- x_test$salary

x_train$salary <- NULL
x_test$salary <- NULL

Next, we’re going to use 10-fold cross validation to divide the dataset in training and test sets. Those parameters will be specified with the trainControl() function :

cv.10.folds <- createMultiFolds(y_train,k=10,times=2)
ctrl_param <- trainControl(method="repeatedcv",number=10,repeats = 2,index = cv.10.folds)

First Model : Logistic Regression

logreg <-train(x=x_train,y=y_train,method="glm", family="binomial",trControl = ctrl_param)
y_predicted<-predict(logreg,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
## 
##         
##           <=50K  >50K
##    <=50K   6889   527
##    >50K     961  1391
##                                           
##                Accuracy : 0.8477          
##                  95% CI : (0.8404, 0.8547)
##     No Information Rate : 0.8036          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5553          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7252          
##             Specificity : 0.8776          
##          Pos Pred Value : 0.5914          
##          Neg Pred Value : 0.9289          
##              Prevalence : 0.1964          
##          Detection Rate : 0.1424          
##    Detection Prevalence : 0.2408          
##       Balanced Accuracy : 0.8014          
##                                           
##        'Positive' Class :  >50K           
## 

Second Model : GBM

gbmodel <-train(x=x_train,y=y_train,method="gbm",trControl = ctrl_param)
y_predicted<-predict(gbmodel,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
## 
##         
##           <=50K  >50K
##    <=50K   6997   419
##    >50K     912  1440
##                                           
##                Accuracy : 0.8637          
##                  95% CI : (0.8568, 0.8705)
##     No Information Rate : 0.8097          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5986          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7746          
##             Specificity : 0.8847          
##          Pos Pred Value : 0.6122          
##          Neg Pred Value : 0.9435          
##              Prevalence : 0.1903          
##          Detection Rate : 0.1474          
##    Detection Prevalence : 0.2408          
##       Balanced Accuracy : 0.8296          
##                                           
##        'Positive' Class :  >50K           
## 

Third Model : Random Forests

rfmodel <-train(x=x_train,y=y_train,method="rf")
y_predicted<-predict(rfmodel,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
## 
##         
##           <=50K  >50K
##    <=50K   6977   439
##    >50K     888  1464
##                                           
##                Accuracy : 0.8641          
##                  95% CI : (0.8572, 0.8709)
##     No Information Rate : 0.8052          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6025          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7693          
##             Specificity : 0.8871          
##          Pos Pred Value : 0.6224          
##          Neg Pred Value : 0.9408          
##              Prevalence : 0.1948          
##          Detection Rate : 0.1499          
##    Detection Prevalence : 0.2408          
##       Balanced Accuracy : 0.8282          
##                                           
##        'Positive' Class :  >50K           
## 

Fourth Model : Naive Bayes

nbmodel <-train(x=x_train,y=y_train,method="nb",trControl = ctrl_param)
y_predicted<-predict(nbmodel,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
## 
##         
##           <=50K  >50K
##    <=50K   6920   496
##    >50K    1182  1170
##                                           
##                Accuracy : 0.8282          
##                  95% CI : (0.8206, 0.8356)
##     No Information Rate : 0.8294          
##     P-Value [Acc > NIR] : 0.6326          
##                                           
##                   Kappa : 0.4782          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7023          
##             Specificity : 0.8541          
##          Pos Pred Value : 0.4974          
##          Neg Pred Value : 0.9331          
##              Prevalence : 0.1706          
##          Detection Rate : 0.1198          
##    Detection Prevalence : 0.2408          
##       Balanced Accuracy : 0.7782          
##                                           
##        'Positive' Class :  >50K           
##