The aim is to predict whether a person makes over $50k/year based on “Census Income” dataset. This dataset has variables like age, workclass, education, hours/week, salary, etc.
library(ggplot2)
library(readr)
library(caret)
library(randomForest)
library(caTools)
library(rpart)
library(rpart.plot)
library(gbm)
library(lattice)
library(klaR)
library(mice)
library(VIM)
adult <- read.csv("C:/Users/Firas/Desktop/income.csv", header=T, sep=",")[,2:15]
head(adult)
## age workclass fnlwgt education education.num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## marital.status occupation relationship race sex
## 1 Never-married Adm-clerical Not-in-family White Male
## 2 Married-civ-spouse Exec-managerial Husband White Male
## 3 Divorced Handlers-cleaners Not-in-family White Male
## 4 Married-civ-spouse Handlers-cleaners Husband Black Male
## 5 Married-civ-spouse Prof-specialty Wife Black Female
## 6 Married-civ-spouse Exec-managerial Wife White Female
## capital.gain capital.loss hours.per.week salary
## 1 2174 0 40 <=50K
## 2 0 0 13 <=50K
## 3 0 0 40 <=50K
## 4 0 0 40 <=50K
## 5 0 0 40 <=50K
## 6 0 0 40 <=50K
We use the aggr() function from the package VIM, to check if some values are missing from our dataset.
aggr(adult, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(adult), cex.axis=.6, gap=1, ylab=c("Histogram of missing data","Pattern"),prop=FALSE,cex.numbers=0.8)
##
## Variables sorted by number of missings:
## Variable Count
## occupation 1843
## workclass 1836
## age 0
## fnlwgt 0
## education 0
## education.num 0
## marital.status 0
## relationship 0
## race 0
## sex 0
## capital.gain 0
## capital.loss 0
## hours.per.week 0
## salary 0
The plots produced by the function are very useful to have an idea about the variables with missing values. We can see that the only variables with missing data are “workclass” and “occupation”. The missing values are about 5% of these variables.
The Aggregation plot can show us the “combination” of missing values : in our dataset, if values in the variable occupation are missing, they are mostly also missing in the variable “workclass”“, except for 7 times.
For the imputation of our data, we’ll work with the package “MICE” (Multiple Imputation by Chained Equations).
miceadult <- mice(adult, m=1, method='polyreg')
##
## iter imp variable
## 1 1 workclass occupation
## 2 1 workclass occupation
## 3 1 workclass occupation
## 4 1 workclass occupation
## 5 1 workclass occupation
The argument “m” is the number of multiple imputations. Here, we’ll use m =1, for the sake of demonstration. The method argument specify the imputation method. Since we’re imputing the variables “workclass” and “occupation”, we’ll choose the method “polyreg” (Bayesian polytomous regression model, also known as Bayesian multinomial logistic regression) which is recommanded for the imputation of unordered categorical variables.
After completing the imputation procedure, we should take a look at the imputed data to make sure that it makes sense :
work_table = table (adult$workclass)
work_table_imp = table(miceadult$imp$workclass)
occup_table = table (adult$occupation)
occup_table_imp = table (miceadult$imp$occupation)
prop.table(work_table)
##
## Federal-gov Local-gov Never-worked Private
## 0.0312449146 0.0681204231 0.0002278275 0.7386818552
## Self-emp-inc Self-emp-not-inc State-gov Without-pay
## 0.0363222132 0.0827013832 0.0422457282 0.0004556550
prop.table(work_table_imp)
##
## Federal-gov Local-gov Never-worked Private
## 0.04956427 0.08605664 0.01361656 0.68518519
## Self-emp-inc Self-emp-not-inc State-gov Without-pay
## 0.02124183 0.08932462 0.04302832 0.01198257
prop.table(occup_table)
##
## Adm-clerical Armed-Forces Craft-repair
## 0.1227293444 0.0002929878 0.1334396771
## Exec-managerial Farming-fishing Handlers-cleaners
## 0.1323653884 0.0323588775 0.0445992578
## Machine-op-inspct Other-service Priv-house-serv
## 0.0651735139 0.1072660981 0.0048505762
## Prof-specialty Protective-serv Sales
## 0.1347743994 0.0211276776 0.1188228400
## Tech-support Transport-moving
## 0.0302103001 0.0519890618
prop.table(occup_table_imp)
##
## Adm-clerical Armed-Forces Craft-repair
## 0.15301139 0.01953337 0.12371134
## Exec-managerial Farming-fishing Handlers-cleaners
## 0.05805751 0.04991861 0.06565383
## Machine-op-inspct Other-service Priv-house-serv
## 0.06456864 0.13673359 0.01844818
## Prof-specialty Protective-serv Sales
## 0.08193163 0.02712968 0.11068909
## Tech-support Transport-moving
## 0.04937602 0.04123711
We can see that the proportions of our imputed values, are more or less close to the proportions of the variables levels in the original data.
We could have tried other methods of imputation or multiple rounds of imputed values using mice. But for the sake of demonstration, let’s say that we are satisfied with the imputed values. Finally, we’re going to merge them into our original data using the complete() function:
adult <- complete (miceadult)
Now we can start our Modeling work. Let’s create 70:30 split :
set.seed(101)
sample = sample.split(adult$salary,SplitRatio = 0.7)
x_train = subset(adult,sample==TRUE)
x_test = subset(adult,sample== FALSE)
y_train <- x_train$salary
y_test <- x_test$salary
x_train$salary <- NULL
x_test$salary <- NULL
Next, we’re going to use 10-fold cross validation to divide the dataset in training and test sets. Those parameters will be specified with the trainControl() function :
cv.10.folds <- createMultiFolds(y_train,k=10,times=2)
ctrl_param <- trainControl(method="repeatedcv",number=10,repeats = 2,index = cv.10.folds)
logreg <-train(x=x_train,y=y_train,method="glm", family="binomial",trControl = ctrl_param)
y_predicted<-predict(logreg,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
##
##
## <=50K >50K
## <=50K 6889 527
## >50K 961 1391
##
## Accuracy : 0.8477
## 95% CI : (0.8404, 0.8547)
## No Information Rate : 0.8036
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5553
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7252
## Specificity : 0.8776
## Pos Pred Value : 0.5914
## Neg Pred Value : 0.9289
## Prevalence : 0.1964
## Detection Rate : 0.1424
## Detection Prevalence : 0.2408
## Balanced Accuracy : 0.8014
##
## 'Positive' Class : >50K
##
gbmodel <-train(x=x_train,y=y_train,method="gbm",trControl = ctrl_param)
y_predicted<-predict(gbmodel,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
##
##
## <=50K >50K
## <=50K 6997 419
## >50K 912 1440
##
## Accuracy : 0.8637
## 95% CI : (0.8568, 0.8705)
## No Information Rate : 0.8097
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5986
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7746
## Specificity : 0.8847
## Pos Pred Value : 0.6122
## Neg Pred Value : 0.9435
## Prevalence : 0.1903
## Detection Rate : 0.1474
## Detection Prevalence : 0.2408
## Balanced Accuracy : 0.8296
##
## 'Positive' Class : >50K
##
rfmodel <-train(x=x_train,y=y_train,method="rf")
y_predicted<-predict(rfmodel,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
##
##
## <=50K >50K
## <=50K 6977 439
## >50K 888 1464
##
## Accuracy : 0.8641
## 95% CI : (0.8572, 0.8709)
## No Information Rate : 0.8052
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6025
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7693
## Specificity : 0.8871
## Pos Pred Value : 0.6224
## Neg Pred Value : 0.9408
## Prevalence : 0.1948
## Detection Rate : 0.1499
## Detection Prevalence : 0.2408
## Balanced Accuracy : 0.8282
##
## 'Positive' Class : >50K
##
nbmodel <-train(x=x_train,y=y_train,method="nb",trControl = ctrl_param)
y_predicted<-predict(nbmodel,x_test)
df1<-data.frame(Orig=y_test,Pred=y_predicted)
confusionMatrix(table(df1$Orig,df1$Pred),positive=" >50K")
## Confusion Matrix and Statistics
##
##
## <=50K >50K
## <=50K 6920 496
## >50K 1182 1170
##
## Accuracy : 0.8282
## 95% CI : (0.8206, 0.8356)
## No Information Rate : 0.8294
## P-Value [Acc > NIR] : 0.6326
##
## Kappa : 0.4782
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7023
## Specificity : 0.8541
## Pos Pred Value : 0.4974
## Neg Pred Value : 0.9331
## Prevalence : 0.1706
## Detection Rate : 0.1198
## Detection Prevalence : 0.2408
## Balanced Accuracy : 0.7782
##
## 'Positive' Class : >50K
##