A machine learning “micro” (non-comprehensive) project attempting to predict the salary class of people in the dataset.
Meant to showcase general understanding of applied ML in R.
library(ggplot2)
library(dplyr)
Reading the data:
adult <- read.csv("adult_sal.csv")
head(adult)
## X age type_employer fnlwgt education education_num marital
## 1 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 3 38 Private 215646 HS-grad 9 Divorced
## 4 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hr_per_week country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
Removing repeated index:
adult <- select(adult, -X)
Checking out the structure of the variables and their summary.
str(adult)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ country : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(adult)
## age type_employer fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education_num marital
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial:4066 Other-relative: 981 Black : 3124
## Adm-clerical :3770 Own-child : 5068 Other : 271
## Sales :3650 Unmarried : 3446 White :27816
## Other-service :3295 Wife : 1568
## (Other) :9541
## sex capital_gain capital_loss hr_per_week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## country income
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## ? : 583
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
We have a lot of categorical variables, with too many factors that may be necessary for model building. In this section, we’ll collapse factors for these columns into fewer categories.
Type Employer
Let’s check out the frequency of the different factor levels in this column.
table(adult$type_employer)
##
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
We’ll combine similar looking factor levels into single levels.
levels(adult$type_employer)[4] <- "Unemployed"
levels(adult$type_employer)[9] <- "Unemployed"
levels(adult$type_employer)[3] <- "SL-gov"
levels(adult$type_employer)[8] <- "SL-gov"
levels(adult$type_employer)[6:7] <- "self-emp"
table(adult$type_employer)
##
## ? Federal-gov SL-gov Unemployed Private self-emp
## 1836 960 3391 21 22696 3657
Marital Status
We’ll do the same for this column.
levels(adult$marital)[1] <- "Not-Married"
levels(adult$marital)[6:7] <- "Not-Married"
levels(adult$marital)[2:4] <- "Married"
table(adult$marital)
##
## Not-Married Married Never-married
## 6461 15417 10683
Countries into region
Again, we have too any countries to deal with. It makes sense to convert them into regions. Let’s build vectors for regions.
Asia <- c('China','Hong','India','Iran','Cambodia','Japan', 'Laos' ,
'Philippines' ,'Vietnam' ,'Taiwan', 'Thailand')
North.America <- c('Canada','United-States','Puerto-Rico' )
Europe <- c('England' ,'France', 'Germany' ,'Greece','Holand-Netherlands','Hungary',
'Ireland','Italy','Poland','Portugal','Scotland','Yugoslavia')
Latin.and.South.America <- c('Columbia','Cuba','Dominican-Republic','Ecuador',
'El-Salvador','Guatemala','Haiti','Honduras',
'Mexico','Nicaragua','Outlying-US(Guam-USVI-etc)','Peru',
'Jamaica','Trinadad&Tobago')
Other <- c('South')
Function to collapse into continents
group_country <- function(ctry){
if (ctry %in% Asia){
return('Asia')
}else if (ctry %in% North.America){
return('North.America')
}else if (ctry %in% Europe){
return('Europe')
}else if (ctry %in% Latin.and.South.America){
return('Latin.and.South.America')
}else{
return('Other')
}
}
Applying the function:
adult$country <- sapply(adult$country,group_country)
colnames(adult)[14] <- "region"
table(adult$region)
##
## Asia Europe Latin.and.South.America
## 671 521 1301
## North.America Other
## 29405 663
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2016 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
adult[adult == "?"] <- NA
Refactoring the columns:
adult$type_employer <- factor(adult$type_employer)
adult$region <- factor(adult$region)
adult$marital <- factor(adult$marital)
adult$occupation <- factor(adult$occupation)
Creating a missing map for the data:
missmap(adult, y.at=c(1),y.labels = c(" "),col=c("yellow","black"))
Ommiting missing values:
adult <- na.omit(adult)
Checking the missing map again:
missmap(adult, y.at=c(1),y.labels = c(" "),col=c("yellow","black"))
Histogram of ages, by income:
ggplot(adult,aes(age)) + geom_histogram(aes(fill=income),color='black',binwidth=1) + theme_bw()
Histogram of hours worked per week:
ggplot(adult,aes(hr_per_week)) + geom_histogram() + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Barplot of region filled by income class:
ggplot(adult,aes(region)) + geom_bar(aes(fill=income),color="black") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Now, we’ll build a model to classify people into two groups: above or below 50k in Salary. It makes sense to use logistic regression here as we’re trying to predict a binary outcome.
Train-test split:
library(caTools)
#Splitting the data into 70% training and 30% test
sample <- sample.split(adult$income,SplitRatio = 0.7)
train = subset(adult,sample==TRUE)
test = subset(adult, sample==FALSE)
Creating the model:
model = glm(income ~ ., family = binomial(logit), data = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1628 -0.5221 -0.1992 0.0000 3.6239
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.137e+00 4.842e-01 -12.675 < 2e-16 ***
## age 2.551e-02 2.001e-03 12.750 < 2e-16 ***
## type_employerSL-gov -6.265e-01 1.250e-01 -5.011 5.42e-07 ***
## type_employerUnemployed -1.442e+01 4.060e+02 -0.036 0.971668
## type_employerPrivate -4.194e-01 1.104e-01 -3.800 0.000145 ***
## type_employerself-emp -6.636e-01 1.222e-01 -5.429 5.65e-08 ***
## fnlwgt 7.389e-07 2.061e-07 3.586 0.000336 ***
## education11th -2.427e-01 2.523e-01 -0.962 0.336037
## education12th 2.917e-01 3.272e-01 0.892 0.372656
## education1st-4th -4.290e-01 5.422e-01 -0.791 0.428796
## education5th-6th -1.965e-01 3.528e-01 -0.557 0.577407
## education7th-8th -6.989e-01 2.760e-01 -2.532 0.011333 *
## education9th -3.801e-01 3.154e-01 -1.205 0.228167
## educationAssoc-acdm 1.045e+00 2.051e-01 5.098 3.44e-07 ***
## educationAssoc-voc 1.138e+00 1.978e-01 5.755 8.68e-09 ***
## educationBachelors 1.707e+00 1.815e-01 9.407 < 2e-16 ***
## educationDoctorate 2.607e+00 2.569e-01 10.145 < 2e-16 ***
## educationHS-grad 5.999e-01 1.764e-01 3.401 0.000670 ***
## educationMasters 2.061e+00 1.947e-01 10.588 < 2e-16 ***
## educationPreschool -1.909e+01 1.679e+02 -0.114 0.909458
## educationProf-school 2.666e+00 2.372e-01 11.236 < 2e-16 ***
## educationSome-college 9.413e-01 1.792e-01 5.253 1.50e-07 ***
## education_num NA NA NA NA
## maritalMarried 8.097e-01 1.957e-01 4.137 3.51e-05 ***
## maritalNever-married -5.042e-01 9.950e-02 -5.067 4.03e-07 ***
## occupationArmed-Forces -3.540e-01 2.079e+00 -0.170 0.864781
## occupationCraft-repair -1.641e-02 9.491e-02 -0.173 0.862735
## occupationExec-managerial 6.925e-01 9.176e-02 7.547 4.45e-14 ***
## occupationFarming-fishing -1.095e+00 1.645e-01 -6.657 2.79e-11 ***
## occupationHandlers-cleaners -7.710e-01 1.676e-01 -4.601 4.20e-06 ***
## occupationMachine-op-inspct -3.501e-01 1.214e-01 -2.883 0.003944 **
## occupationOther-service -8.587e-01 1.379e-01 -6.227 4.75e-10 ***
## occupationPriv-house-serv -3.273e+00 2.252e+00 -1.453 0.146099
## occupationProf-specialty 4.520e-01 9.679e-02 4.670 3.02e-06 ***
## occupationProtective-serv 4.272e-01 1.507e-01 2.835 0.004589 **
## occupationSales 1.791e-01 9.767e-02 1.834 0.066645 .
## occupationTech-support 5.565e-01 1.324e-01 4.203 2.63e-05 ***
## occupationTransport-moving -2.126e-01 1.190e-01 -1.787 0.074005 .
## relationshipNot-in-family -8.618e-01 1.928e-01 -4.470 7.83e-06 ***
## relationshipOther-relative -1.246e+00 2.794e-01 -4.461 8.17e-06 ***
## relationshipOwn-child -1.712e+00 2.320e-01 -7.380 1.58e-13 ***
## relationshipUnmarried -1.111e+00 2.163e-01 -5.138 2.78e-07 ***
## relationshipWife 1.198e+00 1.235e-01 9.702 < 2e-16 ***
## raceAsian-Pac-Islander 7.541e-01 3.324e-01 2.269 0.023284 *
## raceBlack 6.431e-01 2.954e-01 2.177 0.029488 *
## raceOther 1.460e-01 4.349e-01 0.336 0.737126
## raceWhite 8.214e-01 2.831e-01 2.902 0.003710 **
## sexMale 7.821e-01 9.389e-02 8.331 < 2e-16 ***
## capital_gain 3.229e-04 1.273e-05 25.368 < 2e-16 ***
## capital_loss 6.524e-04 4.480e-05 14.564 < 2e-16 ***
## hr_per_week 3.023e-02 1.976e-03 15.304 < 2e-16 ***
## regionEurope 1.456e-01 2.541e-01 0.573 0.566734
## regionLatin.and.South.America -5.790e-01 2.574e-01 -2.250 0.024466 *
## regionNorth.America 7.752e-02 2.065e-01 0.375 0.707335
## regionOther -3.965e-01 2.323e-01 -1.707 0.087891 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24138 on 21502 degrees of freedom
## Residual deviance: 14060 on 21449 degrees of freedom
## AIC: 14168
##
## Number of Fisher Scoring iterations: 14
Removing unimportant features iteratively using step :
new.step.model <- step(model)
## Start: AIC=14167.88
## income ~ age + type_employer + fnlwgt + education + education_num +
## marital + occupation + relationship + race + sex + capital_gain +
## capital_loss + hr_per_week + region
##
##
## Step: AIC=14167.88
## income ~ age + type_employer + fnlwgt + education + marital +
## occupation + relationship + race + sex + capital_gain + capital_loss +
## hr_per_week + region
##
## Df Deviance AIC
## <none> 14060 14168
## - race 4 14078 14178
## - fnlwgt 1 14073 14179
## - region 4 14087 14187
## - type_employer 4 14104 14204
## - marital 2 14113 14217
## - sex 1 14131 14237
## - age 1 14224 14330
## - relationship 5 14282 14380
## - capital_loss 1 14278 14384
## - hr_per_week 1 14300 14406
## - occupation 13 14454 14536
## - education 15 14750 14828
## - capital_gain 1 15333 15439
Predictions and checking results, we’ll use the new step model for this:
test$predicted.income = predict(new.step.model, newdata=test, type="response")
table(test$income, test$predicted.income > 0.5)
##
## FALSE TRUE
## <=50K 6406 514
## >50K 887 1408