This is a smaller machine learning project exploring simple use of logisitc regression. The dataset we will look at is the “adult income”" dataset.
The task is to predict whether an individual has an income higher than 50k.
Loading libraries.
library(ggplot2)
library(dplyr)
The structure of the dataset “adult income”:
adult <- read.csv("adult.csv")
head(adult)
## age workclass fnlwgt education educational.num marital.status
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 44 Private 160323 Some-college 10 Married-civ-spouse
## 5 18 ? 103497 Some-college 10 Never-married
## 6 34 Private 198693 10th 6 Never-married
## occupation relationship race gender capital.gain capital.loss
## 1 Machine-op-inspct Own-child Black Male 0 0
## 2 Farming-fishing Husband White Male 0 0
## 3 Protective-serv Husband White Male 0 0
## 4 Machine-op-inspct Husband Black Male 7688 0
## 5 ? Own-child White Female 0 0
## 6 Other-service Not-in-family White Male 0 0
## hours.per.week native.country income
## 1 40 United-States <=50K
## 2 50 United-States <=50K
## 3 40 United-States >50K
## 4 40 United-States >50K
## 5 30 United-States <=50K
## 6 30 United-States <=50K
Basic data exploration.
str(adult)
## 'data.frame': 48842 obs. of 15 variables:
## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : Factor w/ 9 levels "?","Federal-gov",..: 5 5 3 5 1 5 1 7 5 5 ...
## $ fnlwgt : int 226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 2 12 8 16 16 1 12 15 16 6 ...
## $ educational.num: int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 3 3 5 5 5 3 5 3 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 8 6 12 8 1 9 1 11 9 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 4 1 1 1 4 2 5 1 5 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 3 5 5 3 5 5 3 5 5 5 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 2 2 1 2 ...
## $ capital.gain : int 0 0 0 7688 0 0 0 3103 0 0 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
## $ native.country : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 40 40 40 40 40 40 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 2 2 1 1 1 2 1 1 ...
summary(adult)
## age workclass fnlwgt
## Min. :17.00 Private :33906 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 3862 1st Qu.: 117551
## Median :37.00 Local-gov : 3136 Median : 178145
## Mean :38.64 ? : 2799 Mean : 189664
## 3rd Qu.:48.00 State-gov : 1981 3rd Qu.: 237642
## Max. :90.00 Self-emp-inc : 1695 Max. :1490400
## (Other) : 1463
## education educational.num marital.status
## HS-grad :15784 Min. : 1.00 Divorced : 6633
## Some-college:10878 1st Qu.: 9.00 Married-AF-spouse : 37
## Bachelors : 8025 Median :10.00 Married-civ-spouse :22379
## Masters : 2657 Mean :10.08 Married-spouse-absent: 628
## Assoc-voc : 2061 3rd Qu.:12.00 Never-married :16117
## 11th : 1812 Max. :16.00 Separated : 1530
## (Other) : 7625 Widowed : 1518
## occupation relationship race
## Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470
## Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519
## Exec-managerial: 6086 Other-relative: 1506 Black : 4685
## Adm-clerical : 5611 Own-child : 7581 Other : 406
## Sales : 5504 Unmarried : 5125 White :41762
## Other-service : 4923 Wife : 2331
## (Other) :14434
## gender capital.gain capital.loss hours.per.week
## Female:16192 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :32650 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1079 Mean : 87.5 Mean :40.42
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native.country income
## United-States:43832 <=50K:37155
## Mexico : 951 >50K :11687
## ? : 857
## Philippines : 295
## Germany : 206
## Puerto-Rico : 184
## (Other) : 2517
We have a lot of categorical variables, with too many factors that may be necessary for model building. In this section, we’ll collapse factors for these columns into fewer categories.
table(adult$workclass)
##
## ? Federal-gov Local-gov Never-worked
## 2799 1432 3136 10
## Private Self-emp-inc Self-emp-not-inc State-gov
## 33906 1695 3862 1981
## Without-pay
## 21
The type workclass has alot of similar factors, we put them in fewer categories.
levels(adult$workclass)[4] <- "Unemployed"
levels(adult$workclass)[9] <- "Unemployed"
levels(adult$workclass)[3] <- "SL-gov"
levels(adult$workclass)[8] <- "SL-gov"
levels(adult$workclass)[6:7] <- "self-emp"
table(adult$workclass)
##
## ? Federal-gov SL-gov Unemployed Private self-emp
## 2799 1432 5117 31 33906 5557
Again, here we put these in fewer categories.
table(adult$marital)
##
## Divorced Married-AF-spouse Married-civ-spouse
## 6633 37 22379
## Married-spouse-absent Never-married Separated
## 628 16117 1530
## Widowed
## 1518
The type workclass has alot of similar factors, we put them in fewer categories.
levels(adult$marital)[1] <- "Not-Married"
levels(adult$marital)[6:7] <- "Not-Married"
levels(adult$marital)[2:4] <- "Married"
table(adult$marital)
##
## Not-Married Married Never-married
## 9681 23044 16117
The type for native country has very many factors.
levels(adult$native.country)
## [1] "?" "Cambodia"
## [3] "Canada" "China"
## [5] "Columbia" "Cuba"
## [7] "Dominican-Republic" "Ecuador"
## [9] "El-Salvador" "England"
## [11] "France" "Germany"
## [13] "Greece" "Guatemala"
## [15] "Haiti" "Holand-Netherlands"
## [17] "Honduras" "Hong"
## [19] "Hungary" "India"
## [21] "Iran" "Ireland"
## [23] "Italy" "Jamaica"
## [25] "Japan" "Laos"
## [27] "Mexico" "Nicaragua"
## [29] "Outlying-US(Guam-USVI-etc)" "Peru"
## [31] "Philippines" "Poland"
## [33] "Portugal" "Puerto-Rico"
## [35] "Scotland" "South"
## [37] "Taiwan" "Thailand"
## [39] "Trinadad&Tobago" "United-States"
## [41] "Vietnam" "Yugoslavia"
So we will put them in the following categories.
Asia <- c('China','Hong','India','Iran','Cambodia','Japan', 'Laos' ,
'Philippines' ,'Vietnam' ,'Taiwan', 'Thailand')
North.America <- c('Canada','United-States','Puerto-Rico' )
Europe <- c('England' ,'France', 'Germany' ,'Greece','Holand-Netherlands','Hungary',
'Ireland','Italy','Poland','Portugal','Scotland','Yugoslavia')
Latin.and.South.America <- c('Columbia','Cuba','Dominican-Republic','Ecuador',
'El-Salvador','Guatemala','Haiti','Honduras',
'Mexico','Nicaragua','Outlying-US(Guam-USVI-etc)','Peru',
'Jamaica','Trinadad&Tobago')
Other <- c('South')
With the following function we do this.
group_country <- function(ctry){
if (ctry %in% Asia){
return('Asia')
}else if (ctry %in% North.America){
return('North.America')
}else if (ctry %in% Europe){
return('Europe')
}else if (ctry %in% Latin.and.South.America){
return('Latin.and.South.America')
}else{
return('Other')
}
}
adult$native.country <- sapply(adult$native.country,group_country)
colnames(adult)[14] <- "region"
This creates the following categories.
table(adult$region)
##
## Asia Europe Latin.and.South.America
## 981 780 1911
## North.America Other
## 44198 972
library(Amelia)
Making the columns factors again.
adult$workclass <- factor(adult$workclass)
adult$region <- factor(adult$region)
adult$marital <- factor(adult$marital)
adult$occupation <- factor(adult$occupation)
Ommitting missing values.
adult <- na.omit(adult)
Start by looking at a histogram of ages, sorted by income.
ggplot(adult,aes(age)) + geom_histogram(aes(fill=income),color='black',binwidth=1) + theme_bw()
Histogram of hours worked per week:
ggplot(adult,aes(hours.per.week)) + geom_histogram(bins = 30) + theme_bw()
Barplot of region filled by income class:
ggplot(adult,aes(region)) + geom_bar(aes(fill=income),color="black") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Now we build the model we will use to classify the individuals. As we are trying to predict a binary outcome it makes sense to use logistic regression. ### Train-Test split Splitting the dataset in 70% training and 30% testing.
library(caTools)
set.seed(3)
sample <- sample.split(adult$income,SplitRatio=0.7)
train <- subset(adult,sample==T)
test <- subset(adult,sample==F)
The logistic model is created with the following code.
model <- glm(income ~ ., family = binomial(logit), data = train)
summary(model)
##
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.2318 -0.5238 -0.1960 -0.0131 3.7886
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.021e+00 4.529e-01 -17.711 < 2e-16
## age 2.626e-02 1.649e-03 15.928 < 2e-16
## workclassSL-gov -7.488e-01 1.018e-01 -7.354 1.92e-13
## workclassUnemployed -1.856e+00 1.080e+00 -1.718 0.085780
## workclassPrivate -5.197e-01 9.081e-02 -5.723 1.05e-08
## workclassself-emp -8.507e-01 1.011e-01 -8.414 < 2e-16
## fnlwgt 7.463e-07 1.695e-07 4.404 1.06e-05
## education11th 9.291e-02 2.172e-01 0.428 0.668780
## education12th 6.625e-01 2.662e-01 2.489 0.012810
## education1st-4th -6.187e-01 5.291e-01 -1.169 0.242271
## education5th-6th 8.413e-02 2.944e-01 0.286 0.775050
## education7th-8th -3.454e-01 2.372e-01 -1.456 0.145386
## education9th -1.254e-01 2.599e-01 -0.482 0.629476
## educationAssoc-acdm 1.493e+00 1.804e-01 8.277 < 2e-16
## educationAssoc-voc 1.464e+00 1.744e-01 8.394 < 2e-16
## educationBachelors 2.062e+00 1.629e-01 12.665 < 2e-16
## educationDoctorate 2.877e+00 2.158e-01 13.332 < 2e-16
## educationHS-grad 9.376e-01 1.590e-01 5.895 3.75e-09
## educationMasters 2.428e+00 1.720e-01 14.113 < 2e-16
## educationPreschool -1.099e+01 1.020e+02 -0.108 0.914198
## educationProf-school 2.820e+00 2.054e-01 13.726 < 2e-16
## educationSome-college 1.316e+00 1.611e-01 8.172 3.04e-16
## educational.num NA NA NA NA
## marital.statusMarried-AF-spouse 3.020e+00 5.297e-01 5.702 1.18e-08
## marital.statusMarried-civ-spouse 2.356e+00 2.633e-01 8.949 < 2e-16
## marital.statusMarried-spouse-absent 8.824e-02 2.200e-01 0.401 0.688354
## marital.statusNever-married -3.992e-01 8.593e-02 -4.646 3.39e-06
## marital.statusSeparated -7.093e-02 1.597e-01 -0.444 0.656871
## marital.statusWidowed 1.188e-01 1.510e-01 0.787 0.431195
## occupationArmed-Forces 5.629e-01 8.625e-01 0.653 0.513968
## occupationCraft-repair 9.386e-02 7.706e-02 1.218 0.223236
## occupationExec-managerial 7.981e-01 7.433e-02 10.737 < 2e-16
## occupationFarming-fishing -9.485e-01 1.353e-01 -7.010 2.38e-12
## occupationHandlers-cleaners -6.078e-01 1.354e-01 -4.491 7.10e-06
## occupationMachine-op-inspct -2.690e-01 9.796e-02 -2.746 0.006034
## occupationOther-service -8.864e-01 1.154e-01 -7.682 1.57e-14
## occupationPriv-house-serv -2.230e+00 1.050e+00 -2.123 0.033736
## occupationProf-specialty 5.806e-01 7.811e-02 7.433 1.06e-13
## occupationProtective-serv 5.445e-01 1.215e-01 4.482 7.39e-06
## occupationSales 2.625e-01 7.967e-02 3.295 0.000983
## occupationTech-support 6.877e-01 1.057e-01 6.508 7.61e-11
## occupationTransport-moving -1.031e-01 9.708e-02 -1.062 0.288255
## relationshipNot-in-family 6.214e-01 2.605e-01 2.385 0.017057
## relationshipOther-relative -4.388e-01 2.324e-01 -1.888 0.059061
## relationshipOwn-child -4.132e-01 2.524e-01 -1.637 0.101559
## relationshipUnmarried 4.349e-01 2.765e-01 1.573 0.115645
## relationshipWife 1.162e+00 1.008e-01 11.527 < 2e-16
## raceAsian-Pac-Islander 7.685e-01 2.633e-01 2.919 0.003515
## raceBlack 3.862e-01 2.341e-01 1.649 0.099051
## raceOther 2.304e-01 3.328e-01 0.692 0.488679
## raceWhite 5.557e-01 2.237e-01 2.484 0.012994
## genderMale 7.754e-01 7.662e-02 10.120 < 2e-16
## capital.gain 3.159e-04 1.032e-05 30.603 < 2e-16
## capital.loss 6.197e-04 3.666e-05 16.902 < 2e-16
## hours.per.week 3.006e-02 1.623e-03 18.521 < 2e-16
## regionEurope 5.885e-01 2.093e-01 2.812 0.004930
## regionLatin.and.South.America -2.783e-01 2.099e-01 -1.326 0.184956
## regionNorth.America 3.185e-01 1.688e-01 1.887 0.059196
## regionOther 4.324e-02 1.912e-01 0.226 0.821045
## maritalMarried NA NA NA NA
## maritalNever-married NA NA NA NA
##
## (Intercept) ***
## age ***
## workclassSL-gov ***
## workclassUnemployed .
## workclassPrivate ***
## workclassself-emp ***
## fnlwgt ***
## education11th
## education12th *
## education1st-4th
## education5th-6th
## education7th-8th
## education9th
## educationAssoc-acdm ***
## educationAssoc-voc ***
## educationBachelors ***
## educationDoctorate ***
## educationHS-grad ***
## educationMasters ***
## educationPreschool
## educationProf-school ***
## educationSome-college ***
## educational.num
## marital.statusMarried-AF-spouse ***
## marital.statusMarried-civ-spouse ***
## marital.statusMarried-spouse-absent
## marital.statusNever-married ***
## marital.statusSeparated
## marital.statusWidowed
## occupationArmed-Forces
## occupationCraft-repair
## occupationExec-managerial ***
## occupationFarming-fishing ***
## occupationHandlers-cleaners ***
## occupationMachine-op-inspct **
## occupationOther-service ***
## occupationPriv-house-serv *
## occupationProf-specialty ***
## occupationProtective-serv ***
## occupationSales ***
## occupationTech-support ***
## occupationTransport-moving
## relationshipNot-in-family *
## relationshipOther-relative .
## relationshipOwn-child
## relationshipUnmarried
## relationshipWife ***
## raceAsian-Pac-Islander **
## raceBlack .
## raceOther
## raceWhite *
## genderMale ***
## capital.gain ***
## capital.loss ***
## hours.per.week ***
## regionEurope **
## regionLatin.and.South.America
## regionNorth.America .
## regionOther
## maritalMarried
## maritalNever-married
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36106 on 32222 degrees of freedom
## Residual deviance: 21065 on 32165 degrees of freedom
## AIC: 21181
##
## Number of Fisher Scoring iterations: 13
Removing features iteratively using step:
new.model <- step(model)
Predictions and checking results.
test$predicted.income = predict(new.model, newdata=test, type="response")
table(test$income, test$predicted.income > 0.5)
##
## FALSE TRUE
## <=50K 9661 722
## >50K 1340 2087