I downloaded the data from Kaggle and the goal of this project is to predict wether a person earn greater than \(\$50\)k a year.
I will start by cleaning the data.
adults <- read.csv("adult.csv", na.strings = c('','?'))
str(adults)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 90 82 66 54 41 34 38 74 68 41 ...
## $ workclass : chr NA "Private" NA "Private" ...
## $ fnlwgt : int 77053 132870 186061 140359 264663 216864 150601 88638 422013 70037 ...
## $ education : chr "HS-grad" "HS-grad" "Some-college" "7th-8th" ...
## $ education.num : int 9 9 10 4 10 9 6 16 9 10 ...
## $ marital.status: chr "Widowed" "Widowed" "Widowed" "Divorced" ...
## $ occupation : chr NA "Exec-managerial" NA "Machine-op-inspct" ...
## $ relationship : chr "Not-in-family" "Not-in-family" "Unmarried" "Unmarried" ...
## $ race : chr "White" "White" "Black" "White" ...
## $ sex : chr "Female" "Female" "Female" "Female" ...
## $ capital.gain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ capital.loss : int 4356 4356 4356 3900 3900 3770 3770 3683 3683 3004 ...
## $ hours.per.week: int 40 18 40 40 40 45 40 20 40 60 ...
## $ native.country: chr "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
library(skimr)
skim(adults)
| Name | adults |
| Number of rows | 32561 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| workclass | 1836 | 0.94 | 7 | 16 | 0 | 8 | 0 |
| education | 0 | 1.00 | 3 | 12 | 0 | 16 | 0 |
| marital.status | 0 | 1.00 | 7 | 21 | 0 | 7 | 0 |
| occupation | 1843 | 0.94 | 5 | 17 | 0 | 14 | 0 |
| relationship | 0 | 1.00 | 4 | 14 | 0 | 6 | 0 |
| race | 0 | 1.00 | 5 | 18 | 0 | 5 | 0 |
| sex | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| native.country | 583 | 0.98 | 4 | 26 | 0 | 41 | 0 |
| income | 0 | 1.00 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 38.58 | 13.64 | 17 | 28 | 37 | 48 | 90 | ▇▇▅▂▁ |
| fnlwgt | 0 | 1 | 189778.37 | 105549.98 | 12285 | 117827 | 178356 | 237051 | 1484705 | ▇▁▁▁▁ |
| education.num | 0 | 1 | 10.08 | 2.57 | 1 | 9 | 10 | 12 | 16 | ▁▁▇▃▁ |
| capital.gain | 0 | 1 | 1077.65 | 7385.29 | 0 | 0 | 0 | 0 | 99999 | ▇▁▁▁▁ |
| capital.loss | 0 | 1 | 87.30 | 402.96 | 0 | 0 | 0 | 0 | 4356 | ▇▁▁▁▁ |
| hours.per.week | 0 | 1 | 40.44 | 12.35 | 1 | 40 | 40 | 45 | 99 | ▁▇▃▁▁ |
The data has 32561 observations and 15 variables with some missing values. Now, I will parse character type variables to factor type for convenience on the analysis part.
adults$workclass <- as.factor(adults$workclass)
adults$education <- as.factor(adults$education)
adults$marital.status <- as.factor(adults$marital.status)
adults$relationship <- as.factor(adults$relationship)
adults$race <- as.factor(adults$race)
adults$sex <- as.factor(adults$sex)
adults$native.country <- as.factor(adults$native.country)
adults$occupation <- as.factor(adults$occupation)
adults$income <- as.factor(adults$income)
Checking the missing and unique values for every variables.
sapply(adults, function(x) sum(is.na(x)))
## age workclass fnlwgt education education.num
## 0 1836 0 0 0
## marital.status occupation relationship race sex
## 0 1843 0 0 0
## capital.gain capital.loss hours.per.week native.country income
## 0 0 0 583 0
sapply(adults, function(x) length(unique(x)))
## age workclass fnlwgt education education.num
## 73 9 21648 16 16
## marital.status occupation relationship race sex
## 7 15 6 5 2
## capital.gain capital.loss hours.per.week native.country income
## 119 92 94 42 2
Visualizing missing values.
library(Amelia)
missmap(adults, main = "Missing values vs observed")
table (complete.cases(adults))
##
## FALSE TRUE
## 2399 30162
7% (2399/32561) of the total data has missing value. Missing values are from the variables workclass, occupation, and native.country. Note that these variables are categorical so imputing them won’t be a good idea. So I’ll just remove observations with missing values.
adults <- adults[complete.cases(adults), ]
library(ggplot2)
library(gridExtra)
n1 <- ggplot(aes(x = income, y = age), data = adults) +
geom_boxplot() + labs(title = "Income vs Age")
n2 <- ggplot(aes(x = income, y = fnlwgt), data = adults) +
geom_boxplot() + labs(title = "Income vs fnlwgt")
n3 <- ggplot(aes(x = income, y = education.num), data = adults) +
geom_boxplot() + labs(title = "Income vs Education")
n4 <- ggplot(aes(x = income, y = capital.gain), data = adults) +
geom_boxplot() + labs(title = "Income vs CapGain")
n5 <- ggplot(aes(x = income, y = capital.loss), data = adults) +
geom_boxplot() + labs(title = "Income vs CapLoss")
n6 <- ggplot(aes(x = income, y = hours.per.week), data = adults) +
geom_boxplot() + labs(title = "Income vs Hrs/week")
grid.arrange(n1, n2, n3, n4, n5, n6, ncol = 3)
All numerical variables show significant variation with income except
capital loss, capital gain, and final weight. So I’ll remove these variables.
adults$capital.gain <- NULL
adults$capital.loss <- NULL
adults$fnlwgt <- NULL
library(dplyr)
by_workclass <- adults %>% group_by(workclass, income) %>% summarise(n = n())
by_education <- adults %>% group_by(education, income) %>% summarise(n = n())
by_marital.status <- adults %>% group_by(marital.status, income) %>% summarise(n = n())
by_occupation <- adults %>% group_by(occupation, income) %>% summarise(n = n())
by_relationship <- adults %>% group_by(relationship, income) %>% summarise(n = n())
by_race <- adults %>% group_by(race, income) %>% summarise(n = n())
by_sex <- adults %>% group_by(sex, income) %>% summarise(n = n())
by_native.country <- adults %>% group_by(native.country, income) %>% summarise(n = n())
c1 <- ggplot(aes(x = workclass, y = n), data = by_workclass) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=50, size = 5, vjust=0.5))
c2 <- ggplot(aes(x = education, y = n), data = by_education) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=50, size = 5, vjust=0.5))
c3 <- ggplot(aes(x = marital.status , y = n), data = by_marital.status ) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=50, size = 5, vjust=0.5))
c4 <- ggplot(aes(x = occupation, y = n), data = by_occupation) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=50, size = 5, vjust=0.5))
c5 <- ggplot(aes(x = relationship, y = n), data = by_relationship) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=50, size = 5, vjust=0.5))
c6 <- ggplot(aes(x = race, y = n), data = by_race) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=50, size = 5, vjust=0.5))
c7 <- ggplot(aes(x = sex, y = n), data = by_sex) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=50, size = 5, vjust=0.5))
c8 <- ggplot(aes(x = native.country, y = n), data = by_native.country) +
geom_bar(aes(fill = income), stat = "identity", position = "dodge") +
theme(axis.text.x=element_text(angle=90, size = 3, vjust=0.5))
grid.arrange(c1, c2, c3, c4, c5, c6, c7, c8, ncol = 2)
Only the variable native.country doesn’t show significant variation. All variables so far vary significantly. I’ll exclude the variable native.country.
adults$native.country <- NULL
Splitting the data.
train <- adults[1:24000,]
test <- adults[24001:30162,]
Fitting logistic regression.
glm.fit <- glm(income ~ ., data = train, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = income ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7337 -0.5893 -0.2207 0.4322 3.2731
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.663460 0.445005 -14.974 < 2e-16 ***
## age 0.028169 0.001777 15.851 < 2e-16 ***
## workclassLocal-gov -0.642040 0.119262 -5.383 7.31e-08 ***
## workclassPrivate -0.440045 0.099142 -4.439 9.06e-06 ***
## workclassSelf-emp-inc -0.207660 0.129859 -1.599 0.10979
## workclassSelf-emp-not-inc -0.915222 0.115654 -7.913 2.50e-15 ***
## workclassState-gov -0.837047 0.133330 -6.278 3.43e-10 ***
## workclassWithout-pay -13.126008 236.324426 -0.056 0.95571
## education11th 0.073830 0.218639 0.338 0.73561
## education12th 0.327111 0.287523 1.138 0.25525
## education1st-4th -0.618988 0.502610 -1.232 0.21812
## education5th-6th -0.686193 0.386149 -1.777 0.07557 .
## education7th-8th -0.692057 0.255299 -2.711 0.00671 **
## education9th -0.651847 0.298656 -2.183 0.02907 *
## educationAssoc-acdm 1.243420 0.184045 6.756 1.42e-11 ***
## educationAssoc-voc 1.206325 0.176179 6.847 7.53e-12 ***
## educationBachelors 1.903236 0.163269 11.657 < 2e-16 ***
## educationDoctorate 2.968018 0.228355 12.997 < 2e-16 ***
## educationHS-grad 0.700652 0.158620 4.417 1.00e-05 ***
## educationMasters 2.309539 0.175222 13.181 < 2e-16 ***
## educationPreschool -11.577233 135.367738 -0.086 0.93184
## educationProf-school 2.982229 0.211883 14.075 < 2e-16 ***
## educationSome-college 1.017408 0.161178 6.312 2.75e-10 ***
## education.num NA NA NA NA
## marital.statusMarried-AF-spouse 2.760292 0.672255 4.106 4.03e-05 ***
## marital.statusMarried-civ-spouse 1.920928 0.292816 6.560 5.37e-11 ***
## marital.statusMarried-spouse-absent -0.156763 0.243575 -0.644 0.51984
## marital.statusNever-married -0.411046 0.089457 -4.595 4.33e-06 ***
## marital.statusSeparated 0.023924 0.162875 0.147 0.88322
## marital.statusWidowed 0.160395 0.160641 0.998 0.31805
## occupationArmed-Forces -0.868829 1.283486 -0.677 0.49845
## occupationCraft-repair 0.137775 0.084557 1.629 0.10324
## occupationExec-managerial 0.893693 0.081381 10.982 < 2e-16 ***
## occupationFarming-fishing -0.872898 0.144174 -6.054 1.41e-09 ***
## occupationHandlers-cleaners -0.652086 0.153927 -4.236 2.27e-05 ***
## occupationMachine-op-inspct -0.291588 0.109080 -2.673 0.00751 **
## occupationOther-service -0.858382 0.123617 -6.944 3.81e-12 ***
## occupationPriv-house-serv -2.427559 1.132261 -2.144 0.03203 *
## occupationProf-specialty 0.606014 0.085938 7.052 1.77e-12 ***
## occupationProtective-serv 0.552056 0.133155 4.146 3.38e-05 ***
## occupationSales 0.352614 0.086596 4.072 4.66e-05 ***
## occupationTech-support 0.664931 0.118672 5.603 2.11e-08 ***
## occupationTransport-moving -0.028411 0.104086 -0.273 0.78489
## relationshipNot-in-family 0.384219 0.289971 1.325 0.18516
## relationshipOther-relative -0.497337 0.263173 -1.890 0.05879 .
## relationshipOwn-child -0.768478 0.290424 -2.646 0.00814 **
## relationshipUnmarried 0.152143 0.304899 0.499 0.61778
## relationshipWife 1.351812 0.108484 12.461 < 2e-16 ***
## raceAsian-Pac-Islander 0.139896 0.260619 0.537 0.59142
## raceBlack 0.319200 0.246140 1.297 0.19469
## raceOther -0.645877 0.400563 -1.612 0.10687
## raceWhite 0.360961 0.235036 1.536 0.12459
## sexMale 0.858415 0.080612 10.649 < 2e-16 ***
## hours.per.week 0.029153 0.001801 16.185 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27607 on 23999 degrees of freedom
## Residual deviance: 17501 on 23947 degrees of freedom
## AIC: 17607
##
## Number of Fisher Scoring iterations: 13
The smallest p-value here is associated with the variables age, hours.per.week, occupation, and education. This suggest a strong association with the probability of wage that is greater $50K. Race variable turns out not to be significant so we can eliminate this from our model.
anova(glm.fit, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: income
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 23999 27607
## age 1 1390.0 23998 26217 < 2.2e-16 ***
## workclass 6 357.4 23992 25859 < 2.2e-16 ***
## education 15 3009.1 23977 22850 < 2.2e-16 ***
## education.num 0 0.0 23977 22850
## marital.status 6 4121.0 23971 18729 < 2.2e-16 ***
## occupation 13 634.8 23958 18094 < 2.2e-16 ***
## relationship 5 167.8 23953 17926 < 2.2e-16 ***
## race 4 19.9 23949 17907 0.0005157 ***
## sex 1 136.3 23948 17770 < 2.2e-16 ***
## hours.per.week 1 269.2 23947 17501 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above table we can see the drop in deviance when adding each variable one at a time. Adding age, workclass, education, marital status, occupation, relationship, race, sex, capital gain, capital loss and hours per week significantly reduces the residual deviance. education.num seem to have no effect.
For prediction.
attach(train)
glm.probs <- predict(glm.fit, train, type="response")
glm.probs[1:10]
## 2 4 5 6 7 8
## 0.145668613 0.007488877 0.028122890 0.011222816 0.030132973 0.141080678
## 9 11 12 13
## 0.177990524 0.316438464 0.397569686 0.042891690
contrasts(income)
## >50K
## <=50K 0
## >50K 1
I have printed only the first ten probabilities. We know that these values correspond to the probability that a person’s income is greater than $50k, rather than less than or equal to $50k, because the contrasts() function indicates that R has created a dummy variable with a 1 for greater than $50k.
glm.pred <- rep("<=50K", 24000)
glm.pred[glm.probs > .5] = ">50k"
The first command creates a vector of 24000, \(\leq 50\)K elements. The second line transforms to \(> 50\)K all of the elements for which the predicted probability of a person earning greater than $50k exceeds 0.5.
Confusion matrix.
attach(train)
table(glm.pred, income)
## income
## glm.pred <=50K >50K
## <=50K 16219 2649
## >50k 1493 3639
mean(glm.pred == income)
## [1] 0.6757917
mean(glm.pred != income)
## [1] 0.3242083
The model correctly predicted 14193 persons whose in come is <=50K and 2050 perons whose income is >50K, for a total of 16243 correct predictions. In this case, logistic regression correctly predicted 67 % of the time on the training data. The test error rate is 32%.
glm.probs <- predict(glm.fit, test, type="response")
glm.pred <- rep("<=50K", 6162)
glm.pred[glm.probs > .5] = ">50k"
attach(test)
table(glm.pred, income)
## income
## glm.pred <=50K >50K
## <=50K 4525 526
## >50k 417 694
mean(glm.pred == income)
## [1] 0.7343395
mean(glm.pred != income)
## [1] 0.2656605
The model is doing good in our test set with accuracy 73% and test error rate 26%.
At last, plot the ROC curve and calculate the AUC (area under the curve). The closer AUC for a model comes to 1, the better predictive ability.
library(ROCR)
p <- predict(glm.fit, newdata=test, type="response")
pr <- prediction(p, test$income)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8868692
Here, I only used simple validation set approach in splitting the data. One powerful method to use would be k-fold cross-validation. I can also improve the model by removing those variables which were not significant or perform variable selection methods.