Title: ‘General Social Survey 2015: Relationship between education and earnings’
Synopsis: I explore the relationship between education and earnings in the US, from the General Social Survey 2015 using stepwise regression and cross-validation.
Skills: Logistic regression, data visualization, model assessment, stepwise regression, 10-fold cross-validation, classification errors, for-loops
Cleaning the environment and setting the working directory
rm(list = ls())
setwd("/Users/Roberta/Desktop/Data Analysis & Exploration/HW10")
x <- read.csv("GSS_HW10.csv", as.is = TRUE)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(pander)
## Warning: package 'pander' was built under R version 3.4.2
GSS dataset: > https://www.dropbox.com/s/691s1aun0ujsiyd/GSS_HW10.csv?raw=1
Variables included:
wrkstat: employment statuspolviews: political viewshappy: general rating of happinessborn in the United States?racesexdegree: highest degree attainedagemarital statushrs1: hours worked in the week prior to the surveyearn25k: is the respondent’s income level greater than $25,000?We restrict our attention to the respondents who worked either part-time or full-time in 2015. One question in the GSS asks respondents for their income level by checking off one of several income categories. The highest income bucket provided is >= “$25,000”. (In the 1970s, when the GSS was first conducted, this was a large number, and few respondents checked off highest income category.)
We will use binomial logistic regression to predict whether someone earns more than $25,000 a year, as seen in the earn25k variable.
I consider the relationship between degree and earn25k through a variety of lens including
I Convert earn25k, the response variable, into a factor and degree and polviews into factors with suitable orderings of the levels in order to more easily interpret interpreting regression coefficients).
#EARN25K
x$earn25k <- as.factor(x$earn25k)
levels(x$earn25k)
## [1] "No" "Yes"
#DEGREE
x %>% select(degree) %>% distinct #The different distinct entries under
## degree
## 1 Bachelor
## 2 High school
## 3 Graduate
## 4 Lt high school
## 5 Junior college
#Lt high school, High school, Junior College, Bachelor, Graduate
#Using mutate, currently returns a string of NAs
x<- x %>% mutate(degree = factor(degree, levels = c("Lt high school", "High school", "Junior college", "Bachelor", "Graduate"))) #WHY?
## Warning: package 'bindrcpp' was built under R version 3.4.2
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
##
## * The `env` argument of `eval_tidy()`
## * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
x$degree[1:5]
## [1] Bachelor High school High school Graduate High school
## Levels: Lt high school High school Junior college Bachelor Graduate
levels(x$degree)
## [1] "Lt high school" "High school" "Junior college" "Bachelor"
## [5] "Graduate"
#POLVIEWS
x <- x %>% mutate(polviews = factor(polviews, levels = c("Conservative", "Liberal", "Moderate")))
x$polviews[1:5]
## [1] Moderate Liberal Moderate Liberal Liberal
## Levels: Conservative Liberal Moderate
degree against earn25k.ggplot(x) + geom_bar(aes(x=degree,fill = earn25k), position="fill")
As can be seen from the above graph, as educational attainment increases, individuals are increasingly likely to earn 25k or over.
degree against earn25k.table1 <- table(education=x$degree, earn25k=x$earn25k)
table1
## earn25k
## education No Yes
## Lt high school 53 43
## High school 298 362
## Junior college 39 80
## Bachelor 100 230
## Graduate 55 143
prop.table(table1, 1)*100
## earn25k
## education No Yes
## Lt high school 55.20833 44.79167
## High school 45.15152 54.84848
## Junior college 32.77311 67.22689
## Bachelor 30.30303 69.69697
## Graduate 27.77778 72.22222
prop.table(table1)*100
## earn25k
## education No Yes
## Lt high school 3.777619 3.064861
## High school 21.240200 25.801853
## Junior college 2.779758 5.702067
## Bachelor 7.127584 16.393443
## Graduate 3.920171 10.192445
#Expected number of people with Bachelor degree and do earn25k
sum(x$degree == "Bachelor") * sum(x$earn25k == "Yes") / nrow(x)
## [1] 201.8104
Based on the tables above we can see that individuals with Bachelor’s degree’s have nearly a 75% chance of earning over 25k and above. There is a general pattern of more people earning 25 and above as education increases, indicating a strong postive correlation between education and earnings. If the two variables were independant I would expect that number of respondents with degree = "Bachelor" and earn25k = "No" to be 201.8104.
Using counts in this table (and showing your work), compute the odds ratio of Bachelor-level respondents earning $25k+ vs. less than high school-level respondents earning $25k.
table1
## earn25k
## education No Yes
## Lt high school 53 43
## High school 298 362
## Junior college 39 80
## Bachelor 100 230
## Graduate 55 143
oddsb <- 230/100
oddslth <- 43/53
oddsb/oddslth
## [1] 2.834884
The odds of someone with a Bachelor’s degree earning 25k+ is 69.7% while that of someone with a Less than high school degree is 44.8%. The odds ratio of someone with a Bachelor’s earning 25k and over vs less than high school education doing the same is 2.834884.
“An odds ratio of exactly 1 means that exposure to property A does not affect the odds of property B. An odds ratio of more than 1 means that there is a higher odds of property B happening with exposure to property A. An odds ratio is less than 1 is associated with lower odds.” (http://www.statisticshowto.com/odds-ratio/)
degree, display the model summary, and verify your calculation of the odds ratio in (c).m0 <- glm(earn25k ~ degree, data=x, family=binomial)
summary(m0)
##
## Call:
## glm(formula = earn25k ~ degree, family = binomial, data = x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6006 -1.2611 0.8497 1.0960 1.2674
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2091 0.2052 -1.019 0.30832
## degreeHigh school 0.4036 0.2196 1.838 0.06610 .
## degreeJunior college 0.9276 0.2833 3.274 0.00106 **
## degreeBachelor 1.0420 0.2376 4.385 1.16e-05 ***
## degreeGraduate 1.1646 0.2594 4.489 7.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1874.6 on 1402 degrees of freedom
## Residual deviance: 1830.2 on 1398 degrees of freedom
## AIC: 1840.2
##
## Number of Fisher Scoring iterations: 4
According to the model the odds ratio for someone who has a Bachelor’s degree earning over 25k is \(e^1^.^0^4 = 2.83\). The baseline odds of earning over 25k is \(e^-0.21 = 0.06\), this is also the odds ratio of someone with a less than high school education earning over 25k.
The model coefficients agree with my intuition about the relationship between educational attainment and earning potential, more specifically that educational attainment is positively correlated with earning potential. Each cofficient increases the likelihood that the individual will earn over 25. Additionally, most of the coefficients have p-values are statistically significant. In the case of High School degree, which is not statistically significant at the 0.05 level, this is likely because a high school degree does not significantly affect your earning potential as compared to the baseline, or a less than high school degree.
Here is I will examine the relationship between earnings, degree and age. I then interpret the model and assess the fit of the model given the odds ratio for age.
m1 <- glm(earn25k ~ degree + age, data=x, family=binomial)
summary(m1)
##
## Call:
## glm(formula = earn25k ~ degree + age, family = binomial, data = x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7985 -1.2277 0.8106 1.0091 1.3940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.814966 0.280838 -2.902 0.00371 **
## degreeHigh school 0.440827 0.220919 1.995 0.04600 *
## degreeJunior college 0.949214 0.284535 3.336 0.00085 ***
## degreeBachelor 1.066712 0.238752 4.468 7.90e-06 ***
## degreeGraduate 1.134876 0.260432 4.358 1.31e-05 ***
## age 0.013286 0.004173 3.184 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1874.6 on 1402 degrees of freedom
## Residual deviance: 1819.9 on 1397 degrees of freedom
## AIC: 1831.9
##
## Number of Fisher Scoring iterations: 4
CI_m1 <- confint(m1)
## Waiting for profiling to be done...
CI_m1
## 2.5 % 97.5 %
## (Intercept) -1.369630662 -0.26718455
## degreeHigh school 0.009344090 0.87755480
## degreeJunior college 0.396121189 1.51336054
## degreeBachelor 0.600947631 1.53866474
## degreeGraduate 0.627946668 1.65051552
## age 0.005134799 0.02150278
exp(CI_m1) #confidence interval for the odds ratios
## 2.5 % 97.5 %
## (Intercept) 0.2542008 0.7655318
## degreeHigh school 1.0093879 2.4050118
## degreeJunior college 1.4860494 4.5419687
## degreeBachelor 1.8238463 4.6583660
## degreeGraduate 1.8737592 5.2096648
## age 1.0051480 1.0217356
The odds ratio of age is \(e^0.0133 = 1.0134\). This indicates that for every unit increase in age, a multiplicative factor of 1.01 is applied to the baseline predicted odds of \(e^-0.81 = 0.44\)
The 95% confidence interval for the age OR is 0.0051 to 0.0215, signaling that our odds ratio value falls within that range.
hist1 <- hist(fitted(m1))
#Alternatively, hist(predict(m1, type="response"))
#Converting probabilities to predicted classes with a 0.5 threshold
pred_classm1 <- fitted(m1) >= 0.5
table(pred_classm1)
## pred_classm1
## FALSE TRUE
## 215 1188
#Comparing predicted classes to actual classes
tablem1 <- table(predicted = pred_classm1, actual = x$earn25k)
#Computing the classification error rate
1-sum(diag(tablem1))/sum(tablem1)
## [1] 0.3449751
The error classification rate of the model is 0.345, or 34.4%, which is incredibly high.
#2-way table comparing predicted classes to actual classes
tablem1
## actual
## predicted No Yes
## FALSE 138 77
## TRUE 407 781
#Specificity, true negative, ie the number of people in a population correctly identified as not having the condition being tested for, in this case not earning 25k and over
tablem1[1,1]/sum(tablem1[,1]) #i.e. 138/545 (the sum of the first column)
## [1] 0.253211
#Sensitivity, the true positive, ie the number of people in people in a population correctly identified as having the condition being tested for, in this case earning 25k and over.
tablem1[2,2]/sum(tablem1[,2])
## [1] 0.9102564
The true negative (specificty) has a value of 0.2432 or 25.32% while the true positive (sensitivity) have a value of .9103 or 91.03%. Values of 90% and over are considered to be credible. Based on the outcomes of the sensitivity and specificity test, teh specificity test is more difficult to predict. In other words, it is more difficult to identify people who do not earn 25k and over or the test sometimes mistakenly identifies other values as a positive.
earn25k. I then show the summary of the final model including co-efficients.summary(x)
## wrkstat polviews happy
## Length:1403 Conservative:461 Length:1403
## Class :character Liberal :431 Class :character
## Mode :character Moderate :511 Mode :character
##
##
##
## born race sex
## Length:1403 Length:1403 Length:1403
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## degree age marital hrs1
## Lt high school: 96 Min. :18.00 Length:1403 Min. : 1.00
## High school :660 1st Qu.:33.00 Class :character 1st Qu.:36.00
## Junior college:119 Median :44.00 Mode :character Median :40.00
## Bachelor :330 Mean :44.27 Mean :41.21
## Graduate :198 3rd Qu.:55.00 3rd Qu.:48.00
## Max. :87.00 Max. :89.00
## earn25k
## No :545
## Yes:858
##
##
##
##
#Converting rest of predictors into factors
x$wrkstat <- as.factor(x$wrkstat)
sum(is.na(x$wrkstat))
## [1] 0
x$happy <- as.factor(x$happy)
sum(is.na(x$happy))
## [1] 0
x$born <- as.factor(x$born)
sum(is.na(x$born))
## [1] 0
x$race <- as.factor(x$race)
sum(is.na(x$race))
## [1] 0
x$sex <- as.factor(x$sex)
sum(is.na(x$sex))
## [1] 0
x$marital <- as.factor(x$marital)
sum(is.na(x$marital))
## [1] 0
sum(is.na(x)) #Just to be certain that there are no NAs in the dataset
## [1] 0
#Stepwise regression
m2null <- glm(earn25k ~ 1, data=x, family=binomial)
m2full <- glm(earn25k ~ ., data=x, family=binomial)
m2step <- step(m2full, scope=list(lower=m2null))
## Start: AIC=1611.31
## earn25k ~ wrkstat + polviews + happy + born + race + sex + degree +
## age + marital + hrs1
##
## Df Deviance AIC
## - polviews 2 1571.5 1607.5
## <none> 1571.3 1611.3
## - race 2 1575.9 1611.9
## - born 1 1574.1 1612.1
## - happy 2 1576.1 1612.1
## - sex 1 1577.9 1615.9
## - hrs1 1 1579.5 1617.5
## - age 1 1579.9 1617.9
## - marital 4 1586.8 1618.8
## - degree 4 1595.0 1627.0
## - wrkstat 1 1646.2 1684.2
##
## Step: AIC=1607.47
## earn25k ~ wrkstat + happy + born + race + sex + degree + age +
## marital + hrs1
##
## Df Deviance AIC
## <none> 1571.5 1607.5
## - race 2 1576.0 1608.0
## - born 1 1574.2 1608.2
## - happy 2 1576.3 1608.3
## - sex 1 1578.1 1612.1
## - hrs1 1 1579.6 1613.6
## - age 1 1579.9 1613.9
## - marital 4 1586.9 1614.9
## - degree 4 1596.2 1624.2
## - wrkstat 1 1646.5 1680.5
summary(m2step)
##
## Call:
## glm(formula = earn25k ~ wrkstat + happy + born + race + sex +
## degree + age + marital + hrs1, family = binomial, data = x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2851 -0.9566 0.6316 0.8399 2.0551
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.149939 0.541840 -3.968 7.25e-05 ***
## wrkstatWorking parttime -1.669239 0.199957 -8.348 < 2e-16 ***
## happyPretty happy 0.373102 0.208398 1.790 0.07340 .
## happyVery happy 0.499037 0.227493 2.194 0.02826 *
## bornYes 0.346718 0.208198 1.665 0.09585 .
## raceOther -0.221528 0.264220 -0.838 0.40179
## raceWhite 0.212476 0.165810 1.281 0.20004
## sexMale 0.328496 0.127959 2.567 0.01025 *
## degreeHigh school 0.298342 0.254804 1.171 0.24165
## degreeJunior college 0.890924 0.324912 2.742 0.00611 **
## degreeBachelor 0.839172 0.271850 3.087 0.00202 **
## degreeGraduate 0.890965 0.293478 3.036 0.00240 **
## age 0.015687 0.005435 2.886 0.00390 **
## maritalMarried 0.336504 0.180207 1.867 0.06186 .
## maritalNever married -0.072820 0.203491 -0.358 0.72045
## maritalSeparated -0.713231 0.340628 -2.094 0.03627 *
## maritalWidowed 0.007107 0.385226 0.018 0.98528
## hrs1 0.015681 0.005565 2.818 0.00484 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1874.6 on 1402 degrees of freedom
## Residual deviance: 1571.5 on 1385 degrees of freedom
## AIC: 1607.5
##
## Number of Fisher Scoring iterations: 4
The signs of most of the coefficients are acceptable. The coefficient represent log-odds, and once exponented reveal the odds associated with each predictor. As such a negative coefficient results in odds < 1, thus the odd of the event occurring are lower than the baseline. A positive coefficient, on the other hand has odds > 1, implying increased odds of observing event, relative to baseline.
#Histogram of fitted values
hist2 <- hist(fitted(m2step))
#Converting probabilities to predicted classes with a 0.5 threshold
pred_class_m2step <- fitted(m2step) >= 0.5
table(pred_class_m2step)
## pred_class_m2step
## FALSE TRUE
## 343 1060
#Comparing predicted classes to actual classes
table_m2step <- table(predicted = pred_class_m2step, actual = x$earn25k)
#Computing the classification error rate
1-sum(diag(table_m2step))/sum(table_m2step)
## [1] 0.2637206
#2-way table comparing predicted classes to actual classes
table_m2step
## actual
## predicted No Yes
## FALSE 259 84
## TRUE 286 774
#Specificity, true negative
table_m2step[1,1]/sum(table_m2step[,1])
## [1] 0.4752294
#Sensitivity, the true positive
table_m2step[2,2]/sum(table_m2step[,2])
## [1] 0.9020979
The classification error rate is now 0.2637, or 26.37%. It has been reduced by 8.03% from 34.4%. The specificity value has also increased to 0.475 or 46.5%, while the sensitivity rate has decreased slightly to 90.2%.
#Testing classification accuracy
N <- 99
threshs <- seq(0.01, 0.99, by=0.01)
errors <- rep(NA, length(threshs))
preds <- rep(NA, nrow(x))
for (i in 1:length(threshs)) {
pred_class <- fitted(m2step) >= threshs[i]
errors[i] <- mean(pred_class != x$earn25k)
}
#Optimal classification thresholds
threshs[which.min(errors)]
## [1] 0.01
#Associated classification error rates
errors[which.min(errors)]
## [1] 1
The optimal classification threshold is 0.01, its associated classification error rate is 1.
I repeat part (c) while using 10-fold cross-validation to fit the model/generate predictions. I used the same set of predictors across all folds. This approach gives a better estimation of the out-of-sample prediction accuracy.
#10-fold cross-validation
K <- 10
set.seed(230)
folds <- rep(1:K, length.out = nrow(x))
folds <- sample(folds)
preds <- rep(NA, nrow(x))
for (k in 1:K) {
train <- x[folds != k,]
test <- x[folds == k,]
mtmp <- glm(formula(m2step), data = train, family=binomial)
preds[folds == k] <- predict(mtmp, newdata = test, type="response")
}
#Testing classification accuracy
threshs <- seq(0.01, 0.99, by=0.01)
errs <- rep(NA, length(threshs))
for (i in 1:length(threshs)) {
pred_class <- preds >= threshs[i]
errs[i] <- mean(pred_class != x$earn25k)
}
plot(errs ~ threshs, main="Classification Error by Threshold", xlab="Classification Threshold",ylab="Classification Error Rate", type="l")
#Optimal classification thresholds
threshs[which.min(errs)]
## [1] 0.01
#Associated classification error rates
errs[which.min(errs)]
## [1] 1
The classification threshold with minimum errors is 0.01, and its associated classification error rate is 1. This is the same as the classification threshold and error rate from the previous question.