General Social Survey 2015: Relationship between education and earnings

Title: ‘General Social Survey 2015: Relationship between education and earnings’

Synopsis: I explore the relationship between education and earnings in the US, from the General Social Survey 2015 using stepwise regression and cross-validation.

Skills: Logistic regression, data visualization, model assessment, stepwise regression, 10-fold cross-validation, classification errors, for-loops

Problems

Cleaning the environment and setting the working directory

rm(list = ls()) 
setwd("/Users/Roberta/Desktop/Data Analysis & Exploration/HW10")
x <- read.csv("GSS_HW10.csv", as.is = TRUE)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(pander)

## Warning: package 'pander' was built under R version 3.4.2

GSS dataset: > https://www.dropbox.com/s/691s1aun0ujsiyd/GSS_HW10.csv?raw=1

Variables included:

wrkstat: employment status
polviews: political views
happy: general rating of happiness
born in the United States?
race
sex
degree: highest degree attained
age
marital status
hrs1: hours worked in the week prior to the survey
earn25k: is the respondent’s income level greater than $25,000?

We restrict our attention to the respondents who worked either part-time or full-time in 2015. One question in the GSS asks respondents for their income level by checking off one of several income categories. The highest income bucket provided is >= “$25,000”. (In the 1970s, when the GSS was first conducted, this was a large number, and few respondents checked off highest income category.)

We will use binomial logistic regression to predict whether someone earns more than $25,000 a year, as seen in the earn25k variable.

Data Exploration

I consider the relationship between degree and earn25k through a variety of lens including

Data Cleaning

I Convert earn25k, the response variable, into a factor and degree and polviews into factors with suitable orderings of the levels in order to more easily interpret interpreting regression coefficients).

#EARN25K
x$earn25k <- as.factor(x$earn25k)
levels(x$earn25k)

## [1] "No"  "Yes"

#DEGREE
x %>% select(degree) %>% distinct #The different distinct entries under

##           degree
## 1       Bachelor
## 2    High school
## 3       Graduate
## 4 Lt high school
## 5 Junior college

#Lt high school, High school, Junior College, Bachelor, Graduate 

#Using mutate, currently returns a string of NAs
x<- x %>% mutate(degree = factor(degree, levels = c("Lt high school", "High school", "Junior college", "Bachelor", "Graduate"))) #WHY?

## Warning: package 'bindrcpp' was built under R version 3.4.2

## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.

## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.

## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
## 
##   * The `env` argument of `eval_tidy()`
##   * Quosure environments when applicable
## This warning is displayed once per session.

## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.

x$degree[1:5]

## [1] Bachelor    High school High school Graduate    High school
## Levels: Lt high school High school Junior college Bachelor Graduate

levels(x$degree)

## [1] "Lt high school" "High school"    "Junior college" "Bachelor"      
## [5] "Graduate"

#POLVIEWS
x <- x %>% mutate(polviews = factor(polviews, levels = c("Conservative", "Liberal", "Moderate")))
x$polviews[1:5]

## [1] Moderate Liberal  Moderate Liberal  Liberal 
## Levels: Conservative Liberal Moderate

Visual representation of the relationship between degree against earn25k.

ggplot(x) + geom_bar(aes(x=degree,fill = earn25k), position="fill")

As can be seen from the above graph, as educational attainment increases, individuals are increasingly likely to earn 25k or over.

Table of degree against earn25k.

table1 <- table(education=x$degree, earn25k=x$earn25k)
table1

##                 earn25k
## education         No Yes
##   Lt high school  53  43
##   High school    298 362
##   Junior college  39  80
##   Bachelor       100 230
##   Graduate        55 143

prop.table(table1, 1)*100

##                 earn25k
## education              No      Yes
##   Lt high school 55.20833 44.79167
##   High school    45.15152 54.84848
##   Junior college 32.77311 67.22689
##   Bachelor       30.30303 69.69697
##   Graduate       27.77778 72.22222

prop.table(table1)*100

##                 earn25k
## education               No       Yes
##   Lt high school  3.777619  3.064861
##   High school    21.240200 25.801853
##   Junior college  2.779758  5.702067
##   Bachelor        7.127584 16.393443
##   Graduate        3.920171 10.192445

#Expected number of people with Bachelor degree and do earn25k
sum(x$degree == "Bachelor") * sum(x$earn25k == "Yes") / nrow(x)

## [1] 201.8104

Based on the tables above we can see that individuals with Bachelor’s degree’s have nearly a 75% chance of earning over 25k and above. There is a general pattern of more people earning 25 and above as education increases, indicating a strong postive correlation between education and earnings. If the two variables were independant I would expect that number of respondents with degree = "Bachelor" and earn25k = "No" to be 201.8104.

Using counts in this table (and showing your work), compute the odds ratio of Bachelor-level respondents earning $25k+ vs. less than high school-level respondents earning $25k.

table1

##                 earn25k
## education         No Yes
##   Lt high school  53  43
##   High school    298 362
##   Junior college  39  80
##   Bachelor       100 230
##   Graduate        55 143

oddsb <- 230/100
oddslth <- 43/53
oddsb/oddslth

## [1] 2.834884

The odds of someone with a Bachelor’s degree earning 25k+ is 69.7% while that of someone with a Less than high school degree is 44.8%. The odds ratio of someone with a Bachelor’s earning 25k and over vs less than high school education doing the same is 2.834884.

“An odds ratio of exactly 1 means that exposure to property A does not affect the odds of property B. An odds ratio of more than 1 means that there is a higher odds of property B happening with exposure to property A. An odds ratio is less than 1 is associated with lower odds.” (http://www.statisticshowto.com/odds-ratio/)

Fit a logistic regression model using degree, display the model summary, and verify your calculation of the odds ratio in (c).

m0 <- glm(earn25k ~ degree, data=x, family=binomial)
summary(m0)

## 
## Call:
## glm(formula = earn25k ~ degree, family = binomial, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6006  -1.2611   0.8497   1.0960   1.2674  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -0.2091     0.2052  -1.019  0.30832    
## degreeHigh school      0.4036     0.2196   1.838  0.06610 .  
## degreeJunior college   0.9276     0.2833   3.274  0.00106 ** 
## degreeBachelor         1.0420     0.2376   4.385 1.16e-05 ***
## degreeGraduate         1.1646     0.2594   4.489 7.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1874.6  on 1402  degrees of freedom
## Residual deviance: 1830.2  on 1398  degrees of freedom
## AIC: 1840.2
## 
## Number of Fisher Scoring iterations: 4

According to the model the odds ratio for someone who has a Bachelor’s degree earning over 25k is $e^1^.^0^4 = 2.83$. The baseline odds of earning over 25k is $e^-0.21 = 0.06$, this is also the odds ratio of someone with a less than high school education earning over 25k.

Do the model coefficients agree with your intuition about the relationship between educational attainment and earning potential? Justify your answer.

The model coefficients agree with my intuition about the relationship between educational attainment and earning potential, more specifically that educational attainment is positively correlated with earning potential. Each cofficient increases the likelihood that the individual will earn over 25. Additionally, most of the coefficients have p-values are statistically significant. In the case of High School degree, which is not statistically significant at the 0.05 level, this is likely because a high school degree does not significantly affect your earning potential as compared to the baseline, or a less than high school degree.

Model Assessment

Here is I will examine the relationship between earnings, degree and age. I then interpret the model and assess the fit of the model given the odds ratio for age.

Model creation

m1 <- glm(earn25k ~ degree + age, data=x, family=binomial)
summary(m1)

## 
## Call:
## glm(formula = earn25k ~ degree + age, family = binomial, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7985  -1.2277   0.8106   1.0091   1.3940  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.814966   0.280838  -2.902  0.00371 ** 
## degreeHigh school     0.440827   0.220919   1.995  0.04600 *  
## degreeJunior college  0.949214   0.284535   3.336  0.00085 ***
## degreeBachelor        1.066712   0.238752   4.468 7.90e-06 ***
## degreeGraduate        1.134876   0.260432   4.358 1.31e-05 ***
## age                   0.013286   0.004173   3.184  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1874.6  on 1402  degrees of freedom
## Residual deviance: 1819.9  on 1397  degrees of freedom
## AIC: 1831.9
## 
## Number of Fisher Scoring iterations: 4

CI_m1 <- confint(m1)

## Waiting for profiling to be done...

CI_m1

##                             2.5 %      97.5 %
## (Intercept)          -1.369630662 -0.26718455
## degreeHigh school     0.009344090  0.87755480
## degreeJunior college  0.396121189  1.51336054
## degreeBachelor        0.600947631  1.53866474
## degreeGraduate        0.627946668  1.65051552
## age                   0.005134799  0.02150278

exp(CI_m1) #confidence interval for the odds ratios

##                          2.5 %    97.5 %
## (Intercept)          0.2542008 0.7655318
## degreeHigh school    1.0093879 2.4050118
## degreeJunior college 1.4860494 4.5419687
## degreeBachelor       1.8238463 4.6583660
## degreeGraduate       1.8737592 5.2096648
## age                  1.0051480 1.0217356

The odds ratio of age is $e^0.0133 = 1.0134$. This indicates that for every unit increase in age, a multiplicative factor of 1.01 is applied to the baseline predicted odds of $e^-0.81 = 0.44$

The 95% confidence interval for the age OR is 0.0051 to 0.0215, signaling that our odds ratio value falls within that range.

Histogram of the predicted probabilities and the classification error rate using 0.50 as the classification threshold.

hist1 <- hist(fitted(m1))

#Alternatively, hist(predict(m1, type="response"))

#Converting probabilities to predicted classes with a 0.5 threshold
pred_classm1 <- fitted(m1) >= 0.5
table(pred_classm1)

## pred_classm1
## FALSE  TRUE 
##   215  1188

#Comparing predicted classes to actual classes
tablem1 <- table(predicted = pred_classm1, actual = x$earn25k)

#Computing the classification error rate
1-sum(diag(tablem1))/sum(tablem1)

## [1] 0.3449751

The error classification rate of the model is 0.345, or 34.4%, which is incredibly high.

Display a 2-way table comparing the predicted classes and the actual classes. Based on calculations of sensitivity and specificity, which of the two outcomes seems harder to predict?

#2-way table comparing predicted classes to actual classes
tablem1

##          actual
## predicted  No Yes
##     FALSE 138  77
##     TRUE  407 781

#Specificity, true negative, ie the number of people in a population correctly identified as not having the condition being tested for, in this case not earning 25k and over
tablem1[1,1]/sum(tablem1[,1]) #i.e. 138/545 (the sum of the first column)

## [1] 0.253211

#Sensitivity, the true positive, ie the number of people in people in a population correctly identified as having the condition being tested for, in this case earning 25k and over.
tablem1[2,2]/sum(tablem1[,2])

## [1] 0.9102564

The true negative (specificty) has a value of 0.2432 or 25.32% while the true positive (sensitivity) have a value of .9103 or 91.03%. Values of 90% and over are considered to be credible. Based on the outcomes of the sensitivity and specificity test, teh specificity test is more difficult to predict. In other words, it is more difficult to identify people who do not earn 25k and over or the test sometimes mistakenly identifies other values as a positive.

Stepwise Regression

I started from a full model containing all available predictors and used backward stepwise regression to predict earn25k. I then show the summary of the final model including co-efficients.

summary(x)

##    wrkstat                  polviews      happy          
##  Length:1403        Conservative:461   Length:1403       
##  Class :character   Liberal     :431   Class :character  
##  Mode  :character   Moderate    :511   Mode  :character  
##                                                          
##                                                          
##                                                          
##      born               race               sex           
##  Length:1403        Length:1403        Length:1403       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##             degree         age          marital               hrs1      
##  Lt high school: 96   Min.   :18.00   Length:1403        Min.   : 1.00  
##  High school   :660   1st Qu.:33.00   Class :character   1st Qu.:36.00  
##  Junior college:119   Median :44.00   Mode  :character   Median :40.00  
##  Bachelor      :330   Mean   :44.27                      Mean   :41.21  
##  Graduate      :198   3rd Qu.:55.00                      3rd Qu.:48.00  
##                       Max.   :87.00                      Max.   :89.00  
##  earn25k  
##  No :545  
##  Yes:858  
##           
##           
##           
##

#Converting rest of predictors into factors
x$wrkstat <- as.factor(x$wrkstat) 
sum(is.na(x$wrkstat))

## [1] 0

x$happy <- as.factor(x$happy) 
sum(is.na(x$happy))

## [1] 0

x$born <- as.factor(x$born) 
sum(is.na(x$born))

## [1] 0

x$race <- as.factor(x$race) 
sum(is.na(x$race))

## [1] 0

x$sex <- as.factor(x$sex) 
sum(is.na(x$sex))

## [1] 0

x$marital <- as.factor(x$marital) 
sum(is.na(x$marital))

## [1] 0

sum(is.na(x)) #Just to be certain that there are no NAs in the dataset

## [1] 0

#Stepwise regression
m2null <- glm(earn25k ~ 1, data=x, family=binomial)
m2full <- glm(earn25k ~ ., data=x, family=binomial)
m2step <- step(m2full, scope=list(lower=m2null))

## Start:  AIC=1611.31
## earn25k ~ wrkstat + polviews + happy + born + race + sex + degree + 
##     age + marital + hrs1
## 
##            Df Deviance    AIC
## - polviews  2   1571.5 1607.5
## <none>          1571.3 1611.3
## - race      2   1575.9 1611.9
## - born      1   1574.1 1612.1
## - happy     2   1576.1 1612.1
## - sex       1   1577.9 1615.9
## - hrs1      1   1579.5 1617.5
## - age       1   1579.9 1617.9
## - marital   4   1586.8 1618.8
## - degree    4   1595.0 1627.0
## - wrkstat   1   1646.2 1684.2
## 
## Step:  AIC=1607.47
## earn25k ~ wrkstat + happy + born + race + sex + degree + age + 
##     marital + hrs1
## 
##           Df Deviance    AIC
## <none>         1571.5 1607.5
## - race     2   1576.0 1608.0
## - born     1   1574.2 1608.2
## - happy    2   1576.3 1608.3
## - sex      1   1578.1 1612.1
## - hrs1     1   1579.6 1613.6
## - age      1   1579.9 1613.9
## - marital  4   1586.9 1614.9
## - degree   4   1596.2 1624.2
## - wrkstat  1   1646.5 1680.5

summary(m2step)

## 
## Call:
## glm(formula = earn25k ~ wrkstat + happy + born + race + sex + 
##     degree + age + marital + hrs1, family = binomial, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2851  -0.9566   0.6316   0.8399   2.0551  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.149939   0.541840  -3.968 7.25e-05 ***
## wrkstatWorking parttime -1.669239   0.199957  -8.348  < 2e-16 ***
## happyPretty happy        0.373102   0.208398   1.790  0.07340 .  
## happyVery happy          0.499037   0.227493   2.194  0.02826 *  
## bornYes                  0.346718   0.208198   1.665  0.09585 .  
## raceOther               -0.221528   0.264220  -0.838  0.40179    
## raceWhite                0.212476   0.165810   1.281  0.20004    
## sexMale                  0.328496   0.127959   2.567  0.01025 *  
## degreeHigh school        0.298342   0.254804   1.171  0.24165    
## degreeJunior college     0.890924   0.324912   2.742  0.00611 ** 
## degreeBachelor           0.839172   0.271850   3.087  0.00202 ** 
## degreeGraduate           0.890965   0.293478   3.036  0.00240 ** 
## age                      0.015687   0.005435   2.886  0.00390 ** 
## maritalMarried           0.336504   0.180207   1.867  0.06186 .  
## maritalNever married    -0.072820   0.203491  -0.358  0.72045    
## maritalSeparated        -0.713231   0.340628  -2.094  0.03627 *  
## maritalWidowed           0.007107   0.385226   0.018  0.98528    
## hrs1                     0.015681   0.005565   2.818  0.00484 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1874.6  on 1402  degrees of freedom
## Residual deviance: 1571.5  on 1385  degrees of freedom
## AIC: 1607.5
## 
## Number of Fisher Scoring iterations: 4

The signs of most of the coefficients are acceptable. The coefficient represent log-odds, and once exponented reveal the odds associated with each predictor. As such a negative coefficient results in odds < 1, thus the odd of the event occurring are lower than the baseline. A positive coefficient, on the other hand has odds > 1, implying increased odds of observing event, relative to baseline.

Now repeat Problem 2(b) and 2(c) for the stepwise model. By how much did we reduce the error rate?

#Histogram of fitted values
hist2 <- hist(fitted(m2step))

#Converting probabilities to predicted classes with a 0.5 threshold
pred_class_m2step <- fitted(m2step) >= 0.5
table(pred_class_m2step)

## pred_class_m2step
## FALSE  TRUE 
##   343  1060

#Comparing predicted classes to actual classes
table_m2step <- table(predicted = pred_class_m2step, actual = x$earn25k)

#Computing the classification error rate
1-sum(diag(table_m2step))/sum(table_m2step)

## [1] 0.2637206

#2-way table comparing predicted classes to actual classes
table_m2step

##          actual
## predicted  No Yes
##     FALSE 259  84
##     TRUE  286 774

#Specificity, true negative
table_m2step[1,1]/sum(table_m2step[,1])

## [1] 0.4752294

#Sensitivity, the true positive
table_m2step[2,2]/sum(table_m2step[,2])

## [1] 0.9020979

The classification error rate is now 0.2637, or 26.37%. It has been reduced by 8.03% from 34.4%. The specificity value has also increased to 0.475 or 46.5%, while the sensitivity rate has decreased slightly to 90.2%.

Let’s see if we can improve on the classification error rate if we used different classification thresholds. I use a for-loop to compute what classification error rate would result from thresholds ranging from 0.01 up to 0.99 in steps of 0.01. In one plot, display (1) sensitivity, (2) specificity, and (3) classification error rates that result (on the y-axis) against threshold (on the x-axis). What threshold minimizes the classification error rate?

#Testing classification accuracy
N <- 99
threshs <- seq(0.01, 0.99, by=0.01)
errors <- rep(NA, length(threshs))
preds <- rep(NA, nrow(x))

for (i in 1:length(threshs)) {
  pred_class <- fitted(m2step) >= threshs[i]
  errors[i] <- mean(pred_class != x$earn25k)
}

#Optimal classification thresholds
threshs[which.min(errors)]

## [1] 0.01

#Associated classification error rates
errors[which.min(errors)]

## [1] 1

The optimal classification threshold is 0.01, its associated classification error rate is 1.

10-fold cross-validation

I repeat part (c) while using 10-fold cross-validation to fit the model/generate predictions. I used the same set of predictors across all folds. This approach gives a better estimation of the out-of-sample prediction accuracy.

#10-fold cross-validation
K <- 10
set.seed(230)
folds <- rep(1:K, length.out = nrow(x))
folds <- sample(folds)

preds <- rep(NA, nrow(x))
for (k in 1:K) {
  train <- x[folds != k,]
  test <- x[folds == k,]
  mtmp <- glm(formula(m2step), data = train, family=binomial)
  preds[folds == k] <- predict(mtmp, newdata = test, type="response")
}

#Testing classification accuracy
threshs <- seq(0.01, 0.99, by=0.01)
errs <- rep(NA, length(threshs))

for (i in 1:length(threshs)) {
  pred_class <- preds >= threshs[i]
  errs[i] <- mean(pred_class != x$earn25k)
}

plot(errs ~ threshs, main="Classification Error by Threshold", xlab="Classification Threshold",ylab="Classification Error Rate", type="l")

#Optimal classification thresholds
threshs[which.min(errs)]

## [1] 0.01

#Associated classification error rates
errs[which.min(errs)]

## [1] 1

The classification threshold with minimum errors is 0.01, and its associated classification error rate is 1. This is the same as the classification threshold and error rate from the previous question.