In this homework assignment, I'll be creating various parametric models to estimate the effects of independent variables upon our outcome variable: first-time entry into community college after high school. This data is derived from the ELS:2002 dataset.
#loading packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(haven)
library(survival)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
library(ggplot2)
library(muhaz)
library(eha)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
#loading raw data
load("C:/Users/hhx360/Google Drive (naslund.utsa@gmail.com)/School Files/School/MS Applied Demography/semesters/Fall 2020/DEM 7223/ELS 2002/unzipped/els_02_12_byf3pststu_v1_0.rdata")
els<-els_02_12_byf3pststu_v1_0
rm(els_02_12_byf3pststu_v1_0)
#standardizing variable names
colnames(els)<-toupper(colnames(els))
#college entry variables
entry_variables<-els %>% select(STU_ID,SCH_ID,STRAT_ID,PSU,F2HS2P_P,F2PS1SEC,F3HS2PS1,F3PS1SEC,BYQXDATP,F1RDTLFT,BYSES1QU,F3QWT,BYP33,BYSEX,BYPARED)
entry_variables<-entry_variables %>% mutate(F2_months_to_college=ifelse(F2HS2P_P<0,NA_real_,F2HS2P_P))
entry_variables<-entry_variables %>% mutate(F3_months_to_college=ifelse(F3HS2PS1<0,NA_real_,F3HS2PS1))
entry_variables<-entry_variables %>% mutate(F2_first_college_2_year=ifelse(F2PS1SEC==4,1,
ifelse(F2PS1SEC<0,NA_real_,
ifelse(F2PS1SEC<!0 & F2PS1SEC!=4,0,0))))
entry_variables<-entry_variables %>% mutate(F3_first_college_2_year=ifelse(F3PS1SEC==4,1,
ifelse(F3PS1SEC<0,NA_real_,
ifelse(F3PS1SEC<!0 & F3PS1SEC!=4,0,0))))
entry_variables_complete<-entry_variables %>% filter(!is.na(F3_first_college_2_year))
entry_variables_complete<-entry_variables_complete %>% mutate(TV_ENTERTAIN_HRS=ifelse(BYP33<0,NA,
ifelse(BYP33>=0 & BYP33<=3,0,
ifelse(BYP33>=4 & BYP33<=10,1,NA))))
entry_variables_complete<-entry_variables_complete %>% mutate(sex=ifelse(BYSEX==1,"Male",
ifelse(BYSEX==2,"Female",
ifelse(BYSEX<0,NA,NA))))
entry_variables_complete<-entry_variables_complete %>% mutate(MALE=ifelse(sex=="Male",1,0))
entry_variables_complete<-entry_variables_complete %>% mutate(parents_highest_ed=ifelse(BYPARED==1,"Did.Not.Graduate.HS",
ifelse(BYPARED==2,"HS.Grad.or.GED",
ifelse(BYPARED==3,"2.Year.School.No.Deg",
ifelse(BYPARED==4,"2.Year.School.Awarded.Deg",
ifelse(BYPARED==5,"4.Year.School.No.Deg",
ifelse(BYPARED==6,"4.Year.School.Awarded.Deg",
ifelse(BYPARED==7,"Masters.Degree",
ifelse(BYPARED==8,"PhD.or.other.doctorate",
ifelse(BYPARED<0,NA,NA))))))))))
entry_variables_complete<-entry_variables_complete %>% mutate(PARENT_HS_OR_LOWER=ifelse(parents_highest_ed=="Did.Not.Graduate.HS" | parents_highest_ed=="HS.Grad.or.GED",1,0))
entry_variables_complete<-entry_variables_complete %>% mutate(F3_UPDATED_MONTHS_TO_COLLEGE=ifelse(is.na(F3_months_to_college),21,F3_months_to_college))
entry_variables_complete<-entry_variables_complete %>% mutate(BY_LOW_SES=ifelse(BYSES1QU==1,1,
ifelse(BYSES1QU==2,1,
ifelse(BYSES1QU==3,0,
ifelse(BYSES1QU==4,0,
ifelse(BYSES1QU<0,NA_real_,0))))))
entry_variables_complete<-entry_variables_complete %>% filter(!is.na(BY_LOW_SES) & !is.na(F3_UPDATED_MONTHS_TO_COLLEGE) & !is.na(PARENT_HS_OR_LOWER) & !is.na(!MALE) & !is.na(TV_ENTERTAIN_HRS) & !is.na(F3_first_college_2_year))
For this assignment, I hypothesize that the following variables will have effects upon our dependent variable:
#adding variables for analysis
#Male Status:
summary(entry_variables_complete$MALE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4543 1.0000 1.0000
summary(entry_variables_complete$PARENT_HS_OR_LOWER)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1836 0.0000 1.0000
#TV four hours or longer each week
summary(entry_variables_complete$TV_ENTERTAIN_HRS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.347 1.000 1.000
summary(entry_variables_complete$BY_LOW_SES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3751 1.0000 1.0000
Here is the survival curve for community college first-time entry in the ELS:2002 dataset:
emprical_fx<-survfit(Surv(time=F3_UPDATED_MONTHS_TO_COLLEGE,event=F3_first_college_2_year)~1,data=entry_variables_complete)
plot(emprical_fx,conf.int = T,main="Survival F(x) - ELS:2002 time to CC Entry", ylab="S(t)",xlab="Months")
Because some students began community college in "month zero" (that is, immediately upon graduating high school), we have to add a very small amount to the zeroes, in order to include them in our calculations:
#adding a very small amount to zero month observations
entry_variables_complete2<-entry_variables_complete
entry_variables_complete2<-entry_variables_complete2 %>% mutate(F3_UPDATED_MONTHS_TO_COLLEGE=ifelse(F3_UPDATED_MONTHS_TO_COLLEGE==0,000000.1,F3_UPDATED_MONTHS_TO_COLLEGE))
For these models, we will be using a proportional hazard model as opposed to an accellerated failure time model. In this case, the proportional hazard model is more appropriate because it is meant to measure the hazard of an event outcome (in this case, first-time entry into community college after high school). AFT, on the other hand, is meant to measure the effect of the covariates upon the time intervals between events.
Here is the hazard plot of first time entry into community college after high school for students in the ELS:2002 dataset, along with a smoothed hazard plot interpolated on top:
fit.haz.km<-kphaz.fit(entry_variables_complete2$F3_UPDATED_MONTHS_TO_COLLEGE,entry_variables_complete2$F3_first_college_2_year,method="product-limit")
fit.haz.sm<-muhaz(entry_variables_complete2$F3_UPDATED_MONTHS_TO_COLLEGE,entry_variables_complete2$F3_first_college_2_year,min.time = 0,max.time = 21)
kphaz.plot(fit.haz.km,main="Hazard Plot - Attending CC")
lines(fit.haz.sm,col=2,lwd=3)
legend("topleft",legend=c("KM Hazard","Smoothed Hazard"),col=c(1,2),lty=c(1,1))
Now, I will test some models (exponential, weibull, lognormal and piecewise constant) to see if they fit our empirical data.
Here, we test an exponential model and compare it to the empirical data. Here are the main effects:
#lets make various models to see if they fit
#exponential model
fit.exp<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "weibull",shape=1)
summary(fit.exp)
## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES,
## data = entry_variables_complete2, dist = "weibull", shape = 1)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## MALE 0.476 -0.062 0.940 0.039 0.116
## PARENT_HS_OR_LOW 0.241 -0.018 0.982 0.051 0.728
## TV_ENTERTAIN_HRS 0.344 0.036 1.037 0.041 0.380
## BY_LOW_SES 0.485 0.178 1.195 0.045 0.000
##
## log(scale) 2.950 0.038 0.000
##
## Shape is fixed at 1
##
## Events 2635
## Total time at risk 46960
## Max. log. likelihood -10214
## LR test statistic 22.69
## Degrees of freedom 4
## Overall p-value 0.000145763
The model suggests that of our chosen covariates, only Low SES has a significant effect upon first-time entry into community college after HS.
Here we see the exponential model compared to the empirical data in graphical format:
plot(fit.exp,fn="haz",ylim =c(0,0.2),main = "Exponential")
lines(fit.haz.sm,col=2)
Obviously, the exponential model leaves something to be desired. It does not resemble the empirical data at all.
Now, we will test a Weibull model and compare it to the empirical data. Here are the main effects for Weibull:
#Weibull model, no shape
fit.weibull<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "weibull")
summary(fit.weibull)
## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES,
## data = entry_variables_complete2, dist = "weibull")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## MALE 0.476 -0.079 0.924 0.039 0.045
## PARENT_HS_OR_LOW 0.241 -0.031 0.970 0.051 0.550
## TV_ENTERTAIN_HRS 0.344 0.032 1.033 0.041 0.435
## BY_LOW_SES 0.485 0.108 1.114 0.045 0.018
##
## log(scale) 2.813 0.034 0.000
## log(shape) 0.149 0.014 0.000
##
## Events 2635
## Total time at risk 46960
## Max. log. likelihood -10159
## LR test statistic 11.06
## Degrees of freedom 4
## Overall p-value 0.0259236
Unlike the exponential function, the Weibull model includes "Male" status as being at the edge of significance with a negative coefficent. Low SES retains its significance along with a positive coefficient.
Here is a graphical representation of the weibull model overlaid on the empirical data:
plot(fit.weibull,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)
Clearly, the Weibull model also does not resemble the empirical data in any reasonable way. This makes any main effects drawn from the Weibull model highly suspect, at best.
Here I will attempt a lognormal model and compare it to the empirical data. Here are the main effects for a Lognormal model:
#lognormal hazard function
fit.lognormal<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "lognormal",center=T)
summary(fit.lognormal)
## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES,
## data = entry_variables_complete2, dist = "lognormal", center = T)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## (Intercept) 6.741 1.632 0.000
## MALE 0.476 -0.077 0.926 0.039 0.049
## PARENT_HS_OR_LOW 0.241 -0.028 0.972 0.051 0.578
## TV_ENTERTAIN_HRS 0.344 0.033 1.033 0.041 0.423
## BY_LOW_SES 0.485 0.116 1.123 0.045 0.011
##
## log(scale) 12.328 2.637 0.000
## log(shape) -1.140 0.116 0.000
##
## Events 2635
## Total time at risk 46960
## Max. log. likelihood -10148
## LR test statistic 12.03
## Degrees of freedom 4
## Overall p-value 0.0171399
As we can see, the lognormal model has similar main effects compared to the Weibull distrubition. Male status remains a negative influence at the very edge of significance, while Low SES continues to have a positive and significant influence.
Here, we graphically examine the lognormal model against the empirical data:
plot(fit.lognormal,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)
While the lognormal model rises faster in early months compared to the Weibull model, it also seems to level off quickly and does not accurately capture the variance within the empirical data.
Here, I'll create a piecewise constant exponent model and compare it to our empirical data. The main effects for this model are as follows:
#Piecewise constant exponent
fit.piecewise<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "pch",cuts=c(5,10,15,20))
summary(fit.piecewise)
## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES,
## data = entry_variables_complete2, dist = "pch", cuts = c(5,
## 10, 15, 20))
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## MALE 0.476 -0.041 0.959 0.039 0.290
## PARENT_HS_OR_LOW 0.241 -0.012 0.988 0.051 0.811
## TV_ENTERTAIN_HRS 0.344 0.036 1.036 0.041 0.382
## BY_LOW_SES 0.485 0.236 1.266 0.045 0.000
##
##
## Events 2635
## Total time at risk 46960
## Max. log. likelihood -9337.6
## LR test statistic 35.95
## Degrees of freedom 4
## Overall p-value 2.95916e-07
Notice that the significance for Male status has evaporated while the positive effect and significance of Low SES has been retained.
Here is the piecewise hazard curve plotted against the empirical data:
plot(fit.piecewise,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)
We may have discovered an (admittedly rough) approximation of the empirical data. The piecewise hazard function, cut into 5 month brackets, seems to capture the variation in the data with relative accuracy.
That being said, there are more mathematical ways to determine the fitness of one particular model over another -- namely the Akaike Information Criterion.
In this section, I've created AIC statistics to judge the model-fits from each model (exponential, weibull, lognormal and piecewise):
AIC_fit.exp<-2*fit.exp$loglik[2]+2*length(fit.exp$coefficients)
AIC_fit.weibull<-2*fit.weibull$loglik[2]+2*length(fit.weibull$coefficients)
AIC_fit.lognormal<-2*fit.lognormal$loglik[2]+2*length(fit.lognormal$coefficients)
AIC_fit.piecewise<-2*fit.piecewise$loglik[2]+2*length(fit.piecewise$coefficients)
When collected, the AIC scores paint a strong picture of relative model fit:
AIC_FIT<-as.data.frame(rbind(AIC_fit.exp,AIC_fit.weibull,AIC_fit.lognormal,AIC_fit.piecewise))
colnames(AIC_FIT)<-c("AIC")
AIC_FIT
## AIC
## AIC_fit.exp -20417.06
## AIC_fit.weibull -20306.96
## AIC_fit.lognormal -20281.25
## AIC_fit.piecewise -18667.24
As we can see, the Piecewise Constant Exponent model has a lower AIC score than the other models, suggesting a better overall fit with the empirical data. This result also matches well with our visual examination of the model fits.
Now that we have settled on a (relatively) accurate model, we can test interaction terms with the variables that we have selected. In this test, I'll be creating an interaction term between male status and low SES, as I surmise the effects of male status may be enhanced by the effect of poverty.
# creating a model with TV*LOWSES interaction term
fit.piecewise.interact<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES+(MALE*BY_LOW_SES),data=entry_variables_complete2,dist = "pch",cuts=c(5,10,15,20))
AIC_fit.piecewise.interact<-2*fit.piecewise.interact$loglik[2]+2*length(fit.piecewise.interact$coefficients)
summary(fit.piecewise.interact)
## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES +
## (MALE * BY_LOW_SES), data = entry_variables_complete2,
## dist = "pch", cuts = c(5, 10, 15, 20))
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## MALE 0.476 0.026 1.026 0.057 0.653
## PARENT_HS_OR_LOW 0.241 -0.014 0.986 0.051 0.789
## TV_ENTERTAIN_HRS 0.344 0.035 1.036 0.041 0.392
## BY_LOW_SES 0.485 0.296 1.344 0.058 0.000
## MALE:BY_LOW_SES
## -0.128 0.880 0.079 0.103
##
##
## Events 2635
## Total time at risk 46960
## Max. log. likelihood -9336.3
## LR test statistic 38.62
## Degrees of freedom 5
## Overall p-value 2.83434e-07
From this model, we see that the interaction between Male Status and Low SES is not significant. Low SES remains the only significant independent variable explaining variance in first-time entry to community college.
We can also run the new interaction model through the AIC function to see if it is relatively better than the other models:
AIC_FIT2<-as.data.frame(rbind(AIC_fit.exp,AIC_fit.weibull,AIC_fit.lognormal,AIC_fit.piecewise,AIC_fit.piecewise.interact))
colnames(AIC_FIT2)<-c("AIC")
AIC_FIT2
## AIC
## AIC_fit.exp -20417.06
## AIC_fit.weibull -20306.96
## AIC_fit.lognormal -20281.25
## AIC_fit.piecewise -18667.24
## AIC_fit.piecewise.interact -18662.58
Clearly, the piecewise model with the Male*Low SES interaction term has virtually the same AIC score as the regular piecewise model. In this case, where the AIC scores are so close, it would probably be best to select the less complicated piecewise model as it contains fewer variables and has only a marginally worse AIC score than the piecewise model with interaction terms.
From this exercise, we can draw the following conclusions:
A proportial hazard model is more appropriate for our data than the AFT, as the proportional hazard model is meant to use the hazard of an event happening as it's dependent variable -- which is exactly the kind of analysis I am interested in for the ELS:2002 dataset.
the piecewise constant exponent model is likely the most accurate parametric model for the data as presented -- this is confirmed by both visual examination and the Akaike Information Criterion.
Low SES is the only significant independent variable (among four chosen variables) to affect the hazard of attending community college first out of high school.
This concludes my homework assignment for the week. Thank you!