Dependent Variable

In this homework assignment, I'll be creating various parametric models to estimate the effects of independent variables upon our outcome variable: first-time entry into community college after high school. This data is derived from the ELS:2002 dataset.

#loading packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(haven)
library(survival)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(survminer)

## Loading required package: ggplot2

## Loading required package: ggpubr

library(ggplot2)
library(muhaz)
library(eha)
library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

#loading raw data

load("C:/Users/hhx360/Google Drive (naslund.utsa@gmail.com)/School Files/School/MS Applied Demography/semesters/Fall 2020/DEM 7223/ELS 2002/unzipped/els_02_12_byf3pststu_v1_0.rdata")

els<-els_02_12_byf3pststu_v1_0

rm(els_02_12_byf3pststu_v1_0)

#standardizing variable names

colnames(els)<-toupper(colnames(els))

#college entry variables

entry_variables<-els %>% select(STU_ID,SCH_ID,STRAT_ID,PSU,F2HS2P_P,F2PS1SEC,F3HS2PS1,F3PS1SEC,BYQXDATP,F1RDTLFT,BYSES1QU,F3QWT,BYP33,BYSEX,BYPARED)

entry_variables<-entry_variables %>% mutate(F2_months_to_college=ifelse(F2HS2P_P<0,NA_real_,F2HS2P_P))

entry_variables<-entry_variables %>% mutate(F3_months_to_college=ifelse(F3HS2PS1<0,NA_real_,F3HS2PS1))

entry_variables<-entry_variables %>% mutate(F2_first_college_2_year=ifelse(F2PS1SEC==4,1,
                                                                        ifelse(F2PS1SEC<0,NA_real_,
                                                                               ifelse(F2PS1SEC<!0 & F2PS1SEC!=4,0,0))))

entry_variables<-entry_variables %>% mutate(F3_first_college_2_year=ifelse(F3PS1SEC==4,1,
                                                                        ifelse(F3PS1SEC<0,NA_real_,
                                                                               ifelse(F3PS1SEC<!0 & F3PS1SEC!=4,0,0))))

entry_variables_complete<-entry_variables %>% filter(!is.na(F3_first_college_2_year))

entry_variables_complete<-entry_variables_complete %>% mutate(TV_ENTERTAIN_HRS=ifelse(BYP33<0,NA,
                                                                                      ifelse(BYP33>=0 & BYP33<=3,0,
                                                                                             ifelse(BYP33>=4 & BYP33<=10,1,NA))))

entry_variables_complete<-entry_variables_complete %>% mutate(sex=ifelse(BYSEX==1,"Male",
                                                                     ifelse(BYSEX==2,"Female",
                                                                            ifelse(BYSEX<0,NA,NA))))
entry_variables_complete<-entry_variables_complete %>% mutate(MALE=ifelse(sex=="Male",1,0))

entry_variables_complete<-entry_variables_complete %>% mutate(parents_highest_ed=ifelse(BYPARED==1,"Did.Not.Graduate.HS",
                                                                                    ifelse(BYPARED==2,"HS.Grad.or.GED",
                                                                                           ifelse(BYPARED==3,"2.Year.School.No.Deg",
                                                                                                  ifelse(BYPARED==4,"2.Year.School.Awarded.Deg",
                                                                                                         ifelse(BYPARED==5,"4.Year.School.No.Deg",
                                                                                                                ifelse(BYPARED==6,"4.Year.School.Awarded.Deg",
                                                                                                                       ifelse(BYPARED==7,"Masters.Degree",
                                                                                                                              ifelse(BYPARED==8,"PhD.or.other.doctorate",
                                                                                                                                     ifelse(BYPARED<0,NA,NA))))))))))

entry_variables_complete<-entry_variables_complete %>% mutate(PARENT_HS_OR_LOWER=ifelse(parents_highest_ed=="Did.Not.Graduate.HS" | parents_highest_ed=="HS.Grad.or.GED",1,0))

entry_variables_complete<-entry_variables_complete %>% mutate(F3_UPDATED_MONTHS_TO_COLLEGE=ifelse(is.na(F3_months_to_college),21,F3_months_to_college))

entry_variables_complete<-entry_variables_complete %>% mutate(BY_LOW_SES=ifelse(BYSES1QU==1,1,
                                                                                ifelse(BYSES1QU==2,1,
                                                                                       ifelse(BYSES1QU==3,0,
                                                                                              ifelse(BYSES1QU==4,0,
                                                                                                     ifelse(BYSES1QU<0,NA_real_,0))))))

entry_variables_complete<-entry_variables_complete %>% filter(!is.na(BY_LOW_SES) & !is.na(F3_UPDATED_MONTHS_TO_COLLEGE) & !is.na(PARENT_HS_OR_LOWER) & !is.na(!MALE) & !is.na(TV_ENTERTAIN_HRS) & !is.na(F3_first_college_2_year))

Independent variables

For this assignment, I hypothesize that the following variables will have effects upon our dependent variable:

Male status (should increase community college entry)
- I hypothesize that male students may be more encouraged to enter the job market than attend high education, thus community colleges may be more appealing to them.
Parents having a HS education or lower (should increase community college entry)
- I hypothesize that people from a household where both parents have only a high school education or lower will be more likely to enter community college, as there may be less emphasis on academic rigor in their household.
Number of hours spent watching TV or other electronic entertainment during the base year (should incrase cc entry)
- I hypothesize that students who spend more time watching TV and electronic entertainment during the base year will be more likely to attend community college, as they may desire the less rigorus academic environment of the community college over a traditional university (more time to play video games).
Parents are Low SES in the base-year.
- I hypothesize that students from low SES backgrounds will be attracted to the relative inexpensiveness of community college. Here are the respective summaries of this data in the data set:

#adding variables for analysis

#Male Status:
summary(entry_variables_complete$MALE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4543  1.0000  1.0000

summary(entry_variables_complete$PARENT_HS_OR_LOWER)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1836  0.0000  1.0000

#TV four hours or longer each week
summary(entry_variables_complete$TV_ENTERTAIN_HRS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.347   1.000   1.000

summary(entry_variables_complete$BY_LOW_SES)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3751  1.0000  1.0000

Survival Curve for ELS:2002 community college entry

Here is the survival curve for community college first-time entry in the ELS:2002 dataset:

emprical_fx<-survfit(Surv(time=F3_UPDATED_MONTHS_TO_COLLEGE,event=F3_first_college_2_year)~1,data=entry_variables_complete)

plot(emprical_fx,conf.int = T,main="Survival F(x) - ELS:2002 time to CC Entry", ylab="S(t)",xlab="Months")

Because some students began community college in "month zero" (that is, immediately upon graduating high school), we have to add a very small amount to the zeroes, in order to include them in our calculations:

#adding a very small amount to zero month observations

entry_variables_complete2<-entry_variables_complete

entry_variables_complete2<-entry_variables_complete2 %>% mutate(F3_UPDATED_MONTHS_TO_COLLEGE=ifelse(F3_UPDATED_MONTHS_TO_COLLEGE==0,000000.1,F3_UPDATED_MONTHS_TO_COLLEGE))

AFT or PH?

For these models, we will be using a proportional hazard model as opposed to an accellerated failure time model. In this case, the proportional hazard model is more appropriate because it is meant to measure the hazard of an event outcome (in this case, first-time entry into community college after high school). AFT, on the other hand, is meant to measure the effect of the covariates upon the time intervals between events.

Here is the hazard plot of first time entry into community college after high school for students in the ELS:2002 dataset, along with a smoothed hazard plot interpolated on top:

fit.haz.km<-kphaz.fit(entry_variables_complete2$F3_UPDATED_MONTHS_TO_COLLEGE,entry_variables_complete2$F3_first_college_2_year,method="product-limit")

fit.haz.sm<-muhaz(entry_variables_complete2$F3_UPDATED_MONTHS_TO_COLLEGE,entry_variables_complete2$F3_first_college_2_year,min.time = 0,max.time = 21)

kphaz.plot(fit.haz.km,main="Hazard Plot - Attending CC")
lines(fit.haz.sm,col=2,lwd=3)
legend("topleft",legend=c("KM Hazard","Smoothed Hazard"),col=c(1,2),lty=c(1,1))

Model fits to empirical data

Now, I will test some models (exponential, weibull, lognormal and piecewise constant) to see if they fit our empirical data.

Exponential Model - main effects and graphical representation

Here, we test an exponential model and compare it to the empirical data. Here are the main effects:

#lets make various models to see if they fit

#exponential model

fit.exp<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "weibull",shape=1)

summary(fit.exp)

## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES, 
##     data = entry_variables_complete2, dist = "weibull", shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## MALE                0.476    -0.062     0.940     0.039     0.116 
## PARENT_HS_OR_LOW    0.241    -0.018     0.982     0.051     0.728 
## TV_ENTERTAIN_HRS    0.344     0.036     1.037     0.041     0.380 
## BY_LOW_SES          0.485     0.178     1.195     0.045     0.000 
## 
## log(scale)                    2.950               0.038     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    2635 
## Total time at risk         46960 
## Max. log. likelihood      -10214 
## LR test statistic         22.69 
## Degrees of freedom        4 
## Overall p-value           0.000145763

The model suggests that of our chosen covariates, only Low SES has a significant effect upon first-time entry into community college after HS.

Here we see the exponential model compared to the empirical data in graphical format:

plot(fit.exp,fn="haz",ylim =c(0,0.2),main = "Exponential")
lines(fit.haz.sm,col=2)

Obviously, the exponential model leaves something to be desired. It does not resemble the empirical data at all.

Weibull model - main effects and graphical representation

Now, we will test a Weibull model and compare it to the empirical data. Here are the main effects for Weibull:

#Weibull model, no shape

fit.weibull<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "weibull")

summary(fit.weibull)

## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES, 
##     data = entry_variables_complete2, dist = "weibull")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## MALE                0.476    -0.079     0.924     0.039     0.045 
## PARENT_HS_OR_LOW    0.241    -0.031     0.970     0.051     0.550 
## TV_ENTERTAIN_HRS    0.344     0.032     1.033     0.041     0.435 
## BY_LOW_SES          0.485     0.108     1.114     0.045     0.018 
## 
## log(scale)                    2.813               0.034     0.000 
## log(shape)                    0.149               0.014     0.000 
## 
## Events                    2635 
## Total time at risk         46960 
## Max. log. likelihood      -10159 
## LR test statistic         11.06 
## Degrees of freedom        4 
## Overall p-value           0.0259236

Unlike the exponential function, the Weibull model includes "Male" status as being at the edge of significance with a negative coefficent. Low SES retains its significance along with a positive coefficient.

Here is a graphical representation of the weibull model overlaid on the empirical data:

plot(fit.weibull,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)

Clearly, the Weibull model also does not resemble the empirical data in any reasonable way. This makes any main effects drawn from the Weibull model highly suspect, at best.

Lognormal - main effects and graphical representation

Here I will attempt a lognormal model and compare it to the empirical data. Here are the main effects for a Lognormal model:

#lognormal hazard function

fit.lognormal<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "lognormal",center=T)

summary(fit.lognormal)

## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES, 
##     data = entry_variables_complete2, dist = "lognormal", center = T)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## (Intercept)                   6.741               1.632     0.000 
## MALE                0.476    -0.077     0.926     0.039     0.049 
## PARENT_HS_OR_LOW    0.241    -0.028     0.972     0.051     0.578 
## TV_ENTERTAIN_HRS    0.344     0.033     1.033     0.041     0.423 
## BY_LOW_SES          0.485     0.116     1.123     0.045     0.011 
## 
## log(scale)                   12.328               2.637     0.000 
## log(shape)                   -1.140               0.116     0.000 
## 
## Events                    2635 
## Total time at risk         46960 
## Max. log. likelihood      -10148 
## LR test statistic         12.03 
## Degrees of freedom        4 
## Overall p-value           0.0171399

As we can see, the lognormal model has similar main effects compared to the Weibull distrubition. Male status remains a negative influence at the very edge of significance, while Low SES continues to have a positive and significant influence.

Here, we graphically examine the lognormal model against the empirical data:

plot(fit.lognormal,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)

While the lognormal model rises faster in early months compared to the Weibull model, it also seems to level off quickly and does not accurately capture the variance within the empirical data.

Piecewise constant exponent - main effects and graphical representation

Here, I'll create a piecewise constant exponent model and compare it to our empirical data. The main effects for this model are as follows:

#Piecewise constant exponent

fit.piecewise<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES,data=entry_variables_complete2,dist = "pch",cuts=c(5,10,15,20))

summary(fit.piecewise)

## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES, 
##     data = entry_variables_complete2, dist = "pch", cuts = c(5, 
##         10, 15, 20))
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## MALE                0.476    -0.041     0.959     0.039     0.290 
## PARENT_HS_OR_LOW    0.241    -0.012     0.988     0.051     0.811 
## TV_ENTERTAIN_HRS    0.344     0.036     1.036     0.041     0.382 
## BY_LOW_SES          0.485     0.236     1.266     0.045     0.000 
## 
## 
## Events                    2635 
## Total time at risk         46960 
## Max. log. likelihood      -9337.6 
## LR test statistic         35.95 
## Degrees of freedom        4 
## Overall p-value           2.95916e-07

Notice that the significance for Male status has evaporated while the positive effect and significance of Low SES has been retained.

Here is the piecewise hazard curve plotted against the empirical data:

plot(fit.piecewise,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)

We may have discovered an (admittedly rough) approximation of the empirical data. The piecewise hazard function, cut into 5 month brackets, seems to capture the variation in the data with relative accuracy.

That being said, there are more mathematical ways to determine the fitness of one particular model over another -- namely the Akaike Information Criterion.

Model Fit and Diagnostics - AIC

In this section, I've created AIC statistics to judge the model-fits from each model (exponential, weibull, lognormal and piecewise):

AIC_fit.exp<-2*fit.exp$loglik[2]+2*length(fit.exp$coefficients)
AIC_fit.weibull<-2*fit.weibull$loglik[2]+2*length(fit.weibull$coefficients)
AIC_fit.lognormal<-2*fit.lognormal$loglik[2]+2*length(fit.lognormal$coefficients)
AIC_fit.piecewise<-2*fit.piecewise$loglik[2]+2*length(fit.piecewise$coefficients)

When collected, the AIC scores paint a strong picture of relative model fit:

AIC_FIT<-as.data.frame(rbind(AIC_fit.exp,AIC_fit.weibull,AIC_fit.lognormal,AIC_fit.piecewise))
colnames(AIC_FIT)<-c("AIC")
AIC_FIT

##                         AIC
## AIC_fit.exp       -20417.06
## AIC_fit.weibull   -20306.96
## AIC_fit.lognormal -20281.25
## AIC_fit.piecewise -18667.24

As we can see, the Piecewise Constant Exponent model has a lower AIC score than the other models, suggesting a better overall fit with the empirical data. This result also matches well with our visual examination of the model fits.

Testing Interaction Terms with the Piecewise model

Now that we have settled on a (relatively) accurate model, we can test interaction terms with the variables that we have selected. In this test, I'll be creating an interaction term between male status and low SES, as I surmise the effects of male status may be enhanced by the effect of poverty.

# creating a model with TV*LOWSES interaction term

fit.piecewise.interact<-phreg(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~MALE+PARENT_HS_OR_LOWER+TV_ENTERTAIN_HRS+BY_LOW_SES+(MALE*BY_LOW_SES),data=entry_variables_complete2,dist = "pch",cuts=c(5,10,15,20))
AIC_fit.piecewise.interact<-2*fit.piecewise.interact$loglik[2]+2*length(fit.piecewise.interact$coefficients)
summary(fit.piecewise.interact)

## Call:
## phreg(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     MALE + PARENT_HS_OR_LOWER + TV_ENTERTAIN_HRS + BY_LOW_SES + 
##         (MALE * BY_LOW_SES), data = entry_variables_complete2, 
##     dist = "pch", cuts = c(5, 10, 15, 20))
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## MALE                0.476     0.026     1.026     0.057     0.653 
## PARENT_HS_OR_LOW    0.241    -0.014     0.986     0.051     0.789 
## TV_ENTERTAIN_HRS    0.344     0.035     1.036     0.041     0.392 
## BY_LOW_SES          0.485     0.296     1.344     0.058     0.000 
## MALE:BY_LOW_SES  
##        -0.128     0.880     0.079     0.103 
## 
## 
## Events                    2635 
## Total time at risk         46960 
## Max. log. likelihood      -9336.3 
## LR test statistic         38.62 
## Degrees of freedom        5 
## Overall p-value           2.83434e-07

From this model, we see that the interaction between Male Status and Low SES is not significant. Low SES remains the only significant independent variable explaining variance in first-time entry to community college.

We can also run the new interaction model through the AIC function to see if it is relatively better than the other models:

AIC_FIT2<-as.data.frame(rbind(AIC_fit.exp,AIC_fit.weibull,AIC_fit.lognormal,AIC_fit.piecewise,AIC_fit.piecewise.interact))
colnames(AIC_FIT2)<-c("AIC")
AIC_FIT2

##                                  AIC
## AIC_fit.exp                -20417.06
## AIC_fit.weibull            -20306.96
## AIC_fit.lognormal          -20281.25
## AIC_fit.piecewise          -18667.24
## AIC_fit.piecewise.interact -18662.58

Clearly, the piecewise model with the Male*Low SES interaction term has virtually the same AIC score as the regular piecewise model. In this case, where the AIC scores are so close, it would probably be best to select the less complicated piecewise model as it contains fewer variables and has only a marginally worse AIC score than the piecewise model with interaction terms.

Conclusion

From this exercise, we can draw the following conclusions:

A proportial hazard model is more appropriate for our data than the AFT, as the proportional hazard model is meant to use the hazard of an event happening as it's dependent variable -- which is exactly the kind of analysis I am interested in for the ELS:2002 dataset.
the piecewise constant exponent model is likely the most accurate parametric model for the data as presented -- this is confirmed by both visual examination and the Akaike Information Criterion.
Low SES is the only significant independent variable (among four chosen variables) to affect the hazard of attending community college first out of high school.

This concludes my homework assignment for the week. Thank you!

DEM7223 - Homework 2

Karlerik Naslund

September 22, 2020