For homework #1 in DEM 7223, I will be performing a survival analysis on ELS:2002 data, specifically measuring the survival time for entry into community college after high school.

#loading packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(haven)
library(survival)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
library(ggplot2)
library(muhaz)

#loading raw data

load("C:/Users/hhx360/Google Drive (naslund.utsa@gmail.com)/School Files/School/MS Applied Demography/semesters/Fall 2020/DEM 7223/ELS 2002/unzipped/els_02_12_byf3pststu_v1_0.rdata")

els<-els_02_12_byf3pststu_v1_0

rm(els_02_12_byf3pststu_v1_0)

#standardizing variable names

colnames(els)<-toupper(colnames(els))

#college entry variables

entry_variables<-els %>% select(STU_ID,SCH_ID,STRAT_ID,PSU,F2HS2P_P,F2PS1SEC,F3HS2PS1,F3PS1SEC,BYQXDATP,F1RDTLFT,BYSES1QU,F3QWT)

entry_variables<-entry_variables %>% mutate(F2_months_to_college=ifelse(F2HS2P_P<0,NA_real_,F2HS2P_P))

entry_variables<-entry_variables %>% mutate(F3_months_to_college=ifelse(F3HS2PS1<0,NA_real_,F3HS2PS1))

entry_variables<-entry_variables %>% mutate(F2_first_college_2_year=ifelse(F2PS1SEC==4,1,
                                                                        ifelse(F2PS1SEC<0,NA_real_,
                                                                               ifelse(F2PS1SEC<!0 & F2PS1SEC!=4,0,0))))

entry_variables<-entry_variables %>% mutate(F3_first_college_2_year=ifelse(F3PS1SEC==4,1,
                                                                        ifelse(F3PS1SEC<0,NA_real_,
                                                                               ifelse(F3PS1SEC<!0 & F3PS1SEC!=4,0,0))))

entry_variables_complete<-entry_variables %>% filter(!is.na(F3_first_college_2_year))

entry_variables_complete<-entry_variables_complete %>% mutate(F3_UPDATED_MONTHS_TO_COLLEGE=ifelse(is.na(F3_months_to_college),21,F3_months_to_college))

entry_variables_complete<-entry_variables_complete %>% mutate(BY_LOW_SES=ifelse(BYSES1QU==1,1,
                                                                                ifelse(BYSES1QU==2,1,
                                                                                       ifelse(BYSES1QU==3,0,
                                                                                              ifelse(BYSES1QU==4,0,
                                                                                                     ifelse(BYSES1QU<0,NA_real_,0))))))

entry_variables_complete<-entry_variables_complete %>% filter(!is.na(BY_LOW_SES))

Event Variable

Our event variable for this exercise is "F3_first_college_2_year" -- this is the variable measured in the third follow up (F3) if the first college the student has attended was a public 2-year community college. It registers "1" if the student's first college out of high school was a public 2-year college, and "0" if not. There were 10716 students in total who provided data regarding their first college after high school (or the lack thereof) and their family's base year socioeconmic status (more on that later).

table(entry_variables_complete$F3_first_college_2_year)
## 
##    0    1 
## 7220 3496

Duration or Time Variable

Our duration or time variable is "F3_UPDATED_MONTHS_TO_COLLEGE" - defined as months from high school graduation to entry into a public 2-year college, or to the censoring time (21 months). No months were recorded after 21 months, regardless of how long it took the student to enter community college, or if they ever entered community college at all. This data was also gathered during the third followup (F3).

summary(entry_variables_complete$F3_UPDATED_MONTHS_TO_COLLEGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   5.982   4.000  21.000

Censoring Indicator

As stated above, the censoring indicator for this data is the event variable at 21 months. An event variable of "0" at 21 months indicates that the student did not enroll at a community college by the time of the third follow-up. An event variable of "0" and a time variable of less than 21 months indicates that the student first enrolled in some college within 21 months of high school exit, but that it was not a 2-year college.

Estimate Survival Function and Interpretation

Here is the estimated Kaplan-Meier survival analysis of the student's risk of community college enrollment:

cc_entry<-survfit(Surv(time=F3_UPDATED_MONTHS_TO_COLLEGE,event=F3_first_college_2_year)~1,data=entry_variables_complete)

summary(cc_entry)
## Call: survfit(formula = Surv(time = F3_UPDATED_MONTHS_TO_COLLEGE, event = F3_first_college_2_year) ~ 
##     1, data = entry_variables_complete)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0  10716     124    0.988 0.00103        0.986        0.990
##     1  10465     112    0.978 0.00143        0.975        0.981
##     2  10186     609    0.919 0.00266        0.914        0.925
##     3   7817    1176    0.781 0.00435        0.773        0.790
##     4   3234     140    0.747 0.00501        0.737        0.757
##     5   2600      19    0.742 0.00513        0.732        0.752
##     6   2540      20    0.736 0.00525        0.726        0.746
##     7   2494     106    0.705 0.00584        0.693        0.716
##     8   2314     130    0.665 0.00647        0.653        0.678
##     9   2090      24    0.657 0.00658        0.645        0.670
##    10   2039      13    0.653 0.00664        0.640        0.666
##    11   2012      12    0.649 0.00669        0.636        0.663
##    12   1987      18    0.643 0.00677        0.630        0.657
##    13   1946      25    0.635 0.00688        0.622        0.649
##    14   1901      70    0.612 0.00718        0.598        0.626
##    15   1778     139    0.564 0.00768        0.549        0.579
##    16   1555      29    0.553 0.00778        0.538        0.569
##    17   1488       9    0.550 0.00781        0.535        0.566
##    18   1468      10    0.546 0.00785        0.531        0.562
##    19   1446      28    0.536 0.00795        0.520        0.552
##    20   1397      34    0.523 0.00806        0.507        0.539
##    21   1344     649    0.270 0.00826        0.255        0.287

As we can see, survival starts out strong but takes a dramatic dip in month "3" after high school exit, going from .92 in month "2" to .78 in month "3".

After the third month, survival gradually decreases until the censoring date. This is more easily visualized in the following plot:

ggsurvplot(cc_entry,conf.int=T,risk.table=T,title="Survival to 2-year CC Entry")

Estimate Hazard Function

Here, we will use the kphaz.fit function to find the hazard function of the above dataset:

data_haz<-kphaz.fit(time=entry_variables_complete$F3_UPDATED_MONTHS_TO_COLLEGE,status=entry_variables_complete$F3_first_college_2_year,method="product-limit")
data.frame(data_haz)
##    time         haz          var
## 1   0.5 0.010842127 1.049629e-06
## 2   1.5 0.067718581 7.561816e-06
## 3   2.5 0.215776091 4.193655e-05
## 4   3.5 0.047344972 1.605839e-05
## 5   4.5 0.007378540 2.865521e-06
## 6   5.5 0.007941931 3.153802e-06
## 7   6.5 0.044073284 1.833295e-05
## 8   7.5 0.059001037 2.679990e-05
## 9   8.5 0.011624716 5.630858e-06
## 10  9.5 0.006427032 3.177479e-06
## 11 10.5 0.006002574 3.002611e-06
## 12 11.5 0.009162544 4.664228e-06
## 13 12.5 0.013017458 6.778458e-06
## 14 13.5 0.038062555 2.070388e-05
## 15 14.5 0.082896154 4.950193e-05
## 16 15.5 0.019103787 1.258602e-05
## 17 16.5 0.006093329 4.125492e-06
## 18 17.5 0.006869238 4.718755e-06
## 19 18.5 0.019654878 1.379821e-05
## 20 19.5 0.024851992 1.816714e-05
## 21 20.5 2.752627611          NaN

We notice the same pattern here, although in the inverse of the survival plot: hazard peaks around the third month after high school exit, and remains somewhat stable until the censoring point.

This function can be represented in a plot, such as this:

kphaz.plot(data_haz,main="Hazard Function Plot for CC Entry")

This hazard plot demonstrates that there is a clear increase in hazard during the third month after high school exit, followed by a smaller but still noticeable increase in hazard around the fifteenth month after high school exit.

Estimate Cumulative Hazard Function

Now we can use the above hazard function to calculate the cumulative hazard function, like so:

plot(cumsum(data_haz$haz)~data_haz$time,
     main="Cumulative Hazard Function for CC Entry",
     ylab="H(t)",
     xlab="Time in Months",
     type="l",col=2)

From this graph, it appears that the cumulative risk of community college entry rises gradually over time, with noticeable upward movement in the third and (to a lesser extent) the fifteenth month after high school exit. It seems reasonable that these are two time points in which high school graduates face the daunting prospects of facing adulthood without a post-secondary education, and decide for themselves to attempt attending college.

Define categorical grouping variable and hypothesis

I have selected a categorical grouping variable ("BY_LOW_SES") which will register "1" if the student's household is in the bottom two quartiles of socioeconomic status during the base year survey, and "0" if their household is in the top two quartiles during the same period. I expect that students in the bottom two quartiles ("1") will be more likely to attend community college in the first 21 months after high school, if only because community college is frequently a much more inexpensive route to post-secondary education than a traditional university, and therefore more appealing to low-SES students.

low_ses<-table(entry_variables_complete$BY_LOW_SES)
low_ses
## 
##    0    1 
## 6383 4333

Here is the continuation of the Kaplan-Meier survival analysis of our outcome (LOW SES) by months to first 2-year community college entry:

SES_fit<-survfit(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~BY_LOW_SES,data=entry_variables_complete)
SES_fit
## Call: survfit(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     BY_LOW_SES, data = entry_variables_complete)
## 
##                 n events median 0.95LCL 0.95UCL
## BY_LOW_SES=0 6383   1580     21      21      21
## BY_LOW_SES=1 4333   1916     19      16      21

As you can see, there appears to be some difference between students from low SES ("1") and students from high SES ("0"). These differences can be further seen in a plot of this same model:

ggsurvplot(SES_fit,conf.int = T,title="Survival Function for CC Entry by Low vs. High SES")

As we can see, the hypothesis appears to be correct: low SES students in the base-year appear to have smaller survival probabilities in all months after high school exit than students who come from high SES households.

Kaplan-Meier Survival Analysis Across SES groups

Here, I have run the Kaplan-Meier Survival Analysis across both SES groups:

summary(SES_fit)
## Call: survfit(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     BY_LOW_SES, data = entry_variables_complete)
## 
##                 BY_LOW_SES=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   6383      43    0.993 0.00102        0.991        0.995
##     1   6286      62    0.983 0.00160        0.980        0.987
##     2   6135     300    0.935 0.00311        0.929        0.941
##     3   4542     573    0.817 0.00535        0.807        0.828
##     4   1459      70    0.778 0.00684        0.765        0.792
##     5   1052       8    0.772 0.00710        0.758        0.786
##     6   1023       8    0.766 0.00736        0.752        0.781
##     7   1005      46    0.731 0.00865        0.714        0.748
##     8    925      44    0.696 0.00970        0.678        0.716
##     9    833      12    0.686 0.00998        0.667        0.706
##    10    813       8    0.680 0.01017        0.660        0.700
##    11    801       3    0.677 0.01023        0.657        0.697
##    12    796      10    0.669 0.01045        0.648        0.689
##    13    778      13    0.657 0.01073        0.637        0.679
##    14    754      24    0.636 0.01120        0.615        0.659
##    15    702      45    0.596 0.01202        0.573        0.620
##    16    611      12    0.584 0.01225        0.560        0.608
##    17    584       2    0.582 0.01229        0.558        0.607
##    18    579       6    0.576 0.01241        0.552        0.601
##    19    571      12    0.564 0.01263        0.540        0.589
##    20    550      15    0.548 0.01289        0.524        0.574
##    21    526     264    0.273 0.01357        0.248        0.301
## 
##                 BY_LOW_SES=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   4333      81    0.981 0.00206        0.977        0.985
##     1   4179      50    0.970 0.00262        0.964        0.975
##     2   4051     309    0.896 0.00471        0.886        0.905
##     3   3275     603    0.731 0.00718        0.717        0.745
##     4   1775      70    0.702 0.00768        0.687        0.717
##     5   1548      11    0.697 0.00777        0.682        0.712
##     6   1517      12    0.691 0.00787        0.676        0.707
##     7   1489      60    0.664 0.00834        0.647        0.680
##     8   1389      86    0.622 0.00892        0.605        0.640
##     9   1257      12    0.617 0.00900        0.599        0.634
##    10   1226       5    0.614 0.00903        0.597        0.632
##    11   1211       9    0.609 0.00909        0.592        0.628
##    12   1191       8    0.605 0.00914        0.588        0.624
##    13   1168      12    0.599 0.00922        0.581        0.617
##    14   1147      46    0.575 0.00951        0.557        0.594
##    15   1076      94    0.525 0.00999        0.506        0.545
##    16    944      17    0.515 0.01007        0.496        0.536
##    17    904       7    0.511 0.01011        0.492        0.532
##    18    889       4    0.509 0.01013        0.490        0.529
##    19    875      16    0.500 0.01020        0.480        0.520
##    20    847      19    0.489 0.01029        0.469        0.509
##    21    818     385    0.259 0.01012        0.240        0.279

Initial survival probabilities are relatively similar, but diverge quickly and dramatically after the third month.

Here, I have continued the Kaplan-Meier Survival Analysis across both SES groups:

chi_sq<-survdiff(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~BY_LOW_SES,data=entry_variables_complete)
survdiff(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~BY_LOW_SES,data=entry_variables_complete)
## Call:
## survdiff(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~ 
##     BY_LOW_SES, data = entry_variables_complete)
## 
##                 N Observed Expected (O-E)^2/E (O-E)^2/V
## BY_LOW_SES=0 6383     1580     1782      22.8      57.4
## BY_LOW_SES=1 4333     1916     1714      23.7      57.4
## 
##  Chisq= 57.4  on 1 degrees of freedom, p= 4e-14

We can see clearly that the survival rates of low SES vs. high SES are quite different, especially with a p-value of 3.537058210^{-14} -- which is to say, a p-value of almost but not-quite "zero".

Hazard Analysis of BY_LOW_SES strata

Now I will run a hazard analysis against the data with the BY_LOW_SES variable:

low_ses_haz<-kphaz.fit(entry_variables_complete$F3_UPDATED_MONTHS_TO_COLLEGE,entry_variables_complete$F3_first_college_2_year,entry_variables_complete$BY_LOW_SES)

low_ses_haz
## $time
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5
## [16] 15.5 16.5 17.5 18.5 19.5 20.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5
## [31]  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 20.5
## 
## $haz
##  [1] 0.009977496 0.056445467 0.201391148 0.054179409 0.007696298 0.007891735
##  [7] 0.047515962 0.049858523 0.014568319 0.009918065 0.003757838 0.012681391
## [13] 0.016979527 0.032963481 0.067831212 0.020107957 0.003436467 0.010437949
## [19] 0.021320092 0.027861019 2.901372476 0.012143404 0.084824233 0.242453622
## [25] 0.042144195 0.007158715 0.007969650 0.041725948 0.065062659 0.009664183
## [31] 0.004107141 0.007485584 0.006807899 0.010382360 0.041402986 0.092732177
## [37] 0.018427385 0.007808983 0.004531400 0.018545395 0.022880183 2.501165359
## 
## $var
##  [1] 1.605727e-06 1.068113e-05 7.881494e-05 4.219889e-05 7.404405e-06
##  [6] 7.785174e-06 4.910681e-05 5.654541e-05 1.768729e-05 1.229625e-05
## [11] 4.707127e-06 1.608265e-05 2.217911e-05 4.529053e-05 1.023828e-04
## [16] 3.369725e-05 5.904721e-06 1.815871e-05 3.788341e-05 5.175679e-05
## [21] 2.901289e-01 2.949459e-06 2.335443e-05 1.003230e-04 2.540775e-05
## [26] 4.658982e-06 5.293076e-06 2.902859e-05 4.926253e-05 7.783361e-06
## [31] 3.373743e-06 6.226113e-06 5.793532e-06 8.982991e-06 3.727820e-05
## [36] 9.160412e-05 1.997694e-05 8.711702e-06 5.133516e-06 2.149734e-05
## [41] 2.755479e-05 1.205704e-01
## 
## $strata
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1

In the above data, we clearly see that there are 21 entries in the first strata (BY_LOW_SES=="0") and 21 in the second strata (BY_LOW_SES=="1"). With this knowledge, we can create a hazard plot of the two strata, as follows:

plot(y=low_ses_haz$haz[1:21],x=low_ses_haz$time[1:21],col=1,lty=1,type="s")
lines(y=low_ses_haz$haz[22:42],x=low_ses_haz$time[22:42],col=2,lty=1,type="s")

Once hazard functions for each strata are plotted, we see that low SES students (the red line) do have slightly higher risk levels in esentially the same places that high SES students have: three and fifteen months, respectively.

This concludes my homework for this week. Thank you!