For homework #1 in DEM 7223, I will be performing a survival analysis on ELS:2002 data, specifically measuring the survival time for entry into community college after high school.
#loading packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(haven)
library(survival)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
library(ggplot2)
library(muhaz)
#loading raw data
load("C:/Users/hhx360/Google Drive (naslund.utsa@gmail.com)/School Files/School/MS Applied Demography/semesters/Fall 2020/DEM 7223/ELS 2002/unzipped/els_02_12_byf3pststu_v1_0.rdata")
els<-els_02_12_byf3pststu_v1_0
rm(els_02_12_byf3pststu_v1_0)
#standardizing variable names
colnames(els)<-toupper(colnames(els))
#college entry variables
entry_variables<-els %>% select(STU_ID,SCH_ID,STRAT_ID,PSU,F2HS2P_P,F2PS1SEC,F3HS2PS1,F3PS1SEC,BYQXDATP,F1RDTLFT,BYSES1QU,F3QWT)
entry_variables<-entry_variables %>% mutate(F2_months_to_college=ifelse(F2HS2P_P<0,NA_real_,F2HS2P_P))
entry_variables<-entry_variables %>% mutate(F3_months_to_college=ifelse(F3HS2PS1<0,NA_real_,F3HS2PS1))
entry_variables<-entry_variables %>% mutate(F2_first_college_2_year=ifelse(F2PS1SEC==4,1,
ifelse(F2PS1SEC<0,NA_real_,
ifelse(F2PS1SEC<!0 & F2PS1SEC!=4,0,0))))
entry_variables<-entry_variables %>% mutate(F3_first_college_2_year=ifelse(F3PS1SEC==4,1,
ifelse(F3PS1SEC<0,NA_real_,
ifelse(F3PS1SEC<!0 & F3PS1SEC!=4,0,0))))
entry_variables_complete<-entry_variables %>% filter(!is.na(F3_first_college_2_year))
entry_variables_complete<-entry_variables_complete %>% mutate(F3_UPDATED_MONTHS_TO_COLLEGE=ifelse(is.na(F3_months_to_college),21,F3_months_to_college))
entry_variables_complete<-entry_variables_complete %>% mutate(BY_LOW_SES=ifelse(BYSES1QU==1,1,
ifelse(BYSES1QU==2,1,
ifelse(BYSES1QU==3,0,
ifelse(BYSES1QU==4,0,
ifelse(BYSES1QU<0,NA_real_,0))))))
entry_variables_complete<-entry_variables_complete %>% filter(!is.na(BY_LOW_SES))
Our event variable for this exercise is "F3_first_college_2_year" -- this is the variable measured in the third follow up (F3) if the first college the student has attended was a public 2-year community college. It registers "1" if the student's first college out of high school was a public 2-year college, and "0" if not. There were 10716 students in total who provided data regarding their first college after high school (or the lack thereof) and their family's base year socioeconmic status (more on that later).
table(entry_variables_complete$F3_first_college_2_year)
##
## 0 1
## 7220 3496
Our duration or time variable is "F3_UPDATED_MONTHS_TO_COLLEGE" - defined as months from high school graduation to entry into a public 2-year college, or to the censoring time (21 months). No months were recorded after 21 months, regardless of how long it took the student to enter community college, or if they ever entered community college at all. This data was also gathered during the third followup (F3).
summary(entry_variables_complete$F3_UPDATED_MONTHS_TO_COLLEGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 5.982 4.000 21.000
As stated above, the censoring indicator for this data is the event variable at 21 months. An event variable of "0" at 21 months indicates that the student did not enroll at a community college by the time of the third follow-up. An event variable of "0" and a time variable of less than 21 months indicates that the student first enrolled in some college within 21 months of high school exit, but that it was not a 2-year college.
Here is the estimated Kaplan-Meier survival analysis of the student's risk of community college enrollment:
cc_entry<-survfit(Surv(time=F3_UPDATED_MONTHS_TO_COLLEGE,event=F3_first_college_2_year)~1,data=entry_variables_complete)
summary(cc_entry)
## Call: survfit(formula = Surv(time = F3_UPDATED_MONTHS_TO_COLLEGE, event = F3_first_college_2_year) ~
## 1, data = entry_variables_complete)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 10716 124 0.988 0.00103 0.986 0.990
## 1 10465 112 0.978 0.00143 0.975 0.981
## 2 10186 609 0.919 0.00266 0.914 0.925
## 3 7817 1176 0.781 0.00435 0.773 0.790
## 4 3234 140 0.747 0.00501 0.737 0.757
## 5 2600 19 0.742 0.00513 0.732 0.752
## 6 2540 20 0.736 0.00525 0.726 0.746
## 7 2494 106 0.705 0.00584 0.693 0.716
## 8 2314 130 0.665 0.00647 0.653 0.678
## 9 2090 24 0.657 0.00658 0.645 0.670
## 10 2039 13 0.653 0.00664 0.640 0.666
## 11 2012 12 0.649 0.00669 0.636 0.663
## 12 1987 18 0.643 0.00677 0.630 0.657
## 13 1946 25 0.635 0.00688 0.622 0.649
## 14 1901 70 0.612 0.00718 0.598 0.626
## 15 1778 139 0.564 0.00768 0.549 0.579
## 16 1555 29 0.553 0.00778 0.538 0.569
## 17 1488 9 0.550 0.00781 0.535 0.566
## 18 1468 10 0.546 0.00785 0.531 0.562
## 19 1446 28 0.536 0.00795 0.520 0.552
## 20 1397 34 0.523 0.00806 0.507 0.539
## 21 1344 649 0.270 0.00826 0.255 0.287
As we can see, survival starts out strong but takes a dramatic dip in month "3" after high school exit, going from .92 in month "2" to .78 in month "3".
After the third month, survival gradually decreases until the censoring date. This is more easily visualized in the following plot:
ggsurvplot(cc_entry,conf.int=T,risk.table=T,title="Survival to 2-year CC Entry")
Here, we will use the kphaz.fit function to find the hazard function of the above dataset:
data_haz<-kphaz.fit(time=entry_variables_complete$F3_UPDATED_MONTHS_TO_COLLEGE,status=entry_variables_complete$F3_first_college_2_year,method="product-limit")
data.frame(data_haz)
## time haz var
## 1 0.5 0.010842127 1.049629e-06
## 2 1.5 0.067718581 7.561816e-06
## 3 2.5 0.215776091 4.193655e-05
## 4 3.5 0.047344972 1.605839e-05
## 5 4.5 0.007378540 2.865521e-06
## 6 5.5 0.007941931 3.153802e-06
## 7 6.5 0.044073284 1.833295e-05
## 8 7.5 0.059001037 2.679990e-05
## 9 8.5 0.011624716 5.630858e-06
## 10 9.5 0.006427032 3.177479e-06
## 11 10.5 0.006002574 3.002611e-06
## 12 11.5 0.009162544 4.664228e-06
## 13 12.5 0.013017458 6.778458e-06
## 14 13.5 0.038062555 2.070388e-05
## 15 14.5 0.082896154 4.950193e-05
## 16 15.5 0.019103787 1.258602e-05
## 17 16.5 0.006093329 4.125492e-06
## 18 17.5 0.006869238 4.718755e-06
## 19 18.5 0.019654878 1.379821e-05
## 20 19.5 0.024851992 1.816714e-05
## 21 20.5 2.752627611 NaN
We notice the same pattern here, although in the inverse of the survival plot: hazard peaks around the third month after high school exit, and remains somewhat stable until the censoring point.
This function can be represented in a plot, such as this:
kphaz.plot(data_haz,main="Hazard Function Plot for CC Entry")
This hazard plot demonstrates that there is a clear increase in hazard during the third month after high school exit, followed by a smaller but still noticeable increase in hazard around the fifteenth month after high school exit.
Now we can use the above hazard function to calculate the cumulative hazard function, like so:
plot(cumsum(data_haz$haz)~data_haz$time,
main="Cumulative Hazard Function for CC Entry",
ylab="H(t)",
xlab="Time in Months",
type="l",col=2)
From this graph, it appears that the cumulative risk of community college entry rises gradually over time, with noticeable upward movement in the third and (to a lesser extent) the fifteenth month after high school exit. It seems reasonable that these are two time points in which high school graduates face the daunting prospects of facing adulthood without a post-secondary education, and decide for themselves to attempt attending college.
I have selected a categorical grouping variable ("BY_LOW_SES") which will register "1" if the student's household is in the bottom two quartiles of socioeconomic status during the base year survey, and "0" if their household is in the top two quartiles during the same period. I expect that students in the bottom two quartiles ("1") will be more likely to attend community college in the first 21 months after high school, if only because community college is frequently a much more inexpensive route to post-secondary education than a traditional university, and therefore more appealing to low-SES students.
low_ses<-table(entry_variables_complete$BY_LOW_SES)
low_ses
##
## 0 1
## 6383 4333
Here is the continuation of the Kaplan-Meier survival analysis of our outcome (LOW SES) by months to first 2-year community college entry:
SES_fit<-survfit(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~BY_LOW_SES,data=entry_variables_complete)
SES_fit
## Call: survfit(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## BY_LOW_SES, data = entry_variables_complete)
##
## n events median 0.95LCL 0.95UCL
## BY_LOW_SES=0 6383 1580 21 21 21
## BY_LOW_SES=1 4333 1916 19 16 21
As you can see, there appears to be some difference between students from low SES ("1") and students from high SES ("0"). These differences can be further seen in a plot of this same model:
ggsurvplot(SES_fit,conf.int = T,title="Survival Function for CC Entry by Low vs. High SES")
As we can see, the hypothesis appears to be correct: low SES students in the base-year appear to have smaller survival probabilities in all months after high school exit than students who come from high SES households.
Here, I have run the Kaplan-Meier Survival Analysis across both SES groups:
summary(SES_fit)
## Call: survfit(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## BY_LOW_SES, data = entry_variables_complete)
##
## BY_LOW_SES=0
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 6383 43 0.993 0.00102 0.991 0.995
## 1 6286 62 0.983 0.00160 0.980 0.987
## 2 6135 300 0.935 0.00311 0.929 0.941
## 3 4542 573 0.817 0.00535 0.807 0.828
## 4 1459 70 0.778 0.00684 0.765 0.792
## 5 1052 8 0.772 0.00710 0.758 0.786
## 6 1023 8 0.766 0.00736 0.752 0.781
## 7 1005 46 0.731 0.00865 0.714 0.748
## 8 925 44 0.696 0.00970 0.678 0.716
## 9 833 12 0.686 0.00998 0.667 0.706
## 10 813 8 0.680 0.01017 0.660 0.700
## 11 801 3 0.677 0.01023 0.657 0.697
## 12 796 10 0.669 0.01045 0.648 0.689
## 13 778 13 0.657 0.01073 0.637 0.679
## 14 754 24 0.636 0.01120 0.615 0.659
## 15 702 45 0.596 0.01202 0.573 0.620
## 16 611 12 0.584 0.01225 0.560 0.608
## 17 584 2 0.582 0.01229 0.558 0.607
## 18 579 6 0.576 0.01241 0.552 0.601
## 19 571 12 0.564 0.01263 0.540 0.589
## 20 550 15 0.548 0.01289 0.524 0.574
## 21 526 264 0.273 0.01357 0.248 0.301
##
## BY_LOW_SES=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 4333 81 0.981 0.00206 0.977 0.985
## 1 4179 50 0.970 0.00262 0.964 0.975
## 2 4051 309 0.896 0.00471 0.886 0.905
## 3 3275 603 0.731 0.00718 0.717 0.745
## 4 1775 70 0.702 0.00768 0.687 0.717
## 5 1548 11 0.697 0.00777 0.682 0.712
## 6 1517 12 0.691 0.00787 0.676 0.707
## 7 1489 60 0.664 0.00834 0.647 0.680
## 8 1389 86 0.622 0.00892 0.605 0.640
## 9 1257 12 0.617 0.00900 0.599 0.634
## 10 1226 5 0.614 0.00903 0.597 0.632
## 11 1211 9 0.609 0.00909 0.592 0.628
## 12 1191 8 0.605 0.00914 0.588 0.624
## 13 1168 12 0.599 0.00922 0.581 0.617
## 14 1147 46 0.575 0.00951 0.557 0.594
## 15 1076 94 0.525 0.00999 0.506 0.545
## 16 944 17 0.515 0.01007 0.496 0.536
## 17 904 7 0.511 0.01011 0.492 0.532
## 18 889 4 0.509 0.01013 0.490 0.529
## 19 875 16 0.500 0.01020 0.480 0.520
## 20 847 19 0.489 0.01029 0.469 0.509
## 21 818 385 0.259 0.01012 0.240 0.279
Initial survival probabilities are relatively similar, but diverge quickly and dramatically after the third month.
Here, I have continued the Kaplan-Meier Survival Analysis across both SES groups:
chi_sq<-survdiff(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~BY_LOW_SES,data=entry_variables_complete)
survdiff(Surv(F3_UPDATED_MONTHS_TO_COLLEGE,F3_first_college_2_year)~BY_LOW_SES,data=entry_variables_complete)
## Call:
## survdiff(formula = Surv(F3_UPDATED_MONTHS_TO_COLLEGE, F3_first_college_2_year) ~
## BY_LOW_SES, data = entry_variables_complete)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## BY_LOW_SES=0 6383 1580 1782 22.8 57.4
## BY_LOW_SES=1 4333 1916 1714 23.7 57.4
##
## Chisq= 57.4 on 1 degrees of freedom, p= 4e-14
We can see clearly that the survival rates of low SES vs. high SES are quite different, especially with a p-value of 3.537058210^{-14} -- which is to say, a p-value of almost but not-quite "zero".
Now I will run a hazard analysis against the data with the BY_LOW_SES variable:
low_ses_haz<-kphaz.fit(entry_variables_complete$F3_UPDATED_MONTHS_TO_COLLEGE,entry_variables_complete$F3_first_college_2_year,entry_variables_complete$BY_LOW_SES)
low_ses_haz
## $time
## [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5
## [16] 15.5 16.5 17.5 18.5 19.5 20.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
## [31] 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 20.5
##
## $haz
## [1] 0.009977496 0.056445467 0.201391148 0.054179409 0.007696298 0.007891735
## [7] 0.047515962 0.049858523 0.014568319 0.009918065 0.003757838 0.012681391
## [13] 0.016979527 0.032963481 0.067831212 0.020107957 0.003436467 0.010437949
## [19] 0.021320092 0.027861019 2.901372476 0.012143404 0.084824233 0.242453622
## [25] 0.042144195 0.007158715 0.007969650 0.041725948 0.065062659 0.009664183
## [31] 0.004107141 0.007485584 0.006807899 0.010382360 0.041402986 0.092732177
## [37] 0.018427385 0.007808983 0.004531400 0.018545395 0.022880183 2.501165359
##
## $var
## [1] 1.605727e-06 1.068113e-05 7.881494e-05 4.219889e-05 7.404405e-06
## [6] 7.785174e-06 4.910681e-05 5.654541e-05 1.768729e-05 1.229625e-05
## [11] 4.707127e-06 1.608265e-05 2.217911e-05 4.529053e-05 1.023828e-04
## [16] 3.369725e-05 5.904721e-06 1.815871e-05 3.788341e-05 5.175679e-05
## [21] 2.901289e-01 2.949459e-06 2.335443e-05 1.003230e-04 2.540775e-05
## [26] 4.658982e-06 5.293076e-06 2.902859e-05 4.926253e-05 7.783361e-06
## [31] 3.373743e-06 6.226113e-06 5.793532e-06 8.982991e-06 3.727820e-05
## [36] 9.160412e-05 1.997694e-05 8.711702e-06 5.133516e-06 2.149734e-05
## [41] 2.755479e-05 1.205704e-01
##
## $strata
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1
In the above data, we clearly see that there are 21 entries in the first strata (BY_LOW_SES=="0") and 21 in the second strata (BY_LOW_SES=="1"). With this knowledge, we can create a hazard plot of the two strata, as follows:
plot(y=low_ses_haz$haz[1:21],x=low_ses_haz$time[1:21],col=1,lty=1,type="s")
lines(y=low_ses_haz$haz[22:42],x=low_ses_haz$time[22:42],col=2,lty=1,type="s")
Once hazard functions for each strata are plotted, we see that low SES students (the red line) do have slightly higher risk levels in esentially the same places that high SES students have: three and fifteen months, respectively.
This concludes my homework for this week. Thank you!