Event History Analysis Homework 2
df1 <- read.csv('cat_data.csv')
df2 <- read.csv('raw_data.csv')df1b <- df1 %>%
# filter(CV_HIGHEST_DEGREE_EVER_EDT_2015 %in%
# c("Associate/Junior college (AA)",
# "Bachelor's degree (BA, BS)",
# "Master's degree (MA, MS)",
# "PhD",
# "Professional degree (DDS, JD, MD)"
# )
# ) %>%
mutate(
debt_cat = if_else(CVC_ASSETS_DEBTS_25_XRND > 17000, 1, 0),
age1 = 2005 - KEY_BDATE_Y_1997,
age2 = 2014 - KEY_BDATE_Y_1997,
child05 = if_else(CV_BIO_CHILD_HH_2005 > 0, 1, 0),
child14 = if_else(CV_BIO_CHILD_HH_2015 > 0, 1, 0),
childtran = if_else(child05 == 0 & child14 == 0, 0, 1)
) %>%
# Filter out those with children at first time point
filter(child05 == 0) %>%
filter(complete.cases(.)) %>%
# Select only variables needed
select(KEY_SEX_1997, KEY_RACE_ETHNICITY_1997, CV_HIGHEST_DEGREE_EVER_EDT_2015, debt_cat:childtran)Define your event variable
- Presence of biological child in household
Define a duration or time variable
- 2005 - 2014
Define a censoring indicator
- Censored if respondent does not report having a biological child in the household within the time period
Estimate the survival function for your outcome and plot it 1. by age, proportion of women at a given age who haven’t had a birth
library(survival) child_fit <- survfit(Surv(age2, childtran)~ 1, data = df1b) library(ggsurvfit) child_fit %>% ggsurvfit() + add_confidence_interval(type = "ribbon") + add_quantile()summary(child_fit)Call: survfit(formula = Surv(age2, childtran) ~ 1, data = df1b) time n.risk n.event survival std.err lower 95% CI upper 95% CI 30 261 19 0.927 0.0161 0.896 0.959 31 229 24 0.830 0.0237 0.785 0.878 32 182 28 0.702 0.0299 0.646 0.763 33 123 29 0.537 0.0353 0.472 0.611 34 60 25 0.313 0.0399 0.244 0.402
The survival curve shows that the median age some will have a child in the household at the second time point is about 34 (?)
Carry out the following analysis:
Kaplan-Meier survival analysis of the outcome
Define a grouping variable, this can be dichotomous or categorical.
- Debt less than or equal to $17k (debt_cat = 0) or Debt greater than $17k (debt_cat = 1)
Do you have a research hypothesis about the survival patterns for the levels of the categorical variable? State it.
- Those with debt totaling more than $17k will be less likely to have a child in the home in the second time period.
Comparison of Kaplan-Meier survival across grouping variables in your data. Interpret your results.
Plot the survival function for the analysis for each level of the group variable
kpfit <- survfit(Surv(age2, childtran) ~ debt_cat, data = df1b)
summary(kpfit)Call: survfit(formula = Surv(age2, childtran) ~ debt_cat, data = df1b)
debt_cat=0
time n.risk n.event survival std.err lower 95% CI upper 95% CI
30 227 17 0.925 0.0175 0.891 0.960
31 197 20 0.831 0.0254 0.783 0.882
32 155 23 0.708 0.0321 0.648 0.774
33 105 24 0.546 0.0381 0.476 0.626
34 50 18 0.349 0.0444 0.272 0.448
debt_cat=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
30 34 2 0.941 0.0404 0.8653 1.000
31 32 4 0.824 0.0654 0.7049 0.962
32 27 5 0.671 0.0814 0.5290 0.851
33 18 5 0.485 0.0921 0.3340 0.703
34 10 7 0.145 0.0755 0.0526 0.402
## Compare difference across groups
survdiff(Surv(age2, childtran) ~ debt_cat, data = df1b)Call:
survdiff(formula = Surv(age2, childtran) ~ debt_cat, data = df1b)
N Observed Expected (O-E)^2/E (O-E)^2/V
debt_cat=0 227 102 106.6 0.199 1.7
debt_cat=1 34 23 18.4 1.154 1.7
Chisq= 1.7 on 1 degrees of freedom, p= 0.2
Those with higher debt seem less likely to reach the expected outcome. However, this difference is not significant.
## Plot survival function
kpfit %>%
ggsurvfit(conf.int = T, title = "Survival for transition to child in household")Warning: Ignoring unknown parameters: conf.int, title