For this homework I will use Kaplan-Meier estimation to create and plot survival functions and make comparisons between groups.
library (haven)
library (tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.2 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
loki <- read_stata ("C:/Users/maman/OneDrive - University of Texas at San Antonio/Event History Analysis Data/nhis_00002.dta.gz" )
View (loki)
loki <- haven:: zap_labels (loki)
loki <- loki %>%
filter (mortelig == 1 )
loki_16 <- loki %>%
filter (loki$ year == 2016 )
For this project data from the National Health Interview Survey (NHIS) will be used.
1. Define the event variable
The event variable that will be used in this analysis is an individuals final mortality status, labeled “mortstat”.
2. Define a duration or time variable
The time variable will be the age in years of the respondents
3. Define a censoring indicator
The variable “mortelig” will be used to filter for respondents who are eligible for the mortality follow-up (non-children). The year of the survey, 2016, is the censoring year for mortality.
4. Estimate the survival function of your outcome and plot it.
The plotted survival function for the age-at-death for the 2016 NHIS with mortality follow-up is plotted below:
loki_16 <- loki_16 %>%
mutate (death_age = ifelse ( mortstat == 1 ,
mortdody - (year - age),
2016 - (year - age)),
d.event = ifelse (mortstat == 1 , 1 , 0 ))
library (survival)
age_fit <- survfit (Surv (death_age, d.event) ~ 1 ,
data = loki_16)
library (ggsurvfit)
age_fit %>%
ggsurvfit () +
add_confidence_interval (type = "ribbon" ) +
add_quantile ()
5b. Define a grouping variable.
United States citizenship status (yes/no) will be used as a grouping variable.
5c. State a research hypothesis about the survival patterns by the grouping variable.
People who are U.S. citizens will have a higher survival rate compared to those who are not citizens.
5d. Compare Kaplan-Meier survival across grouping variables in your data. Interpret your results.
Below are the Kaplan-Meier estimates by citizenship status:
loki_16$ cit <- Recode (loki_16$ citizen,
recodes= "1='no'; 2='yes'; else=NA" )
fit <- survfit (Surv (death_age, d.event)~ cit, data= loki_16)
summary (fit)
Call: survfit(formula = Surv(death_age, d.event) ~ cit, data = loki_16)
36 observations deleted due to missingness
cit=no
time n.risk n.event survival std.err lower 95% CI upper 95% CI
20 1902 1 0.999 0.000526 0.9984 1.000
39 947 1 0.998 0.001178 0.9961 1.000
43 757 1 0.997 0.001767 0.9936 1.000
46 630 1 0.996 0.002369 0.9909 1.000
48 561 1 0.994 0.002956 0.9880 1.000
50 492 1 0.992 0.003574 0.9847 0.999
52 435 1 0.989 0.004231 0.9812 0.998
53 404 1 0.987 0.004878 0.9775 0.997
55 357 1 0.984 0.005593 0.9733 0.995
56 332 1 0.981 0.006313 0.9690 0.994
59 268 1 0.978 0.007274 0.9634 0.992
61 224 1 0.973 0.008450 0.9568 0.990
71 95 1 0.963 0.013182 0.9375 0.989
72 86 1 0.952 0.017137 0.9188 0.986
76 60 1 0.936 0.023052 0.8918 0.982
78 50 1 0.917 0.029219 0.8617 0.976
81 28 1 0.884 0.042762 0.8045 0.972
83 22 1 0.844 0.056647 0.7402 0.963
85 15 2 0.732 0.088889 0.5767 0.928
86 3 1 0.488 0.207770 0.2117 1.000
87 2 1 0.244 0.201332 0.0484 1.000
88 1 1 0.000 NaN NA NA
cit=yes
time n.risk n.event survival std.err lower 95% CI upper 95% CI
21 29274 2 1.000 4.83e-05 1.000 1.000
22 28874 1 1.000 5.94e-05 1.000 1.000
24 28141 2 1.000 7.78e-05 1.000 1.000
25 27728 2 1.000 9.30e-05 1.000 1.000
26 27284 1 1.000 1.00e-04 1.000 1.000
27 26855 3 1.000 1.19e-04 0.999 1.000
28 26398 1 1.000 1.25e-04 0.999 1.000
30 25501 2 0.999 1.37e-04 0.999 1.000
32 24608 1 0.999 1.42e-04 0.999 1.000
33 24193 1 0.999 1.48e-04 0.999 1.000
34 23720 1 0.999 1.54e-04 0.999 1.000
36 22781 2 0.999 1.66e-04 0.999 1.000
37 22330 2 0.999 1.78e-04 0.999 1.000
38 21918 3 0.999 1.95e-04 0.999 0.999
39 21495 2 0.999 2.05e-04 0.999 0.999
40 21089 4 0.999 2.26e-04 0.998 0.999
41 20702 7 0.998 2.60e-04 0.998 0.999
42 20317 2 0.998 2.69e-04 0.998 0.999
44 19518 2 0.998 2.78e-04 0.998 0.999
45 19095 2 0.998 2.88e-04 0.998 0.999
46 18641 11 0.998 3.38e-04 0.997 0.998
47 18151 2 0.997 3.47e-04 0.997 0.998
48 17704 4 0.997 3.65e-04 0.996 0.998
49 17302 4 0.997 3.82e-04 0.996 0.998
50 16856 8 0.996 4.17e-04 0.996 0.997
51 16390 11 0.996 4.63e-04 0.995 0.997
52 15904 8 0.995 4.96e-04 0.994 0.996
53 15370 13 0.994 5.47e-04 0.993 0.996
54 14785 8 0.994 5.79e-04 0.993 0.995
55 14254 12 0.993 6.27e-04 0.992 0.994
56 13744 12 0.992 6.75e-04 0.991 0.994
57 13184 17 0.991 7.42e-04 0.990 0.992
58 12596 9 0.990 7.78e-04 0.989 0.992
59 12057 12 0.989 8.27e-04 0.988 0.991
60 11486 21 0.987 9.15e-04 0.986 0.989
61 10903 21 0.986 1.00e-03 0.984 0.988
62 10349 12 0.984 1.05e-03 0.982 0.986
63 9783 19 0.983 1.14e-03 0.980 0.985
64 9192 25 0.980 1.26e-03 0.977 0.982
65 8648 27 0.977 1.38e-03 0.974 0.979
66 8055 23 0.974 1.50e-03 0.971 0.977
67 7534 33 0.970 1.66e-03 0.966 0.973
68 6952 40 0.964 1.87e-03 0.960 0.968
69 6330 24 0.960 2.01e-03 0.957 0.964
70 5773 34 0.955 2.22e-03 0.950 0.959
71 5313 39 0.948 2.47e-03 0.943 0.953
72 4911 42 0.940 2.75e-03 0.934 0.945
73 4503 35 0.932 2.99e-03 0.927 0.938
74 4072 41 0.923 3.30e-03 0.917 0.930
75 3731 36 0.914 3.59e-03 0.907 0.921
76 3381 30 0.906 3.85e-03 0.898 0.914
77 3070 37 0.895 4.20e-03 0.887 0.903
78 2787 30 0.885 4.51e-03 0.877 0.894
79 2506 40 0.871 4.96e-03 0.862 0.881
80 2238 33 0.858 5.37e-03 0.848 0.869
81 2009 38 0.842 5.88e-03 0.831 0.854
82 1776 32 0.827 6.35e-03 0.815 0.840
83 1567 38 0.807 6.98e-03 0.793 0.821
84 1367 33 0.788 7.59e-03 0.773 0.803
85 1186 82 0.733 9.14e-03 0.715 0.751
86 394 134 0.484 1.85e-02 0.449 0.521
87 260 134 0.234 1.75e-02 0.203 0.271
88 126 126 0.000 NaN NA NA
A statistical test for differences between groups is shown below. The results suggest that there is no statistical difference in the survival rates between citizens and non-citizens.
survdiff (Surv (death_age,d.event)~ cit,data= loki_16)
Call:
survdiff(formula = Surv(death_age, d.event) ~ cit, data = loki_16)
n=32451, 36 observations deleted due to missingness.
N Observed Expected (O-E)^2/E (O-E)^2/V
cit=no 1935 23 23.3 4.31e-03 0.00484
cit=yes 30516 1433 1432.7 7.02e-05 0.00484
Chisq= 0 on 1 degrees of freedom, p= 0.9
5e. Plot the survival function for the analysis for each level of the group variable
fit%>% ggsurvfit () + labs (title = "Mortality survival plot" , subtitle = "By Citizenship Status" , x = "Years" , y = "Survival Probability" )