library(haven)
library(survival)
library(readr)
library(tidyverse)
#This reads in the dta file from DHS
<-read_dta("C:/Users/codar/OneDrive/Documents/R/R 2022/COHR72FL.DTA")
dat
# dat<-zap_labels(dat)
#This code creates a multigen variable from hv101
<- dat %>%
hhdatselect(hhid, starts_with("hv101"))%>%
pivot_longer(-hhid) %>%
filter(is.na(value)==F)%>%
group_by(hhid)%>%
mutate(mgenhh = ifelse(value >4, 1,0))%>%
summarise(mgen = sum(mgenhh))%>%
mutate(mgenhh = ifelse(mgen>=1, "mgenhh", "notmgen"))
Homework 2
Homework #2
Data Prep
For my dataset I am using DHS data for Colombia, year 2015. The code below is from Dr. Sparks. It joins the two files together to have multigenerational household id and children’s recode data files into one data frame.
For (my own) reference, here is the code information for the HV101 variable from the code book:
The code below will join both files together
#print(hhdat, n=20)
<- read_dta("C:/Users/codar/OneDrive/Documents/R/R 2022/COKR72FL.DTA")
kdat $hh <- substr(kdat$caseid, 1,12)
kdat<- left_join(kdat, hhdat, by=c("hh" ="hhid")) mdat
Now, I am going to recode the death age variable using Sparks code.
For my own reference, below is the variable from the code book:
$death.age<-ifelse(mdat$b5==1,
mdat$v008)))-(((mdat$b3))))
((((mdat$b7)
,mdat
#censoring indicator for death by age 1, in months (12 months)
$d.event<-ifelse(is.na(mdat$b7)==T|mdat$b7>12,0,1)
mdat$d.eventfac<-factor(mdat$d.event)
mdatlevels(mdat$d.eventfac)<-c("Alive at 1", "Dead by 1")
table(mdat$d.eventfac)
Alive at 1 Dead by 1
11581 178
Defining Variables
The event variable I will be using is infant mortality. The time variable used is 12 months for the year 2015.
Estimating the Survival Function + Plot
To generate the survival function, I used survfit(). Essentially, this generates a life table. The first chunk just generates the life table for the first 20 cases. I’ve also used the head and surv functions to display the first 20 cases with the + indicating where there were no deaths.
library(knitr)
kable(head(mdat[,c("death.age","d.event")], n=20))
death.age | d.event |
---|---|
58 | 0 |
53 | 0 |
1 | 0 |
0 | 1 |
6 | 0 |
37 | 0 |
6 | 0 |
3 | 0 |
10 | 1 |
13 | 0 |
31 | 0 |
32 | 0 |
0 | 1 |
45 | 0 |
6 | 0 |
7 | 0 |
28 | 0 |
7 | 0 |
32 | 0 |
20 | 0 |
head(Surv(mdat$death.age, mdat$d.event), n=20)
[1] 58+ 53+ 1+ 0 6+ 37+ 6+ 3+ 10 13+ 31+ 32+ 0 45+ 6+ 7+ 28+ 7+ 32+
[20] 20+
Here is the real answer to the homework assignment for this portion. Below I estimate and plot the survival function.
<-survfit(Surv(death.age, d.event)~1,
mortdata=mdat,
conf.type="none")
plot(mort, ylim=c(.97,1), #I changed the ylim to .97...I'm truncating?
xlim=c(0,12),
main="Survival Function for Infant Mortality")
The Kaplan-Meier estimates
In the following chunk I produce the KM analysis of the outcome.
summary(mort)
Call: survfit(formula = Surv(death.age, d.event) ~ 1, data = mdat,
conf.type = "none")
time n.risk n.event survival std.err
0 11759 104 0.991 0.000863
1 11587 17 0.990 0.000931
2 11370 9 0.989 0.000967
3 11184 8 0.988 0.000998
4 10988 5 0.988 0.001017
5 10814 5 0.987 0.001037
6 10594 5 0.987 0.001057
7 10370 3 0.987 0.001070
8 10145 4 0.986 0.001087
9 9957 3 0.986 0.001100
10 9778 5 0.985 0.001122
11 9577 1 0.985 0.001127
12 9390 9 0.984 0.001169
1000+(1-summary(mort)$surv[12])
[1] 1000.015
Group Comparison
The grouping variable is multigenerational households. It is a categorical variable. Here I will examine if the survival chances vary based on whether the household is classified as multigenerational or not. T
- Hypothesis: There will be no overall difference between the two (multigenerational households and those that are not multigenerational) KM curves.
To test my hypothesis, I will use survfit(). which is the equivalent of doing a t-test for difference between two groups. (noted from Ex2)
library(ggsurvfit)
<-survfit(Surv(death.age, d.event)~mgenhh, data=mdat)
fit
summary (fit)
Call: survfit(formula = Surv(death.age, d.event) ~ mgenhh, data = mdat)
mgenhh=mgenhh
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 5630 37 0.993 0.00108 0.991 0.996
1 5550 11 0.991 0.00123 0.989 0.994
2 5423 3 0.991 0.00127 0.988 0.993
3 5326 5 0.990 0.00133 0.987 0.993
4 5199 3 0.989 0.00137 0.987 0.992
5 5098 4 0.989 0.00142 0.986 0.991
6 4985 3 0.988 0.00146 0.985 0.991
7 4867 2 0.988 0.00149 0.985 0.991
8 4740 2 0.987 0.00152 0.984 0.990
9 4646 2 0.987 0.00155 0.984 0.990
10 4557 3 0.986 0.00159 0.983 0.989
12 4365 5 0.985 0.00167 0.982 0.988
mgenhh=notmgen
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 6129 67 0.989 0.00133 0.986 0.992
1 6037 6 0.988 0.00139 0.985 0.991
2 5947 6 0.987 0.00144 0.984 0.990
3 5858 3 0.987 0.00147 0.984 0.989
4 5789 2 0.986 0.00149 0.983 0.989
5 5716 1 0.986 0.00150 0.983 0.989
6 5609 2 0.986 0.00152 0.983 0.989
7 5503 1 0.986 0.00153 0.983 0.989
8 5405 2 0.985 0.00155 0.982 0.988
9 5311 1 0.985 0.00156 0.982 0.988
10 5221 2 0.985 0.00158 0.982 0.988
11 5116 1 0.984 0.00160 0.981 0.988
12 5025 4 0.984 0.00164 0.980 0.987
#A different table with gtsummary to show survival probability estimates
library(gtsummary)
tbl_survfit(fit, times=12, label_header = "1-Year Survival")
Characteristic | 1-Year Survival |
---|---|
mgenhh | |
mgenhh | 99% (98%, 99%) |
notmgen | 98% (98%, 99%) |
By only looking at summary output, you can see almost twice as many infant deaths in the first month of event observations for multigenerational households.
Plotting the Survival Function for Both Groups
%>%
fitggsurvfit() +
labs(title = "Infant mortality survival plot", subtitle = "By multigen household ", x = "Years", y = "Survival Probability")+
xlim(c(0, 12))+
ylim(c(.95, 1))+
add_confidence_interval()
# add_risktable()
Results
To test for differences between the two groups, the survdiff() function will perform a log-rank test to compare the survival patterns.
survdiff(Surv(death.age, d.event)~mgenhh, data=mdat)
Call:
survdiff(formula = Surv(death.age, d.event) ~ mgenhh, data = mdat)
N Observed Expected (O-E)^2/E (O-E)^2/V
mgenhh=mgenhh 5630 80 84.8 0.268 0.515
mgenhh=notmgen 6129 98 93.2 0.244 0.515
Chisq= 0.5 on 1 degrees of freedom, p= 0.5
chisq.test(table(mdat$d.event, mdat$mgenhh))
Pearson's Chi-squared test with Yates' continuity correction
data: table(mdat$d.event, mdat$mgenhh)
X-squared = 0.50995, df = 1, p-value = 0.4752
Interpretation
The results of the test produce a chi square that is exactly at .05, which means I cannot reject the null hypothesis that there is no difference between the groups. This means there is no statistical difference in the survival chances for infants in homes that are multigenerational versus those that are not.