Homework 2

Author

Coda Rayo-Garza

Homework #2

Data Prep

For my dataset I am using DHS data for Colombia, year 2015. The code below is from Dr. Sparks. It joins the two files together to have multigenerational household id and children’s recode data files into one data frame.

library(haven)
library(survival)
library(readr)
library(tidyverse)


#This reads in the dta file from DHS
dat<-read_dta("C:/Users/codar/OneDrive/Documents/R/R 2022/COHR72FL.DTA")

# dat<-zap_labels(dat)

#This code creates a multigen variable from  hv101
hhdat<- dat %>%
  select(hhid, starts_with("hv101"))%>%
  pivot_longer(-hhid) %>%
  filter(is.na(value)==F)%>%
  group_by(hhid)%>%
  mutate(mgenhh = ifelse(value >4, 1,0))%>%
  summarise(mgen = sum(mgenhh))%>%
  mutate(mgenhh = ifelse(mgen>=1, "mgenhh", "notmgen"))

For (my own) reference, here is the code information for the HV101 variable from the code book:

The code below will join both files together

#print(hhdat, n=20)

kdat <- read_dta("C:/Users/codar/OneDrive/Documents/R/R 2022/COKR72FL.DTA")
kdat$hh <- substr(kdat$caseid, 1,12)
mdat<- left_join(kdat, hhdat, by=c("hh" ="hhid"))

Now, I am going to recode the death age variable using Sparks code.

For my own reference, below is the variable from the code book:

mdat$death.age<-ifelse(mdat$b5==1,
                          ((((mdat$v008)))-(((mdat$b3)))) 
                          ,mdat$b7)

#censoring indicator for death by age 1, in months (12 months)
mdat$d.event<-ifelse(is.na(mdat$b7)==T|mdat$b7>12,0,1)
mdat$d.eventfac<-factor(mdat$d.event)
levels(mdat$d.eventfac)<-c("Alive at 1", "Dead by 1")
table(mdat$d.eventfac)

Alive at 1  Dead by 1 
     11581        178 

Defining Variables

The event variable I will be using is infant mortality. The time variable used is 12 months for the year 2015.

Estimating the Survival Function + Plot

To generate the survival function, I used survfit(). Essentially, this generates a life table. The first chunk just generates the life table for the first 20 cases. I’ve also used the head and surv functions to display the first 20 cases with the + indicating where there were no deaths.

library(knitr)

kable(head(mdat[,c("death.age","d.event")], n=20))
death.age d.event
58 0
53 0
1 0
0 1
6 0
37 0
6 0
3 0
10 1
13 0
31 0
32 0
0 1
45 0
6 0
7 0
28 0
7 0
32 0
20 0
head(Surv(mdat$death.age, mdat$d.event), n=20)
 [1] 58+ 53+  1+  0   6+ 37+  6+  3+ 10  13+ 31+ 32+  0  45+  6+  7+ 28+  7+ 32+
[20] 20+

Here is the real answer to the homework assignment for this portion. Below I estimate and plot the survival function.

mort<-survfit(Surv(death.age, d.event)~1,
              data=mdat,
              conf.type="none")
plot(mort, ylim=c(.97,1),  #I changed the ylim to .97...I'm truncating? 
     xlim=c(0,12),
     main="Survival Function for Infant Mortality")

The Kaplan-Meier estimates

In the following chunk I produce the KM analysis of the outcome.

summary(mort)
Call: survfit(formula = Surv(death.age, d.event) ~ 1, data = mdat, 
    conf.type = "none")

 time n.risk n.event survival  std.err
    0  11759     104    0.991 0.000863
    1  11587      17    0.990 0.000931
    2  11370       9    0.989 0.000967
    3  11184       8    0.988 0.000998
    4  10988       5    0.988 0.001017
    5  10814       5    0.987 0.001037
    6  10594       5    0.987 0.001057
    7  10370       3    0.987 0.001070
    8  10145       4    0.986 0.001087
    9   9957       3    0.986 0.001100
   10   9778       5    0.985 0.001122
   11   9577       1    0.985 0.001127
   12   9390       9    0.984 0.001169
1000+(1-summary(mort)$surv[12])
[1] 1000.015

Group Comparison

The grouping variable is multigenerational households. It is a categorical variable. Here I will examine if the survival chances vary based on whether the household is classified as multigenerational or not. T

  • Hypothesis: There will be no overall difference between the two (multigenerational households and those that are not multigenerational) KM curves.

To test my hypothesis, I will use survfit(). which is the equivalent of doing a t-test for difference between two groups. (noted from Ex2)

library(ggsurvfit)

fit <-survfit(Surv(death.age, d.event)~mgenhh, data=mdat) 

summary (fit)
Call: survfit(formula = Surv(death.age, d.event) ~ mgenhh, data = mdat)

                mgenhh=mgenhh 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   5630      37    0.993 0.00108        0.991        0.996
    1   5550      11    0.991 0.00123        0.989        0.994
    2   5423       3    0.991 0.00127        0.988        0.993
    3   5326       5    0.990 0.00133        0.987        0.993
    4   5199       3    0.989 0.00137        0.987        0.992
    5   5098       4    0.989 0.00142        0.986        0.991
    6   4985       3    0.988 0.00146        0.985        0.991
    7   4867       2    0.988 0.00149        0.985        0.991
    8   4740       2    0.987 0.00152        0.984        0.990
    9   4646       2    0.987 0.00155        0.984        0.990
   10   4557       3    0.986 0.00159        0.983        0.989
   12   4365       5    0.985 0.00167        0.982        0.988

                mgenhh=notmgen 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   6129      67    0.989 0.00133        0.986        0.992
    1   6037       6    0.988 0.00139        0.985        0.991
    2   5947       6    0.987 0.00144        0.984        0.990
    3   5858       3    0.987 0.00147        0.984        0.989
    4   5789       2    0.986 0.00149        0.983        0.989
    5   5716       1    0.986 0.00150        0.983        0.989
    6   5609       2    0.986 0.00152        0.983        0.989
    7   5503       1    0.986 0.00153        0.983        0.989
    8   5405       2    0.985 0.00155        0.982        0.988
    9   5311       1    0.985 0.00156        0.982        0.988
   10   5221       2    0.985 0.00158        0.982        0.988
   11   5116       1    0.984 0.00160        0.981        0.988
   12   5025       4    0.984 0.00164        0.980        0.987
#A different table with gtsummary to show survival probability estimates 
library(gtsummary)
  tbl_survfit(fit, times=12, label_header = "1-Year Survival")
Characteristic 1-Year Survival
mgenhh
mgenhh 99% (98%, 99%)
notmgen 98% (98%, 99%)

By only looking at summary output, you can see almost twice as many infant deaths in the first month of event observations for multigenerational households.

Plotting the Survival Function for Both Groups

fit%>%
  ggsurvfit() +
  labs(title = "Infant mortality survival plot", subtitle = "By multigen household ", x = "Years", y = "Survival Probability")+
  xlim(c(0, 12))+
  ylim(c(.95, 1))+ 
  add_confidence_interval()

  # add_risktable()

Results

To test for differences between the two groups, the survdiff() function will perform a log-rank test to compare the survival patterns.

survdiff(Surv(death.age, d.event)~mgenhh, data=mdat)
Call:
survdiff(formula = Surv(death.age, d.event) ~ mgenhh, data = mdat)

                  N Observed Expected (O-E)^2/E (O-E)^2/V
mgenhh=mgenhh  5630       80     84.8     0.268     0.515
mgenhh=notmgen 6129       98     93.2     0.244     0.515

 Chisq= 0.5  on 1 degrees of freedom, p= 0.5 
chisq.test(table(mdat$d.event, mdat$mgenhh))

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(mdat$d.event, mdat$mgenhh)
X-squared = 0.50995, df = 1, p-value = 0.4752

Interpretation

The results of the test produce a chi square that is exactly at .05, which means I cannot reject the null hypothesis that there is no difference between the groups. This means there is no statistical difference in the survival chances for infants in homes that are multigenerational versus those that are not.