Part 1. Intent-to-Treat Analysis of Efficacy and Confidence Intervals

1. Create a table that displays the numbers of deaths and numbers of survivors for the vitamin A and control groups separately for the six age-by-sex strata.

Include totals across all strata and percentages so that the reader of your table can see the evidence for whether or not vitamin A is beneficial. Ignoring age and sex for the moment, estimate the proportion of children who died in the vitamin A group and in the control group and estimate the difference in mortality rates between the two groups.

# import data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nepal621 = read_csv("nepal621_v2.csv")
## Rows: 27121 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): sex, age, trt, status
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create a table that displays the numbers of deaths and numbers of survivors for the vitamin A and control groups separately for the six age-by-sex strata.
nepal621 %>%
  group_by(trt, sex, age) %>%
  summarize(N_Alive = sum(status=="Alive"),
            Perc_Alive = N_Alive/n(),
            N_Died = sum(status=="Died"),
            Perc_Died = N_Died/n(),
            Total=n())
## `summarise()` has grouped output by 'trt', 'sex'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 8
## # Groups:   trt, sex [4]
##    trt     sex    age    N_Alive Perc_Alive N_Died Perc_Died Total
##    <chr>   <chr>  <chr>    <int>      <dbl>  <int>     <dbl> <int>
##  1 Placebo Female 1 to 2    2615      0.973     72   0.0268   2687
##  2 Placebo Female 3 to 4    2542      0.990     25   0.00974  2567
##  3 Placebo Female <1        1219      0.946     69   0.0536   1288
##  4 Placebo Male   1 to 2    2770      0.983     47   0.0167   2817
##  5 Placebo Male   3 to 4    2677      0.990     26   0.00962  2703
##  6 Placebo Male   <1        1276      0.962     51   0.0384   1327
##  7 Vit A   Female 1 to 2    2724      0.981     52   0.0187   2776
##  8 Vit A   Female 3 to 4    2529      0.994     15   0.00590  2544
##  9 Vit A   Female <1        1291      0.960     54   0.0401   1345
## 10 Vit A   Male   1 to 2    2837      0.986     40   0.0139   2877
## 11 Vit A   Male   3 to 4    2752      0.996     12   0.00434  2764
## 12 Vit A   Male   <1        1366      0.958     60   0.0421   1426
## adj % (Make it a little prettier!)
nepal621 %>%
  group_by(trt, sex, age) %>%
  summarize(N_Alive = sum(status=="Alive"),
            Perc_Alive = round(N_Alive/n(),4)*100,
            N_Died = sum(status=="Died"),
            Perc_Died = round(N_Died/n(),4)*100,
            Total=n())
## `summarise()` has grouped output by 'trt', 'sex'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 8
## # Groups:   trt, sex [4]
##    trt     sex    age    N_Alive Perc_Alive N_Died Perc_Died Total
##    <chr>   <chr>  <chr>    <int>      <dbl>  <int>     <dbl> <int>
##  1 Placebo Female 1 to 2    2615       97.3     72      2.68  2687
##  2 Placebo Female 3 to 4    2542       99.0     25      0.97  2567
##  3 Placebo Female <1        1219       94.6     69      5.36  1288
##  4 Placebo Male   1 to 2    2770       98.3     47      1.67  2817
##  5 Placebo Male   3 to 4    2677       99.0     26      0.96  2703
##  6 Placebo Male   <1        1276       96.2     51      3.84  1327
##  7 Vit A   Female 1 to 2    2724       98.1     52      1.87  2776
##  8 Vit A   Female 3 to 4    2529       99.4     15      0.59  2544
##  9 Vit A   Female <1        1291       96.0     54      4.01  1345
## 10 Vit A   Male   1 to 2    2837       98.6     40      1.39  2877
## 11 Vit A   Male   3 to 4    2752       99.6     12      0.43  2764
## 12 Vit A   Male   <1        1366       95.8     60      4.21  1426
# (Ignoring age and sex for the moment) estimate the proportion of children who died in the vitamin A group and in the control group and estimate the difference in mortality rates between the two groups.
nepal621 %>%
  group_by(trt) %>%
  summarize(N_Alive = sum(status=="Alive"),
            Perc_Alive = round(N_Alive/n(),4)*100,
            N_Died = sum(status=="Died"),
            Perc_Died = round(N_Died/n(),4)*100,
            Total=n())
## # A tibble: 2 × 6
##   trt     N_Alive Perc_Alive N_Died Perc_Died Total
##   <chr>     <int>      <dbl>  <int>     <dbl> <int>
## 1 Placebo   13099       97.8    290      2.17 13389
## 2 Vit A     13499       98.3    233      1.7  13732
# Calculate a 95% confidence interval for each true mortality rate.
nepal621 %>%
  group_by(trt) %>%
  summarize(N_Alive = sum(status=="Alive"),
            p_Alive = N_Alive/n(),
            N_Died = sum(status=="Died"),
            p_Died = N_Died/n(),
            Total=n(),
            se_Died = sqrt(p_Died *(1-p_Died)/Total),
            CI_L = p_Died - 1.96*se_Died,
            CI_U = p_Died + 1.96*se_Died)
## # A tibble: 2 × 9
##   trt     N_Alive p_Alive N_Died p_Died Total se_Died   CI_L   CI_U
##   <chr>     <int>   <dbl>  <int>  <dbl> <int>   <dbl>  <dbl>  <dbl>
## 1 Placebo   13099   0.978    290 0.0217 13389 0.00126 0.0192 0.0241
## 2 Vit A     13499   0.983    233 0.0170 13732 0.00110 0.0148 0.0191
# By hand, calculate a 95% confidence interval for the true difference in mortality rates by treatment. Confirm using R.
p.1 = 290/13389 # fill in sample proportion for first sample
n.1 = 13389 # fill in sample size for first sample
p.2 = 233/13732 # fill in sample proportion for second sample
n.2 = 13732 # fill in sample size for second sample

diff = p.1 - p.2
se = sqrt(p.1*(1-p.1)/n.1 + p.2*(1-p.2)/n.2)       # standard error
diff - 1.96*se; diff + 1.96*se   # confidence interval
## [1] 0.001413756
## [1] 0.007970053

2. Calculate a 95% confidence interval for the difference in mortality rates for the vitamin A and control groups separately for each age-sex stratum

I displayed the confidence intervals for the six strata and for the overall groups (Table A) on the following graph. We can find the expected difference in mortality rate for the vitamin A and control groups, and it’s confidence interval for each stratum. We can find larger confidence intervals for those aged 3-4 year, which was due to smaller sample size of the age 3-4 group compared with the otehrs.

# 95% CI for the difference in mortality rates for the vitamin A and control groups separately for each age-sex stratum.
nepal621 %>%
  group_by(sex, age, trt) %>%
  summarize(N_Died = sum(status=="Died"),
            p_Died = N_Died/n(),
            Total = n())
## `summarise()` has grouped output by 'sex', 'age'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 6
## # Groups:   sex, age [6]
##    sex    age    trt     N_Died  p_Died Total
##    <chr>  <chr>  <chr>    <int>   <dbl> <int>
##  1 Female 1 to 2 Placebo     72 0.0268   2687
##  2 Female 1 to 2 Vit A       52 0.0187   2776
##  3 Female 3 to 4 Placebo     25 0.00974  2567
##  4 Female 3 to 4 Vit A       15 0.00590  2544
##  5 Female <1     Placebo     69 0.0536   1288
##  6 Female <1     Vit A       54 0.0401   1345
##  7 Male   1 to 2 Placebo     47 0.0167   2817
##  8 Male   1 to 2 Vit A       40 0.0139   2877
##  9 Male   3 to 4 Placebo     26 0.00962  2703
## 10 Male   3 to 4 Vit A       12 0.00434  2764
## 11 Male   <1     Placebo     51 0.0384   1327
## 12 Male   <1     Vit A       60 0.0421   1426
## alternatively, calculate the CIs directly within each age/sex strata
nepal621 %>%
  group_by(sex, age) %>%
  summarize(N_Plac = sum(trt=="Placebo"),
            p_Plac = sum(status=="Died" & trt=="Placebo")/N_Plac,
            N_VitA = sum(trt=="Vit A"),
            p_VitA = sum(status=="Died" & trt=="Vit A")/N_VitA,
            diff = p_Plac - p_VitA,
            se = sqrt(p_Plac*(1 - p_Plac)/N_Plac + p_VitA*(1 - p_VitA)/N_VitA),
            CI_L = diff - 1.96*se,
            CI_U = diff + 1.96*se)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 10
## # Groups:   sex [2]
##   sex    age    N_Plac  p_Plac N_VitA  p_VitA     diff      se      CI_L    CI_U
##   <chr>  <chr>   <int>   <dbl>  <int>   <dbl>    <dbl>   <dbl>     <dbl>   <dbl>
## 1 Female 1 to 2   2687 0.0268    2776 0.0187   0.00806 0.00404  0.000144 0.0160 
## 2 Female 3 to 4   2567 0.00974   2544 0.00590  0.00384 0.00246 -0.000983 0.00867
## 3 Female <1       1288 0.0536    1345 0.0401   0.0134  0.00825 -0.00274  0.0296 
## 4 Male   1 to 2   2817 0.0167    2877 0.0139   0.00278 0.00325 -0.00360  0.00916
## 5 Male   3 to 4   2703 0.00962   2764 0.00434  0.00528 0.00226  0.000856 0.00970
## 6 Male   <1       1327 0.0384    1426 0.0421  -0.00364 0.00749 -0.0183   0.0110

3. Display the confidence intervals for the six strata and for the overall groups (from step iii) on a graph

# draw the graph for the above 95% ci
#utils::install.packages("Hmisc")
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(dplyr)

dataForCIplot = nepal621 %>%
  group_by(sex, age) %>%
  dplyr::summarize(N_Plac = sum(trt=="Placebo"),
            p_Plac = sum(status=="Died" & trt=="Placebo")/N_Plac,
            N_VitA = sum(trt=="Vit A"),
            p_VitA = sum(status=="Died" & trt=="Vit A")/N_VitA,
            diff = p_Plac - p_VitA,
            se = sqrt(p_Plac*(1 - p_Plac)/N_Plac + p_VitA*(1 - p_VitA)/N_VitA),
            CI_L = diff - 1.96*se,
            CI_U = diff + 1.96*se)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
agestrata = c(1,2,3,4,5,6,7)
agestrata_labels = c("F < 1", "F 1-2", "F 3-4", "M < 1", "M 1-2", "M 3-4", "Overall")
diff = c(dataForCIplot$diff, 0.0047)
LL = c(dataForCIplot$CI_L, 0.00142)
UL = c(dataForCIplot$CI_U, 0.00798)

## Add labels to the axes
errbar(x = agestrata,
       y = diff,
       yplus = LL,
       yminus = UL,
       xaxt = "n",     #xaxt removes the numberic lables
       xlab = "Age/Gender Group",  #label for x-axis
       ylab = "Difference in Mortality Rates (Placebo - VitA)")  #label for y-axis()

## Add a title
title(main="95% Confidence Intervals for Difference in Mortality Rates")

## Add group labels for the age-gender groups
axis(side=1, #1 = the bottom of graph
     at=agestrata,   #where on x-axis; same as "x" in errbar
     labels=agestrata_labels)  #what the labels are

# Add horizontal line at zero
abline(h=0, col="red")

4. linear regression model

  1. β0 : the risk of death in the placebo group.
  2. β0 + β1 : the risk of death in the vitamin A group.
  3. β1 : the difference in risk between vitamin A and placebo groups.
  4. When comparing to the estimates of the following key parameters from R output with what I obtained from the above parts 3 and 4, I can find the β0, the coefficient for placebo, is same as the estimated β0 from our data as 290/13389 = 0.0217, which is P(Y=1| placebo) = β0 + β1(0). In addition, the β1, the coefficient for vitamin A, is same as the estimated β1 with subtracting the two proportions from our data, which is P(Y=1| vitamin A) - P(Y=1| placebo) = {β0 + β1(1)} –{β0 + β1(0)}.

5. Explain in a brief sentence that estimated difference in mortality rates with an expression of uncertainty in this estimate for all children.

Vitamin A supplementation is associated with large reductions in mortality in children. Our results indicate statistically significant decrease in the mortality rates for the vitamin A group compared with that of the control group in overall children (risk difference = 0.0047, 95% CI = (0.00798, 0.00142)). When comparing with each age-gender strata, we can find vitamine A supplementation significantly decreases the mortality rate in only girls aged under 1 year and boys aged 1-2 year. Several factors such as risk of bias in study design, small sample size, and risk of measurement error can contribute to increase uncertainty in our estimated values.

# linear regression
model1 = glm(as.factor(status) ~ trt, data=nepal621, family=binomial(link="identity"))
summary(model1)
## 
## Call:
## glm(formula = as.factor(status) ~ trt, family = binomial(link = "identity"), 
##     data = nepal621)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.021660   0.001258  17.217  < 2e-16 ***
## trtVit A    -0.004692   0.001673  -2.805  0.00503 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5166.0  on 27120  degrees of freedom
## Residual deviance: 5158.1  on 27119  degrees of freedom
## AIC: 5162.1
## 
## Number of Fisher Scoring iterations: 2
confint(model1)
## Waiting for profiling to be done...
##                    2.5 %       97.5 %
## (Intercept)  0.019284720  0.024217501
## trtVit A    -0.007988047 -0.001420949

Part 2. Vitamin A Supplementation to Prevent Children’s Mortality in Nepal

6. With binary outcome(< 3 years, ≥ 3 years), Estimate the odds ratio with a 95% confidence interval for vitamin A exposure by vital status within each age stratum.

Based on the following logistic regression, there are statistically significant difference in the log odds between vitamin A and placedo group in both age strata.

Among children aged more than 3 years, I can estimate the odds of death is statistically significantly reduced by 48% in children who have vitamin A (OR = 0.523, 95%CI = (0.328, 0.835)). With the confidence intervals excluding the null value, I can reject the null hypothesis that vitamin A is not effective for the age group. Among children aged less than 3 years, I can estimate the odds of death is reduced by 18% in children who have vitamin A (OR = 0.826, 95%CI = (0.684, 0.998)). I can also reject the null hypothesis with the confidence intervals.

Among children aged more than 3 years, the odds of death is statistically significantly reduced by 48% in children who have vitamin A compared with those without vitamin A (OR = 0.523, 95%CI = (0.328, 0.835)). Among children aged less than 3 years, the odds of death statistically significantly decrease by 18% in children who have vitamin A compared with the control group (OR = 0.826, 95%CI = (0.684, 0.998)). As a result, vitamin A supplementation is associated with reductions in mortality in children.

# import dataset
library(tidyverse)
nepal621 = read_csv("nepal621.csv")
## Rows: 27121 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): sex, age, trt, status
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create two age groups (< 3 years, ≥ 3 years)
nepal621 = nepal621 %>%
  mutate(agegp = ifelse(age == "3-4", "3+ years", "<3 years"))

# Calculates the odds by age group and trt:
nepal621 %>%
  group_by(agegp, trt) %>%
  dplyr::summarize(N_Alive = sum(status=="Alive"),
            N_Died = sum(status=="Died"),
            Odds = N_Died/N_Alive)
## `summarise()` has grouped output by 'agegp'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 5
## # Groups:   agegp [2]
##   agegp    trt     N_Alive N_Died    Odds
##   <chr>    <chr>     <int>  <int>   <dbl>
## 1 3+ years Placebo    5219     51 0.00977
## 2 3+ years Vit A      5281     27 0.00511
## 3 <3 years Placebo    7880    239 0.0303 
## 4 <3 years Vit A      8218    206 0.0251
nepal621 %>%
  group_by(agegp) %>%
  dplyr::summarize(N_Alive_P = sum(status=="Alive" & trt=="Placebo"),
            N_Died_P = sum(status=="Died" & trt=="Placebo"),
            N_Alive_V = sum(status=="Alive" & trt=="Vit A"),
            N_Died_V = sum(status=="Died" & trt=="Vit A"),
            OR = (N_Died_V/N_Alive_V)/(N_Died_P/N_Alive_P),
            se = sqrt(1/N_Alive_P + 1/N_Died_P + 1/N_Alive_V + 1/N_Died_V),
            CI_L = exp(log(OR)-1.96*se),
            CI_U = exp(log(OR)+1.96*se))
## # A tibble: 2 × 9
##   agegp    N_Alive_P N_Died_P N_Alive_V N_Died_V    OR     se  CI_L  CI_U
##   <chr>        <int>    <int>     <int>    <int> <dbl>  <dbl> <dbl> <dbl>
## 1 3+ years      5219       51      5281       27 0.523 0.239  0.328 0.835
## 2 <3 years      7880      239      8218      206 0.826 0.0964 0.684 0.998

7. estimate the odds ratio by using a logistic regression of the binary

Now separately for each age stratum, estimate the odds ratio by using a logistic regression of the binary survival indicator on vitamin A.

# seperately for each age startum, estimate the odds ratio by using a logistic regression of the binary survival indicator
nepal621.lowage = nepal621 %>% filter(agegp == "<3 years")
model2 = glm(as.factor(status) ~ trt, data=nepal621.lowage,
             family=binomial(link="logit"))

summary(model2)  # This summary is on the logOR scale
## 
## Call:
## glm(formula = as.factor(status) ~ trt, family = binomial(link = "logit"), 
##     data = nepal621.lowage)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.49562    0.06566 -53.240   <2e-16 ***
## trtVit A    -0.19059    0.09637  -1.978    0.048 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4095.8  on 16542  degrees of freedom
## Residual deviance: 4091.9  on 16541  degrees of freedom
## AIC: 4095.9
## 
## Number of Fisher Scoring iterations: 6
exp(model2$coefficients)  # We exponentiate to get on the OR scale
## (Intercept)    trtVit A 
##  0.03032995  0.82647439
exp(confint(model2))  # We can also exponentiate the CI to the OR scale
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) 0.02659672 0.03440764
## trtVit A    0.68385047 0.99800770
nepal621.highage = nepal621 %>% filter(agegp == "3+ years")
model3 = glm(as.factor(status) ~ trt, data=nepal621.highage,
             family=binomial(link="logit"))

summary(model3)
## 
## Call:
## glm(formula = as.factor(status) ~ trt, family = binomial(link = "logit"), 
##     data = nepal621.highage)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.6282     0.1407 -32.892  < 2e-16 ***
## trtVit A     -0.6478     0.2388  -2.713  0.00667 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 921.36  on 10577  degrees of freedom
## Residual deviance: 913.62  on 10576  degrees of freedom
## AIC: 917.62
## 
## Number of Fisher Scoring iterations: 8
exp(model3$coefficients)
## (Intercept)    trtVit A 
## 0.009771987 0.523196364
exp(confint(model3))
## Waiting for profiling to be done...
##                   2.5 %     97.5 %
## (Intercept) 0.007320666 0.01272323
## trtVit A    0.323301346 0.82799851