Using R for Reproducible and Transparent Health Research

Boni M. Ale, MD, MSc, MPH

July 24, 2024

Outline

  1. Why reproducibility matters
  2. Limitations of point-and-click software
  3. Advantages of Programming Software
  4. Reproducibility with R
  5. Transparency with R
  6. Example case study
  7. Overcoming Challenges
  8. Q&A session

Why Reproducibility Matters

Definition of Reproducibility:

  • Ability to achieve consistent results using the same data and methods.

Definition of Transparency:

  • Clear and accessible documentation of methods and data.

Why Reproducibility Matters Cont’

  • Ensures integrity and credibility of research.
  • Facilitates peer review and verification.
  • Enhances collaboration and knowledge sharing.

Limitations of point-and-click software

  • Lack of Scriptability:
    • Difficult to automate repetitive tasks.
  • Poor Documentation:
    • Limited ability to track and document the analytical process.
  • Reproducibility Challenges:
    • Hard to replicate analyses without exact same software version and settings.

Limitations of point-and-click software Cont’

  • Transparency Issues:
    • Opaque processes and settings make it hard to understand how results are obtained.

Advantages of Programming Software

  • Script-Based Approach:
    • Everything is coded, allowing for automation and easy replication.
  • Comprehensive Documentation:
    • Code scripts serve as documentation of the entire analysis process.
  • Version Control:
    • Integration with tools like Git for tracking changes and collaboration.

Advantages of Programming Software

  • Extensive Libraries:
    • CRAN, Bioconductor, and other repositories offer a vast range of packages.
  • Community Support:
    • Large and active user community for support and sharing best practices.

Reproducibility with R

  • Script Sharing:
    • Easily share scripts via email, GitHub, GitLab or other platforms.
  • Version Control:
    • Use Git for tracking changes and collaborating with others.
  • Environment Management:
    • RStudio projects and renv for managing package versions and dependencies.

Transparency with R

  • Commenting Code:
    • Use comments to explain each step of the analysis.
  • Documentation:
    • Use RMarkdown and Quarto for combining code and documentation.
    • Create detailed reports and presentations.

Automated Publisheable Table

Code
bas_bp_summary <- bas_bp %>%
  dplyr::select(
    `Treatment Protocol`,
    `Age, Mean (SD), years`,
    `Sex, No. (%)`,
    `Body mass index, Mean (SD), kg/m2`,
    #`Body mass index class, No. (%)`, 
    `Education, No. (%)`,
    `History of hypertension, No. (%)`,
    `History of diabetes, No. (%)`,
    `History of chronic kidney disease, No. (%)`,
    `History of stroke, No. (%)`,
    `History of heart attack, No. (%)`,
    `Smoker/Tobacco user, No. (%)`, 
    `Alcohol user, No. (%)`,
    `Systolic blood pressure, Mean (SD), mmHg`,
    `Diastolic blood pressure, Mean (SD), mmHg`,
    `Hypertension Control, No. (%)`,
    `Heart rate, Mean (SD), bpm`,
    `Baseline treatment rate, No. (%)`
  )

table_1 <- tbl_summary(bas_bp_summary,
                       by = `Treatment Protocol`,
                       missing = "no",
                       digits = list(all_categorical() ~ c(0, 1), all_continuous() ~ c(0, 1)),
                       statistic = list(all_continuous() ~ "{mean} ({sd})",
                                        all_categorical() ~ "{n} ({p})")) %>% 
  # add_n() %>% 
  add_p(pvalue_fun = ~ style_pvalue(.x, digits = 2)) %>% 
  add_overall() %>% 
  modify_header(label = "**Variables**") %>% 
  bold_p() %>% 
  modify_caption("**Table 1**. Patient baseline demographics, clinical characteristics and blood pressure overall and by study arm.") %>% 
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Protocol Received**") %>%   modify_footnote(
    update = everything() ~ NA
  )

table_1 
Table 1. Patient baseline demographics, clinical characteristics and blood pressure overall and by study arm.
Variables Overall, N = 4,427 Treatment Protocol Received p-value
Protocol 1, N = 2,046 Protocol 2, N = 2,381
Age, Mean (SD), years 49 (12.4) 50 (12.0) 49 (12.7) 0.083
Sex, No. (%)


0.013
    Male 1,308 (29.5) 642 (31.4) 666 (28.0)
    Female 3,119 (70.5) 1,404 (68.6) 1,715 (72.0)
Body mass index, Mean (SD), kg/m2 28 (6.1) 28 (6.1) 28 (6.1) 0.23
Education, No. (%)


<0.001
    Never attended 1,179 (26.7) 614 (30.0) 565 (23.8)
    Primary 791 (17.9) 379 (18.5) 412 (17.3)
    Secondary 1,186 (26.8) 468 (22.9) 718 (30.2)
    Higher 1,219 (27.6) 574 (28.1) 645 (27.1)
    Other 49 (1.1) 11 (0.5) 38 (1.6)
History of hypertension, No. (%) 2,286 (51.9) 1,094 (53.8) 1,192 (50.2) 0.019
History of diabetes, No. (%) 205 (4.6) 80 (3.9) 125 (5.3) 0.035
History of chronic kidney disease, No. (%) 2 (0.0) 1 (0.0) 1 (0.0) >0.99
History of stroke, No. (%) 23 (0.5) 8 (0.4) 15 (0.6) 0.27
History of heart attack, No. (%) 6 (0.1) 1 (0.0) 5 (0.2) 0.23
Smoker/Tobacco user, No. (%) 54 (1.2) 34 (1.7) 20 (0.8) 0.013
Alcohol user, No. (%) 188 (4.3) 105 (5.1) 83 (3.5) 0.007
Systolic blood pressure, Mean (SD), mmHg 155 (20.6) 154 (20.4) 155 (20.7) 0.25
Diastolic blood pressure, Mean (SD), mmHg 96 (13.1) 95 (13.0) 96 (13.2) 0.061
Hypertension Control, No. (%) 626 (14.1) 300 (14.7) 326 (13.7) 0.36
Heart rate, Mean (SD), bpm 83 (13.6) 83 (13.5) 83 (13.7) 0.036
Baseline treatment rate, No. (%) 4,251 (96.0) 1,990 (97.3) 2,261 (95.0) <0.001

Automated Publisheable Graphs

Code
# summary data on htn control
htnc_b <- dat_final %>% 
  group_by(id) %>%
  #filter(time > 2) %>%
  filter(time == min(time)) %>% 
  group_by(study_arm) %>%
  count(htn_c) %>%
  mutate(total = sum(n)) %>% 
  filter(htn_c == "1") %>%
   mutate( #total = sum(n),
           rate = map2(n, total, ~ prop.test(.x, .y, conf.level = 0.95) %>%
                   broom::tidy()),
           htn_c = factor(htn_c, levels = c("0", "1"), labels = c("Not controlled", "Controlled"))) %>%
  unnest(rate) %>%
  filter(htn_c == "Controlled") %>% 
  mutate(
    n_base = n,
    total_base = total,
    estimate_base = estimate
  )

htn_control_baseline_itt <- htnc_b %>% 
  dplyr::select(
    study_arm,
    n_base,
    total_base,
    estimate_base
  )


htn_control_finale_itt <- htnc_b %>% 
  dplyr::select(
    study_arm,
    n_finale = n,
    total_finale = total,
    estimate_finale = estimate
  )



## selecting people who attended the PHC twice and are controlled at the last visit
htnc_new <- dat_final %>% 
  group_by(id) %>%
  filter(time > 2) %>%
  filter(time == max(time)) %>%
  group_by(study_arm) %>%
  count(htn_c) %>% 
  mutate(total = sum(n)) %>% 
  filter(htn_c == "1") %>% 
   mutate( #total = sum(n),
           rate = map2(n, total, ~ prop.test(.x, .y, conf.level = 0.95) %>%
                   broom::tidy()),
           htn_c = factor(htn_c, levels = c("0", "1"), labels = c("Not controlled", "Controlled"))) %>%
  unnest(rate) %>%
  filter(htn_c == "Controlled") %>% 
   mutate(
    n_finale = n,
    total_finale = total,
    estimate_finale = estimate
  )


htn_control_baseline <- htnc_b %>% 
  dplyr::select(
    study_arm,
    n_base,
    total_base,
    estimate_base
  )

htn_control_finale <- htnc_new %>% 
  dplyr::select(
    study_arm,
    n_finale = n,
    total_finale = total,
    estimate_finale = estimate
  )



## difference between hypertension control at baseline and at last visit
## let's join the database of htn control at baseline and at the last visit. 

htn_control_diff <- htn_control_baseline %>% 
  left_join(htn_control_finale, by = "study_arm") %>% 
  mutate(
    htn_diff = estimate_finale - estimate_base
  )

## let's compare the proportion of htn control in baseline and at last visit in each study arm









controlled_least_twice <- ggplot(data = htnc_new, aes(x = study_arm, y = estimate)) + 
  geom_col(aes(fill = htn_c), position = "dodge", show.legend = FALSE) + 
    geom_errorbar(
    aes(ymin = conf.low,
        ymax = conf.high),
     position = position_dodge(0.9),
     width = 0.2
  ) + 
  geom_text(aes(label = scales::percent(estimate), 
                y = estimate,
                group = htn_c),
            position = position_dodge(width = 0.9),
            vjust = -2.1) +
  scale_color_jama() +
  #scale_fill_manual(values=c("grey70", "grey70")) + 
  labs(
    x = "Treatment Protocol",
    y = "Percentage (%) of HTN Controlled at last visit",
    fill = "Hypertension"
  ) +  
   scale_y_continuous(limits = c(0, 1), labels = scales::percent_format()) +
   theme_bw() +
  geom_signif(comparisons=list(c("1", "2")), annotations="Cluster adjusted p-value = 0.29",
              y_position = 0.68,
              tip_length = 0.03, 
              vjust = -0.005
              )
  
  

controlled_least_twice 

Overcoming Challenges

  • Learning Curve:
    • Resources for learning R (online courses, books, community forums).
  • Transition from Point-and-Click:
    • Strategies for transitioning to R (start small, build confidence).
  • Collaborative Work:
    • Using version control and collaborative platforms (GitHub, GitLab).

Questions And Answers

Asante Sana