Using R for Reproducible and Transparent Health Research

Boni M. Ale, MD, MSc, MPH

July 24, 2024

Outline

Why reproducibility matters
Limitations of point-and-click software
Advantages of Programming Software
Reproducibility with R
Transparency with R
Example case study
Overcoming Challenges
Q&A session

Why Reproducibility Matters

Definition of Reproducibility:

Ability to achieve consistent results using the same data and methods.

Definition of Transparency:

Clear and accessible documentation of methods and data.

Why Reproducibility Matters Cont’

Ensures integrity and credibility of research.
Facilitates peer review and verification.
Enhances collaboration and knowledge sharing.

Limitations of point-and-click software

Lack of Scriptability:
- Difficult to automate repetitive tasks.
Poor Documentation:
- Limited ability to track and document the analytical process.
Reproducibility Challenges:
- Hard to replicate analyses without exact same software version and settings.

Limitations of point-and-click software Cont’

Transparency Issues:
- Opaque processes and settings make it hard to understand how results are obtained.

Advantages of Programming Software

Script-Based Approach:
- Everything is coded, allowing for automation and easy replication.
Comprehensive Documentation:
- Code scripts serve as documentation of the entire analysis process.
Version Control:
- Integration with tools like Git for tracking changes and collaboration.

Advantages of Programming Software

Extensive Libraries:
- CRAN, Bioconductor, and other repositories offer a vast range of packages.
Community Support:
- Large and active user community for support and sharing best practices.

Reproducibility with R

Script Sharing:
- Easily share scripts via email, GitHub, GitLab or other platforms.
Version Control:
- Use Git for tracking changes and collaborating with others.
Environment Management:
- RStudio projects and renv for managing package versions and dependencies.

Transparency with R

Commenting Code:
- Use comments to explain each step of the analysis.
Documentation:
- Use RMarkdown and Quarto for combining code and documentation.
- Create detailed reports and presentations.

Automated Publisheable Table

Code

bas_bp_summary <- bas_bp %>%
  dplyr::select(
    `Treatment Protocol`,
    `Age, Mean (SD), years`,
    `Sex, No. (%)`,
    `Body mass index, Mean (SD), kg/m2`,
    #`Body mass index class, No. (%)`, 
    `Education, No. (%)`,
    `History of hypertension, No. (%)`,
    `History of diabetes, No. (%)`,
    `History of chronic kidney disease, No. (%)`,
    `History of stroke, No. (%)`,
    `History of heart attack, No. (%)`,
    `Smoker/Tobacco user, No. (%)`, 
    `Alcohol user, No. (%)`,
    `Systolic blood pressure, Mean (SD), mmHg`,
    `Diastolic blood pressure, Mean (SD), mmHg`,
    `Hypertension Control, No. (%)`,
    `Heart rate, Mean (SD), bpm`,
    `Baseline treatment rate, No. (%)`
  )

table_1 <- tbl_summary(bas_bp_summary,
                       by = `Treatment Protocol`,
                       missing = "no",
                       digits = list(all_categorical() ~ c(0, 1), all_continuous() ~ c(0, 1)),
                       statistic = list(all_continuous() ~ "{mean} ({sd})",
                                        all_categorical() ~ "{n} ({p})")) %>% 
  # add_n() %>% 
  add_p(pvalue_fun = ~ style_pvalue(.x, digits = 2)) %>% 
  add_overall() %>% 
  modify_header(label = "**Variables**") %>% 
  bold_p() %>% 
  modify_caption("**Table 1**. Patient baseline demographics, clinical characteristics and blood pressure overall and by study arm.") %>% 
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Protocol Received**") %>%   modify_footnote(
    update = everything() ~ NA
  )

table_1

**Table 1**. Patient baseline demographics, clinical characteristics and blood pressure overall and by study arm.
Variables	Overall, N = 4,427	Treatment Protocol Received		p-value
Variables	Overall, N = 4,427	Protocol 1, N = 2,046	Protocol 2, N = 2,381	p-value
Age, Mean (SD), years	49 (12.4)	50 (12.0)	49 (12.7)	0.083
Sex, No. (%)				0.013
Male	1,308 (29.5)	642 (31.4)	666 (28.0)
Female	3,119 (70.5)	1,404 (68.6)	1,715 (72.0)
Body mass index, Mean (SD), kg/m2	28 (6.1)	28 (6.1)	28 (6.1)	0.23
Education, No. (%)				<0.001
Never attended	1,179 (26.7)	614 (30.0)	565 (23.8)
Primary	791 (17.9)	379 (18.5)	412 (17.3)
Secondary	1,186 (26.8)	468 (22.9)	718 (30.2)
Higher	1,219 (27.6)	574 (28.1)	645 (27.1)
Other	49 (1.1)	11 (0.5)	38 (1.6)
History of hypertension, No. (%)	2,286 (51.9)	1,094 (53.8)	1,192 (50.2)	0.019
History of diabetes, No. (%)	205 (4.6)	80 (3.9)	125 (5.3)	0.035
History of chronic kidney disease, No. (%)	2 (0.0)	1 (0.0)	1 (0.0)	>0.99
History of stroke, No. (%)	23 (0.5)	8 (0.4)	15 (0.6)	0.27
History of heart attack, No. (%)	6 (0.1)	1 (0.0)	5 (0.2)	0.23
Smoker/Tobacco user, No. (%)	54 (1.2)	34 (1.7)	20 (0.8)	0.013
Alcohol user, No. (%)	188 (4.3)	105 (5.1)	83 (3.5)	0.007
Systolic blood pressure, Mean (SD), mmHg	155 (20.6)	154 (20.4)	155 (20.7)	0.25
Diastolic blood pressure, Mean (SD), mmHg	96 (13.1)	95 (13.0)	96 (13.2)	0.061
Hypertension Control, No. (%)	626 (14.1)	300 (14.7)	326 (13.7)	0.36
Heart rate, Mean (SD), bpm	83 (13.6)	83 (13.5)	83 (13.7)	0.036
Baseline treatment rate, No. (%)	4,251 (96.0)	1,990 (97.3)	2,261 (95.0)	<0.001

Automated Publisheable Graphs

Code

# summary data on htn control
htnc_b <- dat_final %>% 
  group_by(id) %>%
  #filter(time > 2) %>%
  filter(time == min(time)) %>% 
  group_by(study_arm) %>%
  count(htn_c) %>%
  mutate(total = sum(n)) %>% 
  filter(htn_c == "1") %>%
   mutate( #total = sum(n),
           rate = map2(n, total, ~ prop.test(.x, .y, conf.level = 0.95) %>%
                   broom::tidy()),
           htn_c = factor(htn_c, levels = c("0", "1"), labels = c("Not controlled", "Controlled"))) %>%
  unnest(rate) %>%
  filter(htn_c == "Controlled") %>% 
  mutate(
    n_base = n,
    total_base = total,
    estimate_base = estimate
  )

htn_control_baseline_itt <- htnc_b %>% 
  dplyr::select(
    study_arm,
    n_base,
    total_base,
    estimate_base
  )


htn_control_finale_itt <- htnc_b %>% 
  dplyr::select(
    study_arm,
    n_finale = n,
    total_finale = total,
    estimate_finale = estimate
  )



## selecting people who attended the PHC twice and are controlled at the last visit
htnc_new <- dat_final %>% 
  group_by(id) %>%
  filter(time > 2) %>%
  filter(time == max(time)) %>%
  group_by(study_arm) %>%
  count(htn_c) %>% 
  mutate(total = sum(n)) %>% 
  filter(htn_c == "1") %>% 
   mutate( #total = sum(n),
           rate = map2(n, total, ~ prop.test(.x, .y, conf.level = 0.95) %>%
                   broom::tidy()),
           htn_c = factor(htn_c, levels = c("0", "1"), labels = c("Not controlled", "Controlled"))) %>%
  unnest(rate) %>%
  filter(htn_c == "Controlled") %>% 
   mutate(
    n_finale = n,
    total_finale = total,
    estimate_finale = estimate
  )


htn_control_baseline <- htnc_b %>% 
  dplyr::select(
    study_arm,
    n_base,
    total_base,
    estimate_base
  )

htn_control_finale <- htnc_new %>% 
  dplyr::select(
    study_arm,
    n_finale = n,
    total_finale = total,
    estimate_finale = estimate
  )



## difference between hypertension control at baseline and at last visit
## let's join the database of htn control at baseline and at the last visit. 

htn_control_diff <- htn_control_baseline %>% 
  left_join(htn_control_finale, by = "study_arm") %>% 
  mutate(
    htn_diff = estimate_finale - estimate_base
  )

## let's compare the proportion of htn control in baseline and at last visit in each study arm









controlled_least_twice <- ggplot(data = htnc_new, aes(x = study_arm, y = estimate)) + 
  geom_col(aes(fill = htn_c), position = "dodge", show.legend = FALSE) + 
    geom_errorbar(
    aes(ymin = conf.low,
        ymax = conf.high),
     position = position_dodge(0.9),
     width = 0.2
  ) + 
  geom_text(aes(label = scales::percent(estimate), 
                y = estimate,
                group = htn_c),
            position = position_dodge(width = 0.9),
            vjust = -2.1) +
  scale_color_jama() +
  #scale_fill_manual(values=c("grey70", "grey70")) + 
  labs(
    x = "Treatment Protocol",
    y = "Percentage (%) of HTN Controlled at last visit",
    fill = "Hypertension"
  ) +  
   scale_y_continuous(limits = c(0, 1), labels = scales::percent_format()) +
   theme_bw() +
  geom_signif(comparisons=list(c("1", "2")), annotations="Cluster adjusted p-value = 0.29",
              y_position = 0.68,
              tip_length = 0.03, 
              vjust = -0.005
              )
  
  

controlled_least_twice

Overcoming Challenges

Learning Curve:
- Resources for learning R (online courses, books, community forums).
Transition from Point-and-Click:
- Strategies for transitioning to R (start small, build confidence).
Collaborative Work:
- Using version control and collaborative platforms (GitHub, GitLab).

Questions And Answers

Asante Sana