How to Do the Stats for Your Fellowship Project

Author

Brian Locke, MD MSCI

Learning Objectives: 

  1. Set Up Statistical Software
  2. Share “tacit knowledge” (how it’s actually done) of how to choose a statistical analyses and perform it.
  3. Understanding the structure of frequentist statistical tests
  4. Understand the logic of regression models

Bad news: you’ll need to learn how to do statistical analyses:

  • realistically, you’ll probably have to do your own stats somewhat
  • if you join a project that has a statistican (or get CTSI assistance), you’ll still need to be able to specify, critique, and interpret the analyses. It will help substantially to have (at least) basic working knowledge.
# TODO:  
# Fix links on Rpubs version.
# Shorten a bit for 1h session
# Change structure to intentionally start people off installing, then progress
# I think perhaps the initial data pull in didn't work? 
# Trouble shoot regression and chi2 equivalence... may be easier with linear regression.
# Add some information on distributions and assumptions
# Create an example of using chatgpt to troubleshoot an error

Statistical Programming

R Stata Python
Cost Free Requires License Free
IDE RStudio Built in editor Many (Visual Code best)
Strengths Best epi / trials libraries for helpful functions Simple functionality; powerful quasi-experimental/Meta-analysis. U of U MSCI uses. Best NLP, machine learning libraries
Weakness Clunky syntax; many ‘dialects’ Simple syntax Moderately Complex Syntax
Explainable Programming* Quarto No options Jupyter

*The idea that code should be readable by consumers of science has caught on in more quantitative fields (Math, CS), but will be coming to medicine. Long overdue. Learn now.

Objective 1: Get Set Up

Step 1: Install R Language https://cran.r-project.org/
Step 2: Install RStudio https://posit.co/downloads/ RStudio is an IDE (development environment)
Step 3: Install Quarto (formerly Markdown) https://quarto.org/docs/get-started/ Facilitates sharing and explaining your code. Will soon be standard in medical science.
Step 4: Download this document https://github.com/reblocke/fellow_stats “fellow_stats.qmd”

This page is made using Quarto.

Objective 2(a): How to do a statistical analysis

Step 0: Save yourself a headache and collect your data in a processable format https://open.substack.com/pub/statsepi/p/simple-tips-for-recording-data-in 

Step 1: Data Wrangling

  • Each row is an observation (usually a patient)
  • Each column contains only 1 type of data (more below)
  • No free text (if you need to, categorize responses)

Step 2: For each data element, consider the data type

  • Binary (aka dichotomous scale): e.g. Yes or No, 0 or 1
  • Unordered Categorical (nominal scale): e.g. Utah, Colorado, Nevada, Idaho
  • Ordered Categorical (ordinal scale): e.g. Room air, nasal cannula, HFNC, intubated, ECMO, dead
  • Continuous (interval & ratio scales - differ by whether 0 is special): e.g. Temperature (Celsius or Kelvin, respectively)
dichotomous nominal ordinal interval
a.ka. binary categorical ordered categorical continuous
n X X X X
% X X X X
min X X
max X X
range X X
mode X X X X
mean X
median X X
IQR X X
Std. dev. X
Std. err. X

From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.

Step 3: Visualize the distribution of each data-point (detect outliers, data entry errors, etc.)

Darren’s hypothetical code lives in a spreadsheet “darren_proj.xlsx”:

Here is some code that loads the excel spreadsheet into R (we’ll revisit)


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

It’s already (mostly) clean.

Let’s summarize it:

summary(darren_data_sheet)
   patient_id    splenectomy        prox_v_dist           qanadli     
 Min.   : 1.00   Length:20          Length:20          Min.   : 2.00  
 1st Qu.: 5.75   Class :character   Class :character   1st Qu.: 3.75  
 Median :10.50   Mode  :character   Mode  :character   Median :10.00  
 Mean   :10.50                                         Mean   :10.30  
 3rd Qu.:15.25                                         3rd Qu.:15.00  
 Max.   :20.00                                         Max.   :25.00  
   got_cteph?       hosp          
 Min.   :0.00   Length:20         
 1st Qu.:0.00   Class :character  
 Median :0.00   Mode  :character  
 Mean   :0.25                     
 3rd Qu.:0.25                     
 Max.   :1.00                     

Hmmm.. what’s wrong with this?

R need to be told that the binary variables are binary (and not characters)

library(dplyr)

# Convert 'y'/'n' in the splenectomy column to TRUE/FALSE
darren_data_sheet <- darren_data_sheet %>%
  mutate(splenectomy = ifelse(splenectomy == "y", TRUE, FALSE))

# Assuming darren_data_sheet is your dataframe
darren_data_sheet <- darren_data_sheet %>%
  mutate(`got_cteph?` = ifelse(`got_cteph?` == 1, TRUE, FALSE))

Let’s visualize each element:

library(ggplot2)

# First, the binary ones

# Plot for splenectomy
ggplot(darren_data_sheet, aes(x = factor(splenectomy))) +
  geom_bar() +
  labs(title = "Distribution of Splenectomy", x = "Splenectomy", y = "Count")

# Plot for prox_v_dist
ggplot(darren_data_sheet, aes(x = factor(prox_v_dist))) +
  geom_bar() +
  labs(title = "Distribution of Proximal vs. Distal", x = "Proximal vs Distal", y = "Count")

# Plot for got_cteph?
ggplot(darren_data_sheet, aes(x = factor(`got_cteph?`))) +
  geom_bar() +
  labs(title = "Distribution of CTEPH Diagnosis", x = "Got CTEPH?", y = "Count")

The categorical one:

# Bar chart for hosp
ggplot(darren_data_sheet, aes(x = factor(hosp))) +
  geom_bar(fill = "coral", color = "black") +
  labs(title = "Distribution of Hospital", x = "Hospital", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Adjust text angle for better readability if needed

and finally, the continuous one:

# Histogram for qanadli
ggplot(darren_data_sheet, aes(x = qanadli)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Histogram of Qanadli Scores", x = "Qanadli Score", y = "Frequency")

Objective 3: the Logic of Frequentist Inferential Statistics

Hume: we cannot directly observe causation 

David Hume

David Hume: causation is never directly observed

If there is an association between an ‘exposure’ and an ‘outcome’, there are 4 possible explanations

  1. Chance
  2. Confounding (some other factor influences the exposure and the outcome)
  3. Bias
  4. Or, causation (meaning, a real effect)

Disjunctive syllogism:

When you have eliminated the impossible, whatever remains, however improbable, must be the truth. 

- Sherlock Holmes

*note: Bayesian analysis follows a different logic. 

P-values ONLY address possibility 1: how likely is it that chance alone could explain the observed result If the two variables were not correlated, what the likeliihood that you’d see as extreme results or more - just as a result of chance? 

Consider: if I start flipping a coin, how many consecutives “Heads” need to occur before you’ll suspect it’s not a fair coin? 

Sequence Flips P-value
HH 2 flips 0.25
HHH 3 flips 0.125
HHHH 4 flips 0.0625
4.32 flips 0.05
HHHHH 5 flips 0.03125

More at: https://stat.lesslikely.com/s-values/

IMPORTANT POINT: the p-value is NOT the probability that the coin is biased. It is the probability of seeing that result (or more extreme) ASSUMING the coin is biased. 

Sidebar: understanding multiplicity: If we all flipped a coin 5 times, what is the chance that one of us would get 5 heads in a row? 

  • If 10 people = (1-0.03125)^10 = 0.727.  23.3% of at least 1 HHHHH -> we’re back to weak evidence of a real effect. 
  • This is obviously true when multiple tests are reported, but less obviously also true if you try several analyses and choose the “best one” after seeing the result. Hence, prespecification.

‘Signal to noise’: if the alternative hypothesis is that the coin is subtly imbalanced, it’ll be a much harder to detect signal. This is the logic of power analysis - if you’re looking for a subtle signal (e.g. small difference, noisy data, rare events), you’ll need a bigger study.

However, there is no free lunch: a statistical tests makes assumptions about the data, and if those assumptions hold in reality, then the implications from the analysis follow.

Example interpretation:

If the p-value from a Chi2 test is P=0.03 - we say it’s as unlikely this would occur from just chance as it would be to flip a fair coin heads 5 times in a row.

IF observations are independent; both variables are categorical; there are enough observations of each, then it is unlikely chance alone explains the difference. 

Using the usual significance threshold (alpha), IF the assumptions hold, we conclude it is unlikely chance alone caused the finding (though it could have been confounding, bias, or a real effect).

How do you choose the right test?

What type of variables? How many groups? Are the samples correlated (e.g. observation from the same patient at two different times)?

Level of measurement of outcome variable Two Independent Groups Three or more Independent Groups Two Correlated* Samples Three or more Correlated* Samples
Dichotomous chi-square or Fisher’s exact test chi-square or Fisher-Freeman-Halton test McNemar test Cochran Q test
Unordered Categorical chi-square or Fisher-Freeman-Halton test chi-square or Fisher-Freeman-Halton test Stuart-Maxwell test Multiplicity adjusted Stuart-Maxwell tests#
Ordered categorical Wilcoxon-Mann-Whitney (WMW) test

Old School***: Kruskal-Wallis analysis of variance (ANOVA)

New School***: multiplicity adjusted WMW test

Wilcoxon sign rank test

Old School# Friedman two-way ANOVA by ranks

New School# Mulitiplicity adjusted Wilcoxon sign rank tests

Continuous independent groups t-test

Old school***: oneway ANOVA

New school***: multiplicity adjusted independent groups t tests

paired t-test mixed effects linear regression
Censored: time to event log-rank test Multiplicity adjusted log-rank test Shared-frailty Cox regression Shared-frailty Cox regression

From: From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.

How can you collaborate effectively with a statistician? They will know these assumptions and can tell you when your analyses makes dubious assumptions (if you communicate the constraints of the problem correctly)

Examples from Darren

What test would we use to assess if “splenectomy” and “prox_v_dist” are associated beyond what’s attributable to chance?

To test if “splenectomy” and “hosp” are associated?

If “splenectomy” and “qanadli” are associated?

chi2_test_result <- chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$prox_v_dist)
Warning in chisq.test(darren_data_sheet$splenectomy,
darren_data_sheet$prox_v_dist): Chi-squared approximation may be incorrect
print(chi2_test_result)

    Pearson's Chi-squared test with Yates' continuity correction

data:  darren_data_sheet$splenectomy and darren_data_sheet$prox_v_dist
X-squared = 1.1122e-31, df = 1, p-value = 1
chi2_test_result <- chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$hosp)
Warning in chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$hosp):
Chi-squared approximation may be incorrect
print(chi2_test_result)

    Pearson's Chi-squared test

data:  darren_data_sheet$splenectomy and darren_data_sheet$hosp
X-squared = 0.13889, df = 3, p-value = 0.9868
t_test_result <- t.test(qanadli ~ splenectomy, data = darren_data_sheet)
print(t_test_result)

    Welch Two Sample t-test

data:  qanadli by splenectomy
t = 0.90543, df = 16.101, p-value = 0.3786
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
 -4.020416 10.020416
sample estimates:
mean in group FALSE  mean in group TRUE 
               11.5                 8.5 

Objective 4: Understand the logic of regression analysis

Recall, if there is an association between an ‘exposure’ and an ‘outcome’, there are 4 possible explanations

  1. Chance
  2. Confounding (some other factor influences the exposure and the outcome)
  3. Bias 
  4. Or, causation (a real effect)

Randomization addresses point 2 (essentially, converts it to point 1, in that only chance confounding can occur)

For non-randomized data, you must make an argument against point 2. This is the most common use of regression. 

[the methods section of your paper is the argument against point 3; pull in RECORD/STROBE recs]

There are at least 3 uses of regression models: 

  1. Inferential Statistics: Hypothesis testing with confounding control
  2. Descriptive Statistics: Summarize the strength of association
  3. Prediction of an outcome (e.g. statistical machine learning)

Regression comes with additional assumptions: 

  • Independent observations (special “mixed models” can relax this)
  • The form of the output variable is correct* 
  • The form of the predictor variables are correct
  • The relationship between the predictors are properly specified.**
  • Additional constraints (e.g. constant variance)

Thus the logic is: if the assumptions of the models hold in reality, then the described relationships are valid

No model is perfect, but some models are useful

  • Morris moment(TM)

Output variable (aka the dependent variable, predicted variable) form determines the type of regression : 

Level of measurement of outcome variable Two Independent Groups without Confounding Adjustment Two Independent Groups without Confounding Adjustment
Dichotomous Chi2 Test logistic regression
Unordered categorical Chi2 Test multinomial logistic regression
Ordered categorical Wilcoxon-Mann-Whitney ordinal logistic regression
Continuous (normally distributed) T-test linear regression
Censored: time to event Log-rank test Cox regression

From: From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.

Interpretation:

Regression coefficient = What change in the outcome do you expected if you change the predictor by 1 unit, holding all other variables constant

  • For linear regression: additive change in outcome
  • For logistic regression: multiplicative change in odds of the outcome
  • For Cox regression: multiplicative change in the hazard of the outcome. 

Example:

Consider, if we want to test whether ‘splenectomy’ and ‘got_cteph?’ are associated, we could use a chi2 test:

chi2_test_result <- chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$`got_cteph?`)
Warning in chisq.test(darren_data_sheet$splenectomy,
darren_data_sheet$`got_cteph?`): Chi-squared approximation may be incorrect
print(chi2_test_result)

    Pearson's Chi-squared test with Yates' continuity correction

data:  darren_data_sheet$splenectomy and darren_data_sheet$`got_cteph?`
X-squared = 0.27778, df = 1, p-value = 0.5982

Alternatively you could specify a logistic regression

(“GLM” standards for ‘general linear model’. Logistic regression is a type of glm where the family is binomial)

logistic_model <- glm(`got_cteph?` ~ splenectomy, data = darren_data_sheet, family = binomial())

# Output the summary of the model to see coefficients and statistics
summary(logistic_model)

Call:
glm(formula = `got_cteph?` ~ splenectomy, family = binomial(), 
    data = darren_data_sheet)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)      -1.6094     0.7746  -2.078   0.0377 *
splenectomyTRUE   1.0986     1.0646   1.032   0.3021  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22.493  on 19  degrees of freedom
Residual deviance: 21.398  on 18  degrees of freedom
AIC: 25.398

Number of Fisher Scoring iterations: 4

Causal relationship of Splenectomy and CTEPH

(https://www.dagitty.net/dags.html Daggity is a tool to specify such diagrams)

logistic_model_updated <- glm(`got_cteph?` ~ splenectomy + prox_v_dist, data = darren_data_sheet, family = binomial())
summary(logistic_model_updated)

Call:
glm(formula = `got_cteph?` ~ splenectomy + prox_v_dist, family = binomial(), 
    data = darren_data_sheet)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)      -1.8474     1.0473  -1.764   0.0777 .
splenectomyTRUE   1.1383     1.0762   1.058   0.2902  
prox_v_distprox   0.3873     1.0890   0.356   0.7221  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22.493  on 19  degrees of freedom
Residual deviance: 21.270  on 17  degrees of freedom
AIC: 27.27

Number of Fisher Scoring iterations: 4

Causal diagram of Splenectomy, Prox_v_dist, and CTEPH

Consider: do you want the adjusted or the unadjusted estimate? 

Hint: it depends….

Objective 2(b): More guidance for how to do the work

Packages

Other people have mostly done all the analyses you’ll want to do:

Example: Say you want to do a meta-analysis.

Relevant packages: https://cran.r-project.org/web/views/MetaAnalysis.html

The ‘meta’ package looks good. Try using `install.packages(‘meta’)` to install it, then you can you can access the documentation using `?meta`

Let’s try an example… I extracted data on all of the trials comparing high O2 to low O2 targets and uploaded to github.

head(data_sheet)
# A tibble: 6 × 15
  name   author  year doi   num_randomized num_patients num_high_o2 high_o2_died
  <chr>  <chr>  <dbl> <chr>          <dbl>        <dbl>       <dbl>        <dbl>
1 Oxyge… Girar…  2016 10.1…            460          432         216          74 
2 CLOSE  Panwar  2016 10.1…            104          103          51          19 
3 HYPER… Asfar   2017 10.1…            442          434         217         104 
4 Lang2… Lang    2018 10.1…             65           65          38           9 
5 COMAC… Jakku…  2018 10.1…            123          120          59          20 
6 ICU-R… Mackle  2020 10.1…           1000          965         484         157.
# ℹ 7 more variables: high_o2_alive <dbl>, num_low_o2 <dbl>, low_o2_died <dbl>,
#   low_o2_alive <dbl>, target <chr>, outcome <chr>, population <chr>
authors <- select(data_sheet, author)

Now, let’s meta-analyze it:

library(meta)
Loading required package: metadat
Loading 'meta' package (version 7.0-0).
Type 'help(meta)' for a brief overview.
Readers of 'Meta-Analysis with R (Use R!)' should install
older version of 'meta' package: https://tinyurl.com/dt4y5drs
#metabin takes events, total (rather than events, nonevents)
m_ex1 <- meta::metabin(low_o2_died, num_low_o2, high_o2_died, num_high_o2, data = data_sheet, studlab = paste(name, author, year), sm = "OR")
meta::forest(m_ex1, comb.random = FALSE, lab.c = "High Oxygen", lab.e = "Low Oxygen", label.left = "Favors Low O2", label.right = "Favors High O2")

And if you want to get really cutting edge, you can do a trial sequential analysis (TSA) on it:

library(RTSA) # A package for performing TSA
library(tidyverse) # A package that pulls in many data manipulation programs.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rtsa_df <- data_sheet |> 
  rename("eI" = high_o2_died, 
         "eC" = low_o2_died, 
         "nI" = num_high_o2,
         "nC" = num_low_o2, 
         "study" = name)

map(rtsa_df, class)
$study
[1] "character"

$author
[1] "character"

$year
[1] "numeric"

$doi
[1] "character"

$num_randomized
[1] "numeric"

$num_patients
[1] "numeric"

$nI
[1] "numeric"

$eI
[1] "numeric"

$high_o2_alive
[1] "numeric"

$nC
[1] "numeric"

$eC
[1] "numeric"

$low_o2_alive
[1] "numeric"

$target
[1] "character"

$outcome
[1] "character"

$population
[1] "character"
sapply(rtsa_df, function(x) which(is.na(x)))
$study
integer(0)

$author
integer(0)

$year
integer(0)

$doi
integer(0)

$num_randomized
integer(0)

$num_patients
integer(0)

$nI
integer(0)

$eI
integer(0)

$high_o2_alive
integer(0)

$nC
integer(0)

$eC
integer(0)

$low_o2_alive
integer(0)

$target
integer(0)

$outcome
integer(0)

$population
integer(0)
an_rtsa <- RTSA(type="analysis", data =  rtsa_df , outcome = "RR", mc = 0.9, side = 2,  alpha = 0.05, beta = 0.1, es_alpha = "esOF", es_beta = "esOF", futility = "non-binding", random_adj = "D2", D2 = 0.25 )
Warning: Unknown or uninitialised column: `order`.
Unknown or uninitialised column: `order`.
Warning in RTSA(type = "analysis", data = rtsa_df, outcome = "RR", mc = 0.9, :
NB. The required information size adjusted by Diversity (D^2). This might cause
an under-powered analysis. Consider changing the argument `random_adj` from
`D2` (default) to `tau2`.
plot(an_rtsa)

Meaning, we’ve passed futility (at 90% power) for a 10% relative risk reduction a few trials ago. Cool.

Large Language Models: Options:

  • OpenAI Chat GPT (requires subscription for best performance; custom GPTs)
  • Github CoPilot (programming specific)
  • Microsoft CoPilot - access to GPT4 = free through University of Utah

Copilot:

  • Visit bing.com/chat.
  • Select “sign in with a work or school account” under the Sign in icon in the upper right corner of the page.
  • Enter your unid@umail.utah.edu and uNID password.
  • Complete Duo two-factor authentication.
  • The conversation is protected when a green shield appears in the upper right corner next to your username. It is critical to verify that the green shield is present for all conversations.

Prompt Engineering:

  1. have the GPT take the persona that you want
  2. spell out the chain of thougt that you want the GPT to take (either multiple steps in 1 prompt or several prompts building on one another works)
  3. Give examples or specifications of what you want done. [this is particularly useful because the documents you give it can form a context and examples]. 

How I used GPT4 creating this workbook:

The prompt I used to create the above example.

U of U Resources -