How to Do the Stats for Your Fellowship Project

Author

Brian Locke, MD MSCI

Learning Objectives:

Set Up Statistical Software
Share “tacit knowledge” (how it’s actually done) of how to choose a statistical analyses and perform it.
Understanding the structure of frequentist statistical tests
Understand the logic of regression models

Bad news: you’ll need to learn how to do statistical analyses:

realistically, you’ll probably have to do your own stats somewhat
if you join a project that has a statistican (or get CTSI assistance), you’ll still need to be able to specify, critique, and interpret the analyses. It will help substantially to have (at least) basic working knowledge.

# TODO:  
# Fix links on Rpubs version.
# Shorten a bit for 1h session
# Change structure to intentionally start people off installing, then progress
# I think perhaps the initial data pull in didn't work? 
# Trouble shoot regression and chi2 equivalence... may be easier with linear regression.
# Add some information on distributions and assumptions
# Create an example of using chatgpt to troubleshoot an error

Statistical Programming

	R	Stata	Python
Cost	Free	Requires License	Free
IDE	RStudio	Built in editor	Many (Visual Code best)
Strengths	Best epi / trials libraries for helpful functions	Simple functionality; powerful quasi-experimental/Meta-analysis. U of U MSCI uses.	Best NLP, machine learning libraries
Weakness	Clunky syntax; many ‘dialects’	Simple syntax	Moderately Complex Syntax
Explainable Programming*	Quarto	No options	Jupyter

*The idea that code should be readable by consumers of science has caught on in more quantitative fields (Math, CS), but will be coming to medicine. Long overdue. Learn now.

Objective 1: Get Set Up

Step 1:	Install R Language	https://cran.r-project.org/
Step 2:	Install RStudio	https://posit.co/downloads/	RStudio is an IDE (development environment)
Step 3:	Install Quarto (formerly Markdown)	https://quarto.org/docs/get-started/	Facilitates sharing and explaining your code. Will soon be standard in medical science.
Step 4:	Download this document	https://github.com/reblocke/fellow_stats	“fellow_stats.qmd”

This page is made using Quarto.

Objective 2(a): How to do a statistical analysis

Step 0: Save yourself a headache and collect your data in a processable format https://open.substack.com/pub/statsepi/p/simple-tips-for-recording-data-in

Step 1: Data Wrangling

Each row is an observation (usually a patient)
Each column contains only 1 type of data (more below)
No free text (if you need to, categorize responses)

Step 2: For each data element, consider the data type

Binary (aka dichotomous scale): e.g. Yes or No, 0 or 1
Unordered Categorical (nominal scale): e.g. Utah, Colorado, Nevada, Idaho
Ordered Categorical (ordinal scale): e.g. Room air, nasal cannula, HFNC, intubated, ECMO, dead
Continuous (interval & ratio scales - differ by whether 0 is special): e.g. Temperature (Celsius or Kelvin, respectively)

	dichotomous	nominal	ordinal	interval
a.ka.	binary	categorical	ordered categorical	continuous
n	X	X	X	X
%	X	X	X	X
min			X	X
max			X	X
range			X	X
mode	X	X	X	X
mean				X
median			X	X
IQR			X	X
Std. dev.				X
Std. err.				X

From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.

Step 3: Visualize the distribution of each data-point (detect outliers, data entry errors, etc.)

Darren’s hypothetical code lives in a spreadsheet “darren_proj.xlsx”:

Here is some code that loads the excel spreadsheet into R (we’ll revisit)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

It’s already (mostly) clean.

Let’s summarize it:

summary(darren_data_sheet)

   patient_id    splenectomy        prox_v_dist           qanadli     
 Min.   : 1.00   Length:20          Length:20          Min.   : 2.00  
 1st Qu.: 5.75   Class :character   Class :character   1st Qu.: 3.75  
 Median :10.50   Mode  :character   Mode  :character   Median :10.00  
 Mean   :10.50                                         Mean   :10.30  
 3rd Qu.:15.25                                         3rd Qu.:15.00  
 Max.   :20.00                                         Max.   :25.00  
   got_cteph?       hosp          
 Min.   :0.00   Length:20         
 1st Qu.:0.00   Class :character  
 Median :0.00   Mode  :character  
 Mean   :0.25                     
 3rd Qu.:0.25                     
 Max.   :1.00

Hmmm.. what’s wrong with this?

R need to be told that the binary variables are binary (and not characters)

library(dplyr)

# Convert 'y'/'n' in the splenectomy column to TRUE/FALSE
darren_data_sheet <- darren_data_sheet %>%
  mutate(splenectomy = ifelse(splenectomy == "y", TRUE, FALSE))

# Assuming darren_data_sheet is your dataframe
darren_data_sheet <- darren_data_sheet %>%
  mutate(`got_cteph?` = ifelse(`got_cteph?` == 1, TRUE, FALSE))

Let’s visualize each element:

library(ggplot2)

# First, the binary ones

# Plot for splenectomy
ggplot(darren_data_sheet, aes(x = factor(splenectomy))) +
  geom_bar() +
  labs(title = "Distribution of Splenectomy", x = "Splenectomy", y = "Count")

# Plot for prox_v_dist
ggplot(darren_data_sheet, aes(x = factor(prox_v_dist))) +
  geom_bar() +
  labs(title = "Distribution of Proximal vs. Distal", x = "Proximal vs Distal", y = "Count")

# Plot for got_cteph?
ggplot(darren_data_sheet, aes(x = factor(`got_cteph?`))) +
  geom_bar() +
  labs(title = "Distribution of CTEPH Diagnosis", x = "Got CTEPH?", y = "Count")

The categorical one:

# Bar chart for hosp
ggplot(darren_data_sheet, aes(x = factor(hosp))) +
  geom_bar(fill = "coral", color = "black") +
  labs(title = "Distribution of Hospital", x = "Hospital", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Adjust text angle for better readability if needed

and finally, the continuous one:

# Histogram for qanadli
ggplot(darren_data_sheet, aes(x = qanadli)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Histogram of Qanadli Scores", x = "Qanadli Score", y = "Frequency")

Objective 3: the Logic of Frequentist Inferential Statistics

Hume: we cannot directly observe causation

David Hume: causation is never directly observed

If there is an association between an ‘exposure’ and an ‘outcome’, there are 4 possible explanations

Chance
Confounding (some other factor influences the exposure and the outcome)
Bias
Or, causation (meaning, a real effect)

Disjunctive syllogism:

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

- Sherlock Holmes

*note: Bayesian analysis follows a different logic.

P-values ONLY address possibility 1: how likely is it that chance alone could explain the observed result If the two variables were not correlated, what the likeliihood that you’d see as extreme results or more - just as a result of chance?

Consider: if I start flipping a coin, how many consecutives “Heads” need to occur before you’ll suspect it’s not a fair coin?

Sequence	Flips	P-value
HH	2 flips	0.25
HHH	3 flips	0.125
HHHH	4 flips	0.0625
	4.32 flips	0.05
HHHHH	5 flips	0.03125

More at: https://stat.lesslikely.com/s-values/

IMPORTANT POINT: the p-value is NOT the probability that the coin is biased. It is the probability of seeing that result (or more extreme) ASSUMING the coin is biased.

Sidebar: understanding multiplicity: If we all flipped a coin 5 times, what is the chance that one of us would get 5 heads in a row?

If 10 people = (1-0.03125)^10 = 0.727. 23.3% of at least 1 HHHHH -> we’re back to weak evidence of a real effect.
This is obviously true when multiple tests are reported, but less obviously also true if you try several analyses and choose the “best one” after seeing the result. Hence, prespecification.

‘Signal to noise’: if the alternative hypothesis is that the coin is subtly imbalanced, it’ll be a much harder to detect signal. This is the logic of power analysis - if you’re looking for a subtle signal (e.g. small difference, noisy data, rare events), you’ll need a bigger study.

However, there is no free lunch: a statistical tests makes assumptions about the data, and if those assumptions hold in reality, then the implications from the analysis follow.

Example interpretation:

If the p-value from a Chi2 test is P=0.03 - we say it’s as unlikely this would occur from just chance as it would be to flip a fair coin heads 5 times in a row.

IF observations are independent; both variables are categorical; there are enough observations of each, then it is unlikely chance alone explains the difference.

Using the usual significance threshold (alpha), IF the assumptions hold, we conclude it is unlikely chance alone caused the finding (though it could have been confounding, bias, or a real effect).

How do you choose the right test?

What type of variables? How many groups? Are the samples correlated (e.g. observation from the same patient at two different times)?

Level of measurement of outcome variable	Two Independent Groups	Three or more Independent Groups	*Two Correlated Samples**	*Three or more Correlated Samples**
Dichotomous	chi-square or Fisher’s exact test	chi-square or Fisher-Freeman-Halton test	McNemar test	Cochran Q test
Unordered Categorical	chi-square or Fisher-Freeman-Halton test	chi-square or Fisher-Freeman-Halton test	Stuart-Maxwell test	Multiplicity adjusted Stuart-Maxwell tests^#
Ordered categorical	Wilcoxon-Mann-Whitney (WMW) test	Old School*: Kruskal-Wallis analysis of variance (ANOVA) New School*: multiplicity adjusted WMW test	Wilcoxon sign rank test	Old School^# Friedman two-way ANOVA by ranks New School^# Mulitiplicity adjusted Wilcoxon sign rank tests
Continuous	independent groups t-test	Old school*: oneway ANOVA New school*: multiplicity adjusted independent groups t tests	paired t-test	mixed effects linear regression
Censored: time to event	log-rank test	Multiplicity adjusted log-rank test	Shared-frailty Cox regression	Shared-frailty Cox regression

From: From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.

How can you collaborate effectively with a statistician? They will know these assumptions and can tell you when your analyses makes dubious assumptions (if you communicate the constraints of the problem correctly)

Examples from Darren

What test would we use to assess if “splenectomy” and “prox_v_dist” are associated beyond what’s attributable to chance?

To test if “splenectomy” and “hosp” are associated?

If “splenectomy” and “qanadli” are associated?

chi2_test_result <- chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$prox_v_dist)

Warning in chisq.test(darren_data_sheet$splenectomy,
darren_data_sheet$prox_v_dist): Chi-squared approximation may be incorrect

print(chi2_test_result)


    Pearson's Chi-squared test with Yates' continuity correction

data:  darren_data_sheet$splenectomy and darren_data_sheet$prox_v_dist
X-squared = 1.1122e-31, df = 1, p-value = 1

chi2_test_result <- chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$hosp)

Warning in chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$hosp):
Chi-squared approximation may be incorrect

print(chi2_test_result)


    Pearson's Chi-squared test

data:  darren_data_sheet$splenectomy and darren_data_sheet$hosp
X-squared = 0.13889, df = 3, p-value = 0.9868

t_test_result <- t.test(qanadli ~ splenectomy, data = darren_data_sheet)
print(t_test_result)


    Welch Two Sample t-test

data:  qanadli by splenectomy
t = 0.90543, df = 16.101, p-value = 0.3786
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
 -4.020416 10.020416
sample estimates:
mean in group FALSE  mean in group TRUE 
               11.5                 8.5

Objective 4: Understand the logic of regression analysis

Recall, if there is an association between an ‘exposure’ and an ‘outcome’, there are 4 possible explanations

Chance
Confounding (some other factor influences the exposure and the outcome)
Bias
Or, causation (a real effect)

Randomization addresses point 2 (essentially, converts it to point 1, in that only chance confounding can occur)

For non-randomized data, you must make an argument against point 2. This is the most common use of regression.

[the methods section of your paper is the argument against point 3; pull in RECORD/STROBE recs]

There are at least 3 uses of regression models:

Inferential Statistics: Hypothesis testing with confounding control
Descriptive Statistics: Summarize the strength of association
Prediction of an outcome (e.g. statistical machine learning)

Regression comes with additional assumptions:

Independent observations (special “mixed models” can relax this)
The form of the output variable is correct*
The form of the predictor variables are correct
The relationship between the predictors are properly specified.**
Additional constraints (e.g. constant variance)

Thus the logic is: if the assumptions of the models hold in reality, then the described relationships are valid

No model is perfect, but some models are useful

Morris moment(TM)

Output variable (aka the dependent variable, predicted variable) form determines the type of regression :

Level of measurement of outcome variable	Two Independent Groups without Confounding Adjustment	Two Independent Groups without Confounding Adjustment
Dichotomous	Chi2 Test	logistic regression
Unordered categorical	Chi2 Test	multinomial logistic regression
Ordered categorical	Wilcoxon-Mann-Whitney	ordinal logistic regression
Continuous (normally distributed)	T-test	linear regression
Censored: time to event	Log-rank test	Cox regression

From: From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.

Interpretation:

Regression coefficient = What change in the outcome do you expected if you change the predictor by 1 unit, holding all other variables constant

For linear regression: additive change in outcome
For logistic regression: multiplicative change in odds of the outcome
For Cox regression: multiplicative change in the hazard of the outcome.

Example:

Consider, if we want to test whether ‘splenectomy’ and ‘got_cteph?’ are associated, we could use a chi2 test:

chi2_test_result <- chisq.test(darren_data_sheet$splenectomy, darren_data_sheet$`got_cteph?`)

Warning in chisq.test(darren_data_sheet$splenectomy,
darren_data_sheet$`got_cteph?`): Chi-squared approximation may be incorrect

print(chi2_test_result)


    Pearson's Chi-squared test with Yates' continuity correction

data:  darren_data_sheet$splenectomy and darren_data_sheet$`got_cteph?`
X-squared = 0.27778, df = 1, p-value = 0.5982

Alternatively you could specify a logistic regression

(“GLM” standards for ‘general linear model’. Logistic regression is a type of glm where the family is binomial)

logistic_model <- glm(`got_cteph?` ~ splenectomy, data = darren_data_sheet, family = binomial())

# Output the summary of the model to see coefficients and statistics
summary(logistic_model)


Call:
glm(formula = `got_cteph?` ~ splenectomy, family = binomial(), 
    data = darren_data_sheet)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)      -1.6094     0.7746  -2.078   0.0377 *
splenectomyTRUE   1.0986     1.0646   1.032   0.3021  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22.493  on 19  degrees of freedom
Residual deviance: 21.398  on 18  degrees of freedom
AIC: 25.398

Number of Fisher Scoring iterations: 4

Causal relationship of Splenectomy and CTEPH

(https://www.dagitty.net/dags.html Daggity is a tool to specify such diagrams)

logistic_model_updated <- glm(`got_cteph?` ~ splenectomy + prox_v_dist, data = darren_data_sheet, family = binomial())
summary(logistic_model_updated)


Call:
glm(formula = `got_cteph?` ~ splenectomy + prox_v_dist, family = binomial(), 
    data = darren_data_sheet)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)      -1.8474     1.0473  -1.764   0.0777 .
splenectomyTRUE   1.1383     1.0762   1.058   0.2902  
prox_v_distprox   0.3873     1.0890   0.356   0.7221  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22.493  on 19  degrees of freedom
Residual deviance: 21.270  on 17  degrees of freedom
AIC: 27.27

Number of Fisher Scoring iterations: 4

Causal diagram of Splenectomy, Prox_v_dist, and CTEPH

Consider: do you want the adjusted or the unadjusted estimate?

Hint: it depends….

Objective 2(b): More guidance for how to do the work

Packages

Other people have mostly done all the analyses you’ll want to do:

Curated lists of relevant packages: https://cran.r-project.org/web/views/
`install.packages( )` will install the packages
`?package` or `?command` will bring up the documentation

Example: Say you want to do a meta-analysis.

Relevant packages: https://cran.r-project.org/web/views/MetaAnalysis.html

The ‘meta’ package looks good. Try using `install.packages(‘meta’)` to install it, then you can you can access the documentation using `?meta`

Let’s try an example… I extracted data on all of the trials comparing high O2 to low O2 targets and uploaded to github.

head(data_sheet)

# A tibble: 6 × 15
  name   author  year doi   num_randomized num_patients num_high_o2 high_o2_died
  <chr>  <chr>  <dbl> <chr>          <dbl>        <dbl>       <dbl>        <dbl>
1 Oxyge… Girar…  2016 10.1…            460          432         216          74 
2 CLOSE  Panwar  2016 10.1…            104          103          51          19 
3 HYPER… Asfar   2017 10.1…            442          434         217         104 
4 Lang2… Lang    2018 10.1…             65           65          38           9 
5 COMAC… Jakku…  2018 10.1…            123          120          59          20 
6 ICU-R… Mackle  2020 10.1…           1000          965         484         157.
# ℹ 7 more variables: high_o2_alive <dbl>, num_low_o2 <dbl>, low_o2_died <dbl>,
#   low_o2_alive <dbl>, target <chr>, outcome <chr>, population <chr>

authors <- select(data_sheet, author)

Now, let’s meta-analyze it:

library(meta)

Loading required package: metadat

Loading 'meta' package (version 7.0-0).
Type 'help(meta)' for a brief overview.
Readers of 'Meta-Analysis with R (Use R!)' should install
older version of 'meta' package: https://tinyurl.com/dt4y5drs

#metabin takes events, total (rather than events, nonevents)
m_ex1 <- meta::metabin(low_o2_died, num_low_o2, high_o2_died, num_high_o2, data = data_sheet, studlab = paste(name, author, year), sm = "OR")
meta::forest(m_ex1, comb.random = FALSE, lab.c = "High Oxygen", lab.e = "Low Oxygen", label.left = "Favors Low O2", label.right = "Favors High O2")

And if you want to get really cutting edge, you can do a trial sequential analysis (TSA) on it:

library(RTSA) # A package for performing TSA
library(tidyverse) # A package that pulls in many data manipulation programs.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

rtsa_df <- data_sheet |> 
  rename("eI" = high_o2_died, 
         "eC" = low_o2_died, 
         "nI" = num_high_o2,
         "nC" = num_low_o2, 
         "study" = name)

map(rtsa_df, class)

$study
[1] "character"

$author
[1] "character"

$year
[1] "numeric"

$doi
[1] "character"

$num_randomized
[1] "numeric"

$num_patients
[1] "numeric"

$nI
[1] "numeric"

$eI
[1] "numeric"

$high_o2_alive
[1] "numeric"

$nC
[1] "numeric"

$eC
[1] "numeric"

$low_o2_alive
[1] "numeric"

$target
[1] "character"

$outcome
[1] "character"

$population
[1] "character"

sapply(rtsa_df, function(x) which(is.na(x)))

$study
integer(0)

$author
integer(0)

$year
integer(0)

$doi
integer(0)

$num_randomized
integer(0)

$num_patients
integer(0)

$nI
integer(0)

$eI
integer(0)

$high_o2_alive
integer(0)

$nC
integer(0)

$eC
integer(0)

$low_o2_alive
integer(0)

$target
integer(0)

$outcome
integer(0)

$population
integer(0)

an_rtsa <- RTSA(type="analysis", data =  rtsa_df , outcome = "RR", mc = 0.9, side = 2,  alpha = 0.05, beta = 0.1, es_alpha = "esOF", es_beta = "esOF", futility = "non-binding", random_adj = "D2", D2 = 0.25 )

Warning: Unknown or uninitialised column: `order`.
Unknown or uninitialised column: `order`.

Warning in RTSA(type = "analysis", data = rtsa_df, outcome = "RR", mc = 0.9, :
NB. The required information size adjusted by Diversity (D^2). This might cause
an under-powered analysis. Consider changing the argument `random_adj` from
`D2` (default) to `tau2`.

plot(an_rtsa)

Meaning, we’ve passed futility (at 90% power) for a 10% relative risk reduction a few trials ago. Cool.

Large Language Models: Options:

OpenAI Chat GPT (requires subscription for best performance; custom GPTs)
Github CoPilot (programming specific)
Microsoft CoPilot - access to GPT4 = free through University of Utah

Copilot:

Visit bing.com/chat.
Select “sign in with a work or school account” under the Sign in icon in the upper right corner of the page.
Enter your unid@umail.utah.edu and uNID password.
Complete Duo two-factor authentication.
The conversation is protected when a green shield appears in the upper right corner next to your username. It is critical to verify that the green shield is present for all conversations.

Prompt Engineering:

have the GPT take the persona that you want
spell out the chain of thougt that you want the GPT to take (either multiple steps in 1 prompt or several prompts building on one another works)
Give examples or specifications of what you want done. [this is particularly useful because the documents you give it can form a context and examples].

How I used GPT4 creating this workbook:

The prompt I used to create the above example.

U of U Resources -

One Data Science Hub Workshops: https://utah-data-science-hub.github.io/education_archived.html
Request CTSI help: https://ctsi.utah.edu/cores-and-services/triad
Intuitive Biostatistics by Harvey Motulsky - https://a.co/d/4NCk2bS