Main Study Designs: Evidence
from: Somerville et.al. (2016), Public Health and Epidemiology at a Glance
Day 1: Basic Notions
09-2024
“The science of collecting, analyzing, presenting, and interpreting data.” (Dodge, 2003, The Oxford Dictionary of Statistical Terms)
“The science of learning from data, and of measuring, controlling, and communicating uncertainty.” (Davidian & Louis, 2012, Why statistics? Science, 336(6077), 12-13
“The science of making decisions in the face of uncertainty through the collection, analysis, and interpretation of data.” (Wasserman, 2004, All of Statistics: A Concise Course in Statistical Inference
Medical Statistics: “The application of statistical methods to medical and health-related data to answer clinical and epidemiological questions.” (Bland, 2015, An Introduction to Medical Statistics)
Biostatistics :”The application of statistics to a wide range of topics in biology, including medicine, public health, and agricultural and environmental sciences.” (Rosner, 2006, Fundamentals of Biostatistics)
18th century: Collection of vital statistics (London Bills of Mortality - J. Graunt)
19th century: Public health statistics (Cholera studies - J. Snow)
Early 20th century: Experimental design (Agricultural experiments - R.A. Fisher)
Mid 20th century: Randomized controlled trials (Streptomycin trial - A.B. Hill)
Late 20th century: Computational epidemiology (HIV/AIDS modeling - R.M. Anderson)
21st century: Big data and AI in healthcare (e.g., W.W. Stead)
16th-17th centuries: Early development of probability theory
18th century: Bayesian methods introduced
Early 19th century: Least squares method
Late 19th century: Correlation and regression analysis
Early 20th century: Student’s t-test (1908)
1920s-1930s: ANOVA and hypothesis testing framework
Mid 20th century: Randomized controlled trials
1960s-1970s: Cox proportional hazards model for survival analysis
Late 20th century: Advancements in computational methods and epidemiology
21st century: Big data analytics and personalized medicine
For any random variable \(X\) and real number \(a\): \[E(a X) = a E(X),\;Var(a X) = a^2 Var(X)\]
Average for [identical] random variables:
Scenario: Test for “Condition X”
Given: - Prevalence: 1% (\(P(X) = 0.01\)) - Sensitivity: 95% (\(P(T|X) = 0.95\)) - Specificity: 90% (\(P(T|\neg X) = 0.10\))
Bayes’ Theorem: \[P(X|T) = \frac{P(T|X) \cdot P(X)}{P(T|X) \cdot P(X) + P(T|\neg X) \cdot P(\neg X)}\]
Calculation: \[\begin{aligned} P(X|T) &= \frac{0.95 \cdot 0.01}{(0.95 \cdot 0.01) + (0.10 \cdot 0.99)} \\ &\approx 0.0876 \end{aligned}\]
Interpretation: A positive test result indicates only an 8.76% chance of having Condition X.
Consider \(n\) independent and identically distributed (i.i.d.) random variables \(X_1, X_2, \ldots, X_n\), following a normal distribution, \(X_i \sim N(\mu, \sigma^2)\).
The sample mean \(\bar{X}\) is: \(\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\)
Expected Value: \[E[\bar{X}] = \mu\]
Variance: \[Var(\bar{X}) = \frac{\sigma^2}{n}\]
Distribution: Sampling distribution of the mean follows a normal distribution:
\[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\]
If we observe data \(x_i\), which \(\mu, \sigma\) is suitable for model and data?
Population: Entire group about which we want to draw conclusions
Hard to analyze whole populations (costs, practicality)
Sample: Subset of the population used to make inferences
Sample Size: Number of participants needed to detect an effect, influenced by effect size, variability, significance level, and power.
Power: Probability of correctly rejecting the null hypothesis when it is false.
Importance: Adequate sample size and power are critical for the validity of the study results.
Example: Depression Scale Assessment
from: Somerville et.al. (2016), Public Health and Epidemiology at a Glance
Definition: Confounding occurs when the relationship between an exposure and an outcome is distorted by the presence of another variable (confounder). Leads to a biased estimate of the association.
Criteria for a Confounder:
Impact: Can create a spurious association or mask a true relationship
Addressing Confounding:
Note: subject-matter knowledge is needed
Historical Context: 1960s studies showed higher lung cancer rates among coffee drinkers
Observed Association:
Confounder: Smoking status
True Relationships:
Result: Smoking created a spurious association between coffee and lung cancer
Resolution: When smoking was accounted for, no significant association between coffee and lung cancer remained (Guertin et al., 2016)
A 2x2 contingency table is a cross-tabulation that displays frequencies in relation to two categorical variables (factors), each with two levels (without marginal distributions and totals).
Experiment: Variable 1 is considered the independent variable, whose influence on Variable 2 is to be examined. In a controlled experiment:
Note that groups have been selected (randomized) w.r.t. the first variable!
| Success | No Success | |
| Control Group | a | b |
| Experimental Group | c | d |
| Control Group | Experimental Group |
|---|---|
| \(P_C=\frac{a}{a+b}\) | \(P_E=\frac{c}{c+d}\) |
Case-Control: Variable 1 is considered the independent variable, whose influence on Variable 2 is to be examined. In a case-control study:
The first category of Variable 2 indicates “cases”, while the second indicates “non-cases”. The first category of Variable 1 refers to “no risk”, the second to “risk”.
Note that groups have been selected (randomized) w.r.t. the second variable!
| case | no case | |
|---|---|---|
| no risk | a | b |
| risk | c | d |
The odds indicate how many times more likely a success is compared to no success in both groups:
| no risk | risk |
|---|---|
| \(O_C=\frac{a}{b}\) | \(O_E=\frac{c}{d}\) |
The ratio of the odds is called the odds ratio and indicates how many times higher (>1) or lower (<1) the odds are in the experimental group compared to the control group. \[ OR=\frac{O_E}{O_C} \]
Copyright: Raimund Kovacevic