Biostatistics

Day 1: Basic Notions

Priv. Doz. Dr. Raimund Kovacevic

09-2024

Statistics

“The science of collecting, analyzing, presenting, and interpreting data.” (Dodge, 2003, The Oxford Dictionary of Statistical Terms)

“The science of learning from data, and of measuring, controlling, and communicating uncertainty.” (Davidian & Louis, 2012, Why statistics? Science, 336(6077), 12-13

“The science of making decisions in the face of uncertainty through the collection, analysis, and interpretation of data.” (Wasserman, 2004, All of Statistics: A Concise Course in Statistical Inference

Medical Statistics: “The application of statistical methods to medical and health-related data to answer clinical and epidemiological questions.” (Bland, 2015, An Introduction to Medical Statistics)
Biostatistics :”The application of statistics to a wide range of topics in biology, including medicine, public health, and agricultural and environmental sciences.” (Rosner, 2006, Fundamentals of Biostatistics)

Application of Statistical Methods

18th century: Collection of vital statistics (London Bills of Mortality - J. Graunt)
19th century: Public health statistics (Cholera studies - J. Snow)
Early 20th century: Experimental design (Agricultural experiments - R.A. Fisher)
Mid 20th century: Randomized controlled trials (Streptomycin trial - A.B. Hill)
Late 20th century: Computational epidemiology (HIV/AIDS modeling - R.M. Anderson)
21st century: Big data and AI in healthcare (e.g., W.W. Stead)

History of Relevant Statistical Methods

16th-17th centuries: Early development of probability theory
18th century: Bayesian methods introduced
Early 19th century: Least squares method
Late 19th century: Correlation and regression analysis
Early 20th century: Student’s t-test (1908)
1920s-1930s: ANOVA and hypothesis testing framework
Mid 20th century: Randomized controlled trials
1960s-1970s: Cox proportional hazards model for survival analysis
Late 20th century: Advancements in computational methods and epidemiology
21st century: Big data analytics and personalized medicine

Probability and Distributions

Probability and Events

Sample Space (\(\Omega\)): Set of all possible outcomes of an experiment
Event: A subset of the sample space
Elementary Event (atom): An individual outcome in the sample space
Example: Rolling a die
- Sample Space: \(\Omega = \{1, 2, 3, 4, 5, 6\}\)
- Elementary Event: Rolling a 3- \(\{3\}\)
- Event: Rolling an even number = \(\{2, 4, 6\}\)

Probability (“Risk”)

Definition: A measure of the likelihood of an event occurring
Notation: \(P(A)\) denotes the probability (measure) of event \(A\)
Properties:
- \(0 \leq P(A) \leq 1\) for any event \(A\)
- \(P(\Omega) = 1\) (certain event)
- \(P(\emptyset) = 0\) (impossible event)
- \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
Example: Rolling a (fair) dice: \(P({1})=\dots=P({6})=1/6\) \[P(\text{Rolling an even number}) = P(\{2, 4, 6\}) = \frac{3}{6} = \frac{1}{2}\]

Andrei N. Kolmogorov (1903-1987)

Soviet mathematician
Rigorous foundation for probability theory
Some key contributions:
- Axioms of probability (1933)
- Contributions to the theory of stochastic processes
- Kolmogorov complexity in information theory
Impact:
- Unified various approaches to probability
- Modern probability theory and its applications
Legacy: His axioms are now the standard foundation for probability theory and its applications

Random Variable

Definition: A function \(X(\cdot)\) that assigns a real number to each outcome \(\omega\in \Omega\) in the sample space: \[\omega\mapsto X(\omega)\]
Notation: \(X(\omega)\), shortly denoted by capital letters (e.g., \(X\), \(Y\)). Small letters then usually denote concrete values. \[X(\omega)=x, \; x=1.0\]
Types:
- Discrete: Takes on countable values (e.g., number of patients)
- Continuous: Can take any value in a range (e.g., blood pressure)
Connection to events: \(\{\omega: X(\omega) \leq x\}\) is an event for any \(x\)

Probability Distribution

In theory, any probability at random-number level may be calculated by using the probability measure at state space level.
However, probabilities at this level may be calculated from certain functions.
Types:
- Discrete values: \(x_1,\dots, x_n\): Probability Mass Function \(p(x_i)\), PMF
- Continuous: \(x\in \mathbb{R}\): Probability Density Function \(f(x)\), (PDF)
For both types, one may use “PDF”
Cumulative Distribution Function (CDF): \[F_X(x) = P(X \leq x)=\int_{-\infty}^x f(y)dy.\]
These functions define a “probability distribution”

Probability Mass Function (PMF)

Definition: Gives the probability of each possible value for a discrete random variable
Notation: \[p_X(x)=P(X = x)\]
Properties:
- \(p_X(x) \geq 0\) for all \(x\)
- \(\sum_x p_X(x) = 1\)
Calculating probabilities: \(P(a \leq X \leq b) = \sum_{x=a}^b p_X(x)\)
Usually, PMFs are parameterized: \(p_X(x|\theta)\). Here, \(\theta\) denotes one or several parameters. This allows to tune the shape of the function.
Example: Probability of observing different numbers of side effects in a clinical trial

Visualizing PMF

# Example: PMF of a binomial distribution (10 trials, 0.3 success probability)
x <- 0:10
pmf <- dbinom(x, size = 10, prob = 0.3)

# Plot
barplot(pmf, names.arg = x, 
        main = "PMF of Binomial(10, 0.3)",
        xlab = "Number of successes", ylab = "Probability")

Probability Density Function (PDF)

Definition: Describes the density of probability for different values of a continuous random variable
Notation: \(f_X(x)\)
Properties:
- \(f_X(x) \geq 0\) for all \(x\)
- \(\int_{-\infty}^{\infty} f_X(x) dx = 1\)
Usually, PDFs are parameterized: \(f_X(x|\theta)\). \(\theta\) denotes one or several parameters. This allows to tune the shape of the function.
Calculating probabilities: \(P(a \leq X \leq b) = \int_a^b f_X(x) dx\)
Note: \(f_X(x)\) is not a probability, but rather a density

Visualizing PDF

# Example: PDF of a normal distribution (mean = 0, sd = 1)
x <- seq(-4, 4, length.out = 100)
pdf <- dnorm(x, mean = 0, sd = 1)

# Plot
plot(x, pdf, type = "l", 
     main = "PDF of Standard Normal Distribution",
     xlab = "x", ylab = "Density")

Visualizing Probability Calculations

Expected Value

Motivation: The average value of a random variable over many repetitions
Notation: \(E[X]\) or \(\mu\)
Definition:
- Discrete: \(E[X] = \sum_x x \cdot p_X(x)\)
- Continuous: \(E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx\)
Relevance: Estimation of population parameters from sample data
Simple examples: dice, uniform distribution

Pierre de Fermat (1607-1665)

French lawyer and mathematician
Correspondence with Blaise Pascal in 1654: birth of probability theory
Key contributions:
- Solved the “problem of points” with Pascal
  - Question: How to fairly divide stakes in an interrupted game of chance?
  - Solution: Consider all possible outcomes
- Developed method of expected values
- Fermat’s Last Theorem (proved in 1994 by Andrew Wiles)
Legacy: First calculations for problems under uncertainty

Variance and Standard Deviation

Aim: Quantification of variability in (medical) measurements and outcomes
Variance: Measure of spread around the expected value
- \(Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2\)
Standard Deviation: Square root of variance
- \(SD(X) = \sqrt{Var(X)}\)
- in the same units of measure as the random variable itself
Observe the difference to mean and standard deviation of data

Typical Calculations I

If we have random variables \(X_1,X_2\) then \[ E(X_1+X_2) = E(X_1) + E(X_2). \]
For a sequence \(X_1,X_2,\dots, X_n\): \[E(\sum_i{X_i}) = \sum_i{E(X_i})\]
If we have independent random variables \(X_1,X_2\) then \[ Var(X_1+X_2) = Var(X_1) + Var(X_2). \]
For a sequence \(X_1,X_2,\dots, X_n\) of independent random variables:
\[Var(\sum_i{X_i}) = \sum_i{Var(X_i})\]

Typical Calculations II

For any random variable \(X\) and real number \(a\): \[E(a X) = a E(X),\;Var(a X) = a^2 Var(X)\]
Average for [identical] random variables:
- Expectation \[E(\bar{X})=E(\frac{1}{n} \sum_i{X_i}) = \frac{1}{n}\sum_i{E(X_i})=\left[(E(X_1)\right]\]
- Variance (only independent variables!) \[Var(\bar{X})=Var(\frac{1}{n} \sum_i{X_i}) = \frac{1}{n^2}\sum_i{Var(X_i})=\frac{1}{n}Var(X_1)\]

Normal Distribution

Properties:
- Symmetric, bell-shaped curve
- Fully described by expectation \(\mu\) and standard deviation \(\sigma\): \(X\sim N(\mu,\sigma)\)
PDF: \[f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]
Standard Normal Distribution: If \(X~N(\mu,\sigma)\), then \(Z = \frac{X - \mu}{\sigma}\) follows a standard normal distribution, i.e., \(Z \sim N(0,1)\)
Relevance: Used to model many natural phenomena, also in medicine (e.g., blood pressure, height)

Interactive graphics

Carl Friedrich Gauss (1777-1855)

German mathematician and physicist
Revolutionized multiple mathematical fields, including probability and statistics
Key contributions:
- Normal distribution (Gaussian distribution)
- Method of least squares for regression analysis
- Gauss-Markov theorem in linear regression
Applications:
- Normal distribution: Used in natural and social sciences
- Least squares: Fundamental in data analysis
Legacy: His work forms the basis of much of modern statistical theory and practice

Visualizing 68-95-99.7 Rule

Normal Distribution: 68-95-99.7 Rule

68-95-99.7 Rule:
- 68% of data within 1 SD of the mean
- 95% within 2 SD
- 99.7% within 3 SD
Application: Determining reference ranges for laboratory values
Example: If adult male height \(\sim N(70 \text{ inches}, 3 \text{ inches})\), what % are over 76 inches?
- \(Z = \frac{76 - 70}{3} = 2\)
- \(P(X > 76) = P(Z > 2) \approx 0.0228\) or 2.28%

Binomial Distribution

Properties:
- Discrete distribution for the number of successes in a fixed number of independent trials
- Two parameters: \(n\) (number of trials) and \(p\) (probability of success on each trial)
- Mean: \(E(X)=np\)
- Variance: \(Var(X) = np(1-p)\)
PDF: \[P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\]
Example: Within a sample of size \(n\), how many individuals get a certain condition within some time period?

Binomial Distribution: Example

Applications in Medicine:
- Analyzing treatment efficacy in clinical trials
- Studying genetic inheritance patterns
- Evaluating diagnostic test accuracy
Example: If a drug has a 70% success rate, what’s the probability of it being effective in exactly 8 out of 10 patients?
- \(n = 10\), \(p = 0.7\), \(k = 8\)
- \(P(X=8) = \binom{10}{8} 0.7^8 (1-0.7)^{10-8} \approx 0.2334\)

Visualizing Binomial Distribution

Interactive graphics

Poisson Distribution

Properties:
- Discrete distribution for rare events
- Single parameter \(\lambda\) (lambda)
- \(E(X)=\lambda\) and \(Var(X)=\lambda\)
PMF: \[P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
Relevance: Modeling event counts in a fixed interval (e.g., mutations, new cases)

Poisson Distribution: Example

Applications in Medicine:
- Modeling infection rates
- Analyzing rare side effects
- Simplest waiting/survival time model
Example: If the average number of heart attacks in a hospital is 2 per day, what’s the probability of seeing exactly 5 in a day?
- \(\lambda = 2\), \(k = 5\)
- \(P(X=5) = \frac{2^5 e^{-2}}{5!} \approx 0.0361\)\(f_X(x;\theta)\)

Visualizing Poisson Distribution

Covariance and Correlation

Aim: We may consider several variables at once. Does information about one (“known”) variable give information about another (“unknown”) variable?
“Independence”: There is no information gain. Covariance and Correlation are measures for linear dependence
Covariance: Measure of joint variability between two random variables
- \(Cov(X,Y) = E[(X - \mu_X)(Y - \mu_Y)]\)
Correlation: Standardized covariance
- \(Corr(X,Y) = \frac{Cov(X,Y)}{SD(X) \cdot SD(Y)}\)
Relevance: Assessing relationships between different medical parameters: e.g., systolic and diastolic blood pressure, dose-response

Visualizing Correlation

Conditional Probability

Definition: Probability of an event given that another event has occurred
Notation: \(P(A|B) = \frac{P(A \cap B)}{P(B)}\)
Bayes’ Theorem: \[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}=\frac{P(B|A) \cdot P(A)}{P(B|A) \cdot P(A)+P(B|\neg A) \cdot P(\neg A))}\]
Relevance: Fundamental for diagnostic testing and risk assessment

Bayes’ Theorem: Medical Test

Scenario: Test for “Condition X”

Given: - Prevalence: 1% (\(P(X) = 0.01\)) - Sensitivity: 95% (\(P(T|X) = 0.95\)) - Specificity: 90% (\(P(T|\neg X) = 0.10\))

Bayes’ Theorem: \[P(X|T) = \frac{P(T|X) \cdot P(X)}{P(T|X) \cdot P(X) + P(T|\neg X) \cdot P(\neg X)}\]

Calculation: \[\begin{aligned} P(X|T) &= \frac{0.95 \cdot 0.01}{(0.95 \cdot 0.01) + (0.10 \cdot 0.99)} \\ &\approx 0.0876 \end{aligned}\]

Interpretation: A positive test result indicates only an 8.76% chance of having Condition X.

Independence of Events

Definition: Events \(A\) and \(B\) are independent if \(P(A|B) = P(A)\)
Properties:
If \(A\) and \(B\) are independent, then \[P(A \cap B) = P(A) \cdot P(B)\]
In this case, then \[E(A\cdot B) = E(A) \cdot E(B)\]
For dependent variables, calculations are different
- \(P(A \cap B) = P(A) \cdot P(B|A)\)
- calculation of expectation needs the joint PDF
Many statistical methods (tests etc.) assume data from independent observations

Law of Large Numbers

Statement: For sequences of independent identically distributed random variables \(X_1, X2,\dots, X_n\): As sample size \(n\) increases, the sample average converges to the expected value
Cumulative mean after \(n\) observations \({x_1,x_2,\dots,x_n}\): \(\bar{X}_n=\frac{1}{n}\sum_{i=1}^n x_i\)
Weak Law: \(P(|\bar{X}_n - \mu| > \epsilon) \to 0\) as \(n \to \infty\)
Relevance: Justifies using sample statistics to estimate population parameters

Visualizing Central Limit Theorem

Central Limit Theorem

Statement: The sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution
Formal: \[\frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0,1) \text{, as } n \to \infty\]
Relevance: Often enables the use of normal distribution-based methods when sample size is large

Statistics

Probability Theory and Statistics

Statistics: The Inverse Problem

Goal of Statistics:
- Characterize probabilistic models that could explain observed data
- “Work backwards” from data to potential models
Process:
1. Collect and analyze data
2. Infer characteristics of the underlying probabilistic model
3. Make predictions or draw conclusions based on this inferred model
Example in medical research:
- Observed: Treatment outcomes in a clinical trial
- Goal: Infer the probability distribution of treatment effectiveness in the population

Statistics as Random Variables

Data: Realizations of random variables
Statistics: Functions of data, therefore also random variables
- Sample mean: \(\bar{x}=\frac{1}{n}\sum_i x_i\)
- Mean as a function of random variables: \[\bar{X}=\frac{1}{n}\sum_i X_i\]
Implications:
- Each sample yields different value for a statistic

Sample/Sampling Distribution

Sample Distribution:
- Distribution of data points in a single sample
- Describes the spread of observed values
Sampling Distribution:
- Distribution of a statistics as a function of random variables
- alternatively: Empirical distribution of a statistic over many samples
- Describes variability of the statistic itself
- Crucial for inferential statistics
Example:
- Sample Distribution: Heights of 100 patients
- Sampling Distribution: Means of height from 50 samples of 100 patients

Sampling Distribution: Mean

Consider \(n\) independent and identically distributed (i.i.d.) random variables \(X_1, X_2, \ldots, X_n\), following a normal distribution, \(X_i \sim N(\mu, \sigma^2)\).
The sample mean \(\bar{X}\) is: \(\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\)
Expected Value: \[E[\bar{X}] = \mu\]
Variance: \[Var(\bar{X}) = \frac{\sigma^2}{n}\]
Distribution: Sampling distribution of the mean follows a normal distribution:

\[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\]
If we observe data \(x_i\), which \(\mu, \sigma\) is suitable for model and data?

Resampling

Definition: Resampling involves repeatedly drawing samples from an original dataset
Purpose: To estimate the variability of a statistic or to approximate its sampling distribution
Bootstrap: Sampling with replacement from the original dataset
Advantages:
- Non-parametric: Doesn’t assume a specific underlying distribution
- Flexible: Applicable to various statistics (mean, median, correlation, etc.)
- Robust: Works well with small sample sizes and complex data structures
- Provides empirical insight into sampling distributions

Resampling

Standard Error

Definition: Standard deviation of a sampling distribution
For mean: \(SE = \frac{s}{\sqrt{n}}\)
- s: sample standard deviation
- n: sample size
Significance:
1. Measures precision of a sample statistic
2. Used in calculating confidence intervals
3. Crucial for hypothesis testing
Example: Interpreting the standard error of the mean and confidence intervals in a clinical trial

Overview of Statistical Approaches

Descriptive Statistics:
- Summarize and describe data characteristics
- Examples: Mean, median, standard deviation, graphs
Inferential Statistics:
- Make predictions or inferences about a population
- Examples: Hypothesis tests, confidence intervals
Exploratory Data Analysis:
- Uncover patterns and relationships in data
- Examples: Data visualization, correlation analysis
Predictive Statistics:
- Develop models to predict future outcomes
- Examples: Regression analysis, machine learning models

Study Design

Introduction to Study Design

Study design is the framework that guides data collection and analysis in research, ensuring that the study answers the research question in a valid and reliable manner.
Purpose: To minimize bias, maximize accuracy, and ensure generalizability.
Categories: Observational and Experimental studies.
Note: Study design refers exclusively to new data collection, excluding any analysis of existing data.

Population and Sample

Population: Entire group about which we want to draw conclusions
- The whole population of a state
- All patients with Type 2 Diabetes in the US
- All erythrocytes in a human body
- …
Hard to analyze whole populations (costs, practicality)
Sample: Subset of the population used to make inferences
- 500 persons, participating in a poll
- 1000 Type 2 Diabetes patients in a clinical trial
- 10 ml bood sample, taken from a specific arm vein
- …

Sample Size and Power Considerations

Sample Size: Number of participants needed to detect an effect, influenced by effect size, variability, significance level, and power.
Power: Probability of correctly rejecting the null hypothesis when it is false.
Importance: Adequate sample size and power are critical for the validity of the study results.

Validity

Internal Validity: The extent to which the study accurately reflects reality within the study context
External Validity: The degree to which the study’s results can be generalized to other settings, populations, or times.

Example: Validity

Study Aim: Assess effectiveness of a new hypertension medication
Measurement: Office blood pressure readings
Validity Concern: White coat hypertension
- Office readings may not accurately represent true blood pressure
- Could lead to overestimation of hypertension prevalence
Improving Validity: Use 24-hour ambulatory blood pressure monitoring
- Provides a more accurate representation of true blood pressure
- Reduces impact of white coat effect (Pickering et al., 2005)

Reliability

Reliability: The consistency of study results across different instances or repetitions under the same conditions.
Importance: Ensures that findings are dependable and can be replicated in similar settings.

Example: Reliability

Example: Depression Scale Assessment

Study Aim: Evaluate efficacy of a new antidepressant
Measurement: Hamilton Depression Rating Scale (HAM-D)
Reliability Concern: Inter-rater variability
- Different clinicians may score the same patient differently
- Could lead to inconsistent results across study sites
Improving Reliability: Standardized training for raters
- Implement rigorous training program for all clinicians
- Conduct regular inter-rater reliability checks (Müller & Szegedi, 2002)

Study Designs: An Overview

Observational Studies: No intervention; the researcher observes and measures outcomes.
Experimental Studies: The researcher manipulates variables to observe effects.
Key Differences:
- Causal Inference: Experimental studies provide stronger evidence.
- Ethical Considerations: Observational studies are often used where experimentation would be unethical or even impossible.

Observational Studies

Cohort Studies: Follows groups over time based on exposure.
Case-Control Studies: Compares those with and without a condition retrospectively.
Cross-Sectional Studies: Analyzes data from a population at a single point in time.
Key problems:
- Confounding Factors: Hidden variables that could influence the results.
- Bias: Potential sources of systematic errors, like selection and information biases.

Main Study Designs: Evidence

from: Somerville et.al. (2016), Public Health and Epidemiology at a Glance

Cohort Studies

Design: Groups based on exposure status, followed over time.
Prospective vs. Retrospective:
- Prospective Cohort: Follows participants forward in time.
- Retrospective Cohort: Uses existing data to track participants backward.
Advantages: Establishes temporal sequence, keeps track of individuals
Disadvantages: Time-consuming, expensive
Incidence Rate: \[ \text{Incidence Rate} = \frac{\text{Number of new cases}}{\text{Total person-time at risk}} \]

Retrospective Cohort Studies

Prospective Cohort Studies

Case-Control Studies

Design: Compares participants with a specific condition to those without.
Key Feature: Data collected retrospectively to identify risk factors.
Advantages: Efficient for rare diseases, quicker, and less costly.
Disadvantages: Prone to recall bias, cannot directly measure incidence.
Formula: \[ \text{Odds Ratio (OR)} = \frac{\text{Odds of exposure among cases}}{\text{Odds of exposure among controls}} \]

Case-Control Studies

Cross-Sectional Studies

Design: Data collection at a single point in time to examine prevalence.
Purpose: Assesses the burden of a disease or health condition.
Advantages: Quick, cost-effective, useful for public health surveillance.
Disadvantages: Cannot determine causality, susceptible to bias.

Cross-Sectional Studies

Confounding

Definition: Confounding occurs when the relationship between an exposure and an outcome is distorted by the presence of another variable (confounder). Leads to a biased estimate of the association.
Criteria for a Confounder:
1. Associated with the exposure in the source population
2. Causally related to the outcome
3. Not in the causal pathway between exposure and outcome
Impact: Can create a spurious association or mask a true relationship
Addressing Confounding:
- Study design: Randomization, matching
- Analysis: take into account confounder - regression, stratification …
Note: subject-matter knowledge is needed

Example: Coffee and Lung Cancer

Historical Context: 1960s studies showed higher lung cancer rates among coffee drinkers
Observed Association:
- Higher coffee consumption → Higher lung cancer risk
Confounder: Smoking status
- Smokers tend to drink more coffee
- Smoking is a strong risk factor for lung cancer
True Relationships:
1. Smoking → Increased coffee consumption
2. Smoking → Increased lung cancer risk
Result: Smoking created a spurious association between coffee and lung cancer
Resolution: When smoking was accounted for, no significant association between coffee and lung cancer remained (Guertin et al., 2016)

Bias in Observational Studies

Selection Bias
- Definition: Study population doesn’t represent target population
- Example: Heart disease study recruiting only from cardiology clinic
- Note: Impact depends on study design and selection mechanism:
  - Case-control: Can affect exposure-outcome association
  - Cohort: May influence incidence rates or risk estimates (Hernán et al., 2004)
Information Bias
- Definition: Systematic errors in measuring variables
- Example: Recall bias in retrospective diet and cancer studies
- Note: Can lead to over- or underestimation (Rothman et al., 2008)

Experimental Studies Overview

Design: Involves manipulation of an independent variable to observe effects.
Randomized Controlled Trials: Participants randomly assigned to groups.
Reduces confounding by equal distribution of confounding variables.
Blinding: Prevents bias by keeping participants and/or researchers unaware of group assignments.
Control Groups: Provide a baseline for comparison with the treatment group.
Disadvantages: Cost, ethical constraints.
Advantages: Strong causal evidence, minimizes confounding.
Disadvantages: Expensive, time-consuming.
Relative Risk: \[ \text{Relative Risk (RR)} = \frac{\text{Risk in treatment group}}{\text{Risk in control group}} \]

Blinding Techniques in RCTs

Single-Blind: Participants unaware of group assignment.
Double-Blind: Both participants and researchers unaware.
Triple-Blind: Participants, researchers, and data analysts unaware.
Purpose:
- Reduces performance bias (systematic differences in exact treatment between groups)
- Reduces detection biases (systematic differences in measurement between groups).

2x2 Contingency Table: Experiment

A 2x2 contingency table is a cross-tabulation that displays frequencies in relation to two categorical variables (factors), each with two levels (without marginal distributions and totals).
Experiment: Variable 1 is considered the independent variable, whose influence on Variable 2 is to be examined. In a controlled experiment:
- The first category of Variable 1 serves as the control group (placebo, or current treatment), the second category defines the experimental group.
- The first category of Variable 2 indicates “successes” in a control study. Depending on the subject of investigation, successes can also be negatively connotated (e.g., deaths).
Note that groups have been selected (randomized) w.r.t. the first variable!

Relative Risk

	Success	No Success
Control Group	a	b
Experimental Group	c	d

Estimated probabilities for success:

Control Group	Experimental Group
\(P_C=\frac{a}{a+b}\)	\(P_E=\frac{c}{c+d}\)

The relative risk indicates how many times greater (smaller) the probability of success is in the experimental group is.\[R_R=\frac{P_E}{P_C}\]

2x2 Contingency Table: Case-Control Study

A 2x2 contingency table is a cross-tabulation that displays frequencies in relation to two categorical variables (factors), each with two levels (without marginal distributions and totals). If data on individual observations is given:

Case-Control: Variable 1 is considered the independent variable, whose influence on Variable 2 is to be examined. In a case-control study:

The first category of Variable 2 indicates “cases”, while the second indicates “non-cases”. The first category of Variable 1 refers to “no risk”, the second to “risk”.
Note that groups have been selected (randomized) w.r.t. the second variable!

Odds Ratio

	case	no case
no risk	a	b
risk	c	d

The odds indicate how many times more likely a success is compared to no success in both groups:

no risk	risk
\(O_C=\frac{a}{b}\)	\(O_E=\frac{c}{d}\)

The ratio of the odds is called the odds ratio and indicates how many times higher (>1) or lower (<1) the odds are in the experimental group compared to the control group. \[ OR=\frac{O_E}{O_C} \]

Decision Tree

Advantages

Ethical Considerations in Study Design

Key Principles:
- Respect for Persons: Acknowledging autonomy and protecting those with diminished autonomy.
- Beneficence: Obligation to minimize harm and maximize benefits.
- Justice: Fair distribution of research benefits and burdens.
Informed Consent: Full disclosure of study purpose, procedures, risks, and benefits.
Institutional Review Boards (IRBs): Review research proposals for ethical compliance.

Biostatistics

Statistics

Application of Statistical Methods

History of Relevant Statistical Methods

Probability and Distributions

Probability and Events

Probability (“Risk”)

Andrei N. Kolmogorov (1903-1987)

Random Variable

Probability Distribution

Probability Mass Function (PMF)

Visualizing PMF

Probability Density Function (PDF)

Visualizing PDF

Visualizing Probability Calculations

Expected Value

Pierre de Fermat (1607-1665)

Variance and Standard Deviation

Typical Calculations I

Typical Calculations II

Normal Distribution

Carl Friedrich Gauss (1777-1855)

Visualizing 68-95-99.7 Rule

Normal Distribution: 68-95-99.7 Rule

Binomial Distribution

Binomial Distribution: Example

Visualizing Binomial Distribution

Poisson Distribution

Poisson Distribution: Example

Visualizing Poisson Distribution

Covariance and Correlation

Visualizing Correlation

Conditional Probability

Bayes’ Theorem: Medical Test

Independence of Events

Law of Large Numbers

Visualizing Central Limit Theorem

Central Limit Theorem

Statistics

Probability Theory and Statistics

Statistics: The Inverse Problem

Statistics as Random Variables

Sample/Sampling Distribution

Sampling Distribution: Mean

Resampling

Resampling

Standard Error

Overview of Statistical Approaches

Study Design

Introduction to Study Design

Population and Sample

Sample Size and Power Considerations

Validity

Example: Validity

Reliability

Example: Reliability

Study Designs: An Overview

Observational Studies

Main Study Designs: Evidence

Cohort Studies

Retrospective Cohort Studies

Prospective Cohort Studies

Case-Control Studies

Case-Control Studies

Cross-Sectional Studies

Cross-Sectional Studies

Confounding

Example: Coffee and Lung Cancer

Bias in Observational Studies

Time-Related Biases

Experimental Studies Overview

Blinding Techniques in RCTs

2x2 Contingency Table: Experiment

Relative Risk

2x2 Contingency Table: Case-Control Study

Odds Ratio

Decision Tree

Advantages

Ethical Considerations in Study Design