Statistics

The science of collecting, analyzing, presenting, and interpreting data.” (Dodge, 2003, The Oxford Dictionary of Statistical Terms)

The science of learning from data, and of measuring, controlling, and communicating uncertainty.” (Davidian & Louis, 2012, Why statistics? Science, 336(6077), 12-13

The science of making decisions in the face of uncertainty through the collection, analysis, and interpretation of data.” (Wasserman, 2004, All of Statistics: A Concise Course in Statistical Inference

  • Medical Statistics:The application of statistical methods to medical and health-related data to answer clinical and epidemiological questions.” (Bland, 2015, An Introduction to Medical Statistics)

  • Biostatistics :The application of statistics to a wide range of topics in biology, including medicine, public health, and agricultural and environmental sciences.” (Rosner, 2006, Fundamentals of Biostatistics)

Application of Statistical Methods

  • 18th century: Collection of vital statistics (London Bills of Mortality - J. Graunt)

  • 19th century: Public health statistics (Cholera studies - J. Snow)

  • Early 20th century: Experimental design (Agricultural experiments - R.A. Fisher)

  • Mid 20th century: Randomized controlled trials (Streptomycin trial - A.B. Hill)

  • Late 20th century: Computational epidemiology (HIV/AIDS modeling - R.M. Anderson)

  • 21st century: Big data and AI in healthcare (e.g., W.W. Stead)

History of Relevant Statistical Methods

  • 16th-17th centuries: Early development of probability theory

  • 18th century: Bayesian methods introduced

  • Early 19th century: Least squares method

  • Late 19th century: Correlation and regression analysis

  • Early 20th century: Student’s t-test (1908)

  • 1920s-1930s: ANOVA and hypothesis testing framework

  • Mid 20th century: Randomized controlled trials

  • 1960s-1970s: Cox proportional hazards model for survival analysis

  • Late 20th century: Advancements in computational methods and epidemiology

  • 21st century: Big data analytics and personalized medicine

Probability and Events

  • Sample Space (\(\Omega\)): Set of all possible outcomes of an experiment
  • Event: A subset of the sample space
  • Elementary Event (atom): An individual outcome in the sample space
  • Example: Rolling a die
    • Sample Space: \(\Omega = \{1, 2, 3, 4, 5, 6\}\)
    • Elementary Event: Rolling a 3- \(\{3\}\)
    • Event: Rolling an even number = \(\{2, 4, 6\}\)

Probability (“Risk”)

  • Definition: A measure of the likelihood of an event occurring
  • Notation: \(P(A)\) denotes the probability (measure) of event \(A\)
  • Properties:
    • \(0 \leq P(A) \leq 1\) for any event \(A\)
    • \(P(\Omega) = 1\) (certain event)
    • \(P(\emptyset) = 0\) (impossible event)
    • \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
  • Example: Rolling a (fair) dice: \(P({1})=\dots=P({6})=1/6\) \[P(\text{Rolling an even number}) = P(\{2, 4, 6\}) = \frac{3}{6} = \frac{1}{2}\]

Andrei N. Kolmogorov (1903-1987)

  • Soviet mathematician
  • Rigorous foundation for probability theory
  • Some key contributions:
    • Axioms of probability (1933)
    • Contributions to the theory of stochastic processes
    • Kolmogorov complexity in information theory
  • Impact:
    • Unified various approaches to probability
    • Modern probability theory and its applications
  • Legacy: His axioms are now the standard foundation for probability theory and its applications

Random Variable

  • Definition: A function \(X(\cdot)\) that assigns a real number to each outcome \(\omega\in \Omega\) in the sample space: \[\omega\mapsto X(\omega)\]
  • Notation: \(X(\omega)\), shortly denoted by capital letters (e.g., \(X\), \(Y\)). Small letters then usually denote concrete values. \[X(\omega)=x, \; x=1.0\]
  • Types:
    • Discrete: Takes on countable values (e.g., number of patients)
    • Continuous: Can take any value in a range (e.g., blood pressure)
  • Connection to events: \(\{\omega: X(\omega) \leq x\}\) is an event for any \(x\)

Probability Distribution

  • In theory, any probability at random-number level may be calculated by using the probability measure at state space level.
  • However, probabilities at this level may be calculated from certain functions.
  • Types:
    • Discrete values: \(x_1,\dots, x_n\): Probability Mass Function \(p(x_i)\), PMF
    • Continuous: \(x\in \mathbb{R}\): Probability Density Function \(f(x)\), (PDF)
  • For both types, one may use “PDF”
  • Cumulative Distribution Function (CDF): \[F_X(x) = P(X \leq x)=\int_{-\infty}^x f(y)dy.\]
  • These functions define a “probability distribution

Probability Mass Function (PMF)

  • Definition: Gives the probability of each possible value for a discrete random variable
  • Notation: \[p_X(x)=P(X = x)\]
  • Properties:
    • \(p_X(x) \geq 0\) for all \(x\)
    • \(\sum_x p_X(x) = 1\)
  • Calculating probabilities: \(P(a \leq X \leq b) = \sum_{x=a}^b p_X(x)\)
  • Usually, PMFs are parameterized: \(p_X(x|\theta)\). Here, \(\theta\) denotes one or several parameters. This allows to tune the shape of the function.
  • Example: Probability of observing different numbers of side effects in a clinical trial

Visualizing PMF

# Example: PMF of a binomial distribution (10 trials, 0.3 success probability)
x <- 0:10
pmf <- dbinom(x, size = 10, prob = 0.3)

# Plot
barplot(pmf, names.arg = x, 
        main = "PMF of Binomial(10, 0.3)",
        xlab = "Number of successes", ylab = "Probability")

Probability Density Function (PDF)

  • Definition: Describes the density of probability for different values of a continuous random variable
  • Notation: \(f_X(x)\)
  • Properties:
    • \(f_X(x) \geq 0\) for all \(x\)
    • \(\int_{-\infty}^{\infty} f_X(x) dx = 1\)
  • Usually, PDFs are parameterized: \(f_X(x|\theta)\). \(\theta\) denotes one or several parameters. This allows to tune the shape of the function.
  • Calculating probabilities: \(P(a \leq X \leq b) = \int_a^b f_X(x) dx\)
  • Note: \(f_X(x)\) is not a probability, but rather a density

Visualizing PDF

# Example: PDF of a normal distribution (mean = 0, sd = 1)
x <- seq(-4, 4, length.out = 100)
pdf <- dnorm(x, mean = 0, sd = 1)

# Plot
plot(x, pdf, type = "l", 
     main = "PDF of Standard Normal Distribution",
     xlab = "x", ylab = "Density")

Visualizing Probability Calculations

Expected Value

  • Motivation: The average value of a random variable over many repetitions
  • Notation: \(E[X]\) or \(\mu\)
  • Definition:
    • Discrete: \(E[X] = \sum_x x \cdot p_X(x)\)
    • Continuous: \(E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx\)
  • Relevance: Estimation of population parameters from sample data
  • Simple examples: dice, uniform distribution

Pierre de Fermat (1607-1665)

  • French lawyer and mathematician
  • Correspondence with Blaise Pascal in 1654: birth of probability theory
  • Key contributions:
    • Solved the “problem of points” with Pascal
      • Question: How to fairly divide stakes in an interrupted game of chance?
      • Solution: Consider all possible outcomes
    • Developed method of expected values
    • Fermat’s Last Theorem (proved in 1994 by Andrew Wiles)
  • Legacy: First calculations for problems under uncertainty

Variance and Standard Deviation

  • Aim: Quantification of variability in (medical) measurements and outcomes
  • Variance: Measure of spread around the expected value
    • \(Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2\)
  • Standard Deviation: Square root of variance
    • \(SD(X) = \sqrt{Var(X)}\)
    • in the same units of measure as the random variable itself
  • Observe the difference to mean and standard deviation of data

Typical Calculations I

  • If we have random variables \(X_1,X_2\) then \[ E(X_1+X_2) = E(X_1) + E(X_2). \]
  • For a sequence \(X_1,X_2,\dots, X_n\): \[E(\sum_i{X_i}) = \sum_i{E(X_i})\]
  • If we have independent random variables \(X_1,X_2\) then \[ Var(X_1+X_2) = Var(X_1) + Var(X_2). \]
  • For a sequence \(X_1,X_2,\dots, X_n\) of independent random variables:
    \[Var(\sum_i{X_i}) = \sum_i{Var(X_i})\]

Typical Calculations II

  • For any random variable \(X\) and real number \(a\): \[E(a X) = a E(X),\;Var(a X) = a^2 Var(X)\]

  • Average for [identical] random variables:

    • Expectation \[E(\bar{X})=E(\frac{1}{n} \sum_i{X_i}) = \frac{1}{n}\sum_i{E(X_i})=\left[(E(X_1)\right]\]
    • Variance (only independent variables!) \[Var(\bar{X})=Var(\frac{1}{n} \sum_i{X_i}) = \frac{1}{n^2}\sum_i{Var(X_i})=\frac{1}{n}Var(X_1)\]

Normal Distribution

  • Properties:
    • Symmetric, bell-shaped curve
    • Fully described by expectation \(\mu\) and standard deviation \(\sigma\): \(X\sim N(\mu,\sigma)\)
  • PDF: \[f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]
  • Standard Normal Distribution: If \(X~N(\mu,\sigma)\), then \(Z = \frac{X - \mu}{\sigma}\) follows a standard normal distribution, i.e., \(Z \sim N(0,1)\)
  • Relevance: Used to model many natural phenomena, also in medicine (e.g., blood pressure, height)

Interactive graphics

Carl Friedrich Gauss (1777-1855)

  • German mathematician and physicist
  • Revolutionized multiple mathematical fields, including probability and statistics
  • Key contributions:
    • Normal distribution (Gaussian distribution)
    • Method of least squares for regression analysis
    • Gauss-Markov theorem in linear regression
  • Applications:
    • Normal distribution: Used in natural and social sciences
    • Least squares: Fundamental in data analysis
  • Legacy: His work forms the basis of much of modern statistical theory and practice

Visualizing 68-95-99.7 Rule

Normal Distribution: 68-95-99.7 Rule

  • 68-95-99.7 Rule:
    • 68% of data within 1 SD of the mean
    • 95% within 2 SD
    • 99.7% within 3 SD
  • Application: Determining reference ranges for laboratory values
  • Example: If adult male height \(\sim N(70 \text{ inches}, 3 \text{ inches})\), what % are over 76 inches?
    • \(Z = \frac{76 - 70}{3} = 2\)
    • \(P(X > 76) = P(Z > 2) \approx 0.0228\) or 2.28%

Binomial Distribution

  • Properties:
    • Discrete distribution for the number of successes in a fixed number of independent trials
    • Two parameters: \(n\) (number of trials) and \(p\) (probability of success on each trial)
    • Mean: \(E(X)=np\)
    • Variance: \(Var(X) = np(1-p)\)
  • PDF: \[P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\]
  • Example: Within a sample of size \(n\), how many individuals get a certain condition within some time period?

Binomial Distribution: Example

  • Applications in Medicine:
    • Analyzing treatment efficacy in clinical trials
    • Studying genetic inheritance patterns
    • Evaluating diagnostic test accuracy
  • Example: If a drug has a 70% success rate, what’s the probability of it being effective in exactly 8 out of 10 patients?
    • \(n = 10\), \(p = 0.7\), \(k = 8\)
    • \(P(X=8) = \binom{10}{8} 0.7^8 (1-0.7)^{10-8} \approx 0.2334\)

Visualizing Binomial Distribution

Interactive graphics

Poisson Distribution

  • Properties:
    • Discrete distribution for rare events
    • Single parameter \(\lambda\) (lambda)
    • \(E(X)=\lambda\) and \(Var(X)=\lambda\)
  • PMF: \[P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
  • Relevance: Modeling event counts in a fixed interval (e.g., mutations, new cases)

Poisson Distribution: Example

  • Applications in Medicine:
    • Modeling infection rates
    • Analyzing rare side effects
    • Simplest waiting/survival time model
  • Example: If the average number of heart attacks in a hospital is 2 per day, what’s the probability of seeing exactly 5 in a day?
    • \(\lambda = 2\), \(k = 5\)
    • \(P(X=5) = \frac{2^5 e^{-2}}{5!} \approx 0.0361\)\(f_X(x;\theta)\)

Visualizing Poisson Distribution

Covariance and Correlation

  • Aim: We may consider several variables at once. Does information about one (“known”) variable give information about another (“unknown”) variable?
  • “Independence”: There is no information gain. Covariance and Correlation are measures for linear dependence
  • Covariance: Measure of joint variability between two random variables
    • \(Cov(X,Y) = E[(X - \mu_X)(Y - \mu_Y)]\)
  • Correlation: Standardized covariance
    • \(Corr(X,Y) = \frac{Cov(X,Y)}{SD(X) \cdot SD(Y)}\)
  • Relevance: Assessing relationships between different medical parameters: e.g., systolic and diastolic blood pressure, dose-response

Visualizing Correlation

Conditional Probability

  • Definition: Probability of an event given that another event has occurred
  • Notation: \(P(A|B) = \frac{P(A \cap B)}{P(B)}\)
  • Bayes’ Theorem: \[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}=\frac{P(B|A) \cdot P(A)}{P(B|A) \cdot P(A)+P(B|\neg A) \cdot P(\neg A))}\]
  • Relevance: Fundamental for diagnostic testing and risk assessment

Bayes’ Theorem: Medical Test

Scenario: Test for “Condition X”

Given: - Prevalence: 1% (\(P(X) = 0.01\)) - Sensitivity: 95% (\(P(T|X) = 0.95\)) - Specificity: 90% (\(P(T|\neg X) = 0.10\))

Bayes’ Theorem: \[P(X|T) = \frac{P(T|X) \cdot P(X)}{P(T|X) \cdot P(X) + P(T|\neg X) \cdot P(\neg X)}\]

Calculation: \[\begin{aligned} P(X|T) &= \frac{0.95 \cdot 0.01}{(0.95 \cdot 0.01) + (0.10 \cdot 0.99)} \\ &\approx 0.0876 \end{aligned}\]

Interpretation: A positive test result indicates only an 8.76% chance of having Condition X.

Independence of Events

  • Definition: Events \(A\) and \(B\) are independent if \(P(A|B) = P(A)\)
  • Properties:
  • If \(A\) and \(B\) are independent, then \[P(A \cap B) = P(A) \cdot P(B)\]
  • In this case, then \[E(A\cdot B) = E(A) \cdot E(B)\]
  • For dependent variables, calculations are different
    • \(P(A \cap B) = P(A) \cdot P(B|A)\)
    • calculation of expectation needs the joint PDF
  • Many statistical methods (tests etc.) assume data from independent observations

Law of Large Numbers

  • Statement: For sequences of independent identically distributed random variables \(X_1, X2,\dots, X_n\): As sample size \(n\) increases, the sample average converges to the expected value
  • Cumulative mean after \(n\) observations \({x_1,x_2,\dots,x_n}\): \(\bar{X}_n=\frac{1}{n}\sum_{i=1}^n x_i\)
  • Weak Law: \(P(|\bar{X}_n - \mu| > \epsilon) \to 0\) as \(n \to \infty\)
  • Relevance: Justifies using sample statistics to estimate population parameters

Visualizing Central Limit Theorem

Central Limit Theorem

  • Statement: The sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution
  • Formal: \[\frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0,1) \text{, as } n \to \infty\]
  • Relevance: Often enables the use of normal distribution-based methods when sample size is large

Probability Theory and Statistics

OBSERVED WORLD

PROBABILITY THEORY

REALIZE

STATISTICAL THEORY

Model

Random Variables: X

Probability Distributions

Expectation, Variance

Data: x

Empirical Distribution

Statistics: Mean, Sample Variance

Statistics: The Inverse Problem

  • Goal of Statistics:
    • Characterize probabilistic models that could explain observed data
    • “Work backwards” from data to potential models
  • Process:
    1. Collect and analyze data
    2. Infer characteristics of the underlying probabilistic model
    3. Make predictions or draw conclusions based on this inferred model
  • Example in medical research:
    • Observed: Treatment outcomes in a clinical trial
    • Goal: Infer the probability distribution of treatment effectiveness in the population

Statistics as Random Variables

  • Data: Realizations of random variables
  • Statistics: Functions of data, therefore also random variables
    • Sample mean: \(\bar{x}=\frac{1}{n}\sum_i x_i\)
    • Mean as a function of random variables: \[\bar{X}=\frac{1}{n}\sum_i X_i\]
  • Implications:
    • Each sample yields different value for a statistic

Sample/Sampling Distribution

  • Sample Distribution:
    • Distribution of data points in a single sample
    • Describes the spread of observed values
  • Sampling Distribution:
    • Distribution of a statistics as a function of random variables
    • alternatively: Empirical distribution of a statistic over many samples
    • Describes variability of the statistic itself
    • Crucial for inferential statistics
  • Example:
    • Sample Distribution: Heights of 100 patients
    • Sampling Distribution: Means of height from 50 samples of 100 patients

Sampling Distribution: Mean

  • Consider \(n\) independent and identically distributed (i.i.d.) random variables \(X_1, X_2, \ldots, X_n\), following a normal distribution, \(X_i \sim N(\mu, \sigma^2)\).

  • The sample mean \(\bar{X}\) is: \(\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\)

  • Expected Value: \[E[\bar{X}] = \mu\]

  • Variance: \[Var(\bar{X}) = \frac{\sigma^2}{n}\]

  • Distribution: Sampling distribution of the mean follows a normal distribution:

    \[\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\]

  • If we observe data \(x_i\), which \(\mu, \sigma\) is suitable for model and data?

Resampling

  • Definition: Resampling involves repeatedly drawing samples from an original dataset
  • Purpose: To estimate the variability of a statistic or to approximate its sampling distribution
  • Bootstrap: Sampling with replacement from the original dataset
  • Advantages:
    • Non-parametric: Doesn’t assume a specific underlying distribution
    • Flexible: Applicable to various statistics (mean, median, correlation, etc.)
    • Robust: Works well with small sample sizes and complex data structures
    • Provides empirical insight into sampling distributions

Resampling

Standard Error

  • Definition: Standard deviation of a sampling distribution
  • For mean: \(SE = \frac{s}{\sqrt{n}}\)
    • s: sample standard deviation
    • n: sample size
  • Significance:
    1. Measures precision of a sample statistic
    2. Used in calculating confidence intervals
    3. Crucial for hypothesis testing
  • Example: Interpreting the standard error of the mean and confidence intervals in a clinical trial

Overview of Statistical Approaches

  1. Descriptive Statistics:
    • Summarize and describe data characteristics
    • Examples: Mean, median, standard deviation, graphs
  2. Inferential Statistics:
    • Make predictions or inferences about a population
    • Examples: Hypothesis tests, confidence intervals
  3. Exploratory Data Analysis:
    • Uncover patterns and relationships in data
    • Examples: Data visualization, correlation analysis
  4. Predictive Statistics:
    • Develop models to predict future outcomes
    • Examples: Regression analysis, machine learning models

Introduction to Study Design

  • Study design is the framework that guides data collection and analysis in research, ensuring that the study answers the research question in a valid and reliable manner.
  • Purpose: To minimize bias, maximize accuracy, and ensure generalizability.
  • Categories: Observational and Experimental studies.
  • Note: Study design refers exclusively to new data collection, excluding any analysis of existing data.

Population and Sample

  • Population: Entire group about which we want to draw conclusions

    • The whole population of a state
    • All patients with Type 2 Diabetes in the US
    • All erythrocytes in a human body
  • Hard to analyze whole populations (costs, practicality)

  • Sample: Subset of the population used to make inferences

    • 500 persons, participating in a poll
    • 1000 Type 2 Diabetes patients in a clinical trial
    • 10 ml bood sample, taken from a specific arm vein

Sample Size and Power Considerations

  • Sample Size: Number of participants needed to detect an effect, influenced by effect size, variability, significance level, and power.

  • Power: Probability of correctly rejecting the null hypothesis when it is false.

  • Importance: Adequate sample size and power are critical for the validity of the study results.

Validity

  • Internal Validity: The extent to which the study accurately reflects reality within the study context
  • External Validity: The degree to which the study’s results can be generalized to other settings, populations, or times.

Example: Validity

  • Study Aim: Assess effectiveness of a new hypertension medication
  • Measurement: Office blood pressure readings
  • Validity Concern: White coat hypertension
    • Office readings may not accurately represent true blood pressure
    • Could lead to overestimation of hypertension prevalence
  • Improving Validity: Use 24-hour ambulatory blood pressure monitoring
    • Provides a more accurate representation of true blood pressure
    • Reduces impact of white coat effect (Pickering et al., 2005)

Reliability

  • Reliability: The consistency of study results across different instances or repetitions under the same conditions.
  • Importance: Ensures that findings are dependable and can be replicated in similar settings.

Example: Reliability

Example: Depression Scale Assessment

  • Study Aim: Evaluate efficacy of a new antidepressant
  • Measurement: Hamilton Depression Rating Scale (HAM-D)
  • Reliability Concern: Inter-rater variability
    • Different clinicians may score the same patient differently
    • Could lead to inconsistent results across study sites
  • Improving Reliability: Standardized training for raters
    • Implement rigorous training program for all clinicians
    • Conduct regular inter-rater reliability checks (Müller & Szegedi, 2002)

Study Designs: An Overview

  • Observational Studies: No intervention; the researcher observes and measures outcomes.
  • Experimental Studies: The researcher manipulates variables to observe effects.
  • Key Differences:
    • Causal Inference: Experimental studies provide stronger evidence.
    • Ethical Considerations: Observational studies are often used where experimentation would be unethical or even impossible.

Observational Studies

  • Cohort Studies: Follows groups over time based on exposure.
  • Case-Control Studies: Compares those with and without a condition retrospectively.
  • Cross-Sectional Studies: Analyzes data from a population at a single point in time.
  • Key problems:
    • Confounding Factors: Hidden variables that could influence the results.
    • Bias: Potential sources of systematic errors, like selection and information biases.

Main Study Designs: Evidence

from: Somerville et.al. (2016), Public Health and Epidemiology at a Glance

Cohort Studies

  • Design: Groups based on exposure status, followed over time.
  • Prospective vs. Retrospective:
    • Prospective Cohort: Follows participants forward in time.
    • Retrospective Cohort: Uses existing data to track participants backward.
  • Advantages: Establishes temporal sequence, keeps track of individuals
  • Disadvantages: Time-consuming, expensive
  • Incidence Rate: \[ \text{Incidence Rate} = \frac{\text{Number of new cases}}{\text{Total person-time at risk}} \]

Retrospective Cohort Studies

Prospective Cohort Studies

Case-Control Studies

  • Design: Compares participants with a specific condition to those without.
  • Key Feature: Data collected retrospectively to identify risk factors.
  • Advantages: Efficient for rare diseases, quicker, and less costly.
  • Disadvantages: Prone to recall bias, cannot directly measure incidence.
  • Formula: \[ \text{Odds Ratio (OR)} = \frac{\text{Odds of exposure among cases}}{\text{Odds of exposure among controls}} \]

Case-Control Studies

Cross-Sectional Studies

  • Design: Data collection at a single point in time to examine prevalence.
  • Purpose: Assesses the burden of a disease or health condition.
  • Advantages: Quick, cost-effective, useful for public health surveillance.
  • Disadvantages: Cannot determine causality, susceptible to bias.

Cross-Sectional Studies

Confounding

  • Definition: Confounding occurs when the relationship between an exposure and an outcome is distorted by the presence of another variable (confounder). Leads to a biased estimate of the association.

  • Criteria for a Confounder:

    1. Associated with the exposure in the source population
    2. Causally related to the outcome
    3. Not in the causal pathway between exposure and outcome
  • Impact: Can create a spurious association or mask a true relationship

  • Addressing Confounding:

    • Study design: Randomization, matching
    • Analysis: take into account confounder - regression, stratification …
  • Note: subject-matter knowledge is needed

Example: Coffee and Lung Cancer

  • Historical Context: 1960s studies showed higher lung cancer rates among coffee drinkers

  • Observed Association:

    • Higher coffee consumption → Higher lung cancer risk
  • Confounder: Smoking status

    • Smokers tend to drink more coffee
    • Smoking is a strong risk factor for lung cancer
  • True Relationships:

    1. Smoking → Increased coffee consumption
    2. Smoking → Increased lung cancer risk
  • Result: Smoking created a spurious association between coffee and lung cancer

  • Resolution: When smoking was accounted for, no significant association between coffee and lung cancer remained (Guertin et al., 2016)

Bias in Observational Studies

  • Selection Bias
    • Definition: Study population doesn’t represent target population
    • Example: Heart disease study recruiting only from cardiology clinic
    • Note: Impact depends on study design and selection mechanism:
      • Case-control: Can affect exposure-outcome association
      • Cohort: May influence incidence rates or risk estimates (Hernán et al., 2004)
  • Information Bias
    • Definition: Systematic errors in measuring variables
    • Example: Recall bias in retrospective diet and cancer studies
    • Note: Can lead to over- or underestimation (Rothman et al., 2008)

Experimental Studies Overview

  • Design: Involves manipulation of an independent variable to observe effects.
  • Randomized Controlled Trials: Participants randomly assigned to groups.
  • Reduces confounding by equal distribution of confounding variables.
  • Blinding: Prevents bias by keeping participants and/or researchers unaware of group assignments.
  • Control Groups: Provide a baseline for comparison with the treatment group.
  • Disadvantages: Cost, ethical constraints.
  • Advantages: Strong causal evidence, minimizes confounding.
  • Disadvantages: Expensive, time-consuming.
  • Relative Risk: \[ \text{Relative Risk (RR)} = \frac{\text{Risk in treatment group}}{\text{Risk in control group}} \]

Blinding Techniques in RCTs

  • Single-Blind: Participants unaware of group assignment.
  • Double-Blind: Both participants and researchers unaware.
  • Triple-Blind: Participants, researchers, and data analysts unaware.
  • Purpose:
    • Reduces performance bias (systematic differences in exact treatment between groups)
    • Reduces detection biases (systematic differences in measurement between groups).

2x2 Contingency Table: Experiment

  • A 2x2 contingency table is a cross-tabulation that displays frequencies in relation to two categorical variables (factors), each with two levels (without marginal distributions and totals).

  • Experiment: Variable 1 is considered the independent variable, whose influence on Variable 2 is to be examined. In a controlled experiment:

    • The first category of Variable 1 serves as the control group (placebo, or current treatment), the second category defines the experimental group.
    • The first category of Variable 2 indicates “successes” in a control study. Depending on the subject of investigation, successes can also be negatively connotated (e.g., deaths).
  • Note that groups have been selected (randomized) w.r.t. the first variable!

Relative Risk

Success No Success
Control Group a b
Experimental Group c d
  • Estimated probabilities for success:
Control Group Experimental Group
\(P_C=\frac{a}{a+b}\) \(P_E=\frac{c}{c+d}\)
  • The relative risk indicates how many times greater (smaller) the probability of success is in the experimental group is.\[R_R=\frac{P_E}{P_C}\]

2x2 Contingency Table: Case-Control Study

  • A 2x2 contingency table is a cross-tabulation that displays frequencies in relation to two categorical variables (factors), each with two levels (without marginal distributions and totals). If data on individual observations is given:

Case-Control: Variable 1 is considered the independent variable, whose influence on Variable 2 is to be examined. In a case-control study:

  • The first category of Variable 2 indicates “cases”, while the second indicates “non-cases”. The first category of Variable 1 refers to “no risk”, the second to “risk”.

  • Note that groups have been selected (randomized) w.r.t. the second variable!

Odds Ratio

case no case
no risk a b
risk c d

The odds indicate how many times more likely a success is compared to no success in both groups:

no risk risk
\(O_C=\frac{a}{b}\) \(O_E=\frac{c}{d}\)

The ratio of the odds is called the odds ratio and indicates how many times higher (>1) or lower (<1) the odds are in the experimental group compared to the control group. \[ OR=\frac{O_E}{O_C} \]

Decision Tree

Advantages

Ethical Considerations in Study Design

  • Key Principles:
    • Respect for Persons: Acknowledging autonomy and protecting those with diminished autonomy.
    • Beneficence: Obligation to minimize harm and maximize benefits.
    • Justice: Fair distribution of research benefits and burdens.
  • Informed Consent: Full disclosure of study purpose, procedures, risks, and benefits.
  • Institutional Review Boards (IRBs): Review research proposals for ethical compliance.