EPI 553 - Lecture 1: Review of Biostatistical Foundations

Author

Muntasir Masum

Published

January 27, 2026

1 Introduction

This lecture reviews the foundational biostatistical concepts covered in HSTA/HEPI 552. These concepts are essential for understanding the advanced statistical modeling techniques we’ll explore in EPI 553.

1.1 Learning Objectives

By the end of this review, you should be able to:

Understand different types of variables and their distributions
Apply probability theory concepts to statistical inference
Conduct and interpret various hypothesis tests
Work with common probability distributions in R

2 Modeling and Variable Types

2.1 What is Statistical Modeling?

Most research aims to assess relationships among a set of variables. The choice of analysis depends on several factors:

Research question
Study design
Mathematical characteristics of variables collected
Distribution assumptions about these variables
Sampling scheme used

2.2 Types of Random Variables

A random variable is a variable whose observed values are possible outcomes of a random experiment.

2.2.1 Role of Variables

Variables can serve different roles depending on your research question:

Independent variables (predictors, covariates, explanatory variables): describe or predict other variables
Dependent variables (response, outcome): are described or predicted by other variables

Important note: A variable can be a predictor in one study and a response in another. For example: - Stroke predicted by systolic blood pressure (SBP) - SBP predicted by age

2.2.2 Categorical Variables

Nominal variables have categories without a natural hierarchy: - Gender - Ethnic identity - Cancer type

Ordinal variables have categories with a natural ordering: - Level of pain (mild, moderate, severe) - Response frequency (never, sometimes, always)

Example in R:

Code

# Creating a nominal variable
gender <- factor(c("Male", "Female", "Female", "Male", "Male"))
print(gender)

[1] Male   Female Female Male   Male  
Levels: Female Male

Code

# Creating an ordinal variable
pain_level <- factor(c("Mild", "Severe", "Moderate", "Mild", "Severe"),
                     levels = c("Mild", "Moderate", "Severe"),
                     ordered = TRUE)
print(pain_level)

[1] Mild     Severe   Moderate Mild     Severe  
Levels: Mild < Moderate < Severe

2.2.3 Quantitative Variables

Discrete variables take on only countable distinct values: - Number of children in a family - Number of cancers in a population

Continuous variables can take on infinite possible values: - Weight, height, time - Can be interval (no true zero) or ratio (has true zero)

Example in R:

Code

# Discrete variable
num_children <- c(0, 2, 1, 3, 2, 1, 0, 4)
hist(num_children, 
     main = "Distribution of Number of Children", 
     xlab = "Number of Children", 
     col = "lightblue", 
     breaks = seq(-0.5, 4.5, 1))

Histograms showing distributions of discrete and continuous variables

Code

# Continuous variable
weights <- rnorm(100, mean = 70, sd = 10)
hist(weights, 
     main = "Distribution of Weights (kg)", 
     xlab = "Weight (kg)", 
     col = "lightgreen", 
     breaks = 20)

3 Probability Distributions

3.1 What is a Probability Distribution?

The probability distribution of a random variable gives the relative frequencies associated with all possible values in a population.

Probability distributions are mathematical functions that provide the probability corresponding to each value or range of values taken by a random variable.

3.2 Common Discrete Distributions

3.2.1 Binomial Distribution

A binomial random variable describes the number of occurrences of an event in a series of n trials.

Parameters: - n: fixed number of trials - p: probability of success on each trial

Properties: - Each trial outcome is independent - Dichotomous outcome (success/failure) - Probability of success remains constant across trials

Notation: If X follows a binomial distribution with parameters n and p, the probability mass function is:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\}\]

R Example:

Code

# Binomial distribution with n=10, p=0.3
n <- 10
p <- 0.3
x <- 0:10

# Probability mass function
prob <- dbinom(x, size = n, prob = p)

# Visualize
barplot(prob, 
        names.arg = x, 
        main = "Binomial Distribution (n=10, p=0.3)",
        xlab = "Number of Successes", 
        ylab = "Probability",
        col = "steelblue")

Bar plot showing binomial probability distribution with n=10, p=0.3

Code

# Example: Probability of exactly 3 successes
dbinom(3, size = 10, prob = 0.3)

[1] 0.2668279

Code

# Cumulative probability: P(X <= 3)
pbinom(3, size = 10, prob = 0.3)

[1] 0.6496107

Code

# Generate random sample
set.seed(123)
random_sample <- rbinom(1000, size = 10, prob = 0.3)
hist(random_sample, 
     main = "Random Sample from Binomial(10, 0.3)", 
     xlab = "Number of Successes", 
     col = "coral", 
     breaks = seq(-0.5, 10.5, 1))

3.2.2 Poisson Distribution

A Poisson random variable describes the number of events during a fixed interval of time.

Parameter: - λ (lambda): expected number of events (λ > 0)

Properties: - Occurrences of events are independent - Probability of a single event is proportional to interval length - Two events cannot occur at exactly the same time

Notation: If X follows a Poisson distribution with parameter λ, the probability mass function is:

\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k \in \{0, 1, 2, \ldots\}\]

R Example:

Code

# Poisson distribution with lambda = 3
lambda <- 3
x <- 0:10

# Probability mass function
prob <- dpois(x, lambda = lambda)

# Visualize
barplot(prob, 
        names.arg = x,
        main = "Poisson Distribution (λ=3)",
        xlab = "Number of Events", 
        ylab = "Probability",
        col = "darkgreen")

Bar plot showing Poisson probability distribution with lambda=3

Code

# Example: Number of hospital admissions per hour
# If λ = 5 admissions per hour
lambda_admissions <- 5

# Probability of exactly 3 admissions in an hour
dpois(3, lambda = lambda_admissions)

[1] 0.1403739

Code

# Probability of 5 or fewer admissions
ppois(5, lambda = lambda_admissions)

[1] 0.6159607

Code

# Generate random sample
set.seed(456)
random_admissions <- rpois(1000, lambda = lambda_admissions)
hist(random_admissions, 
     main = "Random Sample: Hospital Admissions per Hour",
     xlab = "Number of Admissions", 
     col = "purple", 
     breaks = seq(-0.5, max(random_admissions) + 0.5, 1))

3.3 Common Continuous Distributions

3.3.1 Normal Distribution

The normal distribution is the most common continuous probability distribution.

Parameters: - μ (mu): mean - σ² (sigma squared): variance

Properties: - Bell-shaped - Symmetric around the mean - Completely determined by mean and variance

Notation: If X follows a normal distribution with mean μ and variance σ², the probability density function is:

\[f(x; \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)\]

R Functions for Normal Distribution:

Code

# Standard normal distribution (μ=0, σ=1)
x <- seq(-4, 4, length = 100)
y <- dnorm(x, mean = 0, sd = 1)

plot(x, y, 
     type = "l", 
     lwd = 2, 
     col = "blue",
     main = "Standard Normal Distribution",
     xlab = "x", 
     ylab = "Density")
abline(v = 0, lty = 2, col = "red")

Line plot showing the standard normal distribution density curve

Code

# Density at specific point
dnorm(0, mean = 0, sd = 1)

[1] 0.3989423

Code

# Cumulative probability: P(Z < 1.96)
pnorm(1.96, mean = 0, sd = 1)

[1] 0.9750021

Code

# Find the value where P(Z < z) = 0.975 (95th percentile)
qnorm(0.975, mean = 0, sd = 1)

[1] 1.959964

Code

# Generate random sample
set.seed(789)
random_normal <- rnorm(1000, mean = 100, sd = 15)
hist(random_normal, 
     probability = TRUE, 
     main = "Random Sample from N(100, 15²)",
     xlab = "Value", 
     col = "skyblue", 
     breaks = 30)
curve(dnorm(x, mean = 100, sd = 15), 
      add = TRUE, 
      col = "red", 
      lwd = 2)

3.3.2 Central Limit Theorem

The normal distribution is fundamental because of the Central Limit Theorem:

The sample mean of observations from a random variable with finite mean and variance converges to a normal distribution as the sample size increases.

This means that for large sample sizes, many distributions become approximately normal, which is why the normal distribution is so important for statistical inference.

4 Hypothesis Testing Framework

4.1 Structure of Hypothesis Tests

All hypothesis tests follow this same framework:

State hypotheses:
- Null hypothesis (H₀): No effect or difference
- Alternative hypothesis (H₁): Effect or difference exists
Choose significance level (α): Usually 0.05
Calculate test statistic: Depends on the type of test
Find the critical value or p-value: Using the appropriate distribution
Make a decision: Reject or fail to reject H₀

4.2 Important Definitions

Type I error (α): Reject H₀ when it is true (false positive)
Type II error (β): Fail to reject H₀ when it is false (false negative)
Power (1 - β): Probability of correctly rejecting H₀ when it is false

4.3 One-Sample t-Test

Tests whether a sample mean differs from a hypothesized population mean.

Hypotheses: - H₀: μ = μ₀ - H₁: μ ≠ μ₀ (two-tailed test)

Assumptions: - Sample is random - Population is approximately normally distributed (or large sample size)

Test statistic:

\[t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\]

where: - $\bar{x}$ is the sample mean - s is the sample standard deviation - n is the sample size - This follows a t-distribution with n - 1 degrees of freedom

R Example:

Code

# Example: Is average systolic BP different from 130?
set.seed(101)
sbp <- rnorm(50, mean = 125, sd = 15)

# Conduct one-sample t-test
t_test_result <- t.test(sbp, mu = 130)
print(t_test_result)


    One Sample t-test

data:  sbp
t = -3.469, df = 49, p-value = 0.001098
alternative hypothesis: true mean is not equal to 130
95 percent confidence interval:
 119.1667 127.1141
sample estimates:
mean of x 
 123.1404

Code

# Visualize
hist(sbp, 
     main = "Systolic Blood Pressure",
     xlab = "SBP (mmHg)", 
     col = "lightblue",
     breaks = 15)
abline(v = mean(sbp), col = "red", lwd = 2, lty = 2, label = "Sample mean")
abline(v = 130, col = "green", lwd = 2, lty = 3, label = "Hypothesized mean")

Histogram showing distribution of systolic blood pressure with mean marked

4.4 Two-Sample t-Test

Compares means between two independent groups.

Hypotheses (two-tailed): - H₀: μ₁ = μ₂ - H₁: μ₁ ≠ μ₂

Assumptions: - Both samples are random - Both populations are approximately normally distributed - Equal variances (or can be adjusted)

Test statistic (assuming equal variances):

\[t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1 + 1/n_2}}\]

where $s_p$ is the pooled standard deviation.

R Example:

Code

# Generate sample data
set.seed(202)
control_bp <- rnorm(40, mean = 140, sd = 12)
treatment_bp <- rnorm(40, mean = 130, sd = 10)

# Create data frame
bp_comparison <- data.frame(
  bp = c(control_bp, treatment_bp),
  group = factor(rep(c("Control", "Treatment"), c(40, 40)))
)

# Two-sample t-test
t_test_comparison <- t.test(bp ~ group, data = bp_comparison, var.equal = TRUE)
print(t_test_comparison)


    Two Sample t-test

data:  bp by group
t = 4.1685, df = 78, p-value = 7.87e-05
alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
95 percent confidence interval:
  5.515281 15.599483
sample estimates:
  mean in group Control mean in group Treatment 
               141.0522                130.4948

Code

# Boxplot with points
boxplot(bp ~ group, 
        data = bp_comparison,
        main = "Blood Pressure by Group",
        ylab = "Systolic BP (mmHg)",
        col = c("lightcoral", "lightblue"))

# Add individual data points
stripchart(bp ~ group, 
           data = bp_comparison, 
           vertical = TRUE, 
           method = "jitter",
           add = TRUE, 
           pch = 20, 
           col = rgb(0, 0, 0, 0.3))

Boxplot comparing blood pressure between treatment and control groups with individual data points overlaid

4.5 Chi-Square Test for Variance

Tests whether a population variance equals a hypothesized value.

Hypotheses (two-tailed): - H₀: σ² = σ₀² - H₁: σ² ≠ σ₀²

Test statistic:

\[\chi^2 = \frac{(n-1)s^2}{\sigma_0^2}\]

This follows a chi-square distribution with n - 1 degrees of freedom.

R Example:

Code

# Test if population variance differs from 100
set.seed(303)
weight_data <- rnorm(30, mean = 75, sd = 12)

n <- length(weight_data)
s2 <- var(weight_data)
sigma2_0 <- 100

chi_sq_stat <- (n - 1) * s2 / sigma2_0

# Critical values for two-tailed test at α = 0.05
alpha <- 0.05
lower_critical <- qchisq(alpha/2, df = n - 1)
upper_critical <- qchisq(1 - alpha/2, df = n - 1)

cat("Test statistic:", chi_sq_stat, "\n")

Test statistic: 29.6684

Code

cat("Critical values: [", lower_critical, ",", upper_critical, "]\n")

Critical values: [ 16.04707 , 45.72229 ]

Code

cat("Sample variance:", s2, "\n")

Sample variance: 102.3048

Code

cat("P-value:", 2 * min(pchisq(chi_sq_stat, df = n - 1), 
                         1 - pchisq(chi_sq_stat, df = n - 1)), "\n")

P-value: 0.8613845

4.5.1 Chi-Square Test of Independence

Tests whether there is a relationship between two categorical variables.

Hypotheses: - H₀: No relationship between variables - H₁: There is a relationship

Example: 2×2 Contingency Table

	A = 1	A = 0	Total
B = 1	O₁₁	O₁₂	R₁
B = 0	O₂₁	O₂₂	R₂
Total	C₁	C₂	N

Expected count under H₀:

\[E_{ij} = \frac{R_i \times C_j}{N}\]

Test statistic:

\[\chi^2 = \sum_{i=1}^k \sum_{j=1}^m \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

Rejection region: χ² > χ²₍ₖ₋₁₎₍ₘ₋₁₎,₁₋α

Assumption: Expected counts ≥ 5 in at least 80 percent of cells

R Example:

Code

# Example: Relationship between smoking and lung disease
smoking <- c(rep("Smoker", 150), rep("Non-smoker", 150))
disease <- c(rep("Disease", 80), rep("No Disease", 70),
             rep("Disease", 30), rep("No Disease", 120))

# Create contingency table
contingency_table <- table(smoking, disease)
print(contingency_table)

            disease
smoking      Disease No Disease
  Non-smoker      30        120
  Smoker          80         70

Code

# Chi-square test
chi_test <- chisq.test(contingency_table)
print(chi_test)


    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table
X-squared = 34.464, df = 1, p-value = 4.342e-09

Code

# Show expected counts
print(chi_test$expected)

            disease
smoking      Disease No Disease
  Non-smoker      55         95
  Smoker          55         95

Code

# Visualize
mosaicplot(contingency_table, 
           main = "Smoking Status vs Disease",
           color = c("lightcoral", "lightblue"),
           xlab = "Smoking Status", 
           ylab = "Disease Status")

Mosaic plot showing relationship between smoking status and disease

4.6 F-Distribution and F-Tests

4.6.1 F-Distribution

The F-distribution is right-skewed and depends on two degrees of freedom parameters: ν₁ and ν₂

Key property: If X₁ follows a chi-square distribution with ν₁ degrees of freedom and X₂ follows a chi-square distribution with ν₂ degrees of freedom and both are independent, then:

\[F = \frac{X_1/\nu_1}{X_2/\nu_2} \sim F_{\nu_1, \nu_2}\]

Symmetric property:

\[F_{\nu_1, \nu_2, \alpha} = \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}\]

R Example:

Code

# F-distributions with different degrees of freedom
x <- seq(0, 5, length = 200)

plot(x, df(x, df1 = 5, df2 = 10), 
     type = "l", 
     lwd = 2, 
     col = "red",
     main = "F-Distributions",
     xlab = "x", 
     ylab = "Density", 
     ylim = c(0, 1))
     
lines(x, df(x, df1 = 10, df2 = 10), col = "blue", lwd = 2)
lines(x, df(x, df1 = 20, df2 = 30), col = "green", lwd = 2)

legend("topright",
       legend = c("F(5,10)", "F(10,10)", "F(20,30)"),
       col = c("red", "blue", "green"),
       lwd = 2)

Line plot showing three different F-distributions with varying degrees of freedom

Code

# Computing lower percentile using symmetry property
# F(5,10,0.05) = 1/F(10,5,0.95)
qf(0.05, df1 = 5, df2 = 10)

[1] 0.2111904

Code

1 / qf(0.95, df1 = 10, df2 = 5)

[1] 0.2111904

4.6.2 F-Test for Equal Variances

Tests whether population variances from two samples are equal.

Hypotheses (two-tailed): - H₀: σ₁² = σ₂² - H₁: σ₁² ≠ σ₂²

Test statistic:

\[F = \frac{s_1^2}{s_2^2}\]

Rejection region: F < F_{ν₁,ν₂,α/2} or F > F_{ν₁,ν₂,1-α/2}

Assumptions: Random samples from two normally distributed populations

R Example:

Code

# Compare variances of two groups
set.seed(404)
group1 <- rnorm(30, mean = 50, sd = 10)
group2 <- rnorm(25, mean = 50, sd = 15)

# F-test for equal variances
var_test <- var.test(group1, group2)
print(var_test)


    F test to compare two variances

data:  group1 and group2
F = 0.36542, num df = 29, denom df = 24, p-value = 0.01061
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1647913 0.7871065
sample estimates:
ratio of variances 
         0.3654152

Code

# Manual calculation
f_stat <- var(group1) / var(group2)
cat("F-statistic:", f_stat, "\n")

F-statistic: 0.3654152

Code

cat("Group 1 variance:", var(group1), "\n")

Group 1 variance: 74.15993

Code

cat("Group 2 variance:", var(group2), "\n")

Group 2 variance: 202.947

Code

# Visualize
boxplot(group1, group2, 
        names = c("Group 1", "Group 2"),
        main = "Comparison of Two Groups",
        ylab = "Value",
        col = c("lightblue", "lightgreen"))

Boxplot comparing distributions of two groups with different variances

5 Practical Considerations

5.1 Planning Your Analysis

It is crucial to plan your analysis before you collect data:

State the research question clearly and concisely
Formulate hypotheses (null and alternative)
Determine study design which defines appropriate statistical analyses
Verify assumptions - they depend on the study design

5.2 Working with R for Statistical Analysis

Code

# Complete example: Analyzing a dataset

# Generate sample data: Effect of treatment on blood pressure
set.seed(505)
n_control <- 50
n_treatment <- 50

control_bp <- rnorm(n_control, mean = 140, sd = 15)
treatment_bp <- rnorm(n_treatment, mean = 130, sd = 12)

# Combine into data frame
bp_data <- data.frame(
  bp = c(control_bp, treatment_bp),
  group = factor(rep(c("Control", "Treatment"), c(n_control, n_treatment)))
)

# 1. Descriptive statistics
library(dplyr)

bp_data %>%
  group_by(group) %>%
  summarise(
    n = n(),
    mean_bp = mean(bp),
    sd_bp = sd(bp),
    se_bp = sd_bp / sqrt(n)
  )

# A tibble: 2 × 5
  group         n mean_bp sd_bp se_bp
  <fct>     <int>   <dbl> <dbl> <dbl>
1 Control      50    138.  14.2  2.01
2 Treatment    50    130.  13.5  1.90

Code

# 2. Visualize
boxplot(bp ~ group, 
        data = bp_data,
        main = "Blood Pressure by Treatment Group",
        ylab = "Systolic BP (mmHg)",
        col = c("lightcoral", "lightblue"))

# Add points
stripchart(bp ~ group, 
           data = bp_data, 
           vertical = TRUE, 
           method = "jitter",
           add = TRUE, 
           pch = 20, 
           col = "darkgray")

Boxplot with individual data points comparing blood pressure between control and treatment groups

Code

# 3. Test for equal variances
var.test(bp ~ group, data = bp_data)


    F test to compare two variances

data:  bp by group
F = 1.1151, num df = 49, denom df = 49, p-value = 0.7045
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6327925 1.9650160
sample estimates:
ratio of variances 
            1.1151

Code

# 4. Two-sample t-test
t.test(bp ~ group, data = bp_data, var.equal = TRUE)


    Two Sample t-test

data:  bp by group
t = 2.8379, df = 98, p-value = 0.00552
alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
95 percent confidence interval:
  2.362059 13.346979
sample estimates:
  mean in group Control mean in group Treatment 
               137.8893                130.0348

Code

# 5. Effect size (Cohen's d)
mean_diff <- mean(treatment_bp) - mean(control_bp)
pooled_sd <- sqrt(((n_control - 1) * var(control_bp) + 
                   (n_treatment - 1) * var(treatment_bp)) / 
                  (n_control + n_treatment - 2))
cohens_d <- mean_diff / pooled_sd
cat("Cohen's d =", cohens_d, "\n")

Cohen's d = -0.5675794

6 Summary

6.1 Key Takeaways

Variable types determine appropriate statistical methods
Probability distributions (normal, binomial, Poisson, t, χ², F) are foundational for inference
Hypothesis testing follows a structured framework
R provides built-in functions for all common statistical tests
Always plan your analysis before collecting data

6.2 Functions Covered

Probability distributions: - dnorm(), pnorm(), qnorm(), rnorm() - Normal distribution - dbinom(), pbinom(), rbinom() - Binomial distribution - dpois(), ppois(), rpois() - Poisson distribution - dt(), pt(), qt(), rt() - t-distribution - dchisq(), pchisq(), qchisq() - Chi-square distribution - df(), pf(), qf() - F-distribution

Statistical tests: - t.test() - One and two-sample t-tests - var.test() - F-test for equal variances - chisq.test() - Chi-square test of independence

6.3 Next Steps

In upcoming lectures, we’ll build on these foundations to explore: - Linear regression models - Logistic regression for binary outcomes - Generalized linear models - Mixed effects models

7 References

Kleinbaum, D. G., Kupper, L. L., Nizam, A., and Rosenberg, E. S. (2013). Applied Regression Analysis and Other Multivariable Methods. Cengage Learning.

8 Session Information

Code

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4

loaded via a namespace (and not attached):
 [1] digest_0.6.39     utf8_1.2.6        R6_2.6.1          fastmap_1.2.0    
 [5] tidyselect_1.2.1  xfun_0.55         magrittr_2.0.4    glue_1.8.0       
 [9] tibble_3.3.0      knitr_1.51        pkgconfig_2.0.3   htmltools_0.5.9  
[13] generics_0.1.4    rmarkdown_2.30    lifecycle_1.0.4   cli_3.6.5        
[17] vctrs_0.6.5       compiler_4.5.2    rstudioapi_0.17.1 tools_4.5.2      
[21] pillar_1.11.1     evaluate_1.0.5    yaml_2.3.12       otel_0.2.0       
[25] rlang_1.1.6       jsonlite_2.0.0    htmlwidgets_1.6.4

--- title: "EPI 553 - Lecture 1: Review of Biostatistical Foundations" author: "Muntasir Masum" date: "January 27, 2026" format: html: toc: true toc-depth: 3 toc-location: left number-sections: true theme: readable highlight-style: tango code-fold: show code-tools: true lang: en css: - style.css - accessibility.css --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 6, fig.alt = "Statistical visualization" # Placeholder - set specific alt text per chunk ) ``` ```{css, include=FALSE} /* Accessibility-enhanced CSS for biostatistics lecture */ /* Base typography with sufficient contrast and sizing */ body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; line-height: 1.6; font-size: 16px; color: #1a1a1a; /* Dark text for 7:1 contrast with white background */ background-color: #ffffff; } /* Ensure proper heading hierarchy and contrast */ h1 { color: #000000; font-size: 28px; border-bottom: 3px solid #1a1a1a; padding-bottom: 10px; } h2 { color: #1a1a1a; border-left: 4px solid #003d99; /* High contrast blue */ padding-left: 10px; font-size: 22px; margin-top: 1.2em; margin-bottom: 0.8em; } h3 { color: #1a1a1a; font-size: 18px; margin-top: 1em; margin-bottom: 0.6em; } /* Code and pre-formatted text */ code { background-color: #f4f4f4; padding: 2px 6px; border-radius: 3px; font-family: 'Courier New', monospace; font-size: 15px; color: #000000; } pre { background-color: #f8f8f8; border: 1px solid #333333; border-radius: 4px; padding: 12px; overflow-x: auto; line-height: 1.5; } /* Mathematical equations */ .math { color: #000000; font-family: 'Times New Roman', serif; } /* Table accessibility */ table { border-collapse: collapse; width: 100%; margin: 15px 0; border: 1px solid #333333; } th, td { border: 1px solid #333333; padding: 12px; text-align: left; } th { background-color: #1a1a1a; color: #ffffff; font-weight: bold; } tr:nth-child(even) { background-color: #f9f9f9; } tr:nth-child(odd) { background-color: #ffffff; } /* List styling for readability */ ul, ol { margin: 10px 0; padding-left: 25px; line-height: 1.8; } li { margin-bottom: 8px; } /* Focus styles for keyboard navigation */ a:focus, button:focus { outline: 2px solid #003d99; outline-offset: 2px; } /* Blockquote styling */ blockquote { border-left: 4px solid #003d99; padding-left: 15px; margin-left: 0; color: #1a1a1a; font-style: italic; } /* Emphasis styling */ strong, b { font-weight: bold; } em, i { font-style: italic; } ``` # Introduction This lecture reviews the foundational biostatistical concepts covered in HSTA/HEPI 552. These concepts are essential for understanding the advanced statistical modeling techniques we'll explore in EPI 553. ## Learning Objectives By the end of this review, you should be able to: - Understand different types of variables and their distributions - Apply probability theory concepts to statistical inference - Conduct and interpret various hypothesis tests - Work with common probability distributions in R --- # Modeling and Variable Types ## What is Statistical Modeling? Most research aims to assess relationships among a set of variables. The choice of analysis depends on several factors: - Research question - Study design - Mathematical characteristics of variables collected - Distribution assumptions about these variables - Sampling scheme used ## Types of Random Variables A **random variable** is a variable whose observed values are possible outcomes of a random experiment. ### Role of Variables Variables can serve different roles depending on your research question: - **Independent variables** (predictors, covariates, explanatory variables): describe or predict other variables - **Dependent variables** (response, outcome): are described or predicted by other variables **Important note:** A variable can be a predictor in one study and a response in another. For example: - Stroke predicted by systolic blood pressure (SBP) - SBP predicted by age ### Categorical Variables **Nominal variables** have categories without a natural hierarchy: - Gender - Ethnic identity - Cancer type **Ordinal variables** have categories with a natural ordering: - Level of pain (mild, moderate, severe) - Response frequency (never, sometimes, always) **Example in R:** ```{r categorical-example, fig.alt="R output showing factor creation for nominal and ordinal variables"} # Creating a nominal variable gender <- factor(c("Male", "Female", "Female", "Male", "Male")) print(gender) # Creating an ordinal variable pain_level <- factor(c("Mild", "Severe", "Moderate", "Mild", "Severe"), levels = c("Mild", "Moderate", "Severe"), ordered = TRUE) print(pain_level) ``` ### Quantitative Variables **Discrete variables** take on only countable distinct values: - Number of children in a family - Number of cancers in a population **Continuous variables** can take on infinite possible values: - Weight, height, time - Can be **interval** (no true zero) or **ratio** (has true zero) **Example in R:** ```{r quantitative-example, fig.alt="Histograms showing distributions of discrete and continuous variables"} # Discrete variable num_children <- c(0, 2, 1, 3, 2, 1, 0, 4) hist(num_children, main = "Distribution of Number of Children", xlab = "Number of Children", col = "lightblue", breaks = seq(-0.5, 4.5, 1)) # Continuous variable weights <- rnorm(100, mean = 70, sd = 10) hist(weights, main = "Distribution of Weights (kg)", xlab = "Weight (kg)", col = "lightgreen", breaks = 20) ``` --- # Probability Distributions ## What is a Probability Distribution? The **probability distribution** of a random variable gives the relative frequencies associated with all possible values in a population. Probability distributions are mathematical functions that provide the probability corresponding to each value or range of values taken by a random variable. ## Common Discrete Distributions ### Binomial Distribution A binomial random variable describes the **number of occurrences** of an event in a series of *n* trials. **Parameters:** - n: fixed number of trials - p: probability of success on each trial **Properties:** - Each trial outcome is independent - Dichotomous outcome (success/failure) - Probability of success remains constant across trials **Notation:** If X follows a binomial distribution with parameters n and p, the probability mass function is: $$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\}$$ **R Example:** ```{r binomial-distribution, fig.alt="Bar plot showing binomial probability distribution with n=10, p=0.3"} # Binomial distribution with n=10, p=0.3 n <- 10 p <- 0.3 x <- 0:10 # Probability mass function prob <- dbinom(x, size = n, prob = p) # Visualize barplot(prob, names.arg = x, main = "Binomial Distribution (n=10, p=0.3)", xlab = "Number of Successes", ylab = "Probability", col = "steelblue") # Example: Probability of exactly 3 successes dbinom(3, size = 10, prob = 0.3) # Cumulative probability: P(X <= 3) pbinom(3, size = 10, prob = 0.3) # Generate random sample set.seed(123) random_sample <- rbinom(1000, size = 10, prob = 0.3) hist(random_sample, main = "Random Sample from Binomial(10, 0.3)", xlab = "Number of Successes", col = "coral", breaks = seq(-0.5, 10.5, 1)) ``` ### Poisson Distribution A Poisson random variable describes the **number of events during a fixed interval** of time. **Parameter:** - λ (lambda): expected number of events (λ > 0) **Properties:** - Occurrences of events are independent - Probability of a single event is proportional to interval length - Two events cannot occur at exactly the same time **Notation:** If X follows a Poisson distribution with parameter λ, the probability mass function is: $$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k \in \{0, 1, 2, \ldots\}$$ **R Example:** ```{r poisson-distribution, fig.alt="Bar plot showing Poisson probability distribution with lambda=3"} # Poisson distribution with lambda = 3 lambda <- 3 x <- 0:10 # Probability mass function prob <- dpois(x, lambda = lambda) # Visualize barplot(prob, names.arg = x, main = "Poisson Distribution (λ=3)", xlab = "Number of Events", ylab = "Probability", col = "darkgreen") # Example: Number of hospital admissions per hour # If λ = 5 admissions per hour lambda_admissions <- 5 # Probability of exactly 3 admissions in an hour dpois(3, lambda = lambda_admissions) # Probability of 5 or fewer admissions ppois(5, lambda = lambda_admissions) # Generate random sample set.seed(456) random_admissions <- rpois(1000, lambda = lambda_admissions) hist(random_admissions, main = "Random Sample: Hospital Admissions per Hour", xlab = "Number of Admissions", col = "purple", breaks = seq(-0.5, max(random_admissions) + 0.5, 1)) ``` ## Common Continuous Distributions ### Normal Distribution The normal distribution is the most common continuous probability distribution. **Parameters:** - μ (mu): mean - σ² (sigma squared): variance **Properties:** - Bell-shaped - Symmetric around the mean - Completely determined by mean and variance **Notation:** If X follows a normal distribution with mean μ and variance σ², the probability density function is: $$f(x; \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)$$ **R Functions for Normal Distribution:** ```{r normal-distribution, fig.alt="Line plot showing the standard normal distribution density curve"} # Standard normal distribution (μ=0, σ=1) x <- seq(-4, 4, length = 100) y <- dnorm(x, mean = 0, sd = 1) plot(x, y, type = "l", lwd = 2, col = "blue", main = "Standard Normal Distribution", xlab = "x", ylab = "Density") abline(v = 0, lty = 2, col = "red") # Density at specific point dnorm(0, mean = 0, sd = 1) # Cumulative probability: P(Z < 1.96) pnorm(1.96, mean = 0, sd = 1) # Find the value where P(Z < z) = 0.975 (95th percentile) qnorm(0.975, mean = 0, sd = 1) # Generate random sample set.seed(789) random_normal <- rnorm(1000, mean = 100, sd = 15) hist(random_normal, probability = TRUE, main = "Random Sample from N(100, 15²)", xlab = "Value", col = "skyblue", breaks = 30) curve(dnorm(x, mean = 100, sd = 15), add = TRUE, col = "red", lwd = 2) ``` ### Central Limit Theorem The normal distribution is fundamental because of the **Central Limit Theorem**: > The sample mean of observations from a random variable with finite mean and variance converges to a normal distribution as the sample size increases. This means that for large sample sizes, many distributions become approximately normal, which is why the normal distribution is so important for statistical inference. --- # Hypothesis Testing Framework {.section} ## Structure of Hypothesis Tests All hypothesis tests follow this same framework: 1. **State hypotheses:** - Null hypothesis (H₀): No effect or difference - Alternative hypothesis (H₁): Effect or difference exists 2. **Choose significance level (α):** Usually 0.05 3. **Calculate test statistic:** Depends on the type of test 4. **Find the critical value or p-value:** Using the appropriate distribution 5. **Make a decision:** Reject or fail to reject H₀ ## Important Definitions - **Type I error (α):** Reject H₀ when it is true (false positive) - **Type II error (β):** Fail to reject H₀ when it is false (false negative) - **Power (1 - β):** Probability of correctly rejecting H₀ when it is false --- ## One-Sample t-Test Tests whether a sample mean differs from a hypothesized population mean. **Hypotheses:** - H₀: μ = μ₀ - H₁: μ ≠ μ₀ (two-tailed test) **Assumptions:** - Sample is random - Population is approximately normally distributed (or large sample size) **Test statistic:** $$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$$ where: - $\bar{x}$ is the sample mean - s is the sample standard deviation - n is the sample size - This follows a t-distribution with n - 1 degrees of freedom **R Example:** ```{r one-sample-t-test, fig.alt="Histogram showing distribution of systolic blood pressure with mean marked"} # Example: Is average systolic BP different from 130? set.seed(101) sbp <- rnorm(50, mean = 125, sd = 15) # Conduct one-sample t-test t_test_result <- t.test(sbp, mu = 130) print(t_test_result) # Visualize hist(sbp, main = "Systolic Blood Pressure", xlab = "SBP (mmHg)", col = "lightblue", breaks = 15) abline(v = mean(sbp), col = "red", lwd = 2, lty = 2, label = "Sample mean") abline(v = 130, col = "green", lwd = 2, lty = 3, label = "Hypothesized mean") ``` --- ## Two-Sample t-Test Compares means between two independent groups. **Hypotheses (two-tailed):** - H₀: μ₁ = μ₂ - H₁: μ₁ ≠ μ₂ **Assumptions:** - Both samples are random - Both populations are approximately normally distributed - Equal variances (or can be adjusted) **Test statistic (assuming equal variances):** $$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1 + 1/n_2}}$$ where $s_p$ is the pooled standard deviation. **R Example:** ```{r two-sample-t-test, fig.alt="Boxplot comparing blood pressure between treatment and control groups with individual data points overlaid"} # Generate sample data set.seed(202) control_bp <- rnorm(40, mean = 140, sd = 12) treatment_bp <- rnorm(40, mean = 130, sd = 10) # Create data frame bp_comparison <- data.frame( bp = c(control_bp, treatment_bp), group = factor(rep(c("Control", "Treatment"), c(40, 40))) ) # Two-sample t-test t_test_comparison <- t.test(bp ~ group, data = bp_comparison, var.equal = TRUE) print(t_test_comparison) # Boxplot with points boxplot(bp ~ group, data = bp_comparison, main = "Blood Pressure by Group", ylab = "Systolic BP (mmHg)", col = c("lightcoral", "lightblue")) # Add individual data points stripchart(bp ~ group, data = bp_comparison, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = rgb(0, 0, 0, 0.3)) ``` --- ## Chi-Square Test for Variance Tests whether a population variance equals a hypothesized value. **Hypotheses (two-tailed):** - H₀: σ² = σ₀² - H₁: σ² ≠ σ₀² **Test statistic:** $$\chi^2 = \frac{(n-1)s^2}{\sigma_0^2}$$ This follows a chi-square distribution with n - 1 degrees of freedom. **R Example:** ```{r chi-square-variance, fig.alt="Console output showing chi-square test results for variance"} # Test if population variance differs from 100 set.seed(303) weight_data <- rnorm(30, mean = 75, sd = 12) n <- length(weight_data) s2 <- var(weight_data) sigma2_0 <- 100 chi_sq_stat <- (n - 1) * s2 / sigma2_0 # Critical values for two-tailed test at α = 0.05 alpha <- 0.05 lower_critical <- qchisq(alpha/2, df = n - 1) upper_critical <- qchisq(1 - alpha/2, df = n - 1) cat("Test statistic:", chi_sq_stat, "\n") cat("Critical values: [", lower_critical, ",", upper_critical, "]\n") cat("Sample variance:", s2, "\n") cat("P-value:", 2 * min(pchisq(chi_sq_stat, df = n - 1), 1 - pchisq(chi_sq_stat, df = n - 1)), "\n") ``` ### Chi-Square Test of Independence Tests whether there is a relationship between two categorical variables. **Hypotheses:** - H₀: No relationship between variables - H₁: There is a relationship **Example: 2×2 Contingency Table** | | A = 1 | A = 0 | Total | |-------|:---:|:---:|:---:| | B = 1 | O₁₁ | O₁₂ | R₁ | | B = 0 | O₂₁ | O₂₂ | R₂ | | Total | C₁ | C₂ | N | **Expected count under H₀:** $$E_{ij} = \frac{R_i \times C_j}{N}$$ **Test statistic:** $$\chi^2 = \sum_{i=1}^k \sum_{j=1}^m \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$ **Rejection region:** χ² > χ²₍ₖ₋₁₎₍ₘ₋₁₎,₁₋α **Assumption:** Expected counts ≥ 5 in at least 80 percent of cells **R Example:** ```{r chi-square-independence, fig.alt="Mosaic plot showing relationship between smoking status and disease"} # Example: Relationship between smoking and lung disease smoking <- c(rep("Smoker", 150), rep("Non-smoker", 150)) disease <- c(rep("Disease", 80), rep("No Disease", 70), rep("Disease", 30), rep("No Disease", 120)) # Create contingency table contingency_table <- table(smoking, disease) print(contingency_table) # Chi-square test chi_test <- chisq.test(contingency_table) print(chi_test) # Show expected counts print(chi_test$expected) # Visualize mosaicplot(contingency_table, main = "Smoking Status vs Disease", color = c("lightcoral", "lightblue"), xlab = "Smoking Status", ylab = "Disease Status") ``` --- ## F-Distribution and F-Tests ### F-Distribution The F-distribution is right-skewed and depends on two degrees of freedom parameters: ν₁ and ν₂ **Key property:** If X₁ follows a chi-square distribution with ν₁ degrees of freedom and X₂ follows a chi-square distribution with ν₂ degrees of freedom and both are independent, then: $$F = \frac{X_1/\nu_1}{X_2/\nu_2} \sim F_{\nu_1, \nu_2}$$ **Symmetric property:** $$F_{\nu_1, \nu_2, \alpha} = \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}$$ **R Example:** ```{r f-distribution, fig.alt="Line plot showing three different F-distributions with varying degrees of freedom"} # F-distributions with different degrees of freedom x <- seq(0, 5, length = 200) plot(x, df(x, df1 = 5, df2 = 10), type = "l", lwd = 2, col = "red", main = "F-Distributions", xlab = "x", ylab = "Density", ylim = c(0, 1)) lines(x, df(x, df1 = 10, df2 = 10), col = "blue", lwd = 2) lines(x, df(x, df1 = 20, df2 = 30), col = "green", lwd = 2) legend("topright", legend = c("F(5,10)", "F(10,10)", "F(20,30)"), col = c("red", "blue", "green"), lwd = 2) # Computing lower percentile using symmetry property # F(5,10,0.05) = 1/F(10,5,0.95) qf(0.05, df1 = 5, df2 = 10) 1 / qf(0.95, df1 = 10, df2 = 5) ``` ### F-Test for Equal Variances Tests whether population variances from two samples are equal. **Hypotheses (two-tailed):** - H₀: σ₁² = σ₂² - H₁: σ₁² ≠ σ₂² **Test statistic:** $$F = \frac{s_1^2}{s_2^2}$$ **Rejection region:** F < F_{ν₁,ν₂,α/2} or F > F_{ν₁,ν₂,1-α/2} **Assumptions:** Random samples from two normally distributed populations **R Example:** ```{r f-test-variances, fig.alt="Boxplot comparing distributions of two groups with different variances"} # Compare variances of two groups set.seed(404) group1 <- rnorm(30, mean = 50, sd = 10) group2 <- rnorm(25, mean = 50, sd = 15) # F-test for equal variances var_test <- var.test(group1, group2) print(var_test) # Manual calculation f_stat <- var(group1) / var(group2) cat("F-statistic:", f_stat, "\n") cat("Group 1 variance:", var(group1), "\n") cat("Group 2 variance:", var(group2), "\n") # Visualize boxplot(group1, group2, names = c("Group 1", "Group 2"), main = "Comparison of Two Groups", ylab = "Value", col = c("lightblue", "lightgreen")) ``` --- # Practical Considerations ## Planning Your Analysis It is crucial to plan your analysis **before** you collect data: 1. **State the research question** clearly and concisely 2. **Formulate hypotheses** (null and alternative) 3. **Determine study design** which defines appropriate statistical analyses 4. **Verify assumptions** - they depend on the study design ## Working with R for Statistical Analysis ```{r practical-example, fig.alt="Boxplot with individual data points comparing blood pressure between control and treatment groups"} # Complete example: Analyzing a dataset # Generate sample data: Effect of treatment on blood pressure set.seed(505) n_control <- 50 n_treatment <- 50 control_bp <- rnorm(n_control, mean = 140, sd = 15) treatment_bp <- rnorm(n_treatment, mean = 130, sd = 12) # Combine into data frame bp_data <- data.frame( bp = c(control_bp, treatment_bp), group = factor(rep(c("Control", "Treatment"), c(n_control, n_treatment))) ) # 1. Descriptive statistics library(dplyr) bp_data %>% group_by(group) %>% summarise( n = n(), mean_bp = mean(bp), sd_bp = sd(bp), se_bp = sd_bp / sqrt(n) ) # 2. Visualize boxplot(bp ~ group, data = bp_data, main = "Blood Pressure by Treatment Group", ylab = "Systolic BP (mmHg)", col = c("lightcoral", "lightblue")) # Add points stripchart(bp ~ group, data = bp_data, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = "darkgray") # 3. Test for equal variances var.test(bp ~ group, data = bp_data) # 4. Two-sample t-test t.test(bp ~ group, data = bp_data, var.equal = TRUE) # 5. Effect size (Cohen's d) mean_diff <- mean(treatment_bp) - mean(control_bp) pooled_sd <- sqrt(((n_control - 1) * var(control_bp) + (n_treatment - 1) * var(treatment_bp)) / (n_control + n_treatment - 2)) cohens_d <- mean_diff / pooled_sd cat("Cohen's d =", cohens_d, "\n") ``` --- # Summary ## Key Takeaways 1. **Variable types** determine appropriate statistical methods 2. **Probability distributions** (normal, binomial, Poisson, t, χ², F) are foundational for inference 3. **Hypothesis testing** follows a structured framework 4. **R provides built-in functions** for all common statistical tests 5. **Always plan your analysis** before collecting data ## Functions Covered **Probability distributions:** - `dnorm()`, `pnorm()`, `qnorm()`, `rnorm()` - Normal distribution - `dbinom()`, `pbinom()`, `rbinom()` - Binomial distribution - `dpois()`, `ppois()`, `rpois()` - Poisson distribution - `dt()`, `pt()`, `qt()`, `rt()` - t-distribution - `dchisq()`, `pchisq()`, `qchisq()` - Chi-square distribution - `df()`, `pf()`, `qf()` - F-distribution **Statistical tests:** - `t.test()` - One and two-sample t-tests - `var.test()` - F-test for equal variances - `chisq.test()` - Chi-square test of independence ## Next Steps In upcoming lectures, we'll build on these foundations to explore: - Linear regression models - Logistic regression for binary outcomes - Generalized linear models - Mixed effects models --- # References Kleinbaum, D. G., Kupper, L. L., Nizam, A., and Rosenberg, E. S. (2013). *Applied Regression Analysis and Other Multivariable Methods*. Cengage Learning. --- # Session Information ```{r session-info} sessionInfo() ```