---
title: "EPI 553 - Lecture 1: Review of Biostatistical Foundations"
author: "Muntasir Masum"
date: "January 27, 2026"
format:
html:
toc: true
toc-depth: 3
toc-location: left
number-sections: true
theme: readable
highlight-style: tango
code-fold: show
code-tools: true
lang: en
css:
- style.css
- accessibility.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.width = 8,
fig.height = 6,
fig.alt = "Statistical visualization" # Placeholder - set specific alt text per chunk
)
```
```{css, include=FALSE}
/* Accessibility-enhanced CSS for biostatistics lecture */
/* Base typography with sufficient contrast and sizing */
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
line-height: 1.6;
font-size: 16px;
color: #1a1a1a; /* Dark text for 7:1 contrast with white background */
background-color: #ffffff;
}
/* Ensure proper heading hierarchy and contrast */
h1 {
color: #000000;
font-size: 28px;
border-bottom: 3px solid #1a1a1a;
padding-bottom: 10px;
}
h2 {
color: #1a1a1a;
border-left: 4px solid #003d99; /* High contrast blue */
padding-left: 10px;
font-size: 22px;
margin-top: 1.2em;
margin-bottom: 0.8em;
}
h3 {
color: #1a1a1a;
font-size: 18px;
margin-top: 1em;
margin-bottom: 0.6em;
}
/* Code and pre-formatted text */
code {
background-color: #f4f4f4;
padding: 2px 6px;
border-radius: 3px;
font-family: 'Courier New', monospace;
font-size: 15px;
color: #000000;
}
pre {
background-color: #f8f8f8;
border: 1px solid #333333;
border-radius: 4px;
padding: 12px;
overflow-x: auto;
line-height: 1.5;
}
/* Mathematical equations */
.math {
color: #000000;
font-family: 'Times New Roman', serif;
}
/* Table accessibility */
table {
border-collapse: collapse;
width: 100%;
margin: 15px 0;
border: 1px solid #333333;
}
th, td {
border: 1px solid #333333;
padding: 12px;
text-align: left;
}
th {
background-color: #1a1a1a;
color: #ffffff;
font-weight: bold;
}
tr:nth-child(even) {
background-color: #f9f9f9;
}
tr:nth-child(odd) {
background-color: #ffffff;
}
/* List styling for readability */
ul, ol {
margin: 10px 0;
padding-left: 25px;
line-height: 1.8;
}
li {
margin-bottom: 8px;
}
/* Focus styles for keyboard navigation */
a:focus, button:focus {
outline: 2px solid #003d99;
outline-offset: 2px;
}
/* Blockquote styling */
blockquote {
border-left: 4px solid #003d99;
padding-left: 15px;
margin-left: 0;
color: #1a1a1a;
font-style: italic;
}
/* Emphasis styling */
strong, b {
font-weight: bold;
}
em, i {
font-style: italic;
}
```
# Introduction
This lecture reviews the foundational biostatistical concepts covered in HSTA/HEPI 552. These concepts are essential for understanding the advanced statistical modeling techniques we'll explore in EPI 553.
## Learning Objectives
By the end of this review, you should be able to:
- Understand different types of variables and their distributions
- Apply probability theory concepts to statistical inference
- Conduct and interpret various hypothesis tests
- Work with common probability distributions in R
---
# Modeling and Variable Types
## What is Statistical Modeling?
Most research aims to assess relationships among a set of variables. The choice of analysis depends on several factors:
- Research question
- Study design
- Mathematical characteristics of variables collected
- Distribution assumptions about these variables
- Sampling scheme used
## Types of Random Variables
A **random variable** is a variable whose observed values are possible outcomes of a random experiment.
### Role of Variables
Variables can serve different roles depending on your research question:
- **Independent variables** (predictors, covariates, explanatory variables): describe or predict other variables
- **Dependent variables** (response, outcome): are described or predicted by other variables
**Important note:** A variable can be a predictor in one study and a response in another. For example:
- Stroke predicted by systolic blood pressure (SBP)
- SBP predicted by age
### Categorical Variables
**Nominal variables** have categories without a natural hierarchy:
- Gender
- Ethnic identity
- Cancer type
**Ordinal variables** have categories with a natural ordering:
- Level of pain (mild, moderate, severe)
- Response frequency (never, sometimes, always)
**Example in R:**
```{r categorical-example, fig.alt="R output showing factor creation for nominal and ordinal variables"}
# Creating a nominal variable
gender <- factor(c("Male", "Female", "Female", "Male", "Male"))
print(gender)
# Creating an ordinal variable
pain_level <- factor(c("Mild", "Severe", "Moderate", "Mild", "Severe"),
levels = c("Mild", "Moderate", "Severe"),
ordered = TRUE)
print(pain_level)
```
### Quantitative Variables
**Discrete variables** take on only countable distinct values:
- Number of children in a family
- Number of cancers in a population
**Continuous variables** can take on infinite possible values:
- Weight, height, time
- Can be **interval** (no true zero) or **ratio** (has true zero)
**Example in R:**
```{r quantitative-example, fig.alt="Histograms showing distributions of discrete and continuous variables"}
# Discrete variable
num_children <- c(0, 2, 1, 3, 2, 1, 0, 4)
hist(num_children,
main = "Distribution of Number of Children",
xlab = "Number of Children",
col = "lightblue",
breaks = seq(-0.5, 4.5, 1))
# Continuous variable
weights <- rnorm(100, mean = 70, sd = 10)
hist(weights,
main = "Distribution of Weights (kg)",
xlab = "Weight (kg)",
col = "lightgreen",
breaks = 20)
```
---
# Probability Distributions
## What is a Probability Distribution?
The **probability distribution** of a random variable gives the relative frequencies associated with all possible values in a population.
Probability distributions are mathematical functions that provide the probability corresponding to each value or range of values taken by a random variable.
## Common Discrete Distributions
### Binomial Distribution
A binomial random variable describes the **number of occurrences** of an event in a series of *n* trials.
**Parameters:**
- n: fixed number of trials
- p: probability of success on each trial
**Properties:**
- Each trial outcome is independent
- Dichotomous outcome (success/failure)
- Probability of success remains constant across trials
**Notation:** If X follows a binomial distribution with parameters n and p, the probability mass function is:
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\}$$
**R Example:**
```{r binomial-distribution, fig.alt="Bar plot showing binomial probability distribution with n=10, p=0.3"}
# Binomial distribution with n=10, p=0.3
n <- 10
p <- 0.3
x <- 0:10
# Probability mass function
prob <- dbinom(x, size = n, prob = p)
# Visualize
barplot(prob,
names.arg = x,
main = "Binomial Distribution (n=10, p=0.3)",
xlab = "Number of Successes",
ylab = "Probability",
col = "steelblue")
# Example: Probability of exactly 3 successes
dbinom(3, size = 10, prob = 0.3)
# Cumulative probability: P(X <= 3)
pbinom(3, size = 10, prob = 0.3)
# Generate random sample
set.seed(123)
random_sample <- rbinom(1000, size = 10, prob = 0.3)
hist(random_sample,
main = "Random Sample from Binomial(10, 0.3)",
xlab = "Number of Successes",
col = "coral",
breaks = seq(-0.5, 10.5, 1))
```
### Poisson Distribution
A Poisson random variable describes the **number of events during a fixed interval** of time.
**Parameter:**
- λ (lambda): expected number of events (λ > 0)
**Properties:**
- Occurrences of events are independent
- Probability of a single event is proportional to interval length
- Two events cannot occur at exactly the same time
**Notation:** If X follows a Poisson distribution with parameter λ, the probability mass function is:
$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k \in \{0, 1, 2, \ldots\}$$
**R Example:**
```{r poisson-distribution, fig.alt="Bar plot showing Poisson probability distribution with lambda=3"}
# Poisson distribution with lambda = 3
lambda <- 3
x <- 0:10
# Probability mass function
prob <- dpois(x, lambda = lambda)
# Visualize
barplot(prob,
names.arg = x,
main = "Poisson Distribution (λ=3)",
xlab = "Number of Events",
ylab = "Probability",
col = "darkgreen")
# Example: Number of hospital admissions per hour
# If λ = 5 admissions per hour
lambda_admissions <- 5
# Probability of exactly 3 admissions in an hour
dpois(3, lambda = lambda_admissions)
# Probability of 5 or fewer admissions
ppois(5, lambda = lambda_admissions)
# Generate random sample
set.seed(456)
random_admissions <- rpois(1000, lambda = lambda_admissions)
hist(random_admissions,
main = "Random Sample: Hospital Admissions per Hour",
xlab = "Number of Admissions",
col = "purple",
breaks = seq(-0.5, max(random_admissions) + 0.5, 1))
```
## Common Continuous Distributions
### Normal Distribution
The normal distribution is the most common continuous probability distribution.
**Parameters:**
- μ (mu): mean
- σ² (sigma squared): variance
**Properties:**
- Bell-shaped
- Symmetric around the mean
- Completely determined by mean and variance
**Notation:** If X follows a normal distribution with mean μ and variance σ², the probability density function is:
$$f(x; \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)$$
**R Functions for Normal Distribution:**
```{r normal-distribution, fig.alt="Line plot showing the standard normal distribution density curve"}
# Standard normal distribution (μ=0, σ=1)
x <- seq(-4, 4, length = 100)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y,
type = "l",
lwd = 2,
col = "blue",
main = "Standard Normal Distribution",
xlab = "x",
ylab = "Density")
abline(v = 0, lty = 2, col = "red")
# Density at specific point
dnorm(0, mean = 0, sd = 1)
# Cumulative probability: P(Z < 1.96)
pnorm(1.96, mean = 0, sd = 1)
# Find the value where P(Z < z) = 0.975 (95th percentile)
qnorm(0.975, mean = 0, sd = 1)
# Generate random sample
set.seed(789)
random_normal <- rnorm(1000, mean = 100, sd = 15)
hist(random_normal,
probability = TRUE,
main = "Random Sample from N(100, 15²)",
xlab = "Value",
col = "skyblue",
breaks = 30)
curve(dnorm(x, mean = 100, sd = 15),
add = TRUE,
col = "red",
lwd = 2)
```
### Central Limit Theorem
The normal distribution is fundamental because of the **Central Limit Theorem**:
> The sample mean of observations from a random variable with finite mean and variance converges to a normal distribution as the sample size increases.
This means that for large sample sizes, many distributions become approximately normal, which is why the normal distribution is so important for statistical inference.
---
# Hypothesis Testing Framework {.section}
## Structure of Hypothesis Tests
All hypothesis tests follow this same framework:
1. **State hypotheses:**
- Null hypothesis (H₀): No effect or difference
- Alternative hypothesis (H₁): Effect or difference exists
2. **Choose significance level (α):** Usually 0.05
3. **Calculate test statistic:** Depends on the type of test
4. **Find the critical value or p-value:** Using the appropriate distribution
5. **Make a decision:** Reject or fail to reject H₀
## Important Definitions
- **Type I error (α):** Reject H₀ when it is true (false positive)
- **Type II error (β):** Fail to reject H₀ when it is false (false negative)
- **Power (1 - β):** Probability of correctly rejecting H₀ when it is false
---
## One-Sample t-Test
Tests whether a sample mean differs from a hypothesized population mean.
**Hypotheses:**
- H₀: μ = μ₀
- H₁: μ ≠ μ₀ (two-tailed test)
**Assumptions:**
- Sample is random
- Population is approximately normally distributed (or large sample size)
**Test statistic:**
$$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$$
where:
- $\bar{x}$ is the sample mean
- s is the sample standard deviation
- n is the sample size
- This follows a t-distribution with n - 1 degrees of freedom
**R Example:**
```{r one-sample-t-test, fig.alt="Histogram showing distribution of systolic blood pressure with mean marked"}
# Example: Is average systolic BP different from 130?
set.seed(101)
sbp <- rnorm(50, mean = 125, sd = 15)
# Conduct one-sample t-test
t_test_result <- t.test(sbp, mu = 130)
print(t_test_result)
# Visualize
hist(sbp,
main = "Systolic Blood Pressure",
xlab = "SBP (mmHg)",
col = "lightblue",
breaks = 15)
abline(v = mean(sbp), col = "red", lwd = 2, lty = 2, label = "Sample mean")
abline(v = 130, col = "green", lwd = 2, lty = 3, label = "Hypothesized mean")
```
---
## Two-Sample t-Test
Compares means between two independent groups.
**Hypotheses (two-tailed):**
- H₀: μ₁ = μ₂
- H₁: μ₁ ≠ μ₂
**Assumptions:**
- Both samples are random
- Both populations are approximately normally distributed
- Equal variances (or can be adjusted)
**Test statistic (assuming equal variances):**
$$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1 + 1/n_2}}$$
where $s_p$ is the pooled standard deviation.
**R Example:**
```{r two-sample-t-test, fig.alt="Boxplot comparing blood pressure between treatment and control groups with individual data points overlaid"}
# Generate sample data
set.seed(202)
control_bp <- rnorm(40, mean = 140, sd = 12)
treatment_bp <- rnorm(40, mean = 130, sd = 10)
# Create data frame
bp_comparison <- data.frame(
bp = c(control_bp, treatment_bp),
group = factor(rep(c("Control", "Treatment"), c(40, 40)))
)
# Two-sample t-test
t_test_comparison <- t.test(bp ~ group, data = bp_comparison, var.equal = TRUE)
print(t_test_comparison)
# Boxplot with points
boxplot(bp ~ group,
data = bp_comparison,
main = "Blood Pressure by Group",
ylab = "Systolic BP (mmHg)",
col = c("lightcoral", "lightblue"))
# Add individual data points
stripchart(bp ~ group,
data = bp_comparison,
vertical = TRUE,
method = "jitter",
add = TRUE,
pch = 20,
col = rgb(0, 0, 0, 0.3))
```
---
## Chi-Square Test for Variance
Tests whether a population variance equals a hypothesized value.
**Hypotheses (two-tailed):**
- H₀: σ² = σ₀²
- H₁: σ² ≠ σ₀²
**Test statistic:**
$$\chi^2 = \frac{(n-1)s^2}{\sigma_0^2}$$
This follows a chi-square distribution with n - 1 degrees of freedom.
**R Example:**
```{r chi-square-variance, fig.alt="Console output showing chi-square test results for variance"}
# Test if population variance differs from 100
set.seed(303)
weight_data <- rnorm(30, mean = 75, sd = 12)
n <- length(weight_data)
s2 <- var(weight_data)
sigma2_0 <- 100
chi_sq_stat <- (n - 1) * s2 / sigma2_0
# Critical values for two-tailed test at α = 0.05
alpha <- 0.05
lower_critical <- qchisq(alpha/2, df = n - 1)
upper_critical <- qchisq(1 - alpha/2, df = n - 1)
cat("Test statistic:", chi_sq_stat, "\n")
cat("Critical values: [", lower_critical, ",", upper_critical, "]\n")
cat("Sample variance:", s2, "\n")
cat("P-value:", 2 * min(pchisq(chi_sq_stat, df = n - 1),
1 - pchisq(chi_sq_stat, df = n - 1)), "\n")
```
### Chi-Square Test of Independence
Tests whether there is a relationship between two categorical variables.
**Hypotheses:**
- H₀: No relationship between variables
- H₁: There is a relationship
**Example: 2×2 Contingency Table**
| | A = 1 | A = 0 | Total |
|-------|:---:|:---:|:---:|
| B = 1 | O₁₁ | O₁₂ | R₁ |
| B = 0 | O₂₁ | O₂₂ | R₂ |
| Total | C₁ | C₂ | N |
**Expected count under H₀:**
$$E_{ij} = \frac{R_i \times C_j}{N}$$
**Test statistic:**
$$\chi^2 = \sum_{i=1}^k \sum_{j=1}^m \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
**Rejection region:** χ² > χ²₍ₖ₋₁₎₍ₘ₋₁₎,₁₋α
**Assumption:** Expected counts ≥ 5 in at least 80 percent of cells
**R Example:**
```{r chi-square-independence, fig.alt="Mosaic plot showing relationship between smoking status and disease"}
# Example: Relationship between smoking and lung disease
smoking <- c(rep("Smoker", 150), rep("Non-smoker", 150))
disease <- c(rep("Disease", 80), rep("No Disease", 70),
rep("Disease", 30), rep("No Disease", 120))
# Create contingency table
contingency_table <- table(smoking, disease)
print(contingency_table)
# Chi-square test
chi_test <- chisq.test(contingency_table)
print(chi_test)
# Show expected counts
print(chi_test$expected)
# Visualize
mosaicplot(contingency_table,
main = "Smoking Status vs Disease",
color = c("lightcoral", "lightblue"),
xlab = "Smoking Status",
ylab = "Disease Status")
```
---
## F-Distribution and F-Tests
### F-Distribution
The F-distribution is right-skewed and depends on two degrees of freedom parameters: ν₁ and ν₂
**Key property:** If X₁ follows a chi-square distribution with ν₁ degrees of freedom and X₂ follows a chi-square distribution with ν₂ degrees of freedom and both are independent, then:
$$F = \frac{X_1/\nu_1}{X_2/\nu_2} \sim F_{\nu_1, \nu_2}$$
**Symmetric property:**
$$F_{\nu_1, \nu_2, \alpha} = \frac{1}{F_{\nu_2, \nu_1, 1-\alpha}}$$
**R Example:**
```{r f-distribution, fig.alt="Line plot showing three different F-distributions with varying degrees of freedom"}
# F-distributions with different degrees of freedom
x <- seq(0, 5, length = 200)
plot(x, df(x, df1 = 5, df2 = 10),
type = "l",
lwd = 2,
col = "red",
main = "F-Distributions",
xlab = "x",
ylab = "Density",
ylim = c(0, 1))
lines(x, df(x, df1 = 10, df2 = 10), col = "blue", lwd = 2)
lines(x, df(x, df1 = 20, df2 = 30), col = "green", lwd = 2)
legend("topright",
legend = c("F(5,10)", "F(10,10)", "F(20,30)"),
col = c("red", "blue", "green"),
lwd = 2)
# Computing lower percentile using symmetry property
# F(5,10,0.05) = 1/F(10,5,0.95)
qf(0.05, df1 = 5, df2 = 10)
1 / qf(0.95, df1 = 10, df2 = 5)
```
### F-Test for Equal Variances
Tests whether population variances from two samples are equal.
**Hypotheses (two-tailed):**
- H₀: σ₁² = σ₂²
- H₁: σ₁² ≠ σ₂²
**Test statistic:**
$$F = \frac{s_1^2}{s_2^2}$$
**Rejection region:** F < F_{ν₁,ν₂,α/2} or F > F_{ν₁,ν₂,1-α/2}
**Assumptions:** Random samples from two normally distributed populations
**R Example:**
```{r f-test-variances, fig.alt="Boxplot comparing distributions of two groups with different variances"}
# Compare variances of two groups
set.seed(404)
group1 <- rnorm(30, mean = 50, sd = 10)
group2 <- rnorm(25, mean = 50, sd = 15)
# F-test for equal variances
var_test <- var.test(group1, group2)
print(var_test)
# Manual calculation
f_stat <- var(group1) / var(group2)
cat("F-statistic:", f_stat, "\n")
cat("Group 1 variance:", var(group1), "\n")
cat("Group 2 variance:", var(group2), "\n")
# Visualize
boxplot(group1, group2,
names = c("Group 1", "Group 2"),
main = "Comparison of Two Groups",
ylab = "Value",
col = c("lightblue", "lightgreen"))
```
---
# Practical Considerations
## Planning Your Analysis
It is crucial to plan your analysis **before** you collect data:
1. **State the research question** clearly and concisely
2. **Formulate hypotheses** (null and alternative)
3. **Determine study design** which defines appropriate statistical analyses
4. **Verify assumptions** - they depend on the study design
## Working with R for Statistical Analysis
```{r practical-example, fig.alt="Boxplot with individual data points comparing blood pressure between control and treatment groups"}
# Complete example: Analyzing a dataset
# Generate sample data: Effect of treatment on blood pressure
set.seed(505)
n_control <- 50
n_treatment <- 50
control_bp <- rnorm(n_control, mean = 140, sd = 15)
treatment_bp <- rnorm(n_treatment, mean = 130, sd = 12)
# Combine into data frame
bp_data <- data.frame(
bp = c(control_bp, treatment_bp),
group = factor(rep(c("Control", "Treatment"), c(n_control, n_treatment)))
)
# 1. Descriptive statistics
library(dplyr)
bp_data %>%
group_by(group) %>%
summarise(
n = n(),
mean_bp = mean(bp),
sd_bp = sd(bp),
se_bp = sd_bp / sqrt(n)
)
# 2. Visualize
boxplot(bp ~ group,
data = bp_data,
main = "Blood Pressure by Treatment Group",
ylab = "Systolic BP (mmHg)",
col = c("lightcoral", "lightblue"))
# Add points
stripchart(bp ~ group,
data = bp_data,
vertical = TRUE,
method = "jitter",
add = TRUE,
pch = 20,
col = "darkgray")
# 3. Test for equal variances
var.test(bp ~ group, data = bp_data)
# 4. Two-sample t-test
t.test(bp ~ group, data = bp_data, var.equal = TRUE)
# 5. Effect size (Cohen's d)
mean_diff <- mean(treatment_bp) - mean(control_bp)
pooled_sd <- sqrt(((n_control - 1) * var(control_bp) +
(n_treatment - 1) * var(treatment_bp)) /
(n_control + n_treatment - 2))
cohens_d <- mean_diff / pooled_sd
cat("Cohen's d =", cohens_d, "\n")
```
---
# Summary
## Key Takeaways
1. **Variable types** determine appropriate statistical methods
2. **Probability distributions** (normal, binomial, Poisson, t, χ², F) are foundational for inference
3. **Hypothesis testing** follows a structured framework
4. **R provides built-in functions** for all common statistical tests
5. **Always plan your analysis** before collecting data
## Functions Covered
**Probability distributions:**
- `dnorm()`, `pnorm()`, `qnorm()`, `rnorm()` - Normal distribution
- `dbinom()`, `pbinom()`, `rbinom()` - Binomial distribution
- `dpois()`, `ppois()`, `rpois()` - Poisson distribution
- `dt()`, `pt()`, `qt()`, `rt()` - t-distribution
- `dchisq()`, `pchisq()`, `qchisq()` - Chi-square distribution
- `df()`, `pf()`, `qf()` - F-distribution
**Statistical tests:**
- `t.test()` - One and two-sample t-tests
- `var.test()` - F-test for equal variances
- `chisq.test()` - Chi-square test of independence
## Next Steps
In upcoming lectures, we'll build on these foundations to explore:
- Linear regression models
- Logistic regression for binary outcomes
- Generalized linear models
- Mixed effects models
---
# References
Kleinbaum, D. G., Kupper, L. L., Nizam, A., and Rosenberg, E. S. (2013). *Applied Regression Analysis and Other Multivariable Methods*. Cengage Learning.
---
# Session Information
```{r session-info}
sessionInfo()
```