Hypothesis Testing & p-values

What is Hypothesis Testing?

Hypothesis testing is a statistical framework for making decisions about a population based on sample data.

We start with two competing claims:

Null Hypothesis \(H_0\): The default assumption (e.g., “no effect”, “no difference”)
Alternative Hypothesis \(H_a\): What we are trying to find evidence for

The goal is to use data to decide whether we have enough evidence to reject \(H_0\) in favor of \(H_a\).

Example: Does a new drug lower blood pressure more than a placebo?

The Five Steps

State \(H_0\) and \(H_a\)
Choose a significance level \(\alpha\) (commonly 0.05)
Compute the test statistic from the data
Calculate the p-value
Conclude: reject \(H_0\) if p-value \(< \alpha\), otherwise fail to reject

The Math: Test Statistic

For a one-sample t-test, we test whether a population mean \(\mu\) equals some hypothesized value \(\mu_0\).

\[H_0: \mu = \mu_0 \qquad H_a: \mu \neq \mu_0\]

The test statistic is:

\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]

where \(\bar{x}\) is the sample mean, \(s\) is the sample standard deviation, and \(n\) is the sample size.

Under \(H_0\), \(t\) follows a t-distribution with \(n - 1\) degrees of freedom.

The p-value

The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming \(H_0\) is true:

\[p\text{-value} = P(|T| \geq |t_{\text{obs}}| \mid H_0 \text{ is true})\]

\[\text{Reject } H_0 \text{ if } p\text{-value} < \alpha\]

A small p-value means the data would be very unlikely under \(H_0\) — providing evidence against it. A large p-value means the data are consistent with \(H_0\).

Important: The p-value is NOT the probability that \(H_0\) is true.

Visualizing the p-value

A Worked Example

Suppose a trainer claims athletes sleep 8 hours per night on average. We sample 20 athletes and find \(\bar{x} = 7.2\) hours, \(s = 1.5\) hours. Test at \(\alpha = 0.05\).

\[t = \frac{7.2 - 8}{1.5 / \sqrt{20}} = \frac{-0.8}{0.335} \approx -2.39\]

set.seed(42)
sleep_data <- rnorm(20, mean = 7.2, sd = 1.5)
result <- t.test(sleep_data, mu = 8)
cat("t =", round(result$statistic, 3),
    "\np-value =", round(result$p.value, 4),
    "\n95% CI: [", round(result$conf.int[1], 2),
    ",", round(result$conf.int[2], 2), "]")

## t = -1.163 
## p-value = 0.2592 
## 95% CI: [ 6.57 , 8.41 ]

Since p-value \(< 0.05\), we reject \(H_0\) — evidence suggests athletes sleep less than 8 hours.

Distribution of p-values Under \(H_0\) and \(H_a\)

Under \(H_0\), p-values are uniform. Under \(H_a\), they pile up near zero — giving us power to detect the effect.

Power: Effect Size × Sample Size (3D)

Type I & Type II Errors

	\(H_0\) is True	\(H_0\) is False
Reject \(H_0\)	Type I Error (\(\alpha\))	Correct ✓ (Power)
Fail to Reject	Correct ✓	Type II Error (\(\beta\))

Key Takeaways

Hypothesis testing formalizes how we use data to challenge a default assumption
The p-value measures evidence against \(H_0\) — it is NOT the probability \(H_0\) is true
Choosing \(\alpha = 0.05\) means we accept a 5% chance of a false positive (Type I error)
Power (\(1 - \beta\)) increases with larger sample sizes and bigger effect sizes
Statistical significance \(\neq\) practical significance — always report effect sizes

“Statistical significance is the least interesting thing about the results.” — Andrew Gelman