Understanding P-Values

What is a P-Value?

When we run a statistical test, we start with a null hypothesis \(H_0\) — basically an assumption that nothing interesting is going on.

The p-value is the probability of getting results as extreme (or more extreme) than what we observed, assuming the null hypothesis is true.

Small p-value → the data is unlikely under \(H_0\) → evidence against \(H_0\)
Large p-value → the data is consistent with \(H_0\) → not enough evidence to reject

A common threshold is \(\alpha = 0.05\). If \(p < \alpha\), we reject \(H_0\).

Important: A p-value is NOT the probability that \(H_0\) is true.

Hypothesis Testing Setup

For a one-sample test about the population mean \(\mu\), we set up:

\[H_0: \mu = \mu_0 \quad \text{vs} \quad H_a: \mu \neq \mu_0\]

The test statistic for a known \(\sigma\) is:

\[z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]

Under \(H_0\), this follows a standard normal distribution: \(z \sim \mathcal{N}(0, 1)\).

For a two-sided test, the p-value is:

\[p\text{-value} = 2 \cdot P(Z \geq |z|) = 2\left(1 - \Phi(|z|)\right)\]

where \(\Phi\) is the CDF of the standard normal distribution.

Example: Testing Average Sleep

Suppose a study says the average American gets 7 hours of sleep. We survey 30 college students and want to test whether they get less.

\[H_0: \mu = 7 \quad \text{vs} \quad H_a: \mu < 7\]

set.seed(42)
# simulate 30 students, true mean around 6.3 hrs
sleep_data <- rnorm(30, mean = 6.3, sd = 1.2)

result <- t.test(sleep_data, mu = 7, alternative = "less")
cat("Sample mean:", round(mean(sleep_data), 2), "hours\n")

## Sample mean: 6.38 hours

cat("t-statistic:", round(result$statistic, 3), "\n")

## t-statistic: -2.246

cat("p-value:", round(result$p.value, 4))

## p-value: 0.0162

Since \(p < 0.05\), we reject \(H_0\) — there’s evidence that college students in our sample sleep less than 7 hours on average.

Visualizing the P-Value

The red shaded area represents the p-value — the probability of getting a test statistic this extreme or more if \(H_0\) were true.

What Significance Level Means

The further left the critical value, the harder it is to reject \(H_0\). A smaller \(\alpha\) means we need stronger evidence.

Plotly: Simulating Many P-Values

What happens if we run the same test many times? Let’s simulate 500 experiments and see the distribution of p-values.

set.seed(99)
pvals <- replicate(500, {
  samp <- rnorm(30, mean = 6.3, sd = 1.2)
  t.test(samp, mu = 7, alternative = "less")$p.value
})

plot_ly(x = pvals, type = "histogram",
        marker = list(color = "#3498db", 
                      line = list(color = "white", width = 0.5)),
        nbinsx = 30) %>%
  layout(xaxis = list(title = "P-Value"),
         yaxis = list(title = "Count"),
         shapes = list(
           list(type = "line", x0 = 0.05, x1 = 0.05,
                y0 = 0, y1 = 120,
                line = list(color = "#e74c3c", 
                            dash = "dash", width = 2))))

Power of a Test

The red dashed line is at \(\alpha = 0.05\). P-values to the left of it lead to rejecting \(H_0\).

# what fraction of our 500 simulations rejected H0?
power <- mean(pvals < 0.05)
cat("Proportion of simulations that rejected H0:", power, "\n")

## Proportion of simulations that rejected H0: 0.914

cat("This is an estimate of the test's power.")

## This is an estimate of the test's power.

Power is the probability of correctly rejecting \(H_0\) when it’s actually false. Here the true mean is 6.3 (not 7), so \(H_0\) is false, and our test detects it about 91% of the time.

Common Mistakes with P-Values

A few things people get wrong:

“The p-value is the probability that \(H_0\) is true” — No. It’s the probability of the observed data (or more extreme) given that \(H_0\) is true. Formally:

\[p = P(\text{data} \mid H_0) \quad \neq \quad P(H_0 \mid \text{data})\]

“A smaller p-value means a bigger effect” — Not necessarily. P-values depend on sample size. With a huge sample, even tiny effects give small p-values.

“p > 0.05 means there’s no effect” — No. It means we didn’t find enough evidence. Absence of evidence is not evidence of absence.

R Code: Running a t-Test

Here’s the code for the sleep example — it’s pretty straightforward:

# generate some fake sleep data
set.seed(42)
sleep_data <- rnorm(30, mean = 6.3, sd = 1.2)

# one-sample t-test
# H0: mu = 7 vs Ha: mu < 7
result <- t.test(sleep_data, mu = 7, 
                 alternative = "less")

# check the results
result$statistic  # t-statistic
result$p.value    # p-value
result$conf.int   # confidence interval

The t.test() function handles everything — calculates the test statistic, degrees of freedom, and p-value for you.

Summary

The p-value measures how surprising our data would be if \(H_0\) were true
We reject \(H_0\) when \(p < \alpha\) (commonly \(\alpha = 0.05\))
In our sleep example, \(p \approx 0.003\), so we found strong evidence that college students average less than 7 hours of sleep
Simulating many tests showed the connection between p-values and statistical power
P-values are useful but they’re often misinterpreted — be careful with what they actually mean