An Introduction to P-Values

2024-03-11

Quick Overview of Hypothesis Testing

Hypothesis testing: Make and test a claim about your data
Null hypothesis: There is not a significant difference
Alternate hypothesis: There is a significant difference
The null and alternate hypotheses depend largely on the nature of your claim

Example: From the “airquality” dataset, I hypothesize the mean temperature is 80 degrees.

Null hypothesis \(H_0: \mu = 80^{\circ}\)

Alternate hypothesis \(H_a: \mu \neq 80^{\circ}\)

P-Values

P-value: The probability of the observed results if the null hypothesis is true
A lower P-value suggests that the alternate hypothesis is more likely true
Generally, we accept the alternate hypothesis when \(p < 0.05\)

Mathematically, the p-value can be expressed as follows: \[p = P(X | H_0)\] where \(X\) is your observation, and \(H_0\) is the null hypothesis.

Temperature in “airquality” Data

This is a plot of the temperature over the months May through September.

Temperature Distribution

This is a histogram of the temperatures. This suggests that we may assume that the temperature is approximately normally distributed.

Test Statistic Calculation

For our dataset, which has unknown population variance but a large sample size, we should use a two-tailed Z-test.

The test statistic for the Z-test has formula \(z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}\).

We calculate the Z-score as follows:

avgtemp = mean(airquality$Temp)
sdtemp = sd(airquality$Temp)

z = (avgtemp - 80) / (sdtemp / sqrt(nrow(airquality)))
z

## [1] -2.767364

P-value Calculation

For other test statistics, the p-value can be found by using tables found online. For a Z-test, we can calculate it directly by integrating the lower tail of the normal distribution, then multiplying by two.

p = 2 * pnorm(-abs(z))
p

## [1] 0.00565116

Graphical P-value

To calculate the p-value, the red areas are integrated, with bounds at the Z-scores on both sides of the distribution.

Conclusion

We obtain a p-value of \(p = 0.0057\).

According to our mathematical definition of the p-value, this represents \(P(\bar{x} = 77.88 | \mu = 80)\).

Since \(p < 0.05\), and therefore the probability of observing a sample mean of 77.88 given a population mean of 80 is very low, we must accept the alternate hypothesis. Therefore, our population mean is likely different from 80.