Created on 23 June 2013
Revised on Sun Jun 23 22:01:59 2013
What is a P-value?
Idea: Suppose nothing is going on - how unusual is it to see the estimate we got?
Approach:
Galton data
library(UsingR)
## Warning: package 'UsingR' was built under R version 3.0.1
## Loading required package: MASS
data(galton)
plot(galton$parent, galton$child, pch = 19, col = "blue")
lm1 <- lm(galton$child ~ galton$parent)
abline(lm1, col = "red", lwd = 3)
Null distribution
x <- seq(-20, 20, length = 100)
plot(x, dt(x, df = (928 - 2)), col = "blue", lwd = 3, type = "l")
Null distribution + observed statistic
x <- seq(-20, 20, length = 100)
plot(x, dt(x, df = (928 - 2)), col = "blue", lwd = 3, type = "l")
arrows(summary(lm1)$coeff[2, 3], 0.25, summary(lm1)$coeff[2, 3], 0, col = "red",
lwd = 4)
Calculating p-values
summary(lm1)
##
## Call:
## lm(formula = galton$child ~ galton$parent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.805 -1.366 0.049 1.634 5.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9415 2.8109 8.52 <2e-16 ***
## galton$parent 0.6463 0.0411 15.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.24 on 926 degrees of freedom
## Multiple R-squared: 0.21, Adjusted R-squared: 0.21
## F-statistic: 247 on 1 and 926 DF, p-value: <2e-16
A quick simulated example
set.seed(9898324)
yValues <- rnorm(10)
xValues <- rnorm(10)
lm2 <- lm(yValues ~ xValues)
summary(lm2)
##
## Call:
## lm(formula = yValues ~ xValues)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.546 -0.570 0.136 0.771 1.052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.310 0.351 0.88 0.40
## xValues 0.289 0.389 0.74 0.48
##
## Residual standard error: 0.989 on 8 degrees of freedom
## Multiple R-squared: 0.0644, Adjusted R-squared: -0.0525
## F-statistic: 0.551 on 1 and 8 DF, p-value: 0.479
x <- seq(-5, 5, length = 100)
plot(x, dt(x, df = (10 - 2)), col = "blue", lwd = 3, type = "l")
arrows(summary(lm2)$coeff[2, 3], 0.25, summary(lm2)$coeff[2, 3], 0, col = "red",
lwd = 4)
xCoords <- seq(-5, 5, length = 100)
plot(xCoords, dt(xCoords, df = (10 - 2)), col = "blue", lwd = 3, type = "l")
xSequence <- c(seq(summary(lm2)$coeff[2, 3], 5, length = 10), summary(lm2)$coeff[2,
3])
ySequence <- c(dt(seq(summary(lm2)$coeff[2, 3], 5, length = 10), df = 8), 0)
polygon(xSequence, ySequence, col = "red")
polygon(-xSequence, ySequence, col = "red")
Simulate a ton of data sets with no signal
set.seed(8323)
pValues <- rep(NA, 100)
for (i in 1:100) {
xValues <- rnorm(20)
yValues <- rnorm(20)
pValues[i] <- summary(lm(yValues ~ xValues))$coeff[2, 4]
}
hist(pValues, col = "blue", main = "", freq = F)
abline(h = 1, col = "red", lwd = 3)
set.seed(8323)
pValues <- rep(NA, 100)
for (i in 1:100) {
xValues <- rnorm(20)
yValues <- 0.2 * xValues + rnorm(20)
pValues[i] <- summary(lm(yValues ~ xValues))$coeff[2, 4]
}
hist(pValues, col = "blue", main = "", freq = F, xlim = c(0, 1))
abline(h = 1, col = "red", lwd = 3)
set.seed(8323)
pValues <- rep(NA, 100)
for (i in 1:100) {
xValues <- rnorm(100)
yValues <- 0.2 * xValues + rnorm(100)
pValues[i] <- summary(lm(yValues ~ xValues))$coeff[2, 4]
}
hist(pValues, col = "blue", main = "", freq = F, xlim = c(0, 1))
abline(h = 1, col = "red", lwd = 3)
Some typical values (single test)
In modern analyses, people generally report both the confidence interval and P-value. This is less true if many many hypotheses are tested.
How you interpret the results
summary(lm(galton$child ~ galton$parent))$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9415 2.81088 8.517 6.537e-17
## galton$parent 0.6463 0.04114 15.711 1.733e-49
A one inch increase in parental height is associated with a 0.77 inch increase in child's height (95% CI: 0.42-1.12 inches). This difference was statistically significant (P < 0.001 )