Summary

I show two examples where the p-value < 0.05 threshold (statistical significance) leads to poor conclusions.

Very low p-value, worthless predictive model

set.seed(123)
nobs <- 1000000 # as sample size increases, p-values fall, even if the signal is weak

x <- rnorm(nobs)
y <- 50 + 2*x + rnorm(nobs,0, 100)

df <- data.frame(x=x, y=y)

lm.mod <- lm(y ~ x, data=df)

summary(lm.mod)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -458.92  -67.44    0.03   67.39  455.09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.84441    0.09998  498.57   <2e-16 ***
## x            1.91856    0.09998   19.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.98 on 999998 degrees of freedom
## Multiple R-squared:  0.0003681,  Adjusted R-squared:  0.0003671 
## F-statistic: 368.2 on 1 and 999998 DF,  p-value: < 2.2e-16

The coefficient for x is statistically significant. Practically, this doesn’t mean much. The \(R^2\) is less than 0.1%.

High p-value, useful predictive model

set.seed(123)
nobs <- 20 # as sample size decreases, p-values increase, even if the signal exists

x <- rnorm(nobs)
y <- 50 + 2*x + rnorm(nobs,0, 5)

df <- data.frame(x=x, y=y)

lm.mod <- lm(y ~ x, data=df)

summary(lm.mod)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5615 -3.0093 -0.9287  4.1543  6.2955 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.7991     0.9598  51.883   <2e-16 ***
## x             1.6087     1.0013   1.607    0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.245 on 18 degrees of freedom
## Multiple R-squared:  0.1254, Adjusted R-squared:  0.07682 
## F-statistic: 2.581 on 1 and 18 DF,  p-value: 0.1256

The coefficient for x is not statistically significant at the 90% confidence level. However, the model has some predictive power. Combined with a strong narrative, intuition, and policy implications, this may be a viable model despite being “statistically insignificant”.