Suppose there’s data where the response is only weakly corrolated with the independent variable e.g.
\(Y = .01*X + \text{Norm}(0,1)\)
Note in the above equation, Y is indeed a linear function of X plus noise but the magnitude of the noise will be larger than the impact from X as long as |X| < 5 or so.
Let’s imagine we have 10 samples from such a data generating process, will the linear regression model be found to be signifigant?
# First lets make some fake small data
N = 10
x_10 = rnorm(N, 0, 1)
y_10 = .01*x_10 + rnorm(N, 0, 1) # More noise than signal! Effect X on average is 10 times smaller than the noise
test_10 = data.frame(x = x_10, y = y_10)
model_10 = lm(data = test_10, y ~ x)
summary(model_10) # No significance, ex: p-value: 0.7506
##
## Call:
## lm(formula = y ~ x, data = test_10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75339 -0.19427 0.03669 0.15135 0.78903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8300 0.1739 4.772 0.00141 **
## x 0.1850 0.2966 0.624 0.55006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4806 on 8 degrees of freedom
## Multiple R-squared: 0.0464, Adjusted R-squared: -0.0728
## F-statistic: 0.3893 on 1 and 8 DF, p-value: 0.5501
Above we saw that no, with only ten samples we could not pick up on the weak correlation between X and Y. What about if we had 100000 samples?
# Same code as before but with many more samples
N = 100000
x_100000 = rnorm(N, 0, 1)
y_100000 = .01*x_100000 + rnorm(N, 0, 1) # more noise than signal,
test_100000 = data.frame(x = x_100000, y = y_100000)
model_100000 = lm(data = test_100000, y ~ x)
summary(model_100000) # Extremely significant! ex: p-value: 8.579e-06
##
## Call:
## lm(formula = y ~ x, data = test_100000)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3224 -0.6737 -0.0020 0.6742 4.1326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001321 0.003168 0.417 0.6768
## x 0.006293 0.003162 1.990 0.0466 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.002 on 99998 degrees of freedom
## Multiple R-squared: 3.961e-05, Adjusted R-squared: 2.961e-05
## F-statistic: 3.961 on 1 and 99998 DF, p-value: 0.04656
In both cases data is generated from the same weak linear prob. model, \(Y = .01x + \text{Norm}(0,1)\). When only ten samples are seen, even though the data is generated from a linear model, no signifigance is found. When there are 100,000 samples, the model is found to be extremely signifigant! But still not interesting, the correlation is extremely weak and for both models the signal is dominated by the noise.
Be careful, in the real world almost everything has some sort of weak correlation! That does not mean regression is appropriate or interesting. Do not present signifigant but pointless models, they are pointless.
This problem is easy to fix, always report both the p-val and the \(R^2\).
# Here we will find a signifigant model on purely random uncorrelated data by checking many many models.
N = 15
x_1 = rnorm(N, 0, 1)
x_2 = rnorm(N, 0, 1)
x_3 = rnorm(N, 0, 1)
x_4 = rnorm(N, 0, 1)
x_5 = rnorm(N, 0, 1)
x_6 = rnorm(N, 0, 1)
x_7 = rnorm(N, 0, 1)
x_8 = rnorm(N, 0, 1)
x_9 = rnorm(N, 0, 1)
y = rnorm(N, 0, 1)
test = data.frame(x_1 = x_1, x_2 = x_2, x_3 = x_3, x_4 = x_4, x_5 = x_5, x_6 = x_6, x_7 = x_7, x_8 = x_8, x_9 = x_9, y = y)
full_model = lm(data = test, y ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9) # terrible
library(leaps)
cand_model = regsubsets(data = test, y ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9)
plot(cand_model)
# Best model found (changes every time!)
best_model = lm(data = test, y ~ x_4 + x_6 + x_7)
summary(best_model)
##
## Call:
## lm(formula = y ~ x_4 + x_6 + x_7, data = test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6603 -0.4137 -0.2552 0.3416 1.1896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.44418 0.20339 -2.184 0.0515 .
## x_4 -0.30066 0.32923 -0.913 0.3807
## x_6 0.01222 0.23543 0.052 0.9595
## x_7 0.27849 0.23138 1.204 0.2540
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6883 on 11 degrees of freedom
## Multiple R-squared: 0.2199, Adjusted R-squared: 0.007114
## F-statistic: 1.033 on 3 and 11 DF, p-value: 0.4156
# p-value: 0.0006073 !!!!!
# Adjusted R-squared: 0.7207 !!!!!!
In the above example I made absolutely purely random data, and then searched over the randomness to find a model that is extremely signifigant! What happened here? Best subset looks at 2^9 models, if you look at enough models your bounded to find one that looks good by pure chance. In practice this often manifests as people swapping features in and out until they find something signifigant!
You have to be principled in your experiment design and trust other analysts are being principled in their design as well.
# Lets make two totally random (but linear!) datasets
x_1 = rnorm(5, 0, 1)
x_2 = rnorm(5, 0, 1)
training_set = data.frame(x = x_1, y = 2*x_1 + rnorm(5, 0, 1))
testing_set = data.frame(x = x_2, y = 2*x_2 + rnorm(5, 0, 1))
# We'll train our model on the first and use it on the second.
bad_model = lm(data = training_set, y ~ x + I(x^2) + I(x^3) + I(x^4))
summary(bad_model) # R^2 = 1, a perfect fit!
##
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4), data = training_set)
##
## Residuals:
## ALL 5 residuals are 0: no residual degrees of freedom!
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4749 NA NA NA
## x 3.2978 NA NA NA
## I(x^2) -5.9143 NA NA NA
## I(x^3) -1.1950 NA NA NA
## I(x^4) 3.4745 NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 4 and 0 DF, p-value: NA
# Lets plot it!
new_dat = data.frame(x = seq(-3,3,.01))
the_fit = predict(bad_model, new_dat)
plot(data = training_set, y ~ x, col = "blue", xlim = c(-3,3), ylim = c(-15, 15), xlab = "", ylab = "")
par(new = T)
plot(data = testing_set, y ~ x, col = "red", xlim = c(-3,3), ylim = c(-15, 15), xlab = "", ylab = "")
par(new = T)
plot(new_dat$x, the_fit, xlim = c(-3,3), ylim = c(-15, 15), type = "l", xlab = "", ylab = "")
# We see the polynomial is very unnatural and further does not explain the test data well at all!
Given random data, you can always find a (polynomial) model that explains all of the data points. This is can be tempting to do when you really want to present a model with a good R^2 but this is not real data analysis, this is just interpolation. Interpolation does not generalize well, and it has no merit for prediction! ### Solution Higher order terms must be careful justified. After the midterm we’ll see methods to mitigate generalization error (i.e. the model doesn’t work on data it wasn’t trained on).