In this notebook we will investigate the trade-off between bias and variance. We will simulate data according to a linear model, \(y=5+x+Normal(0,1)\), and fit three different models to that data. The first model that we will be fitting is correct: \(y=a+bx+\epsilon\). The second model is underparameterized: \(y=ax+\epsilon\). The third model is overparameterized: \(y=a+bx+cx²+\epsilon\).
We generate data according to \(y=5+x+Normal(0,1)\).
linearComponent <- 1:30
base <- 5 + linearComponent
d <-c()
for (i in 1:30) {
d <- c(d, rnorm(n=1, mean=base[i], sd=1))
}
summary(d)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.751 12.352 19.545 20.399 28.086 34.613
plot(1:30, d, pch=20, ylim=c(1,30), xlab="x", ylab="y")
abline(a=5, b=1, col="grey", lwd=2, lty=2)
legend("bottomright", lwd=c(2), lty=c(2), col=c("grey"), legend=c("True model"))
Here we are trying to fit the correct model to the data: estimation of the parameters should be easy.
model1 <- lm (d~linearComponent)
summary(model1)
Call:
lm(formula = d ~ linearComponent)
Residuals:
Min 1Q Median 3Q Max
-1.34710 -0.53114 0.01659 0.72953 1.17090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.74050 0.30465 15.56 2.6e-15 ***
linearComponent 1.01025 0.01716 58.87 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8135 on 28 degrees of freedom
Multiple R-squared: 0.992, Adjusted R-squared: 0.9917
F-statistic: 3466 on 1 and 28 DF, p-value: < 2.2e-16
confint(model1, level=0.95)
2.5 % 97.5 %
(Intercept) 4.1164492 5.364549
linearComponent 0.9750933 1.045397
plot(1:30, d, pch=20, ylim=c(1,30), xlab="x", ylab="y")
abline(a=5, b=1, col="grey", lwd=2, lty=2)
abline(a=model1$coefficients[1], b=model1$coefficients[2], col="red")
conf_interval <- predict(model1, newdata=data.frame(x=linearComponent), interval="confidence",
level = 0.95)
matlines(linearComponent, conf_interval[,2:3], col = "blue", lty=2)
legend("bottomright", lwd=c(2, 1), lty=c(2, 1), col=c("grey", "red", "blue"), legend=c("True model", "Estimated model", "95% confidence interval"))
The slope parameter is well estimated (its value is now \(~1.01\) instead of \(1.0\)). Its confidence interval at \(95%\) is not very large: \(~0.07\).
We fit a model in which we force the intercept to be 0: compared to the true model, it lacks one parameter and thus cannot correctly fit the data.
model0 <- lm (d~linearComponent+0)
summary(model0)
Call:
lm(formula = d ~ linearComponent + 0)
Residuals:
Min 1Q Median 3Q Max
-2.6886 -0.5252 0.9046 2.8900 5.5398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
linearComponent 1.24338 0.02554 48.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.483 on 29 degrees of freedom
Multiple R-squared: 0.9879, Adjusted R-squared: 0.9875
F-statistic: 2371 on 1 and 29 DF, p-value: < 2.2e-16
confint(model0, level=0.95)
2.5 % 97.5 %
linearComponent 1.191159 1.29561
plot(1:30, d, pch=20, ylim=c(1,30), xlab="x", ylab="y")
abline(a=5, b=1, col="grey", lwd=2, lty=2)
abline(a=0, b=model0$coefficients, col="red")
conf_interval <- predict(model0, newdata=data.frame(x=linearComponent), interval="confidence",
level = 0.95)
matlines(linearComponent, conf_interval[,2:3], col = "blue", lty=2)
legend("bottomright", lwd=c(2, 1), lty=c(2, 1), col=c("grey", "red", "blue"), legend=c("True model", "Estimated model", "95% confidence interval"))
We can see that because the model is underparametrized, our estimate of the slope is biased (\(~1.2\) instead of \(1.0\)). The confidence interval at \(95%\) is not very large though: \(~0.1\).
Our model is over-parametrized because it includes a quadratic term.
model2 <- lm (d~linearComponent+I(linearComponent^2))
summary(model2)
Call:
lm(formula = d ~ linearComponent + I(linearComponent^2))
Residuals:
Min 1Q Median 3Q Max
-1.3206 -0.5353 -0.0084 0.7340 1.2289
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8723787 0.4846814 10.053 1.27e-10 ***
linearComponent 0.9855178 0.0720733 13.674 1.18e-13 ***
I(linearComponent^2) 0.0007977 0.0022559 0.354 0.726
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8266 on 27 degrees of freedom
Multiple R-squared: 0.992, Adjusted R-squared: 0.9914
F-statistic: 1679 on 2 and 27 DF, p-value: < 2.2e-16
confint(model2, level=0.95)
2.5 % 97.5 %
(Intercept) 3.87789460 5.866862750
linearComponent 0.83763551 1.133400016
I(linearComponent^2) -0.00383105 0.005426369
plot(1:30, d, pch=20, ylim=c(1,30), xlab="x", ylab="y")
abline(a=5, b=1, col="grey", lwd=2, lty=2)
abline(a=model2$coefficients[1], b=model2$coefficients[2], col="red")
conf_interval <- predict(model2, newdata=data.frame(x=linearComponent), interval="confidence",
level = 0.95)
matlines(linearComponent, conf_interval[,2:3], col = "blue", lty=2)
legend("bottomright", lwd=c(2, 1), lty=c(2, 1), col=c("grey", "red", "blue"), legend=c("True model", "Estimated model", "95% confidence interval"))
Adding the quadratic term has not really biased the estimate on the slope compared to the estimate in the correct model: its estimate is \(~0.99\), not far from \(1.0\). The confidence interval at \(95%\) however has widened a lot: \(~0.3\) (the same observation can be made for the estimate of the intercept parameter).
Here we saw that the under-parameterized model introduced bias in the estimation of the slope, and that the overparameterized model increased the uncertainty we had about our estimate of the slope.