When we fit a statistical model to our data we have to balance the extent to which our model fits our observed data, and the parsimony of our model. A more parsimonious model will use less parameters to describe the data, but will also provide a worse fit. We need some balance here! But why is this the case? Why don’t we just take more complex model with more parameters?

Let’s look at an example:

set.seed(1)  # we all generate the same data (you can change this if you like)
N <- 10 # we have 10 data points
x <- 1:N # our x variable is a vector from 1 to 10
y <- 9 + -.5 * x + rnorm(N, 0, .75) # here we simulate the y values based on some linear regression model, and add some random error
plot(x, y) # and we plot our results

So, \(x\) and \(y\) are our data here. For example, \(x\) is the number of alcohol beverages someone drinks and \(y\) is his/her score on a working memory test.

With the R code lm(y~poly(x,1)) \(^{1}\) we can fit a linear regression model. With the poly function we can easily include higher order polynomial and thereby make our model more complex. For example with lm(y~poly(x,9)) we fit a model were we predict \(y\) with \(x\), \(x^2\), \(x^3\), to \(x^9\). This polynomial regression model will show a perfect fit to the data, because we use the parameters to describe 10 data points!

  1. Note: this code fits the same model as we would fit with: lm(y ~ 1 + x) (how can you check?). The only differences is that with the poly function we tranform x such that if we would include higher order polynomials all transformations of x are uncorrelated. For example look at plot(x, poly(x, 1)).

Question 1

Fit both models (a linear model and a model with a ninth order polynomial) to the data (\(x\) and \(y\)). What is the explained variance in both models? Add two lines to the plot that show the predicted values for both models.

Answer
lm_Q1 <- lm(y ~ 1 + x)
summary(lm_Q1)
## 
## Call:
## lm(formula = y ~ 1 + x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7350 -0.4808  0.1754  0.2009  1.1589 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.87338    0.41452   21.41 2.39e-08 ***
## x           -0.45895    0.06681   -6.87 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6068 on 8 degrees of freedom
## Multiple R-squared:  0.8551, Adjusted R-squared:  0.8369 
## F-statistic:  47.2 on 1 and 8 DF,  p-value: 0.0001284
summary(lm_Q1)$r.squared #explained variance 0.8550596
## [1] 0.8550596
#fit a polynomial regression
poly_lm <- lm(y~poly(x,9)) 
summary(poly_lm)
## 
## Call:
## lm(formula = y ~ poly(x, 9))
## 
## Residuals:
## ALL 10 residuals are 0: no residual degrees of freedom!
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.349152         NA      NA       NA
## poly(x, 9)1 -4.168629         NA      NA       NA
## poly(x, 9)2 -0.539152         NA      NA       NA
## poly(x, 9)3  0.008498         NA      NA       NA
## poly(x, 9)4 -0.482650         NA      NA       NA
## poly(x, 9)5 -0.565976         NA      NA       NA
## poly(x, 9)6  0.125583         NA      NA       NA
## poly(x, 9)7  1.013247         NA      NA       NA
## poly(x, 9)8 -1.029141         NA      NA       NA
## poly(x, 9)9  0.005499         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 9 and 0 DF,  p-value: NA
summary(poly_lm)$r.squared #explained variance is 1. 
## [1] 1
#Thus, the proportions of explained variance is in the linear model (0.855), and in the polynomial model (1)
#Add two lines to the plot that show the predicted values for both models 

plot(x = x, y = y)
abline(a = lm_Q1, col= "red") #abline for the lm function
lines(fitted(poly_lm) ~ x, col = "green") #line which shows all the slope parameters for the polynomial function

Question 2

So the polynomial regression model of the ninth order fits perfectly to our data. Why is this the case? The problem is over fitting. Look up a definition of over fitting in the world wide web and explain why over fitting is a problem.

Definition (Wiki): An over fitted model is a statistical model that contains more parameters than can be justified by the data.The essence of over fitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

What this definition is saying that when you try to fit a model on your data, you have to watch that you don’t give the same amount of parameters to this model as you have data. The \(R^2\) will now be equal to 1.000, which indicates a perfect fit between the model and the data. When this is the case, you don’t model the concept of the data anymore, but in fact you model the noise/pattern specific to that data. Once you fit other data to this over fitted model, you’ll see that there is no perfect fitted model anymore, this is because this data has a different pattern which isn’t exactly the same as the previous data. It’s better to have a model which may be a little less specific, but recognizes the overall pattern of the concept measured in the data instead of fitting to the data pattern itself.

Question 3

One way to control for overfitting is crossvalidation. Lets look at a simple dataset to examine how crossvalidation works:

set.seed(1000)
x <- 1:10 # we have the same x
y_train <- 9 + -.5 * x + rnorm(N, 0, .75) # and simulate a training data set
y_test <- 9 + -.5 * x + rnorm(N, 0, .75) # and a test data set according to the same model

# and we put our data in some data frame:
example <- data.frame(x, y_train, y_test)

plot(example$x, example$y_train,
     pch = 19, col = 'red', # and we plot our data:
     ylim = range(c(y_train, y_test)), ylab = 'y', xlab = 'x')
points(example$x, example$y_test,
       pch = 19, col = 'blue')
legend('topright', legend = c('test', 'train'), col = c('blue', 'red'), pch = 19, bty = 'n')

So in our example dataset we have a variable y_train on which we fit our model and we will use y_test to check our results. Following such a procedure is called cross validation (we validate our prediction based on some model on a second test data set). So we have y_train and y_test and the same true model is used to generate both variables.

Fit both models (the linear model and the model with a ninth order polynomial) to the training data. So based on x we want to predict y_train. What are the predicted values?

Answer
set.seed(1000)
x <- 1:10 # we have the same x
y_train <- 9 + -.5 * x + rnorm(N, 0, .75) # and simulate a training data set
y_test <- 9 + -.5 * x + rnorm(N, 0, .75) # and a test data set according to the same model
# and we put our data in some data frame:
example <- data.frame(x, y_train, y_test)
plot(example$x, example$y_train,
     pch = 19, col = 'red', # and we plot our data:
     ylim = range(c(y_train, y_test)), ylab = 'y', xlab = 'x')
points(example$x, example$y_test,
       pch = 19, col = 'blue')
legend('topright', legend = c('test', 'train'), col = c('blue', 'red'), pch = 19, bty = 'n')

#Fit both models to the training data (lm and poly,9)
lm_Q3 <- lm(y_train ~ 1 + x)
poly_lm_Q3 <- lm(y_train ~ poly(x,9)) 

predict(lm_Q3) #8.244718 7.746599 7.248480 6.750361 6.252242 5.754123 5.256004 4.757885 4.259766 3.761647
##        1        2        3        4        5        6        7        8 
## 8.244718 7.746599 7.248480 6.750361 6.252242 5.754123 5.256004 4.757885 
##        9       10 
## 4.259766 3.761647
predict(poly_lm_Q3) #8.165666 7.095608 7.530845 7.479541 5.910084 5.710883 5.143099 5.539813 4.486121 2.970162
##        1        2        3        4        5        6        7        8 
## 8.165666 7.095608 7.530845 7.479541 5.910084 5.710883 5.143099 5.539813 
##        9       10 
## 4.486121 2.970162

Question 4

Plot the test data. Add two lines that show the predictions of both models that are based on the training data. Explain what you see. Did you predict that the prediction line based on the poly(9) model would go through all data points?

Answer
plot(example$x, example$y_test,
     pch = 19, col = 'red', # and we plot our data:
     ylim = range(c(y_train, y_test)), ylab = 'y', xlab = 'x')
points(example$x, example$y_test,
       pch = 19, col = 'blue')
legend('topright', legend = c('test', 'train-plot'), col = c('blue', 'red'), pch = 19, bty = 'n')
abline(lm(y_train~1+x), col= "red") # plots regression line
lines(fitted(poly_lm_Q3) ~ x, col = "red") #plots the polynomial line

Question 5

Now we want to quantify how well both models based on the train data predict the test data. Compare the predictions based on the training data to the observations in the test data. To quantify the model fit we can for example look at the variance of the differences between the predictions and the observation. Calculate this for both models.

Answer
#Linear model
PredErr_LM <- y_test - predict(lm_Q3)
PredErr_LM
##          1          2          3          4          5          6          7 
## -0.4815386 -0.1624652  0.3425562  0.1589851 -0.7542725  0.3734204  0.3603054 
##          8          9         10 
##  0.2608142 -1.2947047  0.3982189
var(PredErr_LM)/var(y_test) #0.1344573 is the unobserved variance
## [1] 0.1344573
1-0.1344573 #observed variance
## [1] 0.8655427
#Polynomial model
PredErr_Poly <- y_test - predict(poly_lm_Q3)
PredErr_Poly
##           1           2           3           4           5           6 
## -0.40248717  0.48852590  0.06019116 -0.57019554 -0.41211502  0.41666008 
##           7           8           9          10 
##  0.47320995 -0.52111412 -1.52105984  1.18970390
var(PredErr_Poly)/var(y_test) #0.2286077
## [1] 0.2286077
1- 0.2286077 #observed variance
## [1] 0.7713923

Question 6

So one way to quantify the model fit is to consider the variance between the observations and the predictions (these differences between observations and predictions are called the residuals). However we don’t always have a cross validation sample. Fortunately, we have AIC and BIC. What are the formula’s of the AIC and BIC? Clearly explain how both these fit statistics balance model fit and model complexity.

Answer

\(AIC= -2*log-likelihood + k*npar\)
\(BIC= -2*log-likelihood + k*npar\)
Both these formula’s are criteria for model selection. One can ask for the AIC or BIC and retrieves a value which indicates the model fit. In contrast with the R squared, both the above punish complex models in order to prevent over fitting. AIC punishes complex models by using the \(+k*npar\) (k= 2 by default and \(npar=\) the amount of parameters used in the model). Thus the more parameters, the higher the punishment. BIC is even more strict in punishing complex models, instead of \(k= 2\), it uses \(k= log(n)\) where n is the number of observations, so when using a large data set the extra added parameters are being punished even more.

Question 7

In R you can simply get the AIC and BIC values of your fitted model with AIC() and BIC(). Use both fit statistics to compare the linear model with more complex models by increasing the order of the polynomials: e.g. lm(y~poly(x,2)), lm(y~poly(x,3)), up to lm(y~poly(x,4)). What model fits best? Is this what you expected?

The AIC and BIC of the poly(9) model will be -Inf, because the predictions perfectly describe the observations. Therefore the likelihood of the model will be perfect (as low as it can be). So we cannot use the AIC and BIC to compare the fit of this model to the other models.

Answer
#AIC
AIC(lm_Q3) 
## [1] 20.36371
AIC(lm(y ~ poly(x,2))) 
## [1] 23.11719
AIC(lm(y ~ poly(x,3))) 
## [1] 25.11692
AIC(lm(y ~ poly(x,4)))
## [1] 26.19857

Thus, using the AIC criteria the best fit would be for the one parameter linear model.

#BIC
BIC(lm_Q3) 
## [1] 21.27147
BIC(lm(y ~ poly(x,2))) 
## [1] 24.32754
BIC(lm(y ~ poly(x,3))) 
## [1] 26.62985
BIC(lm(y ~ poly(x,4))) 
## [1] 28.01408

Thus, also in the BIC the one parameter linear model seems to be the best fit. This is what I expected since when you looked at y, this is just a normal linear regression. So it doesn’t need a complex model to explain it, just only a slope parameter. Moreover, it makes sense that the more parameters, the higher the values of the AIC and BIC due to their punishing habits. BIC being even more strict than AIC.

I didn’t add the polynomial functions for the 5th level up to the 9th, since this isn’t in the assignemt description. Moreover, as was discussed in the tutorial. The more levels, the higher the AIC and BIC became, except for the 6th model. There the fit was very nice. But as stated before, it doesn’t make sense to use a 6 parameter model to describe a linear model.

Question 8 (bonus)

We did our simulation for one specific case set.seed(1000). It would be better to repeat it a couple of times to ensure that indeed in many of our simulations the variance of the residuals is higher for the poly(9) model compared to the linear model. Do this simulation. Make a histogram of the variances of the residuals for both models. Describe what you see.

Answer
residu_LM <- numeric()
residu_Poly <- numeric()

set.seed(1)

for (i in 100) {
  x <- 1:10 # we have the same x
  y_train <- 9 + -.5 * x + rnorm(N, 0, .75) # and simulate a training data set
  y_test <- 9 + -.5 * x + rnorm(N, 0, .75) # and a test data set according to the same model
  
  LM_Q8 <- lm(y_train ~ 1 + x)
  Poly_Q8 <- lm(y_train ~ poly(x, 9))
  
  residu_LM[i] <- var(y_test - predict(LM_Q8))
  residu_Poly[i] <- var(y_test - predict(Poly_Q8))
}

par(mfcol= c(1,2))

hist(residu_LM)
hist(residu_Poly)

Unfortunately, the histograms aren’t showing what I want.