When we fit a statistical model to our data we have to balance the extent to which our model fits our observed data, and the parsimony of our model. A more parsimonious model will use less parameters to describe the data, but will also provide a worse fit. We need some balance here! But why is this the case? Why don’t we just take the more complex model with more parameters that has a better fit?
Let’s look at an example:
# Load libraries (if not already loaded)
library(ggplot2)
# Set a seed so we all generate the same data (you can change this if you like)
set.seed(1)
# Number of subjects:
N <- 10
# Generate data:
myData <- data.frame(x1 = runif(10, -1, 1),
x2 = runif(10, 0, 3),
x3 = rnorm(10),
x4 = rnorm(10, 3, 1),
x5 = runif(10, -3, 1),
x6 = runif(10, -7, 1),
x7 = rbinom(10, 10, .5),
x8 = rbinom(10, 10, .05),
x9 = runif(10, 1, 6),
x10 = rnorm(10))
myData <- myData[order(myData$x4), ]
# Simulate y values based on our linear regression model and add some error:
myData$y <- 9 + -1.1 * myData$x4 + rnorm(N, 0, .9)
# Plot the data of our predictor and y values:
plot(myData$x4, myData$y)
So, \(x4\) and \(y\) are our data here. For example, \(x4\) is the number of alcohol beverages someone drinks and \(y\) is their score on a working memory test.
With the R code lm(y ~ x4, data = myData) we can fit a
linear regression model to myData. We can extend the model
by including many more variables (x1, x2, etc), for example with
lm(y ~ x1 + x2 + x3 + x4) we fit a model were we predict
\(y\) with \(x1\), \(x2\), \(x3\), and \(x4\).
Fit both a simple model with the predictor \(x4\), i.e., lm(y ~ x4), and a
complex model with all 10 predictors, i.e.,
lm(y ~ x1 + x2 + x3 + x4 + .. ), to see if you can explain
the y variable in myData. What is the
explained variance for the simple model, and what is the explained
variance for the complex model?
Add two lines to the plot above that show the predicted values for
both the simple and the complex model. Why does the complex model (the
model with 10 predictors) describe the y variable so
well?
# Simple model
simple_model <- lm(y ~ x4, data = myData)
# Complex model
complex_model <- lm(y ~ ., data = myData)
# Explained variance
simple_r_squared <- summary(simple_model)$r.squared
complex_r_squared <- summary(complex_model)$r.squared
simple_r_squared; complex_r_squared
## [1] 0.4863744
## [1] 1
# Predicted values
simple_pred <- predict(simple_model, newdata = myData)
complex_pred <- predict(complex_model, newdata = myData)
# Plot
ggplot(myData, aes(x = x4, y = y)) +
geom_point(color = "blue") +
geom_line(aes(y = simple_pred), color = "red", linetype = "dashed") +
geom_line(aes(y = complex_pred), color = "green") +
labs(title = "Titel",
x = "x", y = "y")
The explained variance of the simple model is (rounded off to 4
decimals); 0.48634, and for the complex model is 1. The fact is that the
complex model is having a higher explained variance is also due too
higher dimensionality: the complex model includes multiple predictors
(x1 through x10), allowing it to capture more dimensions of variability
in the data. This can be particularly beneficial when the relationship
between the predictors and the response variable is not simple. Improved
Fit: By including more predictors, the complex model is better able to
capture the nuances and complexities of the relationship between the
predictors and the response variable. This results in a higher R-squared
value, indicating a better fit to the data compared to the simple
model.
So the complex model fits perfectly to our data. Look up a definition of overfitting and explain, in your own words, why overfitting is a problem.
Overfitting in linear regression occurs when the model is too complex for the available data. This leads to memorizing the training data instead of learning patterns, resulting in poor performance on unseen data. Mitigating overfitting can be done through regularization, cross-validation, or feature selection techniques to reduce complexity.
One way to control for overfitting is cross-validation. Lets look
again at our simple dataset to examine how cross-validation works. In
order to do cross-validation, we will need to simulate a second set of
data referred to as myData_test. We will use R to simulate
the testdata based on the same model as we used before.
# Generate a test dataset:
myData_test <- data.frame(x1 = runif(10, -1, 1),
x2 = runif(10, 0, 3),
x3 = rnorm(10),
x4 = rnorm(10, 3, 1),
x5 = runif(10, -3, 1),
x6 = runif(10, -7, 1),
x7 = rbinom(10, 10, .5),
x8 = rbinom(10, 10, .05),
x9 = runif(10, 1, 6),
x10 = rnorm(10))
# Simulate y values based on our linear regression model and add some error:
myData_test <- myData_test[order(myData_test$x4), ]
myData_test$y_test <- 9 + -1.1 * myData_test$x4 + rnorm(N, 0, .9)
Let’s plot it to compare both dataset:
# Plot both the training data and test data:
plot(myData$x4, myData$y,
pch = 19, col = 'red',
ylab = 'y', xlab = 'x',
xlim = c(1, 6), ylim = c(2, 8))
points(myData_test$x4, myData_test$y_test,
pch = 19, col = 'blue')
legend('topright', legend = c('test data', 'train data'),
col = c('blue', 'red'), pch = 19, bty = 'n')
So in our first dataset we have a variable y on which we
fit our model. Now we will use y_test in
myData_test to check our results. Following such a
procedure is called cross-validation (we validate our prediction based
on some model on a second test data set). So we have y that
represents the y value for our training data, and y_test
that is the y value in our test data. Note that the same true
model is used to generate both y and
y_test.
Fit both models (the simple model with one predictor and the complex
model with 10 predictors) to the training data (myData).
What are the predicted values according to these models?
myData_test_y_test <- myData_test$y_test
# Predict y values for the test data using the simple model:
predicted_y_simple <- predict(simple_model, newdata = myData)
predicted_y_simple
## 4 8 9 7 6 3 10 5
## 7.978790 7.392189 6.269430 5.904807 5.792071 5.644240 5.255838 5.027481
## 2 1
## 4.843887 4.689103
# Predict y values for the test data using the complex model:
predicted_y_complex <- predict(complex_model, newdata = myData)
predicted_y_complex
## 4 8 9 7 6 3 10 5
## 8.058200 5.693365 7.544964 6.009303 7.717092 6.045937 4.601312 5.567845
## 2 1
## 3.998962 3.560855
# Create a dataframe with predicted y values from both models
predicted_y_df <- data.frame(
predicted_y_simple = predicted_y_simple,
predicted_y_complex = predicted_y_complex,
myData_test_y_test = myData_test_y_test
)
# View the dataframe
predicted_y_df
## predicted_y_simple predicted_y_complex myData_test_y_test
## 4 7.978790 8.058200 5.381328
## 8 7.392189 5.693365 8.105999
## 9 6.269430 7.544964 5.203539
## 7 5.904807 6.009303 6.002866
## 6 5.792071 7.717092 5.378075
## 3 5.644240 6.045937 5.654855
## 10 5.255838 4.601312 8.099191
## 5 5.027481 5.567845 5.669329
## 2 4.843887 3.998962 4.368498
## 1 4.689103 3.560855 3.609120
Plot the test data [myData_test]. Add two lines
that show the predictions for both the simple and the complex model that
are based on the training data. Explain what you see. Did you
predict that the prediction line based on the complex model would go
through all data points of the test data?
# Hint:
# predict(simple_model, newdata = myData_test)
Warning: if you fitted your linear model with
lm(myData$y ~ myData$x4) instead of
lm(y ~ x4, data = myData), the above code will not predict
based on the myData_test, but still based on
myData.
# Predicted values
simple_pred <- predict(simple_model, newdata = myData_test)
complex_pred <- predict(complex_model, newdata = myData_test)
# Plot
ggplot(myData_test, aes(x = x4, y = y_test)) +
geom_point(color = "blue") +
geom_line(aes(y = simple_pred), color = "red", linetype = "dashed") +
geom_line(aes(y = complex_pred), color = "green") +
labs(title = "Titel",
x = "x4", y = "y")
No, you shouldn’t ever expect and/or predict that the prediction line based on the complex model using literally all variables made using the training data would go through all data points of the test data. There are a few key reasons for this:
Error Term: Even complex models with many features will have an error term. This term accounts for the inherent variability in the data that the model cannot perfectly capture. This variability will cause some data points in the test set to fall above or below the predicted line.
Overfitting: Complex models are more susceptible to overfitting. This occurs when the model becomes too focused on fitting the training data perfectly, including noise or random variations. As a result, the model may not generalize well to unseen data like the test set, leading to predictions that deviate from the actual values.
Test Set Independence: The test set is designed to be independent of the training data. This means it wasn’t used to build the model. Therefore, the model’s predictions for the test data shouldn’t perfectly match any specific data point because it’s encountering new information.
In conclusion, a good model should capture the underlying trend in the data but will still have some level of error. The complex model’s prediction line will likely be closer to the test data points on average compared to a simpler model, but it won’t align with every single point.
Now we want to quantify how well the simple and more complex
model based on the train data, predict the test data. Compare the model
predictions based on the training data to the observations in the test
data. To quantify the model fit we can look at the variance of the
differences between the predictions of y for the test data
and the observation of y_test in the test data (i.e., the
variance of the residuals). Calculate this for both the simpel and the
complex model.
# Get residuals for simple model:
simple_residuals <- myData_test$y_test - predicted_y_simple
simple_residuals
## 4 8 9 7 6 3
## -2.59746233 0.71380988 -1.06589179 0.09805903 -0.41399577 0.01061494
## 10 5 2 1
## 2.84335294 0.64184745 -0.47538954 -1.07998294
# Get residuals for complex model:
complex_residuals <- myData_test$y_test - predicted_y_complex
complex_residuals
## 4 8 9 7 6 3
## -2.676871989 2.412634001 -2.341425815 -0.006437315 -2.339016866 -0.391082240
## 10 5 2 1
## 3.497878095 0.101483202 0.369535672 0.048265120
# Calculate variance of residuals for simple model:
simple_var <- var(simple_residuals)
simple_var
## [1] 2.031888
# Calculate variance of residuals for complex model:
complex_var <- var(complex_residuals)
complex_var
## [1] 4.033497
The variance of the residuals for the simple model (rounded to 4 decimals) is 2.0319. The variance of the residuals for the complex model (rounded to 4 decimals) is 4.0335. Based on these values, the simpel model seems to perform better. A lower variance indicates a closer fit between the model’s predictions and the actual y_test values in the test data. In simpler terms, the complex model with all ten features captures less of the the underlying trend in the data than the simpler model with just one feature, resulting in less error (variance) in its predictions on unseen data.
One way to quantify the model fit is to consider the variance between the observations and the predictions (these differences between observations and predictions are called the residuals, or prediction errors). However we don’t always have a cross validation sample. Fortunately, we have the AIC and BIC that we can use as model fit criteria. What are the formula’s of the AIC and BIC? Clearly explain how both of these fit metrics balance model fit and model complexity.
Formulas:
Akaike Information Criterion (AIC):
AIC = -2 * ln(L) + 2 * k
ln(L): Represents the log-likelihood of the model (higher likelihood indicates better fit). k: The number of estimated parameters in the model (including the intercept).
Bayesian Information Criterion (BIC):
BIC = -2 * ln(L) + ln(n) * k
Similar to AIC, ln(L) is the log-likelihood and k is the number of parameters. n: The sample size of the data.
Balancing Model Fit and Complexity:
Both AIC and BIC penalize models for their complexity (number of parameters). Here’s how they achieve this:
Log-likelihood: This term rewards models that fit the data well (higher likelihood). Penalty term: The k (or k * ln(n)) term penalizes models with a larger number of parameters (k). As the number of parameters increases, the penalty term also increases, effectively discouraging models with too much complexity.
Choosing the Right Model:
Lower AIC or BIC values indicate a better balance between model fit and complexity. A model with a high likelihood (good fit) but too many parameters will be penalized by the second term. While both criteria aim for similar goals, BIC generally penalizes complex models more heavily than AIC, favoring simpler models with similar fit.
Another way to quantify model fit is to inspect the R-squared. In the lecture we discussed how the R-squared is a biased population estimator. Explain what it means for an estimator to be a biased population estimator.
R-squared is a biased population estimator. Here’s what that means:
Bias in Estimators:
An estimator is a function or procedure used to estimate an unknown population parameter based on a sample. Bias refers to a systematic difference between the average value of the estimator applied to many samples and the true population parameter it’s trying to estimate.
There are two main types of bias:
R-squared as a Biased Estimator:
R-squared, which measures the proportion of variance in the dependent variable explained by the independent variable(s) in a linear regression model, is known to be a positively biased estimator. This means:
Using the code below, show with a simulation study that the R-squared
is a biased population estimator. Think about what the expected value
for R-squared would be in case of this linear model
lm(babies ~ flowers). Write a simulation study such that
you simulate data, fit the linear model and obtain the R-squared value a
1000 times. Compare what you would expected R-squared to be to the mean
R-squared found in the simulation study.
Hint: Does the predictor flowers effectively explain any of the variance in babies for this particular linear model?
n <- 7
birds <- 1:n
bees <- 1:n
babies <- birds + bees + rnorm(n)
flowers <- rnorm(n)
linearModel <- lm(babies ~ flowers)
# Set simulation parameters
simulations <- 1000
n <- 7 # Sample size (same as your example)
# Expected R-squared (since flowers doesn't explain variance)
expected_r2 <- 0
expected_r2
## [1] 0
# Initialize vector to store R-squared values
sim_r2 <- rep(NA, simulations)
# Run simulations
for (i in 1:simulations) {
# Simulate data (same structure as your example)
birds <- 1:n
bees <- 1:n
babies <- birds + bees + rnorm(n)
flowers <- rnorm(n)
# Fit the linear model
linearModel <- lm(babies ~ flowers)
# Extract R-squared value
sim_r2[i] <- summary(linearModel)$r.squared
}
# Calculate mean R-squared across simulations
mean_sim_r2 <- mean(sim_r2)
mean_sim_r2
## [1] 0.1567817
Running this code shows a mean R-squared value from the simulations that is slightly positive. This demonstrates the bias of R-squared. Even though the flowers variable doesn’t explain any variance, the R-squared values from individual simulations are not all exactly 0, leading to a positive average value when taking the mean.
Continuation of question 3 and 5: We ran our simulation for one specific case of test data. It would be better to repeat the simulation a couple of times to ensure that indeed in many of our simulations the variance of the residuals [see question 5] is higher for the complex model compared to the simple model. Set up a simulation in which we inspect the variance of the differences between the predictions and the observation for both the simple and the complex model for a 1000 simulated test datasets. Plot the variances of the residuals for both the simple and the complex model in a histogram. Describe what you see. Is the variance of the residuals higher for the complex model compared to the simple model?
# Set simulation parameters
simulations <- 1000
n <- 10 # Sample size
# Initialize vectors to store variances
simple_var_residuals <- numeric(simulations)
complex_var_residuals <- numeric(simulations)
# Run simulations
for (i in 1:simulations) {
# Generate test data
myData_test <- data.frame(
x1 = runif(n, -1, 1),
x2 = runif(n, 0, 3),
x3 = rnorm(n),
x4 = rnorm(n, 3, 1),
x5 = runif(n, -3, 1),
x6 = runif(n, -7, 1),
x7 = rbinom(n, 10, .5),
x8 = rbinom(n, 10, .05),
x9 = runif(n, 1, 6),
x10 = rnorm(n)
)
myData_test <- myData_test[order(myData_test$x4), ]
myData_test$y_test <- 9 + -1.1 * myData_test$x4 + rnorm(n, 0, .9)
# Predict y values for the test data using the simple model:
simple_pred <- predict(simple_model, newdata = myData_test)
# Predict y values for the test data using the complex model:
complex_pred <- predict(complex_model, newdata = myData_test)
# Calculate residuals for both models
simple_residuals <- myData_test$y_test - simple_pred
complex_residuals <- myData_test$y_test - complex_pred
# Calculate variance of residuals for both models
simple_var_residuals[i] <- var(simple_residuals)
complex_var_residuals[i] <- var(complex_residuals)
}
# Plot histograms of variance of residuals
hist_data <- data.frame(
Model = rep(c("Simple", "Complex"), each = simulations),
Residual_Variance = c(simple_var_residuals, complex_var_residuals)
)
# Plot histogram
ggplot(hist_data, aes(x = Residual_Variance, fill = Model)) +
geom_histogram( position = "identity") +
labs(title = "Distribution of Residual Variances (Simple vs. Complex Model)",
x = "Residual Variance", y = "Frequency") +
scale_fill_manual(values = c("blue", "red")) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The variance of the residuals is higher for the complex model compared to the simple model.The distribution of residuals for the complex model (blue) appears to be wider than the distribution for the simple model (red)indicating more residuals falling farther from the predicted values. This wider spread suggests a higher variance in the residuals of the complex model. In simpler terms, the complex model’s predictions tend to deviate more from the actual values (y-test) compared to the simpler model.