The main purpose of linear regression is to use data to quantify relationships between a single response variable and a single, or multiple, predictor variable(s). We are ultimately trying to estimate the “true”, population-level relationship between the response variable and the predictors using the given data. Once we’ve estimated a relationship, it can then be leveraged to predict outcomes given a new set of inputs (values of predictor variables).
Note that predictor variables are also referred to as covariates.
Here’s an example of 30-year Average Fixed Rate Mortgage Rates vs 10-Year Treasury Yields:
data <- read.csv("Regression_Example.csv", header = TRUE)
# make appropriate column names and get rid of dates column
data <- data[,-1]
names(data) <- c('Treasury', 'Mortgage', 'Delinquency')
# make sure numeric values
data$Treasury <- as.numeric(data$Treasury)
data$Mortgage <- as.numeric(data$Mortgage)
data$Delinquency <- as.numeric(data$Delinquency)
# check no NAs
#sum(is.na(data))
# compute the changes
data_diff <- data[2:dim(data)[1],] - data[1:dim(data)[1] - 1,]
# plot quarterly change in mortgage rate vs weekly change in Treasury yield
library(plotly)
plot_ly(data_diff, x = ~Treasury, y = ~Mortgage, type = "scatter", mode = "markers", alpha = 0.5) %>%
layout(title = "Quarterly Change in 30-Year Avg. Mortgage Rate vs 10-Year Treasury Yield",
xaxis = list(title = "Change in Treasury Yield (%)"),
yaxis = list(title = "Change in Mortgage Rate (%)"))From the plot, it seems pretty reasonable to assume there is a positive linear relationship between the Change in Mortgage rate and the Change in Treasury Yield.
Before trying to quantify the observed positive linear relationship from the example, some other statistics need to be defined.
Consider an observed dataset \(\{x_1, x_2, ..., x_n \}\). Then define the following statistics as such:
Note that the z-scores represent the standardized version of the observed dataset since the sample mean of the z-scores is 0 and the sample standard deviation (and variance) is 1. In more advanced modeling techniques, it’s common to first standardize (or “normalize”) the data rather than dealing with the raw values as this effectively strips out the magnitude of the variable, which can be helpful when comparing variables with different units.
Now consider another observed dataset \(\{y_1, y_2, ..., y_n\}\) (we observed the pair \((x_i, y_i)\) together). The strength of the linear relationship between \(y\) and \(x\) is defined to be correlation coefficient:
\[ \begin{aligned} \rho(x, y) :&= \frac{1}{n - 1}\frac{\sum_{i = 1} ^ {n} (x_i - \overline{x})(y_i - \overline{y})}{\hat{\sigma}(x)\hat{\sigma}(y)} \\ &= \frac{1}{n - 1}\sum_{i = 1} ^ {n}z(x_i)z(y_i) \end{aligned} \]
where \(-1 \leq \rho \leq 1\). A positive correlation indicates a positive linear relationship, and a negative correlation indicates a negative linear relationship. The closer the magnitude of the correlation is to 1, the stronger the linear relationship. Note that 0 correlation only indicates an absence of a linear relationship: there can still be a nonlinear relationship between the variables!
All these statistics can be calculated for the example.
# calculate sample statistics, note that sd() uses n-1 in denominator
mort_mean <- mean(data_diff$Mortgage)
tr_mean <- mean(data_diff$Treasury)
mort_sd <- sd(data_diff$Mortgage)
tr_sd <- sd(data_diff$Treasury)
paste("Mortgage's sample mean and standard deviation are ", round(mort_mean, 4), " and ", round(mort_sd, 4))## [1] "Mortgage's sample mean and standard deviation are 0.0173 and 0.3481"
paste("Treasurys's sample mean and standard deviation are ", round(tr_mean, 4), " and ", round(tr_sd, 4))## [1] "Treasurys's sample mean and standard deviation are 0.0053 and 0.3141"
## [1] "The correlation is 0.7953"
Now that we can measure the strength of the linear relationship, how do we express the value of the response variable as a linear function of the predictor variable (only considering 1 predictor for now).
A pure linear relationship between \(y\) and \(x\) is of the form \(y_i = \beta_0 + \beta_1x_i\), but in reality, even if we were correct in specifying the relationship between \(y\) and \(x\) as linear and we also had access to the true values of \(\beta_0\) and \(\beta_1\), we wouldn’t expect to observe each \(y_i\) to be exactly equal to \(\beta_0 + \beta_1x_i\) because there is inherent randomness in nature. So, let \(\epsilon \in \mathbb{R}^n\) capture this randomness. This way \(\beta_0\) and \(\beta_1\) can capture the actual relationship between the variables (i.e., the true trend). Then, a linear relationship between \(y\) and \(x\) is written as
\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i. \]
And to clarify again, in the equation above \(\beta_0\) and \(\beta_1\) represent the true population parameters: we don’t know these! And \(\epsilon_i\) is a random variable: we don’t know the true distribution!
So, we are going to estimate \(\beta_0\) and \(\beta_1\) with \(\hat{\beta_0}\) and \(\hat{\beta_1}\), and since \(y\) and \(x\) are observed, we will end up with an estimated model of the form
\[ y_i = \hat{\beta_0} + \hat{\beta_1}x_i + e_i \]
where \(e_i\) is the \(i^{th}\) residual: for a given \(\hat{\beta_0}\), \(\hat{\beta_1}\), \(y_i\), and \(x_i\), \(e_i := y_i - (\hat{\beta_0} + \hat{\beta_1}x_i)\). In other words, \(e_i\) is what we believe is the realization of the random variable \(\epsilon_i\) (we don’t know if it’s the true realization because the residual depends on our estimates for the coefficients, which could differ from the true coefficient values).
Here is some intuition as to how we will estimate \(\hat{\beta_0}\) and \(\hat{\beta_1}\). Even though we know we won’t be able to find coefficients such that \(y_i = \hat{\beta_0} + \hat{\beta_1}x_i\) for all \(i = 1,...n\), we still want to be able to explain as much of the observed \(y\) information using the observed \(x\) information. In other words, we would ideally like to pick estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) such that the resulting residuals are as small as possible in magnitude, on average. But if we tried to minimize the average of the residuals, the positive and negative residuals would cancel each other out. So we instead seek to minimize the sum of the residuals squared (minimizing the sum of squares is the same as minimizing the average of squares), or the Sum of Squared Errors (SSE). So,
\[ \begin{aligned} (\hat{\beta}_0, \hat{\beta}_1) &= \underset{\beta_0, \beta_1 }{\operatorname{argmin}} \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1x_i )) ^ 2 \\ &= \underset{\beta_0, \beta_1 }{\operatorname{argmin}} \sum_{i = 1}^{n}e_i^2, \end{aligned} \]
where \((\hat{\beta}_0, \hat{\beta}_1)\) can be explicitly solved as
\[ \hat{\beta}_1 = \rho \frac{\hat{\sigma}(y)}{\hat{\sigma}(x)} \\ \hat{\beta}_0 = \overline{y} - \hat{\beta}_1\overline{x}. \]
This is the Ordinary Least Squares line (line of best fit) when there is only 1 predictor. Also note that this line will always pass through the point \((\overline{x}, \overline{y})\).
Now that we have estimated the coefficients in the linear model, here is some resulting terminology:
An a remark, even though we have a prediction equation for the value of the response variable given the value of the predictor, it might be useful to see the implications of OLS on z-scores:
\[ \begin{aligned} \hat{y_i} &= \hat{\beta_0} + \hat{\beta_1}x_i \\ &= \overline{y} - \hat{\beta}_1\overline{x} + \rho \frac{\hat{\sigma}(y)}{\hat{\sigma}(x)} x_i \\ &= \overline{y} - \rho \frac{\hat{\sigma}(y)}{\hat{\sigma}(x)}\overline{x} + \rho \frac{\hat{\sigma}(y)}{\hat{\sigma}(x)} x_i \\ \frac{\hat{y_i} - \overline{y}}{\sigma(y)} &= \rho \frac{\hat{x_i} - \overline{x}}{\sigma(x)} \\ z(\hat{y_i}) &= \rho z(x_i). \end{aligned} \]
This means given the z-score of \(x_i\), the predicted z-score of the response variable is \(\rho z(x_i)\). So, for instance, if a particular \(x_i\) was very large compared to the other observed values of \(x\), and the correlation between \(x\) and \(y\) was positive but not perfectly 1, then we would expect the corresponding \(i^{th}\) response value to also be large compared to the other response values, but not as extreme as \(x_i\). In short, if the input is extreme, OLS predicts the response to be extreme, but not as extreme. This is referred to as regressing toward the mean.
We can calculate the OLS line for the example:
# perform linear regression
lm_mortgage <- lm(formula = Mortgage~Treasury, data = data_diff)
paste("The estimated intercept is ", round(lm_mortgage$coefficients[1], 4), " and the estimated slope coefficient is ", round(lm_mortgage$coefficients[2], 4))## [1] "The estimated intercept is 0.0126 and the estimated slope coefficient is 0.8814"
# add OLS line to plot
plot_ly(data_diff, x = ~Treasury, y = ~Mortgage, type = "scatter", mode = "markers", name = "Observed Data", alpha = 0.5) %>%
add_trace(x = data_diff$Treasury, y = lm_mortgage$fitted.values, type = "scatter", mode = "lines", name = "Fitted Line", alpha = 1) %>%
layout(title = "Quarterly Change in 30-Year Avg. Mortgage Rate vs 10-Year Treasury Yield",
xaxis = list(title = "Change in Treasury Yield (%)"),
yaxis = list(title = "Change in Mortgage Rate (%)"))##
## Call:
## lm(formula = Mortgage ~ Treasury, data = data_diff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27707 -0.08600 0.01127 0.09455 0.67938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01260 0.02360 0.534 0.595
## Treasury 0.88142 0.07558 11.662 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2123 on 79 degrees of freedom
## Multiple R-squared: 0.6326, Adjusted R-squared: 0.6279
## F-statistic: 136 on 1 and 79 DF, p-value: < 2.2e-16
We can see that even though OLS minimizes SSE, it still doesn’t fit the observed data perfectly (which we already knew would likely happen). So how do we quantify how well this line fits the data? We could just state the SSE as a measure of fit, but SSE will be influenced by units, so you can’t compare the SSE from a linear model with a certain response variable against the SSE from a linear model dealing with a different response variable. The coefficient of determination, \(R^2\), is meant to be a standardized metric to evaluate the linear model’s predictive power.
The idea behind \(R^2\) is to compare the predictive power of the model with the predictor against the simplest predictive model. The simplest predictive model has no predictors: it is a constant. If we solved OLS without any predictor variables (so just \(\hat{\beta_0}\)), then we’d get the simplest predictive model is \(\hat{\beta_0} = \overline{y}\). In other words, our best guess for the next response value is the average of the so far observed response values. And we call the sum of squared residuals for this simplest model the Sum of Squares Total (SST). Notice that SSE from the model with the predictor will never exceed SST because OLS in both cases is just an optimization problem, but the simplest model forces \(\hat{\beta_1} = 0\): i.e., OLS for the simplest model is a constrained version of the optimization problem for OLS with the predictor variable. So,
\[ R^2 := 1 - \frac{SSE}{SST}, \]
with \(0 \leq R^2 \leq 1\). The closer \(R^2\) is to 1, the better the predictive power. For the example, the \(R^2\) is the “Multiple R-squared”, recorded as 0.5945.
OLS with multiple predictors follows the same format as with 1 predictor, except that it’s typically written in matrix format. For a total of p predictors and n observations, let \(X \in \mathbb{R}^{n \text{ x }p + 1}\) denote the observation matrix. The first column of \(X\) is just 1s, and for the \(i^{th}\) row, the remaining p columns are the values of the the predictors for the \(i^{th}\) observation. For instance, in the case of 2 observations and 2 predictors,
\[ \begin{aligned} X &= \left[\begin{array}{ccc} 1 & x_1^{\text{predictor 1}} & x_1^{\text{predictor 2}}\\ 1 & x_2^{\text{predictor 1}} & x_2^{\text{predictor 2}} \end{array}\right] \\ &= \left[\begin{array}{ccc} 1 & x_{11} & x_{12}\\ 1 & x_{21} & x_{22} \end{array}\right] \end{aligned} \] And \(\beta \in \mathbb{R}^{p+1}\) is a vector of the coefficient values:
\[ \beta = \left[\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{array}\right]. \]
The least squares coefficients are then
\[ \begin{aligned} \hat{\beta} &= \underset{\beta}{\operatorname{argmin}} \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip})) ^ 2 \\ &= \underset{\beta}{\operatorname{argmin}} (y - X\beta)^T(y - X\beta), \end{aligned} \]
resulting in the solution,
\[ \hat{\beta} = (X^TX)^{-1}X^Ty. \]
Note that we typically assume \(X^TX\) is invertible.
Furthermore, the predicted values are \(\hat{y} = X\hat{\beta} = Hy\), where \(H := X(X^TX)^{-1}X^T\). \(H\) is a projection matrix where it takes the observed response values, \(y\), and finds the vector \(\hat{y}\) that is closest to \(y\) but can also be written as a linear combination of the predictor values and the vector of 1s (i.e., the column space of \(X\)). All these results hold for the case of only 1 predictor (p = 1).
We can calculate the OLS plane in the example using both predictors in the dataset:
# perform linear regression
lm_mortgage2 <- lm(formula = Mortgage~., data = data_diff)
paste("The estimated intercept is ", round(lm_mortgage2$coefficients[1], 4), " and the estimated Treasury slope coefficient is ", round(lm_mortgage2$coefficients[2], 4), "and estimated Delinquency slope coefficient is ", round(lm_mortgage2$coefficients[3], 4))## [1] "The estimated intercept is 0.0128 and the estimated Treasury slope coefficient is 0.866 and estimated Delinquency slope coefficient is -0.0879"
# estimate OLS plane
coeff <- lm_mortgage2$coefficients
treasury_seq <- seq(from = min(data_diff$Treasury), to = max(data_diff$Treasury), by = 0.1)
delinq_seq <- seq(from = min(data_diff$Delinquency), to = max(data_diff$Delinquency), by = 0.1)
z <- t(outer(treasury_seq, delinq_seq, function(x,y) coeff[1]+coeff[2] * x + coeff[3] * y))
# plot observed data and OLS plane
plot_ly(x = ~treasury_seq, y = ~delinq_seq, z=~z, type="surface") %>%
add_trace(data = data_diff, x = ~Treasury, y = ~Delinquency, z =~Mortgage, type = "scatter3d", mode = "markers", name = "Observed Data", alpha = 1) %>%
layout(title = "30-Year Avg. Mortgage Rate vs 10-Year Treasury Yield and Res. Delinquency Rate",
scene = list(xaxis = list(title = "Change in Treasury Yield (%)"),
yaxis = list(title = "Change in Residential Delinquency Rate (%)"),
zaxis = list(title = "Change in Mortgage Rate (%)")))##
## Call:
## lm(formula = Mortgage ~ ., data = data_diff)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14918 -0.08034 0.00195 0.11409 0.68104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01277 0.02342 0.545 0.587
## Treasury 0.86599 0.07573 11.435 <2e-16 ***
## Delinquency -0.08795 0.05937 -1.481 0.143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2107 on 78 degrees of freedom
## Multiple R-squared: 0.6426, Adjusted R-squared: 0.6335
## F-statistic: 70.13 on 2 and 78 DF, p-value: < 2.2e-16
Once we have a prediction equation, \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + ... + \hat{\beta}_px_{ip}\), we can interpret the slope coefficients as follows: for 2 observations which differ in \(x_j\) by 1 unit and have all other predictors equal, the expected difference in responses is \(\hat{\beta}_j\). Note that the interpretations makes no mention of causality: we can only talk about the expected difference in responses between 2 sets of inputs where only 1 input differs (all other predictors must be controlled). This also means if we are missing a predictor in the model, we are not explicitly controlling it.
\(\hat{\beta}_0\) typically does not have any interpretation since it is just a vertical placement parameter.
Given a dataset of observations, we can form a prediction equation, but what would happen if we collected a different set of observations? We would get a different prediction equation. This is true even if we held the predictors’ values constant because of the randomness of \(\epsilon\). As a result, for a fixed \(X\), \(\hat{\beta}\) is a random vector, meaning \(\hat{y}\) is also random.
Here is an example illustrating how different datasets generate different OLS lines. Each dataset consists of 20 observations, \((x, y)\), where for each \(i\), \(y_i = 2x_i + \epsilon_i\), and \(\epsilon \sim MVN(0, 1)\) (meaning each \(\epsilon_i \sim N(0, 1)\) and all the \(\epsilon_i\) are independent). There are a total of 5 datasets.
obs <- 20
sets <- 5
# generate 20 x values between 0 and 5
# these x values will be the same for all 10 datasets
x = runif(20, min = 0, max = 5)
# set true model
beta1 = 5
sigma_epsilon = 5
fitted_vals <- matrix(0, nrow = obs, ncol = sets)
for (i in 1:sets) {
# generate y values
y = beta1 * x + rnorm(obs, mean = 0, sd = sigma_epsilon)
# run regression
lm_y = lm(formula = y~x)
# store fitted values
fitted_vals[,i] <- as.vector(lm_y$fitted.values)
}
plot_ly(x = x, y = beta1 * x, type = "scatter", mode = "lines", name = "True Line", line= list(width = 3)) %>%
add_trace(x = x, y = fitted_vals[,1], type = "scatter", mode = "lines", name = "Sample 1", line= list(width = 1.5)) %>%
add_trace(x = x, y = fitted_vals[,2], type = "scatter", mode = "lines", name = "Sample 2", line= list(width = 1.5)) %>%
add_trace(x = x, y = fitted_vals[,3], type = "scatter", mode = "lines", name = "Sample 3", line= list(width = 1.5)) %>%
add_trace(x = x, y = fitted_vals[,4], type = "scatter", mode = "lines", name = "Sample 4", line= list(width = 1.5)) %>%
add_trace(x = x, y = fitted_vals[,5], type = "scatter", mode = "lines", name = "Sample 5", line= list(width = 1.5)) %>%
layout(title = "Different OLS Lines",
xaxis = list(title = "x"),
yaxis = list(title = "y"))Since the predictive function can differ from the true generative function, what inferences can we make about what the true population parameters might actually be? To conduct any form of inference, we need to make 3 key assumptions about the true generative model (in more advanced courses you will also learn when it’s okay to relax some of these assumptions).
Linearity. The linearity assumption states that the true generative model is of the form \(y_i = \beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip} + \epsilon_i\) with \(\mathbb{E}(\epsilon_i|x_i) = 0\). So, \(\mathbb{E}(y_i|x_i) = \beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip}\). In words, this means that the generative model is in fact linear, with the expected values of the sources of randomness (error terms), \(\epsilon_i\), being 0 once the predictors’ values are given. This assumptions validates the idea of using \(\beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip}\) as the best guess for the response value.
Homoskedasticity. The homoskedasticity assumption states that the error terms have the same variance, conditional on the predictors: \(Var(\epsilon_i|x_i) = Var(\epsilon_j|x_j) = \sigma_\epsilon^2\) for all \(i\) and \(j\). In words, this means that the variability of each response variable, \(y_i\) is the same. This assumption also typically incorporates that the error terms are uncorrelated (so 0 covariance as well), conditional on the predictors: \(Cov(\epsilon_i, \epsilon_j | x_i, x_j) = 0\) for \(i \neq j\). This assumption results in the following variance-covariance matrix for \(\epsilon, y \in \mathbb{R}^n\):
\[ Var(\epsilon) = Var(y) = \left[\begin{array}{cccc} \sigma_\epsilon^2 & 0 & \cdots &0 \\ 0 & \ddots & & 0 \\ 0 & \cdots & 0 & \sigma_\epsilon^2 \end{array}\right] = \sigma_\epsilon^2 I^{n \text{ x } n}. \] Together, the linearity and homoskedasticity assumptions mean the centering of \(y_i\) depends only on \(x_{i1},...,x_{ip}\), but the variability of \(y_i\) depends only on \(\epsilon_i\), with the variability of all \(y_i\) being constant.
Here is a plot illustrating the 3 conditions:
beta0 <- 1
beta1 <- 2
sigma <- 2
x <- seq(from = 0, to = 5, by = 0.02)
sim <- 150
x_seq <- rep(x, sim)
error <- rnorm(length(x_seq), 0, sigma)
y_exp <- beta0 + beta1 * x
y_seq <- beta0 + beta1 * x_seq + error
plot_ly(x = x_seq, y = y_seq, type = "scatter", mode = "marker", alpha = 0.5, marker = list(size = 1), name = "Data") %>%
add_trace(x = x, y = y_exp, type = "scatter", mode = "lines", alpha = 1, line = list(width = 3), name = "True Cond. Exp.") %>%
layout(title = "3 Conditions of Linear Models Holding",
xaxis = list(title = "x"),
yaxis = list(title = "y"))Here is a similar example but the true generative model is heteroskedastic:
beta0 <- 1
beta1 <- 2
sigma <- 2
x <- seq(from = 0, to = 5, by = 0.02)
sim <- 150
x_seq <- rep(x, sim)
error <- rnorm(length(x_seq), 0, sigma * x_seq)
y_exp <- beta0 + beta1 * x
y_seq <- beta0 + beta1 * x_seq + error
plot_ly(x = x_seq, y = y_seq, type = "scatter", mode = "marker", alpha = 0.5, marker = list(size = 1), name = "Data") %>%
add_trace(x = x, y = y_exp, type = "scatter", mode = "lines", alpha = 1, line = list(width = 3), name = "True Cond. Exp.") %>%
layout(title = "Plot of Heteroskedasticity",
xaxis = list(title = "x"),
yaxis = list(title = "y"))The key implication from these assumptions is that \(\epsilon \sim MVN(0, \sigma_\epsilon^2 I^{n \text{ x } n})\) as then for \(y = X\beta + \epsilon\), \(y \sim MVN(X\beta, \sigma_\epsilon^2 I^{n \text{ x } n})\) where \(X\) is fixed (we are always conditioning on predictors). Furthermore, since \(\hat{\beta} = (X^TX)^{-1}X^Ty = (X^TX)^{-1}X^T(X\beta + \epsilon) = (X^TX)^{-1}X^TX\beta + (X^TX)^{-1}X^T\epsilon = \beta + (X^TX)^{-1}X^T\epsilon\), then \(\hat{\beta} \sim MVN(\beta, \sigma_\epsilon^2 (X^TX)^{-1})\). So for any particular \(\hat{\beta}_j\), \(Var(\hat{\beta}_j) = \sigma_\epsilon^2(X^TX)^{-1}_{j+1, j+1}\). We now know the distribution of the estimated coefficients!
In particular, under these assumptions,
\[ \frac{\hat{\beta}_j - \beta_j}{SD(\hat{\beta}_j)} \sim N(0,1), \] where \(SD(\hat{\beta}_j) = \sigma_\epsilon\sqrt{(X^TX)^{-1}_{j+1, j+1}}\) is the standard deviation.
The only issue with with this is that we don’t actually know \(\sigma_\epsilon\): this is the true population parameter, but we only have observed a sample.
To remedy the issue with not knowing \(\sigma_\epsilon\), we replace it with an estimate for the standard deviation of the error terms called the Root Mean Squared Error (RMSE), which is defined as:
\[ \hat{\sigma}_\epsilon := \sqrt{\frac{\sum_{i = 1}^{n}e_i^2}{n - p - 1}} = \sqrt{\frac{\sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}{n - p - 1}}. \]
Since the residuals represent realizations of the random error terms, it makes sense to form an estimator for \(\sigma_\epsilon\) by using the sample standard deviation. Note that we divide by \(n - p - 1\) so that \(\hat{\sigma}_\epsilon^2\) is an unbiased estimator for \(\sigma_\epsilon^2\), just as \(\hat{\beta}\) is an unbiased estimator for \(\beta\). Furthermore, \(n-p-1\) is referred to as the degrees of freedom.
So, \(SD(\hat{\beta}_j)\) is estimated by \(\hat{\sigma}_\epsilon\sqrt{(X^TX)^{-1}_{j+1, j+1}}\). This estimate of the standard deviation is called the standard error: i.e., \(se(\hat{\beta}_j) = \hat{\sigma}_\epsilon\sqrt{(X^TX)^{-1}_{j+1, j+1}}\). These are the exact same coefficient standard errors shown in the summary output in R!
Using the standard error, it can then be shown that
\[ \frac{\hat{\beta}_j - \beta_j}{se(\hat{\beta}_j)} \sim t_{n-p-1}(0,1), \]
where \(t_{n-p-1}\) is a t-distribution with \(n-p-1\) degrees of freedom. t-distributions have fatter tails than normal distributions. As \(n-p-1 \rightarrow \infty\), the t-distributions’ tails get skinnier and the distribution converges to a standard normal.
With this, we can conduct inference on the true \(\beta_i\) using hypothesis testing. There are 2 main approaches to hypothesis testing (both are equivalent though):
We can set a null hypothesis that \(\beta_j = \gamma\) and compute \(t_{stat} = \frac{\hat{\beta}_j - \gamma}{se(\hat{\beta}_j)}\). If the null hypothesis is correct, the computed \(t_{stat}\) comes from a \(t_{n-p-1}(0,1)\) distribution. We then decide whether we want to test whether the true \(\beta_j\) differs from \(\gamma\) (2-sided test), whether it is strictly greater than \(\gamma\) (1-sided test), or whether it is strictly less than \(\gamma\) (1-sided test). For each test, we can compute the probability of realizing a more extreme value than the \(t_{stat}\) that was observed. And if this probability is less than the given significance level \(\alpha\), we conclude that there is enough evidence to reject the null hypothesis that \(\beta_j = \gamma\). In practice, the most common null hypothesis is \(\beta_j = 0\): i.e., the true coefficient value is 0, meaning the predictor has no predictive power of the response. The resulting \(t_{stat}\) is just \(\frac{\hat{\beta}_j}{se(\hat{\beta}_j)}\), which is exactly each coefficient’s t value reported in the summary output! And R defaults to testing whether the true coefficient differs from 0 (so 2-sided), and so the probability of observing a t value that has a greater magnitude than the reported \(t_{stat}\) is exactly the reported Pr(>|t|) value in the summary output.
We can construct confidence intervals for \(\beta_j\) based on a significance level \(\alpha\). The most common type of confidence intervals are centered at \(\hat{\beta}_j\), and extend in each direction. Specifically, the confidence interval is \([\hat{\beta}_j - t_{1 - \alpha/2, n-p-1}se(\hat{\beta}_j), \hat{\beta}_j + t_{1 - \alpha/2, n-p-1}se(\hat{\beta}_j)]\). We interpret the confidence interval by saying we are \(100 (1 - \alpha)\%\) confident that the interval contains the true \(\beta_i\). Remember, the interval is random, not \(\beta_j\)! If we had many, many datasets, and we constructed the confidence interval for \(\beta_j\) using each dataset, then \(100 (1 - \alpha)\%\) of the confidence intervals would contain \(\beta_j\). Confidence intervals can also be used to reject a null hypothesis that \(\beta_j = \gamma\): if \(\gamma\) doesn’t lie in the confidence interval, then we reject the null.
It may be useful to understand the interpretation of conducting a hypothesis test with the null \(\beta_j = 0\). This test is effectively comparing the predictive performance of the full predictive model, \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + ... + \hat{\beta}_px_{ip}\), to the performance of the predictive model without the \(j^{th}\) predictor, \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + ... \hat{\beta}_{j-1}x_{i(j-1)} + \hat{\beta}_{j+1}x_{i(j+1)} + ... + \hat{\beta}_px_{ip}\) (this comparison is explicitly made in the corresponding partial F-test, which is closely linked to the t-test). The test is checking whether the increase in predictive performance with the additional predictor variable is statistically significant (large enough), meaning the additional predictor actually improves upon the model without it.
Instead of determining whether only a single \(\hat{\beta}_j\) is statistically significant (i.e., should we include it in our final model), what if we want to determine whether a group of coefficients are statistically significant? In other words, how do we test whether a set of \(k\) additional predictors significantly improves the predictive power of a model with originally only \(p - k\) predictors (plus an intercept)?
For this, we use the partial F-test. Let \(red\) denote the reduced model with only \(p - k\) predictors and \(full\) denote the full model with all \(p\) predictors. Then,
\[ \begin{aligned} F_{stat} :&= \frac{(SSE_{red} - SSE_{full})/(df_{red} - df_{full})}{SSE_{full}/{df_{full}}}\\ &= \frac{(SSE_{red} - SSE_{full})/k}{SSE_{full}/{(n-p-1)}} \end{aligned} \]
As \(SSE\) represents the amount unexplained by the predictive model, \(F_{stat}\) essentially is a metric to gauge the improvement in predictive performance from the reduced to the full model, but penalized at the increase in number of predictors. We perform this penalization because since OLS is simply an \(SSE\) minimization problem, as more predictors are added, the predictive model will never achieve a worse \(SSE\) than a model with fewer predictors. So, there must be some penalty for adding more predictors when evaluating whether it is worthwhile to add more predictors. As a result, it is natural to claim that if \(F_{stat}\) is sufficiently large enough (it is always non-negative), then it is worthwhile to add the \(k\) additional predictors.
Let \(\beta_{p - k + 1},\dots, \beta_p\) be the additional predictors. The hypothesis test for a partial F-test is:
\(H_0: \beta_{p - k + 1} = \dots = \beta_p = 0\),
\(H_a:\) at least one of \(\beta_{p - k + 1},\dots, \beta_p \neq 0\).
Note that having 0 as the coefficient values for the additional predictors is equivalent to simply not even having the predictors involved in the model, hence the null hypothesis is exactly stating the true model is just the reduced model without the additional \(k\) predictors.
Furthermore, it can be shown that under the null, the \(F_{stat}\) follows the \(F_{k, n - p - 1}\) distribution, where \(k\) is the numerator’s degrees of freedom, and \(n - p - 1\) is the denominator’s degrees of freedom. For any set of degrees of freedom, the \(F\) distribution can only take on non-negative values, and so any hypothesis test conducted with this distribution will be one-sided.
Just as before, for a given significance level \(\alpha\), if the p-value corresponding to the \(F_{stat}\) is below \(\alpha\), we reject the null hypothesis that none of the \(k\) additional predictors are in the true generative model.
There are 2 important specific cases of the partial F-test.
The first case is when \(k = 1\). Under the partial F-test, this means we are testing whether a single coefficient, \(\beta_j\), is non-zero. But, we already have the t-test for this exact test. As it turns out, the \(F_{1, n - p - 1}\) distribution is exactly the same as the \(t_{n - p - 1}^2\) distribution, which then means the partial F-test for \(k = 1\) is exactly the same as doing the t-test for \(\hat{\beta}_j\).
The second case is when \(k = p\). More specifically, for the null hypothesis that \(\beta_1 = \dots = \beta_p = 0\). This means the reduced model is simply the intercept. This test is called the overall F-test, as it is testing whether any of the predictors are actually meaningful. In other words, if you fail to reject the null in an overall F-test, then absolutely none of your predictors can meaningfully explain the response variable (through a linear model). R’s summary output reports the \(F_{stat}\) and corresponding p-value for the overall F-test.
We have now seen \(\hat{\beta}\) is a random vector as it is a function of the observed dataset, and each dataset can be different. We also have seen how to construct confidence intervals for the \(j^{th}\) coefficient, which can then be used to conduct a hypothesis test on the true \(\beta_j\) and also to conduct inference on the expected difference in response values if \(x_j\) differs in 1 unit.
These results can be further generalized to involving multiple coefficients: i.e., linear combinations of the coefficients. Let \(A \in \mathbb{R}^{d \text{ x } (p + 1)}\), and \(c \in \mathbb{R}^d\) be constants. Then, as \(y = X\beta + \epsilon\), where \(\epsilon \sim MVN(0, \sigma_\epsilon I^{n \text{ x } n})\),
As creating confidence intervals and performing hypothesis tests rely on knowing the distributions of the random variables, the above results (in particular the last bullet) provide the distribution for any linear combination of \(\hat{\beta}\). So, for instance, if two observations differ only in predictors \(x_j\) and \(x_m\) by 1 unit each, then the difference in response values follows the \(t_{n-p-1}(A\hat{\beta}, \hat{\sigma}_\epsilon^2A(X^TX)^{-1}A^T)\) distribution where \(A \in \mathbb{R}^{p + 1}\), with the \(j^{th}\) and \(m^{th}\) elements being 1 and the rest being 0.
Another type of inference we can now conduct is on the conditional mean, \(\mu_{y|\tilde{x}}\), at the point \(\tilde{x}\). Remember, for the inputs \(1, \tilde{x}_1, \dots, \tilde{x}_p\), the response variable is \(y = \beta_0 + \beta_1\tilde{x}_1 + \dots + \beta_p\tilde{x}_p + \tilde{\epsilon}\) with \(\mu_{y|\tilde{x}} = \mathbb{E}[y|\tilde{x}] = \beta_0 + \beta_1\tilde{x}_1 + \dots + \beta_p\tilde{x}_p\). And the predictive model outputs \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1\tilde{x}_1 + \dots + \hat{\beta}_p\tilde{x}_p\). In other words, the value outputted by the predictive model we created is actually our guess for the conditional expectation of the response variable given the covariates’ values, \(\hat{\mu}_{y|\tilde{x}}\)! We even have the distribution,
\[ \hat{\mu}_{y|\tilde{x}} \sim N(\mu_{y|\tilde{x}}, \sigma_\epsilon^2\tilde{x}^T(X^TX)^{-1}\tilde{x}), \]
or by using the estimator \(\hat{\sigma}_\epsilon^2\) in place of \(\sigma_\epsilon^2\),
\[ \hat{\mu}_{y|\tilde{x}} \sim t_{n-p-1}\left(\mu_{y|\tilde{x}}, \hat{\sigma}_\epsilon^2\tilde{x}^T(X^TX)\tilde{x}\right). \]
So, we can create pointwise confidence intervals for \(\mu_{y|\tilde{x}}\) at each \(\tilde{x}\). It also means we can conduct hypothesis testing on the true \(\mu_{y|\tilde{x}}\).
But if our prediction equation provides an estimate for the conditional mean at the point \(\tilde{x}\), (i.e., \(\hat{\mu}_{y|\tilde{x}}\)), then if we had a new observation at \(\tilde{x}\), what would our guess for the value of the response variable be? It would be exactly the same \(\hat{\mu}_{y|\tilde{x}}\) since the expectation (specifically the conditional expectation as we always know the input values) is our best guess for any new observation. However, our range of error for the prediction is actually going to be a bit different. Here is the intuition: for a new observation \(y^*\) at \(\tilde{x}\), \(y^* = \beta_0 + \beta_1\tilde{x}_1 + \dots + \beta_p\tilde{x}_p + \epsilon = \tilde{x}^T\beta\). So, if we did hypothetically know \(\beta\) and \(\tilde{x}\) (but not the true value \(y^*\)), then our best guess for \(y^*\) would be \(\mu_{y|\tilde{x}} = \tilde{x}^T\beta\). So, our prediction would differ from the true value due to \(\epsilon\). This means if we were to form a prediction interval for \(y^*\) centered at our best guess, the standard deviation would only need the randomness of \(\epsilon\). But in practice, we don’t know \(\beta\) and instead estimate it with \(\hat{\beta}\), leading to a best prediction guess of \(\hat{\mu}_{y|\tilde{x}}\). But this means our prediction would differ from the true \(y^*\) due to both \(\epsilon\) and incorrectly guessing the conditional expectation. So the standard deviation used for the prediction interval has to contain the randomness of \(\epsilon\) as well as the randomness in the deviation of \(\hat{\mu}_{y|\tilde{x}}\) from \(\mu_{y|\tilde{x}}\). And since our observed sample of size \(n\) is independent from the new observation, \(\epsilon\) is independent of \(\hat{\mu}_{y|\tilde{x}}\). So, just based on intuition, we would think that the variance used for prediction intervals should just be the sum of the variance of \(\epsilon\) plus the variance of \(\hat{\mu}_{y|\tilde{x}}\) (remember \(\mu_{y|\tilde{x}}\) is constant so it doesn’t impact variance). And this is exactly what we see when we formally calculate the deviation of \(y^*\) from our best guess, \(\hat{\mu}_{y|\tilde{x}}\):
\[ \begin{aligned} \text{Var}(y^* - \hat{\mu}_{y|\tilde{x}}) &= \text{Var}(\epsilon^*) + \text{Var}(\hat{\mu}_{y|\tilde{x}}) \\ &= \sigma_\epsilon^2 + \sigma_\epsilon^2\tilde{x}^T(X^TX)^{-1}\tilde{x} \\ &= \sigma_\epsilon^2(1 + \tilde{x}^T(X^TX)^{-1}\tilde{x}). \end{aligned} \]
Then,
\[ y^* \sim N\left(\hat{\mu}_{y|\tilde{x}}, \sigma_\epsilon^2(1 + \tilde{x}^T(X^TX)^{-1}\tilde{x})\right), \]
or by using the estimator \(\hat{\sigma}_\epsilon^2\) in place of \(\sigma_\epsilon^2\),
\[ y^* \sim t_{n-p-1}\left(\hat{\mu}_{y|\tilde{x}}, \hat{\sigma}_\epsilon^2(1 + \tilde{x}^T(X^TX)^{-1}\tilde{x})\right). \]
Here is an example of the difference between confidence and prediction intervals:
# using mortgage example with 1 predictor
# calculate the confidence and prediction intervals
intervals <- data.frame(matrix(rep(data_diff$Treasury, 5), ncol = 5))
intervals[,2:3] <- predict(lm_mortgage, newdata = data.frame(Treasury = data_diff$Treasury), interval = "confidence", level = 0.95)[,2:3]
intervals[,4:5] <- predict(lm_mortgage, newdata = data.frame(Treasury = data_diff$Treasury), interval = "prediction", level = 0.95)[,2:3]
intervals <- intervals[order(intervals$X1),]
# plot everything
plot_ly() %>%
add_trace(x = data_diff$Treasury, y = data_diff$Mortgage, type = "scatter", mode = "markers", name = "Observed Data", alpha = 0.5) %>%
add_trace(x = data_diff$Treasury, y = lm_mortgage$fitted.values, type = "scatter", mode = "lines", name = "Fitted Line", alpha = 1) %>%
add_trace(x = intervals$X1, y = intervals$X2, type = "scatter", mode = "lines", name = "Pointwise Confidence Interval (LB)", alpha = 1, line = list(color = 'blue')) %>%
add_trace(x = intervals$X1, y = intervals$X3, type = "scatter", mode = "lines", name = "Pointwise Confidence Interval (UB)", alpha = 0.75, line = list(color = 'blue')) %>%
add_trace(x = intervals$X1, y = intervals$X4, type = "scatter", mode = "lines", name = "Pointwise Prediction Interval (LB)", alpha = 1, line = list(color = 'red')) %>%
add_trace(x = intervals$X1, y = intervals$X5, type = "scatter", mode = "lines", name = "Pointwise Prediction Interval (LB)", alpha = 0.75, line = list(color = 'red')) %>%
layout(title = "Quarterly Change in 30-Year Avg. Mortgage Rate vs 10-Year Treasury Yield",
xaxis = list(title = "Change in Treasury Yield (%)"),
yaxis = list(title = "Change in Mortgage Rate (%)"))Most of the results and inference techniques so far have relied on the assumptions of linearity, homoskedasticity, and normality. As a result, it’s always good to check whether these assumptions actually hold in the dataset. Here are some simple methods of testing the assumptions using diagnostic plots.
Linearity: linearity holds as long as for \(y = X\beta + \epsilon\), \(\mathbb{E}[\epsilon_i|x_i] = 0\). And since we only have the residuals, \(e_1,\dots,e_n\), if there seems to be no relationship between the residuals and the predictors based on plots, then we assume linearity holds. If there are many predictors, then we may just plot the residuals vs. the fitted values.
Homoskedasticity: under homoskedasticity, \(\text{Var}(\epsilon_i|x_i) = \sigma_\epsilon^2\). And again, since we only have the residuals, we will use them to test for constant variance. However, for \(1_i\) being a vector with all 0s except for 1 in the \(i^{th}\) element, note that
\[ \begin{aligned} \text{Var}(e_i|x_i) &= \text{Var}(y_i - x_i^T\hat{\beta}|x_i) \\ &= \text{Var}(1_i^T(y - Hy)|x_i) \\ &= 1_i^T(I^{n \text{ x } n} - H)\text{Var}(y|x_i)(I^{n \text{ x } n}-H)^T1_i \\ &= \sigma_\epsilon^21_i^T(I^{n \text{ x } n} - H)1_i \\ &= \sigma_\epsilon^2(1 - H_{i,i}), \end{aligned} \]
where the fourth equality is due to \((I^{n \text{ x } n} - H)I^{n \text{ x } n}-H)^T = (I^{n \text{ x } n} - H)\). As \(H_{i,i}\) is not constant across all \(i\), \(\text{Var}(e_i|x_i)\) is actually not constant across all \(i\). This means the realized dispersion of the residuals cannot be directly used to assess homoskedasticity and the residuals themselves aren’t homoskedastic. But it can be seen that the modified residuals \(\frac{e_i}{\sqrt{1 - H_{i,i}}}\) are in fact homoskedastic with a constant conditional variance of \(\sigma_\epsilon^2\). But more commonly, the standardized residuals are used to remove the impact of units:
\[ r_i := \frac{e_i}{\hat{\sigma}_\epsilon\sqrt{1 - H_{i,i}}}. \]
The standardized residuals are plotted against each covariate, or if there are too many covariate, they are plotted against the fitted values. Under homoskedasticity, there should be no clear trend in these plots.
Here is an example of the diagnostic plots:
# using mortgage example with 2 predictors
X <- model.matrix(lm_mortgage2)
H <- X%*%solve(t(X)%*%X)%*%t(X)
H_diag <- diag(H)
RMSE <- summary(lm_mortgage2)$sigma
y_hat <- lm_mortgage2$fitted.values
res <- lm_mortgage2$residuals
stand_res <- lm_mortgage2$residuals/(RMSE * (1 - H_diag) ^ 0.5)
par(mfrow = c(1,3), mar=c(3,3,2,1), mgp = c(1.8,.5, 0))
# linearity diagnostic plot
plot(y_hat, res, pch = 20, cex = 0.75,
main = "Linearity Diag. Plot",
cex.main = 1,
xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, col = "blue", lty = 2)
# homoskedasticity diagnostic plot
plot(y_hat, stand_res, pch = 20, cex = 0.75,
main = "Homoskedasticity Diag. Plot",
cex.main = 1,
xlab = "Fitted Values",
ylab = "Standardized Residuals")
abline(h = 0, col = "blue", lty = 2)
# normality diagnostic plot
qqnorm(stand_res, pch = 20,
main = "Normality Diag. Plot",
cex.main = 1,
xlab = "Quantiles of Standard Normal",
ylab = "Quantiles of Standardized Residuals")
qqline(stand_res, col = "blue", lty = 2)We have collected n observations for a total of \(p\) covariates and the response variable. Model selection refers to optimally selecting which covariates to use in the final model. Initially, it might seem as though having more covariates is only more helpful as it enables the model to be more flexible, so it can fit the data better. Generally, this additional flexibility by having more complexity (more covariates) is called lower model bias: it the predictive model is more able to capture any trend that the true generative model produces. However, there is a tradeoff: too much complexity may result in too much model flexibility, meaning the model will overfit to the training data. Overfitting can result in the model focusing too much on perfectly predicting response values rather than capturing the true trend. Remember, there is always the randomness due to \(\epsilon\), so even if we had the true \(\beta\), we still wouldn’t be able to perfectly predict the exact value of a new \(y^*\) given the appropriate covariate values, so the goal of building a linear model is not to be able to perfectly predict each response value, but instead to capture the true trend, which is the \(\beta\). When a predictive model has too much complexity, we say it has high variance: the model’s parameters which are based on the dataset become too sensitive to the dataset as the model is trying to fit to noise rather than capturing the true trend, and noise can vary wildly across different datasets. In short, model selection is about balancing the bias-variance tradeoff.
Here is an example of the bias-variance tradeoff using polynomial regression (as the degree initially increases, the model seems to better capture the true trend, but eventually once the degree is too high, the model is fitting to noise and losing the true trend):
# true model is y = sin(x) + epsilon, where epsilon ~ N(0, 0.25)
x <- seq(0, 5, length = 100)
set.seed(1234)
y <- sin(x) + rnorm(100, mean = 0, sd = 0.25)
# run regression of y = beta0 + beta1 * x
lm_1 <- lm(formula = y~x)
predict1 <- predict(lm_1, new_data = x)
# run regression of y = beta0 + beta1 * x + beta2 * x^2
lm_2 <- lm(formula = y~poly(x, degree = 2, raw = TRUE))
predict2 <- predict(lm_2, new_data = x)
# run regression of y = beta0 + beta1 * x + ... + beta5 * x^5
lm_5 <- lm(formula = y~poly(x, degree = 5, raw = TRUE))
predict5 <- predict(lm_5, new_data = x)
# run regression of y = beta0 + beta1 * x + ... + beta15 * x^15
lm_15 <- lm(formula = y~poly(x, degree = 15, raw = TRUE))
predict15 <- predict(lm_15, new_data = x)
# plot results
plot_ly(x = x, y = y, type = "scatter", mode = "markers", name = "Observed Data", alpha = 0.5) %>%
add_trace(x = x, y = predict1, type = "scatter", mode = "lines", name = "Degree 1", alpha = 1) %>%
add_trace(x = x, y = predict2, type = "scatter", mode = "lines", name = "Degree 2", alpha = 1) %>%
add_trace(x = x, y = predict5, type = "scatter", mode = "lines", name = "Degree 5", alpha = 1) %>%
add_trace(x = x, y = predict15, type = "scatter", mode = "lines", name = "Degree 15", alpha = 1) %>%
layout(title = "Comparison between Models of Different Complexities",
xaxis = list(title = "x"),
yaxis = list(title = "y"))So how do we actually construct a model? In total, there are \(2^p\) total possible models, as for each predictor, it can either be included or not (2 choices), and we always include the intercept. For even a modest \(p\), like 10, there are already way too many models to sift through. So, we need a smarter algorithm of testing models. There are 3 main iterative methods to model selection, all of which are termed stepwise regression:
Backward Selection: start with the model with all \(p\) predictors . Then choose the “worst” predictor to remove, if any, based on some prespecified condition. If a predictor is removed, now there \(p - 1\) predictors left. We then choose the “worst” remaining predictor to remove, if any. Keep repeating this process until you are left with a model where none of the remaining predictors should be removed based on the condition. Note that in backward selection, if a predictor was removed in a step, it has been permanently removed: it can never be added back in any future step.
Forward Selection: start with the model with only an intercept. Then choose the “best” predictor to add, if any, based on some prespecified condition. If a predictor is added, then the model has 1 predictor. From the remaining covariates that haven’t been added, choose the “best” one to add, if any. Keep repeating this process until you are left with a model where none of the non-selected predictors should be added based on the condition. Note that in forward selection, if a predictor was added in a step, it has been permanently added: it can never be removed in any future step.
Hybrid Selection: this is combination of forward and backward selection. Start with either all \(p\) predictors, just an intercept, or some subset of the predictors. If starting with all predictors, then the first step is same as in backward selection. If starting with just an intercept, then the first step is the same as in forward selection. Once the first step has been made, the algorithm then decides based on the prespecified condition whether it is more optimal to add a variable - and if so, which one - remove a variable - and if so, which one - or keep the model as it is and terminate the iteration. This process repeats until the algorithm decides to stop changing the model. In other words, at each future step, a predictor could be added or removed, making hybrid selection more flexible as it doesn’t constrain a predictor remain in/out of the final model based on past steps.
Note that the 3 different algorithms are not guaranteed to produce the same final model. Also, these algorithms do not necessarily find the most optimal model based on the condition, as this would require actually going through all \(2^p\) possible models: the algorithms find a model that is “good enough.”
So far, we’ve seen how to use t-tests and partial F-tests to determine whether to include certain predictors in a model. More generally, these types of model selection techniques are generally referred to as hypothesis test-based methods. How do apply this to the selection techniques (specifically backward and forward selection)? The core of hypothesis test-based methods is using p-values to determine whether a covariate meaningfully improves predictive power. So, if using backward selection, we look at the p-values for each predictor, and pick the greatest p-value. If this value is above the significance level, then we remove the predictor, rerun the regression to get new coefficients and p-values, and just repeat the removal decision. If using forward selection, then when starting with just an intercept, we consider a total of \(p\) models, where each one is just 1 predictor added to the intercept. The p-value for each covariate is calculated, and we select the largest one. If this p-value is below the significance level, then the covariate is added. Then we run a total of \(p - 1\) models, where each model is the intercept and the previously added predictor plus a new remaining covariate. We select the covariate with the lowest p-value, and if this is below the significance level, it is added to the model, and this process repeats.
Although hypothesis test-based methods are used in practice, there is a pretty big problem with them: they rely on comparing p-values across many, many different models. As a p-value just represents the probability of observing a more extreme \(\hat{\beta}_j\) when doing just 1 covariate addition/removal, p-values can be very high/low just due to chance, so aren’t reliable. In other words, just due to random chance of what the observed data was, \(\hat{\beta}_j\) can be very high, even if \(\beta_j = 0\). If it’s very high, the p-value is then very low, which based on hypothesis testing would lead to the conclusion that the \(j^{th}\) variable is meaningful, even though it actually is not. When we just have a dataset and run 1 regression, we are fully aware of this error of incorrectly rejecting the null that \(\beta_j = 0\) (called Type 1 Error), which we control through choose the significance level. However, when we keep just slightly changing the selected covariates and run new regressions, we begin to increase the chance of getting an exceptionally high \(\hat{\beta}_j\) (the same idea applies for if \(\beta_j \neq 0\) but we falsely accept the null that \(\beta_j = 0\) because of a really low \(\hat{\beta}_j\) in one of the multiple regressions that occurred just by chance). In short, when doing multiple comparisons, the p-values aren’t really a reliable method of determining whether to include covariates.
Criterion-based methods tend to be better methods for model selection than hypothesis test-based methods. A criterion-based method searches for
\[ \underset{\text{all models}}{\text{argmin}} \text{ }Criterion. \]
The simplest criterion is
\[ \underset{\beta}{\text{argmin}} \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2 = \underset{\beta}{\text{argmin}} \text{ }SSE, \]
for which the solution is just the model using all \(p\) predictors (as there is no penalty for complexity). Also note that minimizing \(SSE\) is equivalent to maximizing \(R^2\).
For a candidate model with \(m\) predictors, some other commonly used criterion that penalize complexity are:
Adjusted \(R^2 = 1 - \frac{SSE/(n-m-1)}{SST/(n-1)}\). The optimization problem would be to maximize Adjusted \(R^2\), which is equivalent to minimizing \(\frac{SSE}{n-m-1}\).
Mallow’s \(C_p = \frac{SSE}{\hat{\sigma}_\epsilon(p)} + 2(m+1) - n\), where \(\hat{\sigma}_\epsilon(p)\) is the RMSE with all \(p\) predictors. The optimization problem would be to minimize Mallow’s \(C_p\).
Akaike Information Criterion \((AIC) = n\ln(\frac{SSE}{n}) + 2m\). The optimization problem would be to minimize \(AIC\).
Bayesian Information Criterion \((BIC) = n\ln(\frac{SSE}{n}) + \ln(n)m\). The optimization problem would be to minimize \(BIC\).
\(AIC\) and \(BIC\) are both theoretically derived, and have very similar forms: they just differ in the penalizer for model complexity as \(AIC\) uses 2 whereas \(BIC\) uses \(\ln(n)\). For a fixed model size \(m\), all five of the criterion would produce the same set of optimal covariates. However, they differ in which model size is optimal. The goal of finding the best subset of covariates using a criterion-based approach is called Best Subset Selection. And using any of the criterion with a forward, backward, or hybrid selection is very straightforward: at each step, the decision that decision minimizes the criterion the most is chosen.
With different types of criterion, how do we decide which final model to use? For this, we consider maximizing out-of-sample performance as the goal of model building is to ultimately be able to use it to form reasonable predictions of response values when new inputs are given (in-sample performance refers to predictive performance on the data used to train the model). To evaluate out-of-sample performance of different models, we randomly split our observed dataset into a training set (which will be used to build the model) and a disjoint testing set (which will be used to test the performance of the model on new data: i.e., data that wasn’t used to train the model). Typically, the size of the training set (\(n_{train}\)) is larger than the size of the testing set (\(n_{test}\)). And a common out-of-sample performance metric is Out-of-Sample \(R^2 (OSR^2)\) :
\[ OSR^2 := 1 - \frac{\sum_{i = 1}^{n_{test}}(y_i - \hat{y}_i)^2}{\sum_{i = 1}^{n_{test}}(y_i - \overline{y}_{train})^2}. \]
Note that just like the traditional \(R^2\) (now referred to as in-sample \(R^2\)), \(OSR^2\) compares the performance of the predictive model on the testing set against the simplest predictive model, which just the sample mean of the response values in the training set. Also, unlike in-sample \(R^2\), \(OSR^2\) can be negative.
Here is an example of applying the AIC and BIC and evaluating out-of-sample performance:
# create dataframe with the first couple columns being x ^ col, and remaining columns being functions of x
x_full <- data.frame(matrix(0, nrow = 100, ncol = 10))
for (col in 1:7){
x_full[,col] <- x ** col
}
x_full[,8] <- sin(sqrt(x))
x_full[,9] <- log(x + 0.1)
x_full[,10] <- cos(x)
set.seed(134)
y <- sin(x) + rnorm(100, mean = 0, sd = 0.25)
data_full <- cbind(y, x_full)
# randomly select 70% of rows for training set and remaining 30% for testing set
set.seed(134)
ntrain <- 70
ntest <- 100 - ntrain
train <- data_full[sample(1:100, ntrain, replace = F),]
test <- data_full[-sample(1:100, ntrain, replace = F),]
# first fit the empty (only intercept) and full models
lm_empty <- lm(y~1, data = train)
lm_full <- lm(y~., data = train)
lm_half <- lm(y~X1+X2+X3+X4+X5, data = train)
# summary dataframe: note that number of predictors does not include intercept
summary_df <- data.frame(matrix(0, nrow = 3, ncol = 6))
colnames(summary_df) <- c("Foward # Predictors", "Foward ORS^2", "Backward # Predictors","Backward ORS^2", "Hybrid # Predictors", "Hybrid ORS^2")
rownames(summary_df) <- c("R^2", "AIC", "BIC")
simplest_SSE <- sum((test$y - mean(train$y)) **2)
# stepwise with R^2 (note this will provide exact same results as applying lin reg to full model)
lm_f <- step(lm_empty, scope = list(lower = lm_empty, upper = lm_full), direction = "forward", k = 0, trace = 0)
new_SSE <- sum((test$y - predict(lm_f, newdata = test)) ** 2)
f_osr <- 1 - new_SSE/simplest_SSE
lm_b <- step(lm_full, scope = list(lower = lm_empty, upper = lm_full), direction = "backward", k = 0, trace = 0)
new_SSE <- sum((test$y - predict(lm_b, newdata = test)) ** 2)
b_osr <- 1 - new_SSE/simplest_SSE
lm_h <- step(lm_half, scope = list(lower = lm_empty, upper = lm_full), direction = "both", k = 0, trace = 0)
new_SSE <- sum((test$y - predict(lm_h, newdata = test)) ** 2)
h_osr <- 1 - new_SSE/simplest_SSE
# stepwise with AIC (note that AIC is default so not specifying in function)
lm_fAIC <- step(lm_empty, scope = list(lower = lm_empty, upper = lm_full), direction = "forward", trace = 0)
new_SSE <- sum((test$y - predict(lm_fAIC, newdata = test)) ** 2)
fAIC_osr <- 1 - new_SSE/simplest_SSE
lm_bAIC <- step(lm_full, scope = list(lower = lm_empty, upper = lm_full), direction = "backward", trace = 0)
new_SSE <- sum((test$y - predict(lm_bAIC, newdata = test)) ** 2)
bAIC_osr <- 1 - new_SSE/simplest_SSE
lm_hAIC <- step(lm_half, scope = list(lower = lm_empty, upper = lm_full), direction = "both", trace = 0)
new_SSE <- sum((test$y - predict(lm_hAIC, newdata = test)) ** 2)
hAIC_osr <- 1 - new_SSE/simplest_SSE
# stepwise with BIC
lm_fBIC <- step(lm_empty, scope = list(lower = lm_empty, upper = lm_full), direction = "forward", k = log(ntrain), trace = 0)
new_SSE <- sum((test$y - predict(lm_fBIC, newdata = test)) ** 2)
fBIC_osr <- 1 - new_SSE/simplest_SSE
lm_bBIC <- step(lm_full, scope = list(lower = lm_empty, upper = lm_full), direction = "backward", k = log(ntrain), trace = 0)
new_SSE <- sum((test$y - predict(lm_bBIC, newdata = test)) ** 2)
bBIC_osr <- 1 - new_SSE/simplest_SSE
lm_hBIC <- step(lm_half, scope = list(lower = lm_empty, upper = lm_full), direction = "both", k = log(ntrain), trace = 0)
new_SSE <- sum((test$y - predict(lm_hBIC, newdata = test)) ** 2)
hBIC_osr <- 1 - new_SSE/simplest_SSE
# put everything in summary dataframe
summary_df$`Foward # Predictors` <- c(length(lm_f$coef) - 1, length(lm_fAIC$coef) - 1, length(lm_fBIC$coef) - 1)
summary_df$`Backward # Predictors` <- c(length(lm_b$coef) - 1, length(lm_bAIC$coef) - 1, length(lm_bBIC$coef) - 1)
summary_df$`Hybrid # Predictors` <- c(length(lm_h$coef) - 1, length(lm_hAIC$coef) - 1, length(lm_hBIC$coef) - 1)
summary_df$`Foward ORS^2` <- c(f_osr, fAIC_osr, fBIC_osr)
summary_df$`Backward ORS^2` <- c(b_osr, bAIC_osr, bBIC_osr)
summary_df$`Hybrid ORS^2` <- c(h_osr, hAIC_osr, hBIC_osr)
summary_df## Foward # Predictors Foward ORS^2 Backward # Predictors Backward ORS^2
## R^2 10 -746.9570952 10 -746.9570953
## AIC 3 0.8406896 7 -131.4288066
## BIC 3 0.8406896 3 0.8519249
## Hybrid # Predictors Hybrid ORS^2
## R^2 10 -746.9570953
## AIC 3 0.8570114
## BIC 3 0.8570114
To potentially improve out of sample performance even further, we might also consider choosing our criterion more smartly. Criterion are generally of the form (or can be shown to converge to the form) \(n\ln(\frac{SSE}{n}) + \lambda m\), where \(\lambda \geq 0\). For \(\lambda = 0\), minimizing the criterion is equivalent to minimizing the in-sample \(R^2\) (low bias, high variance). For \(\lambda = 2\), the criterion is \(AIC\), and for \(\lambda = \ln(n)\), the criterion is \(BIC\). And as \(\lambda \rightarrow \infty\), the criterion penalizes additional predictors so much that the final model would just be the intercept (high bias, low variance). The goal is to choose an appropriate \(\lambda\) using the training set such that out-of-sample performance is maximized: in other words, by choosing an optimal \(\lambda\), we can better balance the bias-variance tradeoff. To do this, we set a grid of possible \(\lambda\) values, and for each value in the grid, run a selection method (forward, backward, or hybrid) on the training set to get an final model: i.e., for each \(\lambda\) in the grid, it has a corresponding final model. But the issue is now we somehow have to evaluate the performance of each model on a new dataset to determine which model produces the greatest \(OSR^2\). We can’t use the testing set to do this because we are still in the stage of determining the absolute final model as we are selecting between a set of candidate final models. This introduces the idea of k-fold cross validation.
In k-fold cross validation, we randomly split the training set into k disjoint sets (folds). We then set aside the \(1^{st}\) fold and train the model on only the remaining the k-1 folds. Then the \(1^{st}\) fold is used a “preliminary testing set” (even though each fold is technically classified in the training set, it is not actually part of the official testing set). We record the performance, and now restart by setting the \(2^{nd}\) fold aside, and using the remaining k-1 folds to train the model from scratch, and evaluate the performance on the \(2^{nd}\) fold. We repeat this process until all k folds have each been used to evaluate the performance of the model. Then all of the “out-of-sample” performances can be considered to make a final decision for the final model before actually applying it on the true testing set.
Applying k-fold cross validation to our problem of finding the optimal \(\lambda\), for each \(\lambda\) in the grid, the chosen selection method will run on k-1 folds at a time, and the \(OSR^2\) on the left-out fold will be saved. So, in total, for each \(\lambda\), there will be k saved \(OSR^2\) values, each of which correspond to a different optimal model (even for a fixed \(\lambda\), the optimal model when trained on folds 1,…,k-1 may be different than the optimal model when trained on folds 2,…,k). Then, the \(\lambda\) with the greatest average \(OSR^2\) is selected, call it \(\lambda^*\). Now, the criterion is set as \(n\ln(\frac{SSE}{n}) + \lambda^* m\). Using this criterion, the selection method is run on the entire training set (i.e., all k folds are put back together and run the algorithm on the entire set) to generate the absolute final model. This model’s performance is then evaluated on the actual testing set.
So far, we’ve only explicitly mentioned cases where the trends immediately looks linear (except for some of the examples in Model Selection). But linear models are way more versatile. A linear model just means it is a linear combination (plus intercept) of the predictors: it says nothing about the form of the predictors. All of the following are examples of linear models:
Remember, we always assume the values of the predictors are known beforehand. So, if we define a new predictor as some function of other predictors, like \(x_{i1}x_{i2}\), then we know the value it takes on in the \(i^{th}\) observation as we have \(x_{1i}\) and \(x_{2i}\). Then, just define \(x_{i3} := x_{i1}x_{i2}\), so the last example is just \(y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \epsilon_i\). In other words, linear models can capture nonlinear trends! However, when new predictors are defined as functions of other predictors, there is an explicit dependence among the predictors. As a result, we may lose interpretability of the \(\hat{\beta}\) since we can’t control all other predictors to be constant in the interpretation (e.g., if two individuals differ in \(x_{i1}x_{i2}\) by 1 unit, they must differ in at least one of \(x_{i1}\) and \(x_{i2}\) as well).