This note is a complete practical companion for a postgraduate lecture on multiple linear regression. It is written so that a student can read the theory, run the code, reproduce the calculations and understand how regression is actually used.
Core reference for the lecture: Montgomery, Peck and Vining, Introduction to Linear Regression Analysis, 6th edition, Wiley.
We will study the model \[ Y_i=\beta_0+\beta_1x_{i1}+\cdots+\beta_kx_{ik}+\varepsilon_i, \qquad i=1,\ldots,n, \] or, in matrix notation, \[ \mathbf y=X\boldsymbol\beta+\boldsymbol\varepsilon. \]
The main ideas are:
Assignment 1: Download Multiple Linear Regression Theory Assignment 1
This assignment contains theory, derivations,
interpretation-based questions, and matrix-form exercises on multiple
linear regression.
We use the built-in mtcars dataset. The response is
miles per gallon (mpg). We use car weight
(wt), horsepower (hp) and quarter-mile time
(qsec) as predictors. The purpose is not automotive
science; the purpose is to learn regression workflow with reproducible
data.
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#> mpg wt hp qsec
#> Min. :10.40 Min. :1.513 Min. : 52.0 Min. :14.50
#> 1st Qu.:15.43 1st Qu.:2.581 1st Qu.: 96.5 1st Qu.:16.89
#> Median :19.20 Median :3.325 Median :123.0 Median :17.71
#> Mean :20.09 Mean :3.217 Mean :146.7 Mean :17.85
#> 3rd Qu.:22.80 3rd Qu.:3.610 3rd Qu.:180.0 3rd Qu.:18.90
#> Max. :33.90 Max. :5.424 Max. :335.0 Max. :22.90
#> disp cyl
#> Min. : 71.1 Min. :4.000
#> 1st Qu.:120.8 1st Qu.:4.000
#> Median :196.3 Median :6.000
#> Mean :230.7 Mean :6.188
#> 3rd Qu.:326.0 3rd Qu.:8.000
#> Max. :472.0 Max. :8.000
A quick scatterplot matrix helps us see pairwise relationships.
pairs(mtcars[, c("mpg", "wt", "hp", "qsec", "disp")],
main = "Pairwise view of response and candidate predictors",
pch = 19, col = "steelblue")Let \(p=k+1\), counting the intercept. Define \[ \mathbf y=\begin{bmatrix}y_1\\y_2\\ \vdots\\y_n\end{bmatrix},\quad X=\begin{bmatrix} 1&x_{11}&x_{12}&\cdots&x_{1k}\\ 1&x_{21}&x_{22}&\cdots&x_{2k}\\ \vdots&\vdots&\vdots&&\vdots\\ 1&x_{n1}&x_{n2}&\cdots&x_{nk} \end{bmatrix},\quad \boldsymbol\beta=\begin{bmatrix}\beta_0\\\beta_1\\ \vdots\\\beta_k\end{bmatrix}. \] Then \[ \mathbf y=X\boldsymbol\beta+\boldsymbol\varepsilon. \]
The standard assumptions for the classical linear model are \[ \mathbb E(\boldsymbol\varepsilon\mid X)=0,\qquad \operatorname{Var}(\boldsymbol\varepsilon\mid X)=\sigma^2 I_n. \] For exact finite-sample \(t\)- and \(F\)-inference, we additionally assume \[ \boldsymbol\varepsilon\mid X\sim N_n(0,\sigma^2 I_n). \]
Important warning: if \(p\ge n\), or if the columns of \(X\) are nearly linearly dependent, then \(X^\top X\) is singular or ill-conditioned. OLS can become non-unique or numerically unstable. This motivates variable reduction, principal components, ridge regression or other regularization methods.
#>
#> Call:
#> lm(formula = mpg ~ wt + hp + qsec, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.8591 -1.6418 -0.4636 1.1940 5.6092
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 27.61053 8.41993 3.279 0.00278 **
#> wt -4.35880 0.75270 -5.791 3.22e-06 ***
#> hp -0.01782 0.01498 -1.190 0.24418
#> qsec 0.51083 0.43922 1.163 0.25463
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.578 on 28 degrees of freedom
#> Multiple R-squared: 0.8348, Adjusted R-squared: 0.8171
#> F-statistic: 47.15 on 3 and 28 DF, p-value: 4.506e-11
The fitted model is \[ \widehat{mpg}=\widehat\beta_0+ \widehat\beta_1 wt+ \widehat\beta_2 hp+ \widehat\beta_3 qsec. \]
coef_table <- summary(fit)$coefficients
kable_num(coef_table, digits = 4, caption = "Coefficient table for the model mpg ~ wt + hp + qsec")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 27.6105 | 8.4199 | 3.2792 | 0.0028 |
| wt | -4.3588 | 0.7527 | -5.7909 | 0.0000 |
| hp | -0.0178 | 0.0150 | -1.1896 | 0.2442 |
| qsec | 0.5108 | 0.4392 | 1.1630 | 0.2546 |
Interpretation of slopes should always be conditional. For example,
the coefficient of wt is the expected change in
mpg for one unit increase in wt,
holding horsepower and quarter-mile time fixed.
The least-squares estimator minimizes \[ S(\boldsymbol\beta)=\|\mathbf y-X\boldsymbol\beta\|^2. \] Expanding the quadratic, \[ S(\boldsymbol\beta)=\mathbf y^\top\mathbf y -2\boldsymbol\beta^\top X^\top\mathbf y +\boldsymbol\beta^\top X^\top X\boldsymbol\beta. \] Differentiating with respect to \(\boldsymbol\beta\), \[ \frac{\partial S}{\partial\boldsymbol\beta} =-2X^\top(\mathbf y-X\boldsymbol\beta). \] Setting the gradient to zero gives the normal equations \[ X^\top X\widehat{\boldsymbol\beta}=X^\top\mathbf y. \] If \(X\) has full column rank, then \[ \widehat{\boldsymbol\beta}=(X^\top X)^{-1}X^\top\mathbf y. \]
Let us verify this directly in R.
y <- mtcars$mpg
X <- model.matrix(fit)
beta_matrix <- solve(t(X) %*% X) %*% t(X) %*% y
beta_lm <- coef(fit)
kable_num(
cbind(Matrix_Formula = as.vector(beta_matrix), lm_Output = beta_lm),
digits = 6,
caption = "OLS coefficients from matrix formula and from lm()"
)| Matrix_Formula | lm_Output | |
|---|---|---|
| (Intercept) | 27.610527 | 27.610527 |
| wt | -4.358797 | -4.358797 |
| hp | -0.017822 | -0.017822 |
| qsec | 0.510834 | 0.510834 |
In multiple linear regression,
\[ \mathbf y = X\boldsymbol\beta+\boldsymbol\varepsilon, \qquad \widehat{\boldsymbol\beta}=(X^\top X)^{-1}X^\top \mathbf y. \]
Therefore,
\[ \widehat{\mathbf y}=X\widehat{\boldsymbol\beta} = X(X^\top X)^{-1}X^\top \mathbf y = H\mathbf y, \]
where
\[ H=X(X^\top X)^{-1}X^\top \]
is called the hat matrix. It is called the hat matrix because it maps the observed response vector \(\mathbf y\) to the fitted vector \(\widehat{\mathbf y}\).
Geometrically, \(H\) projects \(\mathbf y\) onto the column space \(\mathcal C(X)\). The fitted vector \(\widehat{\mathbf y}=H\mathbf y\) lies inside \(\mathcal C(X)\), while the residual vector
\[ \mathbf e=\mathbf y-\widehat{\mathbf y}=(I-H)\mathbf y \]
is orthogonal to \(\mathcal C(X)\). Hence,
\[ X^\top \mathbf e=\mathbf 0. \]
This is the geometric heart of ordinary least squares.
The diagonal elements \(h_{ii}\) of \(H\) are called leverages. A large \(h_{ii}\) means the \(i\)-th observation has an unusual predictor pattern and can strongly influence the fitted regression surface.
Define the hat matrix \[ H=X(X^\top X)^{-1}X^\top. \] Then \[ \widehat{\mathbf y}=H\mathbf y, \qquad \mathbf e=\mathbf y-\widehat{\mathbf y}=(I-H)\mathbf y. \] The matrix \(H\) is symmetric and idempotent: \[ H^\top=H, \qquad H^2=H. \] Thus it is an orthogonal projection matrix onto the column space of \(X\).
H <- X %*% solve(t(X) %*% X) %*% t(X)
yhat <- H %*% y
e <- y - yhat
checks <- c(
max_abs_yhat_difference = max(abs(as.vector(yhat) - fitted(fit))),
symmetry_error = max(abs(H - t(H))),
idempotence_error = max(abs(H %*% H - H)),
normal_equation_error = max(abs(t(X) %*% e))
)
round(checks, 10)#> max_abs_yhat_difference symmetry_error idempotence_error
#> 0.0e+00 0.0e+00 0.0e+00
#> normal_equation_error
#> 8.1e-09
The last number verifies \(X^\top\mathbf e=0\): residuals are orthogonal to every column of the design matrix.
Regression geometry is most precise in \(\mathbb R^n\), not in the two- or three-dimensional predictor plot. The vector \(\mathbf y\in\mathbb R^n\) is projected onto the column space \(\mathcal C(X)\). Its projection is \(\widehat{\mathbf y}=H\mathbf y\), and the residual vector \(\mathbf e=\mathbf y-\widehat{\mathbf y}\) is perpendicular to \(\mathcal C(X)\).
Because the intercept column \(\mathbf 1_n\) belongs to \(\mathcal C(X)\), we get \[ \mathbf 1_n^\top \mathbf e=0. \] Therefore residuals sum to zero and the observed and fitted values have the same mean: \[ \sum_{i=1}^n e_i=0, \qquad \overline{y}=\overline{\widehat y}. \]
#> sum_residuals mean_y mean_fitted
#> -3.608225e-16 2.009062e+01 2.009062e+01
With an intercept in the model, \[ \mathbf y-\overline y\mathbf 1_n = (\widehat{\mathbf y}-\overline y\mathbf 1_n)+\mathbf e. \] Here \(\widehat{\mathbf y}-\overline y\mathbf 1_n\in\mathcal C(X)\), while \(\mathbf e\perp\mathcal C(X)\). Hence these three vectors form a right triangle.
Thus \[ \|\mathbf y-\overline y\mathbf 1_n\|^2 = \|\widehat{\mathbf y}-\overline y\mathbf 1_n\|^2 + \|\mathbf e\|^2. \] Equivalently, \[ SST=SSR+SSE. \]
The plot visualizes the Pythagorean decomposition
\[ SST=SSR+SSE, \]
or equivalently,
\[ \|y-\bar y\mathbf 1_n\|^2 = \|\hat y-\bar y\mathbf 1_n\|^2+\|e\|^2. \]
Thus \(R^2=SSR/SST=\cos^2(\theta)\), where \(\theta\) is the angle between \(y-\bar y\mathbf 1_n\) and \(\hat y-\bar y\mathbf 1_n\).
yvec <- as.vector(y)
yhat_vec <- as.vector(fitted(fit))
evec <- as.vector(resid(fit))
ybar1 <- mean(yvec) * rep(1, length(yvec))
SST <- sum((yvec - ybar1)^2)
SSR <- sum((yhat_vec - ybar1)^2)
SSE <- sum((yvec - yhat_vec)^2)
R2_length <- SSR / SST
R2_def <- 1 - SSE / SST
R_multiple <- cor(yvec, yhat_vec)
kable_num(
data.frame(SST = SST, SSR = SSR, SSE = SSE,
R2_from_lengths = R2_length,
R2_from_definition = R2_def,
multiple_R = R_multiple,
multiple_R_squared = R_multiple^2),
digits = 5,
caption = "Geometric decomposition of sums of squares and relation to multiple R"
)| SST | SSR | SSE | R2_from_lengths | R2_from_definition | multiple_R | multiple_R_squared |
|---|---|---|---|---|---|---|
| 1126.047 | 939.9879 | 186.0593 | 0.83477 | 0.83477 | 0.91366 | 0.83477 |
The following plot is not the original \(n\)-dimensional vector space. It is a two-dimensional diagram with the same side lengths as the right triangle in \(\mathbb R^n\).
len_total <- sqrt(SST)
len_reg <- sqrt(SSR)
len_res <- sqrt(SSE)
theta <- acos(R_multiple)
plot(c(0, len_total * 1.05), c(0, len_res * 1.25), type = "n",
xlab = "Length along fitted centred vector",
ylab = "Residual length",
main = expression("Geometry of "*R^2*": right triangle in subject space"))
segments(0, 0, len_reg, 0, lwd = 3, col = "darkgreen")
segments(len_reg, 0, len_reg, len_res, lwd = 3, col = "firebrick")
segments(0, 0, len_reg, len_res, lwd = 3, col = "steelblue")
points(c(0, len_reg, len_reg), c(0, 0, len_res), pch = 19)
text(len_reg/2, -0.08 * len_res, "|| yhat - ybar 1 ||", col = "darkgreen")
text(len_reg * 1.02, len_res/2, "|| y - yhat ||", col = "firebrick", pos = 4)
text(len_reg/2, len_res/2 + 0.08 * len_res, "|| y - ybar 1 ||", col = "steelblue")
legend("topleft",
legend = c(paste0("R = cor(y, fitted) = ", round(R_multiple, 3)),
paste0("R^2 = ", round(R2_def, 3)),
paste0("theta = ", round(theta * 180/pi, 2), " degrees")),
bty = "n")The multiple correlation coefficient is \[ R=\operatorname{corr}(\mathbf y,\widehat{\mathbf y}) =\cos\theta. \] Therefore \[ R^2=\cos^2\theta =\frac{\|\widehat{\mathbf y}-\overline y\mathbf 1_n\|^2} {\|\mathbf y-\overline y\mathbf 1_n\|^2} =\frac{SSR}{SST}. \] This equivalence depends on the intercept. Without the intercept, the residuals need not sum to zero, the Pythagorean decomposition may fail in the usual centred form, and some software uses a different definition of \(R^2\).
With two predictors the fitted response surface is a plane: \[
\widehat y=\widehat\beta_0+\widehat\beta_1 x_1+\widehat\beta_2 x_2.
\] The next plot is a visual representation of the model
mpg ~ wt + hp. It is useful pedagogically, but remember
that the exact projection geometry above lives in \(\mathbb R^n\).
fit2 <- lm(mpg ~ wt + hp, data = mtcars)
wt_seq <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 25)
hp_seq <- seq(min(mtcars$hp), max(mtcars$hp), length.out = 25)
z_mat <- outer(wt_seq, hp_seq,
function(wt, hp) coef(fit2)[1] + coef(fit2)[2] * wt + coef(fit2)[3] * hp)
pmat <- persp(wt_seq, hp_seq, z_mat,
theta = 40, phi = 20,
col = "lightblue", border = "gray70",
xlab = "Weight", ylab = "Horsepower", zlab = "MPG",
main = "Fitted regression plane: mpg ~ wt + hp")
points(trans3d(mtcars$wt, mtcars$hp, mtcars$mpg, pmat),
pch = 19, col = "firebrick")Since \[ \widehat{\boldsymbol\beta}=\boldsymbol\beta+(X^\top X)^{-1}X^\top\boldsymbol\varepsilon, \] we have \[ \mathbb E(\widehat{\boldsymbol\beta}\mid X)=\boldsymbol\beta. \] If \(\operatorname{Var}(\boldsymbol\varepsilon\mid X)=\sigma^2I_n\), then \[ \operatorname{Var}(\widehat{\boldsymbol\beta}\mid X)=\sigma^2(X^\top X)^{-1}. \] An unbiased estimator of \(\sigma^2\) is \[ \widehat\sigma^2=MSE=\frac{SSE}{n-p}. \]
n <- nrow(X)
p <- ncol(X)
SSE <- sum(resid(fit)^2)
MSE <- SSE / (n - p)
C <- solve(t(X) %*% X)
se_manual <- sqrt(MSE * diag(C))
kable_num(
data.frame(Estimate = coef(fit), SE_manual = se_manual,
SE_summary = summary(fit)$coefficients[, "Std. Error"]),
digits = 5,
caption = "Manual standard errors compared with summary(lm)"
)| Estimate | SE_manual | SE_summary | |
|---|---|---|---|
| (Intercept) | 27.61053 | 8.41993 | 8.41993 |
| wt | -4.35880 | 0.75270 | 0.75270 |
| hp | -0.01782 | 0.01498 | 0.01498 |
| qsec | 0.51083 | 0.43922 | 0.43922 |
The global test asks whether the regression explains any linear signal beyond the intercept: \[ H_0:\beta_1=\cdots=\beta_k=0. \] The test statistic is \[ F=\frac{SSR/k}{SSE/(n-p)}. \]
#> value numdf dendf
#> 47.15282 3.00000 28.00000
#> Analysis of Variance Table
#>
#> Response: mpg
#> Df Sum Sq Mean Sq F value Pr(>F)
#> wt 1 847.73 847.73 127.5739 6.131e-12 ***
#> hp 1 83.27 83.27 12.5319 0.00142 **
#> qsec 1 8.99 8.99 1.3527 0.25463
#> Residuals 28 186.06 6.64
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For a single coefficient, \[ H_0:\beta_j=0, \qquad T_j=\frac{\widehat\beta_j}{\sqrt{MSE\,C_{jj}}}\sim t_{n-p} \] under the normal error model.
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 27.61052686 8.41992848 3.279188 2.784556e-03
#> wt -4.35879720 0.75270039 -5.790879 3.217222e-06
#> hp -0.01782227 0.01498117 -1.189645 2.441762e-01
#> qsec 0.51083369 0.43922153 1.163043 2.546284e-01
A non-significant marginal coefficient does not automatically imply that the variable is scientifically useless. Collinearity, interactions, design constraints and subject-matter interpretation must be considered.
Suppose the reduced model has design matrix \(X_1\) and the full model has \(X=[X_1,X_2]\). To test whether the block \(X_2\) adds explanatory power after \(X_1\), use \[ F=\frac{(SSE_R-SSE_F)/r}{SSE_F/(n-p_F)}. \] Here \(r\) is the number of added coefficients.
#> Analysis of Variance Table
#>
#> Model 1: mpg ~ wt + hp
#> Model 2: mpg ~ wt + hp + qsec
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 29 195.05
#> 2 28 186.06 1 8.9885 1.3527 0.2546
For a new vector \(x_0=(1,x_{01},\ldots,x_{0k})^\top\), the fitted mean is \[ \widehat\mu(x_0)=x_0^\top\widehat{\boldsymbol\beta}. \]
A confidence interval for the mean response is \[ \widehat\mu(x_0)\pm t_{\alpha/2,n-p}\sqrt{MSE\,x_0^\top(X^\top X)^{-1}x_0}. \]
A prediction interval for one future observation is \[ \widehat\mu(x_0)\pm t_{\alpha/2,n-p}\sqrt{MSE\{1+x_0^\top(X^\top X)^{-1}x_0\}}. \] The extra \(1\) makes the prediction interval wider because a future observation has its own random error.
new_car <- data.frame(wt = 3.0, hp = 120, qsec = 18)
predict(fit, newdata = new_car, interval = "confidence")#> fit lwr upr
#> 1 21.59047 20.53118 22.64976
#> fit lwr upr
#> 1 21.59047 16.20491 26.97603
#> 2.5 % 97.5 %
#> (Intercept) 10.3630852 44.85796848
#> wt -5.9006341 -2.81696034
#> hp -0.0485098 0.01286526
#> qsec -0.3888708 1.41053822
The ordinary \(R^2\) never decreases when new predictors are added. Adjusted \(R^2\), AIC and BIC penalize complexity.
fit_small <- lm(mpg ~ wt, data = mtcars)
fit_mid <- lm(mpg ~ wt + hp, data = mtcars)
fit_big <- lm(mpg ~ wt + hp + qsec + disp + cyl, data = mtcars)
model_table <- data.frame(
model = c("wt", "wt + hp", "wt + hp + qsec", "larger model"),
R2 = c(summary(fit_small)$r.squared,
summary(fit_mid)$r.squared,
summary(fit)$r.squared,
summary(fit_big)$r.squared),
adj_R2 = c(summary(fit_small)$adj.r.squared,
summary(fit_mid)$adj.r.squared,
summary(fit)$adj.r.squared,
summary(fit_big)$adj.r.squared),
AIC = c(AIC(fit_small), AIC(fit_mid), AIC(fit), AIC(fit_big)),
BIC = c(BIC(fit_small), BIC(fit_mid), BIC(fit), BIC(fit_big))
)
kable_num(model_table, digits = 4,
caption = "Model comparison by R2, adjusted R2, AIC and BIC")| model | R2 | adj_R2 | AIC | BIC |
|---|---|---|---|---|
| wt | 0.7528 | 0.7446 | 166.0294 | 170.4266 |
| wt + hp | 0.8268 | 0.8148 | 156.6523 | 162.5153 |
| wt + hp + qsec | 0.8348 | 0.8171 | 157.1426 | 164.4713 |
| larger model | 0.8502 | 0.8214 | 158.0056 | 168.2658 |
Good model selection balances fit, parsimony, interpretability and scientific plausibility.
Regression diagnostics ask whether the assumptions are reasonable.
Interpretation:
The diagonal entries of \(H\) are leverages: \[ h_{ii}=x_i^\top(X^\top X)^{-1}x_i. \] A high-leverage point has an unusual predictor combination. Cook’s distance combines residual size and leverage.
infl <- data.frame(
car = rownames(mtcars),
fitted = fitted(fit),
residual = resid(fit),
leverage = hatvalues(fit),
studentized = rstudent(fit),
cooksD = cooks.distance(fit)
)
infl_top <- infl[order(-infl$cooksD), ][1:8, ]
kable_num(infl_top, digits = 4,
caption = "Most influential observations by Cook's distance")| car | fitted | residual | leverage | studentized | cooksD | |
|---|---|---|---|---|---|---|
| Chrysler Imperial | Chrysler Imperial | 9.1124 | 5.5876 | 0.1872 | 2.6504 | 0.3328 |
| Toyota Corolla | Toyota Corolla | 28.6193 | 5.2807 | 0.1360 | 2.3805 | 0.1912 |
| Maserati Bora | Maserati Bora | 13.5373 | 1.4627 | 0.4650 | 0.7701 | 0.1308 |
| Fiat 128 | Fiat 128 | 26.7908 | 5.6092 | 0.0840 | 2.4724 | 0.1185 |
| Toyota Corona | Toyota Corona | 25.3591 | -3.8591 | 0.1195 | -1.6431 | 0.0863 |
| Merc 230 | Merc 230 | 23.8853 | -1.0853 | 0.4581 | -0.5649 | 0.0691 |
| Lotus Europa | Lotus Europa | 27.6348 | 2.7652 | 0.1557 | 1.1753 | 0.0628 |
| Valiant | Valiant | 20.9868 | -2.8868 | 0.0964 | -1.1867 | 0.0370 |
plot(hatvalues(fit), rstudent(fit),
xlab = "Leverage", ylab = "Externally studentized residual",
main = "Leverage and studentized residuals",
pch = 19, col = "steelblue")
abline(h = c(-2, 2), lty = 2, col = "firebrick")
abline(v = 2 * p / n, lty = 2, col = "darkgreen")
text(hatvalues(fit), rstudent(fit), labels = rownames(mtcars), pos = 4, cex = 0.7)A point is influential when it is unusual in predictor space and has a substantial residual effect on the fitted model.
Multicollinearity means that some predictors are nearly linearly dependent. It makes coefficient estimates unstable and standard errors large.
For predictor \(x_j\), define \[ VIF_j=\frac{1}{1-R_j^2}, \] where \(R_j^2\) is obtained by regressing \(x_j\) on the remaining predictors.
vif_manual <- function(fit) {
X <- model.matrix(fit)
X <- X[, colnames(X) != "(Intercept)", drop = FALSE]
if (ncol(X) < 2) {
return(data.frame(term = colnames(X), VIF = NA_real_))
}
out <- numeric(ncol(X))
for (j in seq_len(ncol(X))) {
yj <- X[, j]
others <- X[, -j, drop = FALSE]
r2 <- summary(lm(yj ~ others))$r.squared
out[j] <- 1 / (1 - r2)
}
data.frame(term = colnames(X), VIF = out)
}
kable_num(vif_manual(fit_big), digits = 3,
caption = "Manual variance inflation factors for the larger model")| term | VIF |
|---|---|
| wt | 7.175 |
| hp | 5.235 |
| qsec | 3.625 |
| disp | 10.407 |
| cyl | 7.796 |
A rule of thumb is that VIF above 5 or 10 should trigger concern, though the threshold is context-dependent.
Regression coefficients depend on the units of measurement. To compare relative effects, we may standardize variables.
Unit normal scaling uses \[ z_{ij}=\frac{x_{ij}-\overline{x}_j}{s_j}, \qquad y_i^*=\frac{y_i-\overline y}{s_y}. \] The regression of \(y^*\) on standardized predictors has no intercept except for numerical rounding.
mt_std <- as.data.frame(scale(mtcars[, c("mpg", "wt", "hp", "qsec")]))
fit_std <- lm(mpg ~ wt + hp + qsec, data = mt_std)
summary(fit_std)#>
#> Call:
#> lm(formula = mpg ~ wt + hp + qsec, data = mt_std)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.64031 -0.27240 -0.07691 0.19811 0.93068
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.615e-17 7.561e-02 0.000 1.000
#> wt -7.076e-01 1.222e-01 -5.791 3.22e-06 ***
#> hp -2.027e-01 1.704e-01 -1.190 0.244
#> qsec 1.515e-01 1.302e-01 1.163 0.255
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4277 on 28 degrees of freedom
#> Multiple R-squared: 0.8348, Adjusted R-squared: 0.8171
#> F-statistic: 47.15 on 3 and 28 DF, p-value: 4.506e-11
The standardized coefficients are useful for comparing relative linear association within the fitted model, but they should not replace subject-matter interpretation.
A model may be linear in coefficients while nonlinear in predictors: \[ Y=\beta_0+\beta_1x+\beta_2x^2+\varepsilon. \] Interactions are also linear models: \[ Y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2+\varepsilon. \]
fit_inter <- lm(mpg ~ wt * hp + qsec, data = mtcars)
fit_poly <- lm(mpg ~ wt + I(wt^2) + hp + qsec, data = mtcars)
anova(fit, fit_inter)#> Analysis of Variance Table
#>
#> Model 1: mpg ~ wt + hp + qsec
#> Model 2: mpg ~ wt * hp + qsec
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 28 186.06
#> 2 27 121.04 1 65.018 14.503 0.0007333 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Analysis of Variance Table
#>
#> Model 1: mpg ~ wt + hp + qsec
#> Model 2: mpg ~ wt + I(wt^2) + hp + qsec
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 28 186.06
#> 2 27 130.59 1 55.466 11.468 0.002185 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When using interactions or polynomial terms, centering predictors can make interpretation more stable.
If errors are heteroscedastic, OLS coefficients may remain unbiased under \(\mathbb E(\varepsilon\mid X)=0\), but the usual standard errors may be wrong.
A simple White-type auxiliary regression regresses squared residuals on predictors, squares and interactions.
e2 <- resid(fit)^2
white_aux <- lm(e2 ~ wt + hp + qsec + I(wt^2) + I(hp^2) + I(qsec^2) + wt:hp + wt:qsec + hp:qsec,
data = mtcars)
LM <- length(e2) * summary(white_aux)$r.squared
df_white <- length(coef(white_aux)) - 1
p_white <- 1 - pchisq(LM, df = df_white)
c(LM_statistic = LM, df = df_white, p_value = p_white)#> LM_statistic df p_value
#> 12.5312885 9.0000000 0.1849867
A heteroscedasticity-consistent covariance matrix can be computed manually. HC3 uses \[ D_{ii}=\frac{e_i^2}{(1-h_{ii})^2}, \qquad \widehat{\operatorname{Var}}_{HC3}(\widehat\beta) =(X^\top X)^{-1}X^\top D X(X^\top X)^{-1}. \]
X <- model.matrix(fit)
e <- resid(fit)
h <- hatvalues(fit)
D <- diag((e^2) / (1 - h)^2)
V_hc3 <- solve(t(X) %*% X) %*% t(X) %*% D %*% X %*% solve(t(X) %*% X)
se_hc3 <- sqrt(diag(V_hc3))
compare_se <- data.frame(
Estimate = coef(fit),
Classical_SE = summary(fit)$coefficients[, "Std. Error"],
HC3_SE = se_hc3,
Classical_t = coef(fit) / summary(fit)$coefficients[, "Std. Error"],
HC3_t = coef(fit) / se_hc3
)
kable_num(compare_se, digits = 4,
caption = "Classical versus HC3 robust standard errors")| Estimate | Classical_SE | HC3_SE | Classical_t | HC3_t | |
|---|---|---|---|---|---|
| (Intercept) | 27.6105 | 8.4199 | 7.5473 | 3.2792 | 3.6583 |
| wt | -4.3588 | 0.7527 | 0.9501 | -5.7909 | -4.5879 |
| hp | -0.0178 | 0.0150 | 0.0138 | -1.1896 | -1.2897 |
| qsec | 0.5108 | 0.4392 | 0.4336 | 1.1630 | 1.1781 |
We now verify the classical theory by simulation. The design matrix is fixed, the true coefficient vector is known, and we repeatedly generate Gaussian errors.
B <- 5000
n_sim <- 80
x1 <- rnorm(n_sim)
x2 <- 0.4 * x1 + rnorm(n_sim)
X_sim <- cbind(1, x1, x2)
beta_true <- c(1, 2, -1)
sigma_true <- 1
p_sim <- ncol(X_sim)
est <- matrix(NA_real_, B, p_sim)
cover <- matrix(NA, B, p_sim)
for (b in 1:B) {
y_sim <- as.vector(X_sim %*% beta_true + rnorm(n_sim, sd = sigma_true))
fit_sim <- lm(y_sim ~ x1 + x2)
est[b, ] <- coef(fit_sim)
ci <- confint(fit_sim)
cover[b, ] <- beta_true >= ci[, 1] & beta_true <= ci[, 2]
}
sim_table <- data.frame(
coefficient = c("beta0", "beta1", "beta2"),
true = beta_true,
mean_estimate = colMeans(est),
empirical_variance = apply(est, 2, var),
coverage_95 = colMeans(cover)
)
kable_num(sim_table, digits = 4,
caption = "Monte Carlo check: unbiasedness and 95 percent interval coverage")| coefficient | true | mean_estimate | empirical_variance | coverage_95 |
|---|---|---|---|---|
| beta0 | 1 | 1.0011 | 0.0129 | 0.9474 |
| beta1 | 2 | 1.9994 | 0.0154 | 0.9496 |
| beta2 | -1 | -1.0005 | 0.0118 | 0.9480 |
The estimates are centered near the truth and the confidence interval coverage is near 0.95 when the model assumptions hold.
We compare two designs: one with weak collinearity and one with strong collinearity.
sim_collinearity <- function(rho, B = 3000, n = 80) {
est <- matrix(NA_real_, B, 3)
for (b in 1:B) {
x1 <- rnorm(n)
x2 <- rho * x1 + sqrt(1 - rho^2) * rnorm(n)
y <- 1 + 2 * x1 - 1 * x2 + rnorm(n)
est[b, ] <- coef(lm(y ~ x1 + x2))
}
apply(est, 2, var)
}
var_low <- sim_collinearity(rho = 0.2)
var_high <- sim_collinearity(rho = 0.98)
kable_num(
rbind(low_collinearity = var_low, high_collinearity = var_high),
digits = 4,
caption = "Sampling variance of OLS estimates under low versus high collinearity"
)| low_collinearity | 0.0131 | 0.0143 | 0.0136 |
| high_collinearity | 0.0124 | 0.3398 | 0.3393 |
Multicollinearity may leave prediction reasonably good but makes individual coefficient interpretation unstable.
If \(p>n\), the matrix \(X^\top X\) cannot have full rank. Classical OLS is not uniquely determined.
n_hd <- 20
p_hd <- 35
X_hd <- matrix(rnorm(n_hd * p_hd), nrow = n_hd, ncol = p_hd)
y_hd <- rnorm(n_hd)
rank_X <- qr(X_hd)$rank
rank_XtX <- qr(t(X_hd) %*% X_hd)$rank
c(n = n_hd, p = p_hd, rank_X = rank_X, rank_XtX = rank_XtX)#> n p rank_X rank_XtX
#> 20 35 20 20
#> [1] "Error in solve.default(t(X_hd) %*% X_hd) : \n system is computationally singular: reciprocal condition number = 3.84338e-19\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in solve.default(t(X_hd) %*% X_hd): system is computationally singular: reciprocal condition number = 3.84338e-19>
A ridge-type estimator replaces \((X^\top X)^{-1}\) by \((X^\top X+\lambda I)^{-1}\): \[ \widehat\beta_{ridge}(\lambda)=(X^\top X+\lambda I)^{-1}X^\top y. \]
lambda <- 1
beta_ridge <- solve(t(X_hd) %*% X_hd + lambda * diag(p_hd), t(X_hd) %*% y_hd)
length(beta_ridge)#> [1] 35
#> [,1]
#> [1,] 0.25594043
#> [2,] 0.18178068
#> [3,] 0.01720601
#> [4,] -0.06574177
#> [5,] -0.18636388
#> [6,] -0.09152578
This demonstrates why regularization is not just a computational trick; it is necessary when the regression problem is ill-posed.
Before reporting a multiple linear regression model, check the following.
The function below gives students a compact automatic summary for many classroom datasets.
mlr_report <- function(formula, data) {
fit <- lm(formula, data = data)
X <- model.matrix(fit)
y <- model.response(model.frame(fit))
n <- nrow(X); p <- ncol(X)
H <- X %*% solve(t(X) %*% X) %*% t(X)
e <- resid(fit)
SSE <- sum(e^2)
MSE <- SSE / (n - p)
out <- list(
fit = fit,
coefficients = summary(fit)$coefficients,
fit_measures = c(R2 = summary(fit)$r.squared,
adj_R2 = summary(fit)$adj.r.squared,
sigma_hat = sqrt(MSE),
AIC = AIC(fit),
BIC = BIC(fit)),
VIF = try(vif_manual(fit), silent = TRUE),
max_leverage = max(diag(H)),
largest_cooks = sort(cooks.distance(fit), decreasing = TRUE)[1:min(5, n)]
)
return(out)
}
rep1 <- mlr_report(mpg ~ wt + hp + qsec, data = mtcars)
rep1$fit_measures#> R2 adj_R2 sigma_hat AIC BIC
#> 0.8347678 0.8170643 2.5777849 157.1426108 164.4712904
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 27.6105 | 8.4199 | 3.2792 | 0.0028 |
| wt | -4.3588 | 0.7527 | -5.7909 | 0.0000 |
| hp | -0.0178 | 0.0150 | -1.1896 | 0.2442 |
| qsec | 0.5108 | 0.4392 | 1.1630 | 0.2546 |
| term | VIF |
|---|---|
| wt | 2.530 |
| hp | 4.922 |
| qsec | 2.874 |
#> Chrysler Imperial Toyota Corolla Maserati Bora Fiat 128
#> 0.33284943 0.19121065 0.13078514 0.11850846
#> Toyota Corona
#> 0.08631978
mpg ~ wt + hp + disp + cyl and compute VIFs. Which
variable creates the strongest collinearity issue?mpg ~ wt + hp with mpg ~ wt * hp.
Is the interaction useful by partial \(F\)-test?wt = 2, hp = 300.
Compute the leverage of this new point and decide whether hidden
extrapolation is a concern.Montgomery, D. C., Peck, E. A., and Vining, G. G. (2021). Introduction to Linear Regression Analysis, 6th edition. Wiley.
Draper, N. R. and Smith, H. (1998). Applied Regression Analysis, 3rd edition. Wiley.
Seber, G. A. F. and Lee, A. J. (2003). Linear Regression Analysis, 2nd edition. Wiley.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B, 26(2), 211–252.