1 Aim of this RPubs note

This note is a complete practical companion for a postgraduate lecture on multiple linear regression. It is written so that a student can read the theory, run the code, reproduce the calculations and understand how regression is actually used.

Core reference for the lecture: Montgomery, Peck and Vining, Introduction to Linear Regression Analysis, 6th edition, Wiley.

We will study the model \[ Y_i=\beta_0+\beta_1x_{i1}+\cdots+\beta_kx_{ik}+\varepsilon_i, \qquad i=1,\ldots,n, \] or, in matrix notation, \[ \mathbf y=X\boldsymbol\beta+\boldsymbol\varepsilon. \]

The main ideas are:

1.1 Assignment PDF

Assignment 1: Download Multiple Linear Regression Theory Assignment 1



This assignment contains theory, derivations, interpretation-based questions, and matrix-form exercises on multiple linear regression.

2 A real data example

We use the built-in mtcars dataset. The response is miles per gallon (mpg). We use car weight (wt), horsepower (hp) and quarter-mile time (qsec) as predictors. The purpose is not automotive science; the purpose is to learn regression workflow with reproducible data.

data(mtcars)
str(mtcars)
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars[, c("mpg", "wt", "hp", "qsec", "disp", "cyl")])
#>       mpg              wt              hp             qsec      
#>  Min.   :10.40   Min.   :1.513   Min.   : 52.0   Min.   :14.50  
#>  1st Qu.:15.43   1st Qu.:2.581   1st Qu.: 96.5   1st Qu.:16.89  
#>  Median :19.20   Median :3.325   Median :123.0   Median :17.71  
#>  Mean   :20.09   Mean   :3.217   Mean   :146.7   Mean   :17.85  
#>  3rd Qu.:22.80   3rd Qu.:3.610   3rd Qu.:180.0   3rd Qu.:18.90  
#>  Max.   :33.90   Max.   :5.424   Max.   :335.0   Max.   :22.90  
#>       disp            cyl       
#>  Min.   : 71.1   Min.   :4.000  
#>  1st Qu.:120.8   1st Qu.:4.000  
#>  Median :196.3   Median :6.000  
#>  Mean   :230.7   Mean   :6.188  
#>  3rd Qu.:326.0   3rd Qu.:8.000  
#>  Max.   :472.0   Max.   :8.000

A quick scatterplot matrix helps us see pairwise relationships.

pairs(mtcars[, c("mpg", "wt", "hp", "qsec", "disp")],
      main = "Pairwise view of response and candidate predictors",
      pch = 19, col = "steelblue")

3 Model formulation

Let \(p=k+1\), counting the intercept. Define \[ \mathbf y=\begin{bmatrix}y_1\\y_2\\ \vdots\\y_n\end{bmatrix},\quad X=\begin{bmatrix} 1&x_{11}&x_{12}&\cdots&x_{1k}\\ 1&x_{21}&x_{22}&\cdots&x_{2k}\\ \vdots&\vdots&\vdots&&\vdots\\ 1&x_{n1}&x_{n2}&\cdots&x_{nk} \end{bmatrix},\quad \boldsymbol\beta=\begin{bmatrix}\beta_0\\\beta_1\\ \vdots\\\beta_k\end{bmatrix}. \] Then \[ \mathbf y=X\boldsymbol\beta+\boldsymbol\varepsilon. \]

The standard assumptions for the classical linear model are \[ \mathbb E(\boldsymbol\varepsilon\mid X)=0,\qquad \operatorname{Var}(\boldsymbol\varepsilon\mid X)=\sigma^2 I_n. \] For exact finite-sample \(t\)- and \(F\)-inference, we additionally assume \[ \boldsymbol\varepsilon\mid X\sim N_n(0,\sigma^2 I_n). \]

Important warning: if \(p\ge n\), or if the columns of \(X\) are nearly linearly dependent, then \(X^\top X\) is singular or ill-conditioned. OLS can become non-unique or numerically unstable. This motivates variable reduction, principal components, ridge regression or other regularization methods.

4 Fitting an initial multiple linear regression model

fit <- lm(mpg ~ wt + hp + qsec, data = mtcars)
summary(fit)
#> 
#> Call:
#> lm(formula = mpg ~ wt + hp + qsec, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.8591 -1.6418 -0.4636  1.1940  5.6092 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 27.61053    8.41993   3.279  0.00278 ** 
#> wt          -4.35880    0.75270  -5.791 3.22e-06 ***
#> hp          -0.01782    0.01498  -1.190  0.24418    
#> qsec         0.51083    0.43922   1.163  0.25463    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.578 on 28 degrees of freedom
#> Multiple R-squared:  0.8348, Adjusted R-squared:  0.8171 
#> F-statistic: 47.15 on 3 and 28 DF,  p-value: 4.506e-11

The fitted model is \[ \widehat{mpg}=\widehat\beta_0+ \widehat\beta_1 wt+ \widehat\beta_2 hp+ \widehat\beta_3 qsec. \]

coef_table <- summary(fit)$coefficients
kable_num(coef_table, digits = 4, caption = "Coefficient table for the model mpg ~ wt + hp + qsec")
Coefficient table for the model mpg ~ wt + hp + qsec
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.6105 8.4199 3.2792 0.0028
wt -4.3588 0.7527 -5.7909 0.0000
hp -0.0178 0.0150 -1.1896 0.2442
qsec 0.5108 0.4392 1.1630 0.2546

Interpretation of slopes should always be conditional. For example, the coefficient of wt is the expected change in mpg for one unit increase in wt, holding horsepower and quarter-mile time fixed.

5 OLS from matrix algebra

The least-squares estimator minimizes \[ S(\boldsymbol\beta)=\|\mathbf y-X\boldsymbol\beta\|^2. \] Expanding the quadratic, \[ S(\boldsymbol\beta)=\mathbf y^\top\mathbf y -2\boldsymbol\beta^\top X^\top\mathbf y +\boldsymbol\beta^\top X^\top X\boldsymbol\beta. \] Differentiating with respect to \(\boldsymbol\beta\), \[ \frac{\partial S}{\partial\boldsymbol\beta} =-2X^\top(\mathbf y-X\boldsymbol\beta). \] Setting the gradient to zero gives the normal equations \[ X^\top X\widehat{\boldsymbol\beta}=X^\top\mathbf y. \] If \(X\) has full column rank, then \[ \widehat{\boldsymbol\beta}=(X^\top X)^{-1}X^\top\mathbf y. \]

Let us verify this directly in R.

y <- mtcars$mpg
X <- model.matrix(fit)

beta_matrix <- solve(t(X) %*% X) %*% t(X) %*% y
beta_lm <- coef(fit)

kable_num(
  cbind(Matrix_Formula = as.vector(beta_matrix), lm_Output = beta_lm),
  digits = 6,
  caption = "OLS coefficients from matrix formula and from lm()"
)
OLS coefficients from matrix formula and from lm()
Matrix_Formula lm_Output
(Intercept) 27.610527 27.610527
wt -4.358797 -4.358797
hp -0.017822 -0.017822
qsec 0.510834 0.510834

5.1 Importance of the hat matrix

In multiple linear regression,

\[ \mathbf y = X\boldsymbol\beta+\boldsymbol\varepsilon, \qquad \widehat{\boldsymbol\beta}=(X^\top X)^{-1}X^\top \mathbf y. \]

Therefore,

\[ \widehat{\mathbf y}=X\widehat{\boldsymbol\beta} = X(X^\top X)^{-1}X^\top \mathbf y = H\mathbf y, \]

where

\[ H=X(X^\top X)^{-1}X^\top \]

is called the hat matrix. It is called the hat matrix because it maps the observed response vector \(\mathbf y\) to the fitted vector \(\widehat{\mathbf y}\).

Geometrically, \(H\) projects \(\mathbf y\) onto the column space \(\mathcal C(X)\). The fitted vector \(\widehat{\mathbf y}=H\mathbf y\) lies inside \(\mathcal C(X)\), while the residual vector

\[ \mathbf e=\mathbf y-\widehat{\mathbf y}=(I-H)\mathbf y \]

is orthogonal to \(\mathcal C(X)\). Hence,

\[ X^\top \mathbf e=\mathbf 0. \]

This is the geometric heart of ordinary least squares.

The diagonal elements \(h_{ii}\) of \(H\) are called leverages. A large \(h_{ii}\) means the \(i\)-th observation has an unusual predictor pattern and can strongly influence the fitted regression surface.

6 The fitted values, residuals and the hat matrix

Define the hat matrix \[ H=X(X^\top X)^{-1}X^\top. \] Then \[ \widehat{\mathbf y}=H\mathbf y, \qquad \mathbf e=\mathbf y-\widehat{\mathbf y}=(I-H)\mathbf y. \] The matrix \(H\) is symmetric and idempotent: \[ H^\top=H, \qquad H^2=H. \] Thus it is an orthogonal projection matrix onto the column space of \(X\).

H <- X %*% solve(t(X) %*% X) %*% t(X)
yhat <- H %*% y
e <- y - yhat

checks <- c(
  max_abs_yhat_difference = max(abs(as.vector(yhat) - fitted(fit))),
  symmetry_error = max(abs(H - t(H))),
  idempotence_error = max(abs(H %*% H - H)),
  normal_equation_error = max(abs(t(X) %*% e))
)
round(checks, 10)
#> max_abs_yhat_difference          symmetry_error       idempotence_error 
#>                 0.0e+00                 0.0e+00                 0.0e+00 
#>   normal_equation_error 
#>                 8.1e-09

The last number verifies \(X^\top\mathbf e=0\): residuals are orthogonal to every column of the design matrix.

7 Geometry of OLS in subject space

Regression geometry is most precise in \(\mathbb R^n\), not in the two- or three-dimensional predictor plot. The vector \(\mathbf y\in\mathbb R^n\) is projected onto the column space \(\mathcal C(X)\). Its projection is \(\widehat{\mathbf y}=H\mathbf y\), and the residual vector \(\mathbf e=\mathbf y-\widehat{\mathbf y}\) is perpendicular to \(\mathcal C(X)\).

Because the intercept column \(\mathbf 1_n\) belongs to \(\mathcal C(X)\), we get \[ \mathbf 1_n^\top \mathbf e=0. \] Therefore residuals sum to zero and the observed and fitted values have the same mean: \[ \sum_{i=1}^n e_i=0, \qquad \overline{y}=\overline{\widehat y}. \]

c(sum_residuals = sum(resid(fit)), mean_y = mean(y), mean_fitted = mean(fitted(fit)))
#> sum_residuals        mean_y   mean_fitted 
#> -3.608225e-16  2.009062e+01  2.009062e+01

7.1 A geometric right triangle for \(R^2\)

With an intercept in the model, \[ \mathbf y-\overline y\mathbf 1_n = (\widehat{\mathbf y}-\overline y\mathbf 1_n)+\mathbf e. \] Here \(\widehat{\mathbf y}-\overline y\mathbf 1_n\in\mathcal C(X)\), while \(\mathbf e\perp\mathcal C(X)\). Hence these three vectors form a right triangle.

Thus \[ \|\mathbf y-\overline y\mathbf 1_n\|^2 = \|\widehat{\mathbf y}-\overline y\mathbf 1_n\|^2 + \|\mathbf e\|^2. \] Equivalently, \[ SST=SSR+SSE. \]

The plot visualizes the Pythagorean decomposition

\[ SST=SSR+SSE, \]

or equivalently,

\[ \|y-\bar y\mathbf 1_n\|^2 = \|\hat y-\bar y\mathbf 1_n\|^2+\|e\|^2. \]

Thus \(R^2=SSR/SST=\cos^2(\theta)\), where \(\theta\) is the angle between \(y-\bar y\mathbf 1_n\) and \(\hat y-\bar y\mathbf 1_n\).

yvec <- as.vector(y)
yhat_vec <- as.vector(fitted(fit))
evec <- as.vector(resid(fit))
ybar1 <- mean(yvec) * rep(1, length(yvec))

SST <- sum((yvec - ybar1)^2)
SSR <- sum((yhat_vec - ybar1)^2)
SSE <- sum((yvec - yhat_vec)^2)

R2_length <- SSR / SST
R2_def <- 1 - SSE / SST
R_multiple <- cor(yvec, yhat_vec)

kable_num(
  data.frame(SST = SST, SSR = SSR, SSE = SSE,
             R2_from_lengths = R2_length,
             R2_from_definition = R2_def,
             multiple_R = R_multiple,
             multiple_R_squared = R_multiple^2),
  digits = 5,
  caption = "Geometric decomposition of sums of squares and relation to multiple R"
)
Geometric decomposition of sums of squares and relation to multiple R
SST SSR SSE R2_from_lengths R2_from_definition multiple_R multiple_R_squared
1126.047 939.9879 186.0593 0.83477 0.83477 0.91366 0.83477

7.2 Plotting the \(R^2\) triangle

The following plot is not the original \(n\)-dimensional vector space. It is a two-dimensional diagram with the same side lengths as the right triangle in \(\mathbb R^n\).

len_total <- sqrt(SST)
len_reg <- sqrt(SSR)
len_res <- sqrt(SSE)
theta <- acos(R_multiple)

plot(c(0, len_total * 1.05), c(0, len_res * 1.25), type = "n",
     xlab = "Length along fitted centred vector",
     ylab = "Residual length",
     main = expression("Geometry of "*R^2*": right triangle in subject space"))
segments(0, 0, len_reg, 0, lwd = 3, col = "darkgreen")
segments(len_reg, 0, len_reg, len_res, lwd = 3, col = "firebrick")
segments(0, 0, len_reg, len_res, lwd = 3, col = "steelblue")
points(c(0, len_reg, len_reg), c(0, 0, len_res), pch = 19)
text(len_reg/2, -0.08 * len_res, "|| yhat - ybar 1 ||", col = "darkgreen")
text(len_reg * 1.02, len_res/2, "|| y - yhat ||", col = "firebrick", pos = 4)
text(len_reg/2, len_res/2 + 0.08 * len_res, "|| y - ybar 1 ||", col = "steelblue")
legend("topleft",
       legend = c(paste0("R = cor(y, fitted) = ", round(R_multiple, 3)),
                  paste0("R^2 = ", round(R2_def, 3)),
                  paste0("theta = ", round(theta * 180/pi, 2), " degrees")),
       bty = "n")

The multiple correlation coefficient is \[ R=\operatorname{corr}(\mathbf y,\widehat{\mathbf y}) =\cos\theta. \] Therefore \[ R^2=\cos^2\theta =\frac{\|\widehat{\mathbf y}-\overline y\mathbf 1_n\|^2} {\|\mathbf y-\overline y\mathbf 1_n\|^2} =\frac{SSR}{SST}. \] This equivalence depends on the intercept. Without the intercept, the residuals need not sum to zero, the Pythagorean decomposition may fail in the usual centred form, and some software uses a different definition of \(R^2\).

8 Regression plane for two predictors

With two predictors the fitted response surface is a plane: \[ \widehat y=\widehat\beta_0+\widehat\beta_1 x_1+\widehat\beta_2 x_2. \] The next plot is a visual representation of the model mpg ~ wt + hp. It is useful pedagogically, but remember that the exact projection geometry above lives in \(\mathbb R^n\).

fit2 <- lm(mpg ~ wt + hp, data = mtcars)

wt_seq <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 25)
hp_seq <- seq(min(mtcars$hp), max(mtcars$hp), length.out = 25)
z_mat <- outer(wt_seq, hp_seq,
               function(wt, hp) coef(fit2)[1] + coef(fit2)[2] * wt + coef(fit2)[3] * hp)

pmat <- persp(wt_seq, hp_seq, z_mat,
              theta = 40, phi = 20,
              col = "lightblue", border = "gray70",
              xlab = "Weight", ylab = "Horsepower", zlab = "MPG",
              main = "Fitted regression plane: mpg ~ wt + hp")
points(trans3d(mtcars$wt, mtcars$hp, mtcars$mpg, pmat),
       pch = 19, col = "firebrick")

9 Expectation and variance of OLS

Since \[ \widehat{\boldsymbol\beta}=\boldsymbol\beta+(X^\top X)^{-1}X^\top\boldsymbol\varepsilon, \] we have \[ \mathbb E(\widehat{\boldsymbol\beta}\mid X)=\boldsymbol\beta. \] If \(\operatorname{Var}(\boldsymbol\varepsilon\mid X)=\sigma^2I_n\), then \[ \operatorname{Var}(\widehat{\boldsymbol\beta}\mid X)=\sigma^2(X^\top X)^{-1}. \] An unbiased estimator of \(\sigma^2\) is \[ \widehat\sigma^2=MSE=\frac{SSE}{n-p}. \]

n <- nrow(X)
p <- ncol(X)
SSE <- sum(resid(fit)^2)
MSE <- SSE / (n - p)
C <- solve(t(X) %*% X)
se_manual <- sqrt(MSE * diag(C))

kable_num(
  data.frame(Estimate = coef(fit), SE_manual = se_manual,
             SE_summary = summary(fit)$coefficients[, "Std. Error"]),
  digits = 5,
  caption = "Manual standard errors compared with summary(lm)"
)
Manual standard errors compared with summary(lm)
Estimate SE_manual SE_summary
(Intercept) 27.61053 8.41993 8.41993
wt -4.35880 0.75270 0.75270
hp -0.01782 0.01498 0.01498
qsec 0.51083 0.43922 0.43922

10 Hypothesis testing

10.1 Global ANOVA F-test

The global test asks whether the regression explains any linear signal beyond the intercept: \[ H_0:\beta_1=\cdots=\beta_k=0. \] The test statistic is \[ F=\frac{SSR/k}{SSE/(n-p)}. \]

summary(fit)$fstatistic
#>    value    numdf    dendf 
#> 47.15282  3.00000 28.00000
anova(fit)
#> Analysis of Variance Table
#> 
#> Response: mpg
#>           Df Sum Sq Mean Sq  F value    Pr(>F)    
#> wt         1 847.73  847.73 127.5739 6.131e-12 ***
#> hp         1  83.27   83.27  12.5319   0.00142 ** 
#> qsec       1   8.99    8.99   1.3527   0.25463    
#> Residuals 28 186.06    6.64                       
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10.2 Marginal t-test for one coefficient

For a single coefficient, \[ H_0:\beta_j=0, \qquad T_j=\frac{\widehat\beta_j}{\sqrt{MSE\,C_{jj}}}\sim t_{n-p} \] under the normal error model.

summary(fit)$coefficients
#>                Estimate Std. Error   t value     Pr(>|t|)
#> (Intercept) 27.61052686 8.41992848  3.279188 2.784556e-03
#> wt          -4.35879720 0.75270039 -5.790879 3.217222e-06
#> hp          -0.01782227 0.01498117 -1.189645 2.441762e-01
#> qsec         0.51083369 0.43922153  1.163043 2.546284e-01

A non-significant marginal coefficient does not automatically imply that the variable is scientifically useless. Collinearity, interactions, design constraints and subject-matter interpretation must be considered.

10.3 Partial F-test for adding a block of predictors

Suppose the reduced model has design matrix \(X_1\) and the full model has \(X=[X_1,X_2]\). To test whether the block \(X_2\) adds explanatory power after \(X_1\), use \[ F=\frac{(SSE_R-SSE_F)/r}{SSE_F/(n-p_F)}. \] Here \(r\) is the number of added coefficients.

fit_red <- lm(mpg ~ wt + hp, data = mtcars)
anova(fit_red, fit)
#> Analysis of Variance Table
#> 
#> Model 1: mpg ~ wt + hp
#> Model 2: mpg ~ wt + hp + qsec
#>   Res.Df    RSS Df Sum of Sq      F Pr(>F)
#> 1     29 195.05                           
#> 2     28 186.06  1    8.9885 1.3527 0.2546

11 Confidence intervals and prediction intervals

For a new vector \(x_0=(1,x_{01},\ldots,x_{0k})^\top\), the fitted mean is \[ \widehat\mu(x_0)=x_0^\top\widehat{\boldsymbol\beta}. \]

A confidence interval for the mean response is \[ \widehat\mu(x_0)\pm t_{\alpha/2,n-p}\sqrt{MSE\,x_0^\top(X^\top X)^{-1}x_0}. \]

A prediction interval for one future observation is \[ \widehat\mu(x_0)\pm t_{\alpha/2,n-p}\sqrt{MSE\{1+x_0^\top(X^\top X)^{-1}x_0\}}. \] The extra \(1\) makes the prediction interval wider because a future observation has its own random error.

new_car <- data.frame(wt = 3.0, hp = 120, qsec = 18)
predict(fit, newdata = new_car, interval = "confidence")
#>        fit      lwr      upr
#> 1 21.59047 20.53118 22.64976
predict(fit, newdata = new_car, interval = "prediction")
#>        fit      lwr      upr
#> 1 21.59047 16.20491 26.97603
confint(fit)
#>                  2.5 %      97.5 %
#> (Intercept) 10.3630852 44.85796848
#> wt          -5.9006341 -2.81696034
#> hp          -0.0485098  0.01286526
#> qsec        -0.3888708  1.41053822

12 Adjusted \(R^2\), AIC and BIC

The ordinary \(R^2\) never decreases when new predictors are added. Adjusted \(R^2\), AIC and BIC penalize complexity.

fit_small <- lm(mpg ~ wt, data = mtcars)
fit_mid <- lm(mpg ~ wt + hp, data = mtcars)
fit_big <- lm(mpg ~ wt + hp + qsec + disp + cyl, data = mtcars)

model_table <- data.frame(
  model = c("wt", "wt + hp", "wt + hp + qsec", "larger model"),
  R2 = c(summary(fit_small)$r.squared,
         summary(fit_mid)$r.squared,
         summary(fit)$r.squared,
         summary(fit_big)$r.squared),
  adj_R2 = c(summary(fit_small)$adj.r.squared,
             summary(fit_mid)$adj.r.squared,
             summary(fit)$adj.r.squared,
             summary(fit_big)$adj.r.squared),
  AIC = c(AIC(fit_small), AIC(fit_mid), AIC(fit), AIC(fit_big)),
  BIC = c(BIC(fit_small), BIC(fit_mid), BIC(fit), BIC(fit_big))
)
kable_num(model_table, digits = 4,
          caption = "Model comparison by R2, adjusted R2, AIC and BIC")
Model comparison by R2, adjusted R2, AIC and BIC
model R2 adj_R2 AIC BIC
wt 0.7528 0.7446 166.0294 170.4266
wt + hp 0.8268 0.8148 156.6523 162.5153
wt + hp + qsec 0.8348 0.8171 157.1426 164.4713
larger model 0.8502 0.8214 158.0056 168.2658

Good model selection balances fit, parsimony, interpretability and scientific plausibility.

13 Diagnostics

Regression diagnostics ask whether the assumptions are reasonable.

par(mfrow = c(2, 2))
plot(fit)

par(mfrow = c(1, 1))

Interpretation:

14 Leverage, studentized residuals and Cook’s distance

The diagonal entries of \(H\) are leverages: \[ h_{ii}=x_i^\top(X^\top X)^{-1}x_i. \] A high-leverage point has an unusual predictor combination. Cook’s distance combines residual size and leverage.

infl <- data.frame(
  car = rownames(mtcars),
  fitted = fitted(fit),
  residual = resid(fit),
  leverage = hatvalues(fit),
  studentized = rstudent(fit),
  cooksD = cooks.distance(fit)
)

infl_top <- infl[order(-infl$cooksD), ][1:8, ]
kable_num(infl_top, digits = 4,
          caption = "Most influential observations by Cook's distance")
Most influential observations by Cook’s distance
car fitted residual leverage studentized cooksD
Chrysler Imperial Chrysler Imperial 9.1124 5.5876 0.1872 2.6504 0.3328
Toyota Corolla Toyota Corolla 28.6193 5.2807 0.1360 2.3805 0.1912
Maserati Bora Maserati Bora 13.5373 1.4627 0.4650 0.7701 0.1308
Fiat 128 Fiat 128 26.7908 5.6092 0.0840 2.4724 0.1185
Toyota Corona Toyota Corona 25.3591 -3.8591 0.1195 -1.6431 0.0863
Merc 230 Merc 230 23.8853 -1.0853 0.4581 -0.5649 0.0691
Lotus Europa Lotus Europa 27.6348 2.7652 0.1557 1.1753 0.0628
Valiant Valiant 20.9868 -2.8868 0.0964 -1.1867 0.0370
plot(hatvalues(fit), rstudent(fit),
     xlab = "Leverage", ylab = "Externally studentized residual",
     main = "Leverage and studentized residuals",
     pch = 19, col = "steelblue")
abline(h = c(-2, 2), lty = 2, col = "firebrick")
abline(v = 2 * p / n, lty = 2, col = "darkgreen")
text(hatvalues(fit), rstudent(fit), labels = rownames(mtcars), pos = 4, cex = 0.7)

A point is influential when it is unusual in predictor space and has a substantial residual effect on the fitted model.

15 Multicollinearity

Multicollinearity means that some predictors are nearly linearly dependent. It makes coefficient estimates unstable and standard errors large.

For predictor \(x_j\), define \[ VIF_j=\frac{1}{1-R_j^2}, \] where \(R_j^2\) is obtained by regressing \(x_j\) on the remaining predictors.

vif_manual <- function(fit) {
  X <- model.matrix(fit)
  X <- X[, colnames(X) != "(Intercept)", drop = FALSE]
  if (ncol(X) < 2) {
    return(data.frame(term = colnames(X), VIF = NA_real_))
  }
  out <- numeric(ncol(X))
  for (j in seq_len(ncol(X))) {
    yj <- X[, j]
    others <- X[, -j, drop = FALSE]
    r2 <- summary(lm(yj ~ others))$r.squared
    out[j] <- 1 / (1 - r2)
  }
  data.frame(term = colnames(X), VIF = out)
}

kable_num(vif_manual(fit_big), digits = 3,
          caption = "Manual variance inflation factors for the larger model")
Manual variance inflation factors for the larger model
term VIF
wt 7.175
hp 5.235
qsec 3.625
disp 10.407
cyl 7.796

A rule of thumb is that VIF above 5 or 10 should trigger concern, though the threshold is context-dependent.

16 Hidden extrapolation

In multiple regression, a point can be inside the marginal range of each predictor but outside the joint cloud of observed predictor combinations. This is called hidden extrapolation.

For a new point \(x_0\), the quantity \[ h_0=x_0^\top(X^\top X)^{-1}x_0 \] is the leverage of the new point. A useful warning rule is to compare \(h_0\) with the maximum observed leverage.

fit2 <- lm(mpg ~ wt + hp, data = mtcars)
X2 <- model.matrix(fit2)
hmax <- max(hatvalues(fit2))

x0 <- c(1, wt = 2.0, hp = 300)  # inside separate ranges, but unusual jointly
h0 <- as.numeric(t(x0) %*% solve(t(X2) %*% X2) %*% x0)

c(max_observed_leverage = hmax, new_point_leverage = h0, hidden_extrapolation_warning = h0 > hmax)
#>        max_observed_leverage           new_point_leverage 
#>                    0.3942082                    0.6132535 
#> hidden_extrapolation_warning 
#>                    1.0000000
plot(mtcars$wt, mtcars$hp,
     xlab = "Weight", ylab = "Horsepower",
     main = "Hidden extrapolation: marginally plausible, jointly unusual",
     pch = 19, col = "steelblue")
points(x0["wt"], x0["hp"], pch = 19, col = "firebrick", cex = 1.5)
legend("topleft", legend = c("observed cars", "new point"),
       pch = 19, col = c("steelblue", "firebrick"), bty = "n")

17 Units and standardized regression coefficients

Regression coefficients depend on the units of measurement. To compare relative effects, we may standardize variables.

Unit normal scaling uses \[ z_{ij}=\frac{x_{ij}-\overline{x}_j}{s_j}, \qquad y_i^*=\frac{y_i-\overline y}{s_y}. \] The regression of \(y^*\) on standardized predictors has no intercept except for numerical rounding.

mt_std <- as.data.frame(scale(mtcars[, c("mpg", "wt", "hp", "qsec")]))
fit_std <- lm(mpg ~ wt + hp + qsec, data = mt_std)
summary(fit_std)
#> 
#> Call:
#> lm(formula = mpg ~ wt + hp + qsec, data = mt_std)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.64031 -0.27240 -0.07691  0.19811  0.93068 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -1.615e-17  7.561e-02   0.000    1.000    
#> wt          -7.076e-01  1.222e-01  -5.791 3.22e-06 ***
#> hp          -2.027e-01  1.704e-01  -1.190    0.244    
#> qsec         1.515e-01  1.302e-01   1.163    0.255    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4277 on 28 degrees of freedom
#> Multiple R-squared:  0.8348, Adjusted R-squared:  0.8171 
#> F-statistic: 47.15 on 3 and 28 DF,  p-value: 4.506e-11

The standardized coefficients are useful for comparing relative linear association within the fitted model, but they should not replace subject-matter interpretation.

18 Polynomial and interaction terms

A model may be linear in coefficients while nonlinear in predictors: \[ Y=\beta_0+\beta_1x+\beta_2x^2+\varepsilon. \] Interactions are also linear models: \[ Y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2+\varepsilon. \]

fit_inter <- lm(mpg ~ wt * hp + qsec, data = mtcars)
fit_poly <- lm(mpg ~ wt + I(wt^2) + hp + qsec, data = mtcars)

anova(fit, fit_inter)
#> Analysis of Variance Table
#> 
#> Model 1: mpg ~ wt + hp + qsec
#> Model 2: mpg ~ wt * hp + qsec
#>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#> 1     28 186.06                                  
#> 2     27 121.04  1    65.018 14.503 0.0007333 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit, fit_poly)
#> Analysis of Variance Table
#> 
#> Model 1: mpg ~ wt + hp + qsec
#> Model 2: mpg ~ wt + I(wt^2) + hp + qsec
#>   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
#> 1     28 186.06                                
#> 2     27 130.59  1    55.466 11.468 0.002185 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When using interactions or polynomial terms, centering predictors can make interpretation more stable.

19 Heteroscedasticity and robust standard errors

If errors are heteroscedastic, OLS coefficients may remain unbiased under \(\mathbb E(\varepsilon\mid X)=0\), but the usual standard errors may be wrong.

A simple White-type auxiliary regression regresses squared residuals on predictors, squares and interactions.

e2 <- resid(fit)^2
white_aux <- lm(e2 ~ wt + hp + qsec + I(wt^2) + I(hp^2) + I(qsec^2) + wt:hp + wt:qsec + hp:qsec,
                data = mtcars)
LM <- length(e2) * summary(white_aux)$r.squared
df_white <- length(coef(white_aux)) - 1
p_white <- 1 - pchisq(LM, df = df_white)

c(LM_statistic = LM, df = df_white, p_value = p_white)
#> LM_statistic           df      p_value 
#>   12.5312885    9.0000000    0.1849867

A heteroscedasticity-consistent covariance matrix can be computed manually. HC3 uses \[ D_{ii}=\frac{e_i^2}{(1-h_{ii})^2}, \qquad \widehat{\operatorname{Var}}_{HC3}(\widehat\beta) =(X^\top X)^{-1}X^\top D X(X^\top X)^{-1}. \]

X <- model.matrix(fit)
e <- resid(fit)
h <- hatvalues(fit)
D <- diag((e^2) / (1 - h)^2)
V_hc3 <- solve(t(X) %*% X) %*% t(X) %*% D %*% X %*% solve(t(X) %*% X)
se_hc3 <- sqrt(diag(V_hc3))

compare_se <- data.frame(
  Estimate = coef(fit),
  Classical_SE = summary(fit)$coefficients[, "Std. Error"],
  HC3_SE = se_hc3,
  Classical_t = coef(fit) / summary(fit)$coefficients[, "Std. Error"],
  HC3_t = coef(fit) / se_hc3
)
kable_num(compare_se, digits = 4,
          caption = "Classical versus HC3 robust standard errors")
Classical versus HC3 robust standard errors
Estimate Classical_SE HC3_SE Classical_t HC3_t
(Intercept) 27.6105 8.4199 7.5473 3.2792 3.6583
wt -4.3588 0.7527 0.9501 -5.7909 -4.5879
hp -0.0178 0.0150 0.0138 -1.1896 -1.2897
qsec 0.5108 0.4392 0.4336 1.1630 1.1781

20 Simulation 1: unbiasedness and interval coverage

We now verify the classical theory by simulation. The design matrix is fixed, the true coefficient vector is known, and we repeatedly generate Gaussian errors.

B <- 5000
n_sim <- 80
x1 <- rnorm(n_sim)
x2 <- 0.4 * x1 + rnorm(n_sim)
X_sim <- cbind(1, x1, x2)
beta_true <- c(1, 2, -1)
sigma_true <- 1
p_sim <- ncol(X_sim)

est <- matrix(NA_real_, B, p_sim)
cover <- matrix(NA, B, p_sim)

for (b in 1:B) {
  y_sim <- as.vector(X_sim %*% beta_true + rnorm(n_sim, sd = sigma_true))
  fit_sim <- lm(y_sim ~ x1 + x2)
  est[b, ] <- coef(fit_sim)
  ci <- confint(fit_sim)
  cover[b, ] <- beta_true >= ci[, 1] & beta_true <= ci[, 2]
}

sim_table <- data.frame(
  coefficient = c("beta0", "beta1", "beta2"),
  true = beta_true,
  mean_estimate = colMeans(est),
  empirical_variance = apply(est, 2, var),
  coverage_95 = colMeans(cover)
)
kable_num(sim_table, digits = 4,
          caption = "Monte Carlo check: unbiasedness and 95 percent interval coverage")
Monte Carlo check: unbiasedness and 95 percent interval coverage
coefficient true mean_estimate empirical_variance coverage_95
beta0 1 1.0011 0.0129 0.9474
beta1 2 1.9994 0.0154 0.9496
beta2 -1 -1.0005 0.0118 0.9480

The estimates are centered near the truth and the confidence interval coverage is near 0.95 when the model assumptions hold.

21 Simulation 2: multicollinearity inflates uncertainty

We compare two designs: one with weak collinearity and one with strong collinearity.

sim_collinearity <- function(rho, B = 3000, n = 80) {
  est <- matrix(NA_real_, B, 3)
  for (b in 1:B) {
    x1 <- rnorm(n)
    x2 <- rho * x1 + sqrt(1 - rho^2) * rnorm(n)
    y <- 1 + 2 * x1 - 1 * x2 + rnorm(n)
    est[b, ] <- coef(lm(y ~ x1 + x2))
  }
  apply(est, 2, var)
}

var_low <- sim_collinearity(rho = 0.2)
var_high <- sim_collinearity(rho = 0.98)

kable_num(
  rbind(low_collinearity = var_low, high_collinearity = var_high),
  digits = 4,
  caption = "Sampling variance of OLS estimates under low versus high collinearity"
)
Sampling variance of OLS estimates under low versus high collinearity
low_collinearity 0.0131 0.0143 0.0136
high_collinearity 0.0124 0.3398 0.3393

Multicollinearity may leave prediction reasonably good but makes individual coefficient interpretation unstable.

22 Simulation 3: ill-posed high-dimensional regression

If \(p>n\), the matrix \(X^\top X\) cannot have full rank. Classical OLS is not uniquely determined.

n_hd <- 20
p_hd <- 35
X_hd <- matrix(rnorm(n_hd * p_hd), nrow = n_hd, ncol = p_hd)
y_hd <- rnorm(n_hd)

rank_X <- qr(X_hd)$rank
rank_XtX <- qr(t(X_hd) %*% X_hd)$rank
c(n = n_hd, p = p_hd, rank_X = rank_X, rank_XtX = rank_XtX)
#>        n        p   rank_X rank_XtX 
#>       20       35       20       20
try_solve <- try(solve(t(X_hd) %*% X_hd), silent = TRUE)
try_solve
#> [1] "Error in solve.default(t(X_hd) %*% X_hd) : \n  system is computationally singular: reciprocal condition number = 3.84338e-19\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in solve.default(t(X_hd) %*% X_hd): system is computationally singular: reciprocal condition number = 3.84338e-19>

A ridge-type estimator replaces \((X^\top X)^{-1}\) by \((X^\top X+\lambda I)^{-1}\): \[ \widehat\beta_{ridge}(\lambda)=(X^\top X+\lambda I)^{-1}X^\top y. \]

lambda <- 1
beta_ridge <- solve(t(X_hd) %*% X_hd + lambda * diag(p_hd), t(X_hd) %*% y_hd)
length(beta_ridge)
#> [1] 35
head(beta_ridge)
#>             [,1]
#> [1,]  0.25594043
#> [2,]  0.18178068
#> [3,]  0.01720601
#> [4,] -0.06574177
#> [5,] -0.18636388
#> [6,] -0.09152578

This demonstrates why regularization is not just a computational trick; it is necessary when the regression problem is ill-posed.

23 Practical checklist for students

Before reporting a multiple linear regression model, check the following.

  1. Meaning of the model. Are the response and predictors scientifically sensible?
  2. Rank and collinearity. Is \(X\) full rank? Are VIFs reasonable?
  3. Coefficient interpretation. Are slopes interpreted conditionally?
  4. Overall model. Is the global \(F\)-test meaningful for the scientific question?
  5. Partial contribution. Are nested model comparisons used when needed?
  6. Diagnostics. Are residual plots, leverage and Cook’s distance checked?
  7. Prediction. Are prediction intervals used when predicting future observations?
  8. Extrapolation. Is the new point inside the joint predictor cloud?
  9. Model revision. Are nonlinear terms, interactions, transformations or robust SE needed?
  10. Reproducibility. Can the analysis be rerun from raw data and code?

24 A reusable MLR workflow function

The function below gives students a compact automatic summary for many classroom datasets.

mlr_report <- function(formula, data) {
  fit <- lm(formula, data = data)
  X <- model.matrix(fit)
  y <- model.response(model.frame(fit))
  n <- nrow(X); p <- ncol(X)
  H <- X %*% solve(t(X) %*% X) %*% t(X)
  e <- resid(fit)
  SSE <- sum(e^2)
  MSE <- SSE / (n - p)
  out <- list(
    fit = fit,
    coefficients = summary(fit)$coefficients,
    fit_measures = c(R2 = summary(fit)$r.squared,
                     adj_R2 = summary(fit)$adj.r.squared,
                     sigma_hat = sqrt(MSE),
                     AIC = AIC(fit),
                     BIC = BIC(fit)),
    VIF = try(vif_manual(fit), silent = TRUE),
    max_leverage = max(diag(H)),
    largest_cooks = sort(cooks.distance(fit), decreasing = TRUE)[1:min(5, n)]
  )
  return(out)
}

rep1 <- mlr_report(mpg ~ wt + hp + qsec, data = mtcars)
rep1$fit_measures
#>          R2      adj_R2   sigma_hat         AIC         BIC 
#>   0.8347678   0.8170643   2.5777849 157.1426108 164.4712904
kable_num(rep1$coefficients, digits = 4, caption = "Report function: coefficients")
Report function: coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.6105 8.4199 3.2792 0.0028
wt -4.3588 0.7527 -5.7909 0.0000
hp -0.0178 0.0150 -1.1896 0.2442
qsec 0.5108 0.4392 1.1630 0.2546
kable_num(rep1$VIF, digits = 3, caption = "Report function: VIF")
Report function: VIF
term VIF
wt 2.530
hp 4.922
qsec 2.874
rep1$largest_cooks
#> Chrysler Imperial    Toyota Corolla     Maserati Bora          Fiat 128 
#>        0.33284943        0.19121065        0.13078514        0.11850846 
#>     Toyota Corona 
#>        0.08631978

25 Short exercises

  1. Fit mpg ~ wt + hp + disp + cyl and compute VIFs. Which variable creates the strongest collinearity issue?
  2. Compare mpg ~ wt + hp with mpg ~ wt * hp. Is the interaction useful by partial \(F\)-test?
  3. Pick a new car with wt = 2, hp = 300. Compute the leverage of this new point and decide whether hidden extrapolation is a concern.
  4. Simulate data with heteroscedastic errors \(\varepsilon_i\sim N(0,\sigma_i^2)\), where \(\sigma_i=1+|x_i|\). Compare classical and HC3 standard errors.
  5. Repeat the Monte Carlo experiment with \(\rho=0.99\) between predictors and explain why slope estimation becomes unstable.

26 References

Montgomery, D. C., Peck, E. A., and Vining, G. G. (2021). Introduction to Linear Regression Analysis, 6th edition. Wiley.

Draper, N. R. and Smith, H. (1998). Applied Regression Analysis, 3rd edition. Wiley.

Seber, G. A. F. and Lee, A. J. (2003). Linear Regression Analysis, 2nd edition. Wiley.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.

Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B, 26(2), 211–252.