Replicating OLS output

1 Setup

# Clear the workspace
rm(list = ls()) # Clear environment
gc()            # Clear unused memory
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  609750 32.6    1384338   74         NA   715608 38.3
Vcells 1124703  8.6    8388608   64      49152  2010814 15.4
cat("\f")       # Clear the console

2 Multivariate Regression with mtcars

We begin by fitting a multiple regression model predicting miles per gallon (mpg) from the number of cylinders (cyl) and engine displacement (disp) using the built-in mtcars dataset.

df   <- mtcars
reg1 <- lm(data = df, formula = mpg ~ cyl + disp)
summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

3 Residuals

3.1 Inspecting Residuals

Residuals are the differences between observed and fitted values: \(e_i = y_i - \hat{y}_i\). A key property of OLS residuals is that they sum to zero (within rounding error).

reg1$residuals
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
        -0.84395255         -0.84395255         -3.28885510          1.57324352 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
         4.14732774         -2.40601638         -0.25267226         -0.89226849 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
        -2.61371193         -2.48751693         -3.88751693          0.11418581 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
         1.01418581         -1.08581419         -1.84730532         -2.09430892 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
         1.79401841          5.70804444          3.64629354          7.05160883 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
        -4.33979314          0.08281514         -0.50535572         -1.45850859 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
         5.47067308          0.61421953          0.16432359          4.04561603 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
         1.06207504         -2.45270705         -0.76710662         -4.42126787 
length(reg1$residuals)
[1] 32
# Residuals sum to (essentially) zero
round(x = sum(reg1$residuals),   digits = 14)
[1] 0
# Sum of squared residuals (SSR)
round(x = sum(reg1$residuals^2), digits = 14)
[1] 270.7403

3.2 Summary Statistics on Residuals

sum(reg1$residuals)
[1] 2.664535e-15
psych::describe(reg1$residuals)
   vars  n mean   sd median trimmed  mad   min  max range skew kurtosis   se
X1    1 32    0 2.96  -0.64   -0.21 2.57 -4.42 7.05 11.47 0.67    -0.35 0.52

4 Replicating Regression Output by Hand

4.1 Replicating the t-value for \(\hat{\beta}_1\)

The t-statistic for a coefficient is its estimate divided by its standard error:

\[t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\]

# Manually
-1.58728 / 0.71184
[1] -2.229827
# Using stored coefficient
round(x = reg1$coefficients[2] / 0.71184, digits = 3)
  cyl 
-2.23 

4.2 Replicating the p-value for \(\hat{\beta}_1\)

The two-tailed p-value uses the t-distribution with \(n - (k + 1)\) degrees of freedom, where \(k\) is the number of predictors.

\[p = 2 \times P\!\left(T \leq t_{\text{obs}}\right)\]

2 * pt(q = -2.230, df = 32 - (2 + 1), lower.tail = TRUE)
[1] 0.03365088

4.3 Replicating the Residual Standard Error

The residual standard error (RSE) is the square root of the mean squared error:

\[\hat{\sigma} = \sqrt{\frac{\sum e_i^2}{n - (k+1)}} = \sqrt{\frac{SSR}{df_{\text{error}}}}\]

Because OLS residuals have a mean of exactly zero, the deviation from the mean equals the residual itself, simplifying the calculation.

residuals <- as.data.frame(reg1$residuals)

# Standard deviation of residuals (adjusts for df automatically)
sd(as.numeric(residuals$`reg1$residuals`))
[1] 2.955259
# Step-by-step manual calculation
residuals$`reg1$residuals` - mean(residuals$`reg1$residuals`)          # deviation from mean
 [1] -0.84395255 -0.84395255 -3.28885510  1.57324352  4.14732774 -2.40601638
 [7] -0.25267226 -0.89226849 -2.61371193 -2.48751693 -3.88751693  0.11418581
[13]  1.01418581 -1.08581419 -1.84730532 -2.09430892  1.79401841  5.70804444
[19]  3.64629354  7.05160883 -4.33979314  0.08281514 -0.50535572 -1.45850859
[25]  5.47067308  0.61421953  0.16432359  4.04561603  1.06207504 -2.45270705
[31] -0.76710662 -4.42126787
(residuals$`reg1$residuals` - mean(residuals$`reg1$residuals`))^2      # squared deviations
 [1]  0.712255903  0.712255903 10.816567877  2.475095181 17.200327418
 [6]  5.788914815  0.063843269  0.796143058  6.831490036  6.187740499
[11] 15.112787915  0.013038400  1.028572866  1.178992446  3.412536939
[16]  4.386129847  3.218502069 32.581771340 13.295456586 49.725187054
[21] 18.833804476  0.006858348  0.255384408  2.127247306 29.928263937
[26]  0.377265632  0.027002242 16.367009048  1.128003399  6.015771866
[31]  0.588452574 19.547609556
sum((residuals$`reg1$residuals` - mean(residuals$`reg1$residuals`))^2) # SSR
[1] 270.7403
# Variance of residuals
sum((residuals$`reg1$residuals` - mean(residuals$`reg1$residuals`))^2) / (32 - (2 + 1))
[1] 9.335872
# RSE
sqrt(sum((residuals$`reg1$residuals` - mean(residuals$`reg1$residuals`))^2) / (32 - (2 + 1)))
[1] 3.055466
# Shortcut: since mean(residuals) = 0, we can skip the mean-centering step
sqrt(sum(reg1$residuals^2) / 29)
[1] 3.055466
summary(reg1) # confirm

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

5 Matrix Approach to OLS

5.1 Building the Design Matrix \(\mathbf{X}\)

In matrix notation, the OLS model is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{e}\). The design matrix \(\mathbf{X}\) contains a leading column of ones (for the intercept) followed by the predictor columns.

X <- as.matrix(cbind(df$cyl, df$disp))

# Prepend a column of 1s for the intercept
additional_column <- matrix(data = 1, nrow = nrow(X), ncol = 1)
X <- cbind(additional_column, X)

head(X)
     [,1] [,2] [,3]
[1,]    1    6  160
[2,]    1    6  160
[3,]    1    4  108
[4,]    1    6  258
[5,]    1    8  360
[6,]    1    6  225

5.2 What is \(\mathbf{X}\)?

\(\mathbf{X}\) is the design matrix — an \(n \times (k+1)\) matrix where:

  • \(n\) = number of observations (rows)
  • \(k\) = number of predictors
  • The first column is all 1s, anchoring the intercept
  • Each subsequent column holds the observed values of a predictor

5.3 What is \(\mathbf{X}^\top\mathbf{X}\)?

\(\mathbf{X}^\top\mathbf{X}\) is the Sum of Squares and Cross Products (SSCP) matrix — a \((k+1) \times (k+1)\) symmetric matrix where:

  • The diagonal entries are sums of squares for each column of \(\mathbf{X}\)
  • The off-diagonal entries are cross products between pairs of columns

\[\mathbf{X}^\top\mathbf{X} = \begin{bmatrix} \sum 1 & \sum x_1 & \sum x_2 \\ \sum x_1 & \sum x_1^2 & \sum x_1 x_2 \\ \sum x_2 & \sum x_1 x_2 & \sum x_2^2 \end{bmatrix}\]

Variations on this theme reveal different matrix interpretations depending on how \(\mathbf{X}\) is scaled:

Transformation of \(\mathbf{X}\) \(\frac{1}{n-1}\,\mathbf{X}^\top\mathbf{X}\) produces… Diagonal Off-diagonal
Raw \(\mathbf{X}\) SSCP matrix Sums of squares Cross products
Centered \(\mathbf{X}\) ÷ \((n-1)\) Variance–Covariance matrix Variances Covariances
Standardized \(\mathbf{X}\) ÷ \((n-1)\) Pearson Correlation matrix 1s Correlations
Unit-scaled \(\mathbf{X}\) (no ÷) Cosine Similarity matrix 1s Cosine similarities
X_transposed <- t(X)
XtX          <- X_transposed %*% X
XtX
       [,1]    [,2]      [,3]
[1,]   32.0   198.0    7383.1
[2,]  198.0  1324.0   51872.4
[3,] 7383.1 51872.4 2179627.5

5.4 Printing the Four \(\mathbf{X}^\top\mathbf{X}\) Matrices

To keep things clean, all four transformations use only the predictor columns (no intercept column of 1s), since centering, standardizing, or unit-scaling a constant column is undefined.

5.4.1 1. Raw \(\mathbf{X}\): SSCP Matrix

\(\mathbf{X}_{\text{raw}}^\top \mathbf{X}_{\text{raw}}\) gives the Sum of Squares and Cross Products (SSCP) matrix — raw sums of squares on the diagonal, raw cross products off the diagonal.

X_raw <- as.matrix(cbind(df$cyl, df$disp))
colnames(X_raw) <- c("cyl", "disp")

SSCP <- t(X_raw) %*% X_raw
cat("--- Raw X'X: SSCP Matrix ---\n")
--- Raw X'X: SSCP Matrix ---
print(SSCP)
         cyl      disp
cyl   1324.0   51872.4
disp 51872.4 2179627.5

5.4.2 2. Centered \(\mathbf{X}\): Variance–Covariance Matrix

Centering means subtracting the column mean from every entry in that column:

\[x_{ij}^{\text{centered}} = x_{ij} - \bar{x}_j\]

This shifts each variable so it has a mean of zero. When you compute \(\mathbf{X}_c^\top \mathbf{X}_c\) on the centered matrix and divide by \(n - 1\), the diagonal entries become variances and the off-diagonal entries become covariances — identical to cov().

\[\frac{1}{n-1}\,\mathbf{X}_c^\top \mathbf{X}_c = \mathbf{\Sigma} \quad \text{(Variance–Covariance Matrix)}\]

# Center each column by subtracting its mean
X_centered <- scale(X_raw, center = TRUE, scale = FALSE)
cat("Column means of X_centered (should be ~0):\n")
Column means of X_centered (should be ~0):
print(colMeans(X_centered))
          cyl          disp 
 0.000000e+00 -3.907985e-14 
# X_c'X_c / (n-1) = Variance-Covariance matrix
n <- nrow(X_raw)
VarCov <- (t(X_centered) %*% X_centered) / (n - 1)

cat("\n--- Centered X'X / (n-1): Variance-Covariance Matrix ---\n")

--- Centered X'X / (n-1): Variance-Covariance Matrix ---
print(VarCov)
            cyl       disp
cyl    3.189516   199.6603
disp 199.660282 15360.7998
# Confirm with cov()
cat("\n--- Confirm with cov() ---\n")

--- Confirm with cov() ---
print(cov(X_raw))
            cyl       disp
cyl    3.189516   199.6603
disp 199.660282 15360.7998

5.4.3 3. Standardized \(\mathbf{X}\): Pearson Correlation Matrix

Standardizing means centering AND dividing by the standard deviation:

\[x_{ij}^{\text{std}} = \frac{x_{ij} - \bar{x}_j}{s_j}\]

This gives each variable a mean of 0 and a standard deviation of 1. Dividing \(\mathbf{X}_s^\top \mathbf{X}_s\) by \(n - 1\) yields the Pearson correlation matrix — 1s on the diagonal, correlations off the diagonal.

\[\frac{1}{n-1}\,\mathbf{X}_s^\top \mathbf{X}_s = \mathbf{R} \quad \text{(Correlation Matrix)}\]

X_std <- scale(X_raw, center = TRUE, scale = TRUE)

CorrMat <- (t(X_std) %*% X_std) / (n - 1)

cat("--- Standardized X'X / (n-1): Pearson Correlation Matrix ---\n")
--- Standardized X'X / (n-1): Pearson Correlation Matrix ---
print(CorrMat)
           cyl      disp
cyl  1.0000000 0.9020329
disp 0.9020329 1.0000000
# Confirm with cor()
cat("\n--- Confirm with cor() ---\n")

--- Confirm with cor() ---
print(cor(X_raw))
           cyl      disp
cyl  1.0000000 0.9020329
disp 0.9020329 1.0000000

5.4.4 4. Unit-scaled \(\mathbf{X}\): Cosine Similarity Matrix

Unit-scaling (L2 normalization) divides each column by its Euclidean norm (length):

\[x_{ij}^{\text{unit}} = \frac{x_{ij}}{\|\mathbf{x}_j\|} \quad \text{where} \quad \|\mathbf{x}_j\| = \sqrt{\sum_i x_{ij}^2}\]

\(\mathbf{X}_u^\top \mathbf{X}_u\) (no division by \(n-1\)) gives the cosine similarity matrix — 1s on the diagonal, cosine similarities off the diagonal.

\[\mathbf{X}_u^\top \mathbf{X}_u = \text{Cosine Similarity Matrix}\]

col_norms <- sqrt(colSums(X_raw^2))
X_unit    <- sweep(X_raw, 2, col_norms, FUN = "/")

CosineMat <- t(X_unit) %*% X_unit

cat("--- Unit-scaled X'X: Cosine Similarity Matrix ---\n")
--- Unit-scaled X'X: Cosine Similarity Matrix ---
print(CosineMat)
           cyl      disp
cyl  1.0000000 0.9656088
disp 0.9656088 1.0000000

5.5 What is \((\mathbf{X}^\top\mathbf{X})^{-1}\)?

\((\mathbf{X}^\top\mathbf{X})^{-1}\) is the matrix inverse of the SSCP matrix. It plays the role of “dividing” in matrix algebra and is central to the OLS solution:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

XtX_inv <- solve(XtX)
XtX_inv
             [,1]          [,2]          [,3]
[1,]  0.694871232 -0.1730656141  1.764992e-03
[2,] -0.173065614  0.0542769093 -7.054934e-04
[3,]  0.001764992 -0.0007054934  1.127006e-05

5.5.1 How is \((\mathbf{X}^\top\mathbf{X})^{-1}\) Centered?

In practice, OLS is often computed on the centered design matrix \(\mathbf{X}_c\) (predictors mean-centered, intercept column dropped) because centering:

  1. Removes the intercept from the coefficient estimation problem, making \((\mathbf{X}_c^\top \mathbf{X}_c)^{-1}\) easier to interpret
  2. Improves numerical stability — raw predictors with large magnitudes can produce ill-conditioned \(\mathbf{X}^\top\mathbf{X}\) matrices, making inversion unreliable
  3. Makes \(\mathbf{X}_c^\top \mathbf{X}_c \propto \boldsymbol{\Sigma}\) (the variance–covariance matrix), so its inverse \((\mathbf{X}_c^\top \mathbf{X}_c)^{-1} \propto \boldsymbol{\Sigma}^{-1}\) — a quantity that appears throughout multivariate statistics

The relationship is:

\[(\mathbf{X}_c^\top \mathbf{X}_c)^{-1} = \left[(n-1)\,\boldsymbol{\Sigma}\right]^{-1} = \frac{1}{n-1}\,\boldsymbol{\Sigma}^{-1}\]

# Centered X (predictors only, no intercept column)
X_c   <- scale(as.matrix(cbind(df$cyl, df$disp)), center = TRUE, scale = FALSE)
colnames(X_c) <- c("cyl", "disp")

XtX_c     <- t(X_c) %*% X_c
XtX_c_inv <- solve(XtX_c)

cat("--- Centered X'X ---\n")
--- Centered X'X ---
print(XtX_c)
          cyl       disp
cyl    98.875   6189.469
disp 6189.469 476184.795
cat("\n--- (Centered X'X)^{-1} ---\n")

--- (Centered X'X)^{-1} ---
print(XtX_c_inv)
               cyl          disp
cyl   0.0542769093 -7.054934e-04
disp -0.0007054934  1.127006e-05
# Verify: (n-1) * Sigma = XtX_c, so XtX_c_inv = (1/(n-1)) * Sigma_inv
Sigma     <- cov(cbind(df$cyl, df$disp))
Sigma_inv <- solve(Sigma)

cat("\n--- (1/(n-1)) * Sigma^{-1}: should match (Centered X'X)^{-1} ---\n")

--- (1/(n-1)) * Sigma^{-1}: should match (Centered X'X)^{-1} ---
print((1 / (nrow(X_c) - 1)) * Sigma_inv)
              [,1]          [,2]
[1,]  0.0542769093 -7.054934e-04
[2,] -0.0007054934  1.127006e-05

5.6 How to Get Standard Errors

The variance–covariance matrix of the coefficient estimates is:

\[\text{Var}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 \,(\mathbf{X}^\top\mathbf{X})^{-1}\]

where \(\hat{\sigma}^2 = \frac{\sum e_i^2}{n-(k+1)}\) is the estimated error variance (mean squared error). The standard errors are the square roots of the diagonal entries of this matrix.

\[SE(\hat{\beta}_j) = \sqrt{\left[\hat{\sigma}^2 (\mathbf{X}^\top\mathbf{X})^{-1}\right]_{jj}}\]

# Estimated error variance (sigma-hat squared)
sigma2_hat <- sum(reg1$residuals^2) / 29

# Variance-covariance matrix of beta-hat
vcov_beta <- XtX_inv * sigma2_hat
vcov_beta
            [,1]         [,2]          [,3]
[1,]  6.48722874 -1.615718386  0.0164777387
[2,] -1.61571839  0.506722267 -0.0065863960
[3,]  0.01647774 -0.006586396  0.0001052158
# Standard errors = sqrt of diagonal entries
se_betas <- sqrt(diag(vcov_beta))
se_betas
[1] 2.54700388 0.71184427 0.01025748
# Confirm against summary output
summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09
# Manual checks
sqrt(6.48722874)    # SE for intercept
[1] 2.547004
sqrt(0.506722267)   # SE for cyl
[1] 0.7118443
sqrt(0.0001052158)  # SE for disp
[1] 0.01025748

The values match the standard errors reported by summary(reg1), confirming the matrix derivation.