Multiple Regression: Reconciling manual computations with the lm() function in R

PSYC 4111 Fall 2024

Chris Cox

R Environment

library(dplyr)
library(purrr)
library(tidyr)
options(digits = 3)
knitr::opts_chunk$set(echo = TRUE)
set.seed(411124)

standardize <- function(x) {
  return((x - mean(x)) / sd(x))
}

sumsquares <- function(x) {
  return(sum((x - mean(x))^2))
}

Generate correlated variables

First, we will generate three variables: a dependent variable and two predictor variables. All three variables are correlated, and all three are standardized.

The standardize() function is defined on the R Environment slide.

shared_variance <- rnorm(100)
d <- tibble(
  x1 = -shared_variance   + rnorm(100),
  x2 = +shared_variance   + rnorm(100),
  y  = (-.5*x1) + (.5*x2) + rnorm(100)
) |>
  mutate(across(everything(), standardize))
r_full <- c(
  yx1  = cor(d$y , d$x1),
  yx2  = cor(d$y , d$x2),
  x1x2 = cor(d$x1, d$x2)
)
print(r_full)
   yx1    yx2   x1x2 
-0.592  0.638 -0.441 

Note on variable generation

I wanted \(\beta_1\) and \(\beta_2\) to have opposing signs in this example. This is why I sign flip the shared variance when generating \(x_1\) and then multiply it by \(-0.5\) when generating \(y\). This just makes the example a little more interesting.

What is the goal?

For the linear model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\), we want to do the following:

  1. Summarize the unique variance in \(y\) explained by \(x_1\) and \(x_2\), respectively.
  2. Determine if these quantities are significantly different from zero.

Of course, this assumes that the optimal values for \(\beta_0\), \(\beta_1\), and \(\beta_2\) can be found. The general case of finding these values is beyond the scope of this example or this course, but in this simple case of two standardized predictor variables, we can solve for them. This is done here primarily to demonstrate that there is an important relationship between the semi-partial correlation and a regression coefficient (\(\beta\)).

Semi-partial correlation

The semi-partial correlation summarizes the relationship between \(y\) and the variance in a single predictor variable \(x_k\), after removing the variance from \(x_k\) that can be explained by whatever other predictor variables are in the model. Remember, our model is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\).

We can obtain one semi-partial correlation between \(x_1\) and \(y\), and a second between \(x_2\) and \(y\).

Compute semi-partial correlations

Notice that the semi-partial correlations are smaller than the full correlations. Grab pen and paper and write down an explanation for why this makes sense before checking yourself.

Reveal answer … The predictor variables \(x_1\) and \(x_2\) share some variance with each other, and this variance also exists in \(y\). Once this shared variance is removed from \(x_1\) and \(x_2\), the variance that remains in each is less correlated with \(y\).
mreg <- list(
  x1 = lm(x1 ~ x2, data = d),
  x2 = lm(x2 ~ x1, data = d)
)
r_semi <- c(
  x1 = cor(d$y, resid(mreg$x1)),
  x2 = cor(d$y, resid(mreg$x2))
)
rbind(semi = r_semi, full = r_full[1:2])
         x1    x2
semi -0.345 0.421
full -0.592 0.638

Estimate standardized \(\beta\) coefficients

Both apply ONLY to linear models with two predictor variables; \(\beta_{x_1}\) is standardized.

\[ r^{\textrm{semi-partial}}_{x_1} = \frac{\textrm{cor}(y, x_1) - \left(\textrm{cor}\left(y, x_2\right)\times\textrm{cor}\left(x_1, x_2\right)\right)}{\sqrt{1 - \textrm{cor}(x_1, x_2)^2}} \] \[ \beta_{x_1} = \frac{\textrm{cor}(y, x_1) - \left(\textrm{cor}\left(y, x_2\right)\times\textrm{cor}\left(x_1, x_2\right)\right)}{1 - \textrm{cor}(x_1, x_2)^2} \]

num <- c(
  x1 = (r_full["yx1"] - (r_full["yx2"] * r_full["x1x2"])) |> unname(),
  x2 = (r_full["yx2"] - (r_full["yx1"] * r_full["x1x2"])) |> unname()
)
den <- (1 - r_full["x1x2"]^2) |> unname()
betas <- c(num["x1"] / den, num["x2"] / den)
x <- rbind(
  x1 = c(num["x1"] / sqrt(den), r_semi["x1"], betas["x1"]),
  x2 = c(num["x2"] / sqrt(den), r_semi["x2"], betas["x2"])
)
colnames(x) <- c("semi (eq)", "semi (lm)", "beta (eq)")

Check semi-partial correlations and \(\beta_{x_1}\) and \(\beta_{x_2}\)

Our estimate for the beta coefficient, based on an equation very similar to the semi-partial correlation, corresponds exactly with those obtained by lm(y ~ x1 + x2, data = d).

m <- lm(y ~ x1 + x2, data = d)
print(cbind(x, "beta (lm)" = coef(m)[c("x1", "x2")]))
   semi (eq) semi (lm) beta (eq) beta (lm)
x1    -0.345    -0.345    -0.385    -0.385
x2     0.421     0.421     0.469     0.469

Important

The coefficients from the equation and from lm() only correspond because \(y\), \(x_1\), and \(x_2\) are standardized. This means lm() is returning standardized \(\beta\), which is what the equation will produce.

Explore individual regressors

Do \(x_1\) and \(x_2\) make significant unique contributions to reducing the sum of squared errors?

  • We can assess this by testing the null hypothesis for each regression coefficient (that it is sampled from a distribution with mean equal to zero).

  • This will involve a t-test for each \(\beta\).

  • Thus, we need to find the standard error for each.

Standard error of a regression coefficient \(\beta_k\)

Note

\(N\) is the number of observations, \(K\) is the number of predictor variables, and \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\).

\[ \textrm{se}_{\beta_k} = \sqrt{\frac{1-R^2}{(1-R^2_k) \times df_{residual}}} \times \frac{s_y}{s_x} \quad where \quad df_{residual}=N-K-1 \]

Important

\(\beta_0\) is a special case, and the above equation will not work for estimating its standard error. Since \(\beta_0\) rarely corresponds to a hypothesis of interest, we will just bypass that. Its formulation is more complicated.

Variance Inflation Factor

The following is the same equation with minor reorganization: \[ \textrm{se}_{\beta_k}^2 = \frac{1-R^2}{df_\textrm{residual}} \times \frac{1}{1-R^2_k} \times \frac{s_y}{s_x} \] Notice that it has four major components:

  1. The residual variance of the full model (\(1 - R^2\)).
  2. The number of observations minus the number of model parameters (\(df_{\textrm{residual}}\)).
  3. The residual variance in predictor variable \(k\) after regressing it on all other predictor variables (\(1-R^2_k\)).
  4. A scaling term, \(s_y / s_x\), so that the variance is in units of \(x\) and \(y\).

Notice that component (3) contributes to the overall equation by inflating the variance for the estimate of \(\beta_k\) proportional to how correlated \(x_k\) is with the other predictor variables. Thus, \(1/(1-R^2_k)\) is the variance inflation factor (VIF) for \(\beta_k\). Its reciprocal, \(1-R^2_k\), is called the tolerance.

Compute standard errors

N <- 100 # number of observations
K <- 2 # number of regressors (i.e., predictor variables)
ss_total <- sumsquares(d$y)
ss_explained <- sumsquares(fitted(m))
R2 <- ss_explained / ss_total
se <- numeric(2)
for (k in 1:2) {
  mk <- if (k == 1) {
    lm(x1 ~ x2, data = d)
  } else {
    lm(x2 ~ x1, data = d)
  }
  R2k <- sumsquares(fitted(mk)) / ss_total
  se[k] <- sqrt((1 - R2) / ((1-R2k) * (N - K - 1)))
}
names(se) <- c("x1", "x2")
print(se)
    x1     x2 
0.0778 0.0778 

Compare standard errors with lm()

print(se)
    x1     x2 
0.0778 0.0778 
summary(m)

Call:
lm(formula = y ~ x1 + x2, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8637 -0.4260 -0.0046  0.5224  1.7468 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.95e-17   6.95e-02    0.00        1    
x1          -3.85e-01   7.78e-02   -4.94  3.2e-06 ***
x2           4.69e-01   7.78e-02    6.02  3.1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.695 on 97 degrees of freedom
Multiple R-squared:  0.527, Adjusted R-squared:  0.517 
F-statistic:   54 on 2 and 97 DF,  p-value: <2e-16

Perform t-tests of \(\beta\) coefficients

Since the \(\beta\) coefficients and standard errors we computed ourselves match what lm() returns, it follows that we will obtain the same t-values:

print(cbind(
  beta = betas,
  tval = betas / se,
  pval = pt(abs(betas / se), df = N-K-1, lower.tail = FALSE) * 2
))
     beta  tval     pval
x1 -0.385 -4.94 3.23e-06
x2  0.469  6.02 3.09e-08
walk(10:15, \(i, out) {cat(out[i], "\n")}, out = capture.output(summary(m)))
             Estimate Std. Error t value Pr(>|t|)     
(Intercept)  3.95e-17   6.95e-02    0.00        1     
x1          -3.85e-01   7.78e-02   -4.94  3.2e-06 *** 
x2           4.69e-01   7.78e-02    6.02  3.1e-08 *** 
--- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Standard error of a semi-partial correlation

Note

\(N\) is the number of observations and \(K\) is the number of predictor variables.

\[ \textrm{se}^{\textrm{semi-partial}}_{x_k} = \sqrt{\frac{1-R^2}{df_{residual}}} \quad where \quad df_{residual} = N-K-1 \]

Perform t-tests of semi-partial correlations

Notice that the t-values for the semi-partial correlations and the \(\beta\) coefficients are exactly the same!

df_residual <- N-K-1
se <- sqrt((1 - R2) / df_residual)
print(cbind(
  r_semi = r_semi,
  tval = r_semi / se,
  pval = pt(abs(r_semi / se), df = df_residual, lower.tail = FALSE) * 2
))
   r_semi  tval     pval
x1 -0.345 -4.94 3.23e-06
x2  0.421  6.02 3.09e-08
walk(10:13, \(i, out) {cat(out[i], "\n")}, out = capture.output(summary(m)))
             Estimate Std. Error t value Pr(>|t|)     
(Intercept)  3.95e-17   6.95e-02    0.00        1     
x1          -3.85e-01   7.78e-02   -4.94  3.2e-06 *** 
x2           4.69e-01   7.78e-02    6.02  3.1e-08 *** 

How well does this model fit the data?

The linear model predicts \(\hat{y} =\beta_0 + \beta_1 x_1 + \beta_2 x_2\). What proportion of the total variance in \(y\) do these predictions account for?

The sumsquares() function is defined on the R Environment slide.

ss_total <- sumsquares(d$y)
ss_explained <- sumsquares(fitted(m))
ss_unexplained <- sumsquares(residuals(m))
print(c(total = ss_total, explained = ss_explained, unexplained = ss_unexplained))
      total   explained unexplained 
       99.0        52.2        46.8 
all.equal(ss_total, ss_explained + ss_unexplained)
[1] TRUE
print(c("R^2 (eq)" = ss_explained / ss_total, "R^2 (lm)" = summary(m)$r.squared))
R^2 (eq) R^2 (lm) 
   0.527    0.527 

Model degrees of freedom

This model has three regression coefficients (i.e., parameters; \(\beta_0\), \(\beta_1\), and \(\beta_2\)) and two regressors (i.e., predictor variables; \(x_1\) and \(x_2\)). There are 100 observations.

How many denominator (i.e., residual) degrees of freedom are there? Work it out for your self before checking the answer.

Reveal answer … \(100 - 3 = 97\)

How many numerator degrees of freedom are there? Work it out for your self before checking the answer.

Reveal answer … \(\textrm{number of parameters} - 1 = 2\)

The full-model F-statistic

The relevant F-statistic can be formulated several ways.

\[ F = \frac{MS_{explained}}{MS_{residual}} = \frac{SS_{explained} / df_{explained}}{SS_{residual}/df_{residual}} = \frac{SS_{explained}}{SS_{residual}} \times \frac{df_{residual}}{df_{explained}} \] After separating the sums of squares and degrees of freedom, \(SS_{explained}\) and \(SS_{residual}\) can be divided by the total sum of squares to cast the equation in terms of \(R^2\).

\[ F = \frac{R^2}{1-R^2} \times \frac{df_{residual}}{df_{explained}} = \frac{R^2}{1-R^2} \times \frac{1}{df_{explained}/df_{residual}} \] Which ultimately results in the equation presented in the lecture slides: \[ F = \frac{R^2}{MS_{residual}} \quad \textrm{where} \quad MS_{residual}=\frac{df_{explained}(1-R^2)}{df_{residual}} \]

Manual full-model F-test

Does this model, with these two predictor variables and 100 observations, explain more variance than a model consisting of two randomly-selected predictor variables? (Yes, and we match the lm() solution, ignoring rounding differences.)

N <- 100 # number of observations
K <- 2 # number of regressors (i.e., predictor variables)
R2 <- ss_explained / ss_total
df_residual <- (N - K - 1)
MS_residual <- (K * (1 - R2)) / df_residual
Fval <- R2 / MS_residual
pval <- pf(Fval, df1 = K, df2 = df_residual, lower.tail = FALSE)
cat(paste("F-statistic:", round(Fval, 3), "on", K, "and", df_residual, "DF,", "p-value:", sprintf("%.2g", pval)))
F-statistic: 53.992 on 2 and 97 DF, p-value: 1.7e-16
x <- capture.output(summary(m))
cat(x[[grep("F-stat", x)]])
F-statistic:   54 on 2 and 97 DF,  p-value: <2e-16

Full-model F-test as a comparison of two models

The F-test conducted on the previous slide can be thought of a comparison of two models:

  1. \(y = \beta_0 + \epsilon\)
  2. \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)

The residual sum of squares for model (1) is the total sum of squares, and so \(R^2_{model_1} = 0\). Thus, in this case: \[ R^2_{model_2} = R^2_{model_2} - R^2_{model_1} = \Delta R^2_{model_2 - model_1} \] where \(\Delta\) means “change” or “difference”.

Full-model comparisons as tests of \(\Delta R^2\)

The null hypothesis is that \(\Delta R^2 = 0\) between two models.

The “restricted” model includes a subset of the predictor variables from the “full” model.

\[ D = K_{\textrm{full}} - K_{\textrm{restricted}} \]

\[ MS_{\Delta R^2} = \frac{D \times (1 - R^2_{\textrm{full}})}{N - K_{\textrm{full}} - 1} \]

\[ F(D, N - K_{\textrm{full}} - 1) = \frac{R^2_{\textrm{full}} - R^2_{\textrm{restricted}}}{MS_{\Delta R^2}} = \frac{\Delta R^2}{MS_{\Delta R^2}} \]