lm() function in RPSYC 4111 Fall 2024
First, we will generate three variables: a dependent variable and two predictor variables. All three variables are correlated, and all three are standardized.
The standardize() function is defined on the R Environment slide.
shared_variance <- rnorm(100)
d <- tibble(
  x1 = -shared_variance   + rnorm(100),
  x2 = +shared_variance   + rnorm(100),
  y  = (-.5*x1) + (.5*x2) + rnorm(100)
) |>
  mutate(across(everything(), standardize))
r_full <- c(
  yx1  = cor(d$y , d$x1),
  yx2  = cor(d$y , d$x2),
  x1x2 = cor(d$x1, d$x2)
)
print(r_full)   yx1    yx2   x1x2 
-0.592  0.638 -0.441 I wanted \(\beta_1\) and \(\beta_2\) to have opposing signs in this example. This is why I sign flip the shared variance when generating \(x_1\) and then multiply it by \(-0.5\) when generating \(y\). This just makes the example a little more interesting.
For the linear model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\), we want to do the following:
Of course, this assumes that the optimal values for \(\beta_0\), \(\beta_1\), and \(\beta_2\) can be found. The general case of finding these values is beyond the scope of this example or this course, but in this simple case of two standardized predictor variables, we can solve for them. This is done here primarily to demonstrate that there is an important relationship between the semi-partial correlation and a regression coefficient (\(\beta\)).
The semi-partial correlation summarizes the relationship between \(y\) and the variance in a single predictor variable \(x_k\), after removing the variance from \(x_k\) that can be explained by whatever other predictor variables are in the model. Remember, our model is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\).
We can obtain one semi-partial correlation between \(x_1\) and \(y\), and a second between \(x_2\) and \(y\).
Notice that the semi-partial correlations are smaller than the full correlations. Grab pen and paper and write down an explanation for why this makes sense before checking yourself.
Both apply ONLY to linear models with two predictor variables; \(\beta_{x_1}\) is standardized.
\[ r^{\textrm{semi-partial}}_{x_1} = \frac{\textrm{cor}(y, x_1) - \left(\textrm{cor}\left(y, x_2\right)\times\textrm{cor}\left(x_1, x_2\right)\right)}{\sqrt{1 - \textrm{cor}(x_1, x_2)^2}} \] \[ \beta_{x_1} = \frac{\textrm{cor}(y, x_1) - \left(\textrm{cor}\left(y, x_2\right)\times\textrm{cor}\left(x_1, x_2\right)\right)}{1 - \textrm{cor}(x_1, x_2)^2} \]
num <- c(
  x1 = (r_full["yx1"] - (r_full["yx2"] * r_full["x1x2"])) |> unname(),
  x2 = (r_full["yx2"] - (r_full["yx1"] * r_full["x1x2"])) |> unname()
)
den <- (1 - r_full["x1x2"]^2) |> unname()
betas <- c(num["x1"] / den, num["x2"] / den)
x <- rbind(
  x1 = c(num["x1"] / sqrt(den), r_semi["x1"], betas["x1"]),
  x2 = c(num["x2"] / sqrt(den), r_semi["x2"], betas["x2"])
)
colnames(x) <- c("semi (eq)", "semi (lm)", "beta (eq)")Our estimate for the beta coefficient, based on an equation very similar to the semi-partial correlation, corresponds exactly with those obtained by lm(y ~ x1 + x2, data = d).
   semi (eq) semi (lm) beta (eq) beta (lm)
x1    -0.345    -0.345    -0.385    -0.385
x2     0.421     0.421     0.469     0.469Important
The coefficients from the equation and from lm() only correspond because \(y\), \(x_1\), and \(x_2\) are standardized. This means lm() is returning standardized \(\beta\), which is what the equation will produce.
Do \(x_1\) and \(x_2\) make significant unique contributions to reducing the sum of squared errors?
We can assess this by testing the null hypothesis for each regression coefficient (that it is sampled from a distribution with mean equal to zero).
This will involve a t-test for each \(\beta\).
Thus, we need to find the standard error for each.
Note
\(N\) is the number of observations, \(K\) is the number of predictor variables, and \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\).
\[ \textrm{se}_{\beta_k} = \sqrt{\frac{1-R^2}{(1-R^2_k) \times df_{residual}}} \times \frac{s_y}{s_x} \quad where \quad df_{residual}=N-K-1 \]
Important
\(\beta_0\) is a special case, and the above equation will not work for estimating its standard error. Since \(\beta_0\) rarely corresponds to a hypothesis of interest, we will just bypass that. Its formulation is more complicated.
The following is the same equation with minor reorganization: \[ \textrm{se}_{\beta_k}^2 = \frac{1-R^2}{df_\textrm{residual}} \times \frac{1}{1-R^2_k} \times \frac{s_y}{s_x} \] Notice that it has four major components:
Notice that component (3) contributes to the overall equation by inflating the variance for the estimate of \(\beta_k\) proportional to how correlated \(x_k\) is with the other predictor variables. Thus, \(1/(1-R^2_k)\) is the variance inflation factor (VIF) for \(\beta_k\). Its reciprocal, \(1-R^2_k\), is called the tolerance.
N <- 100 # number of observations
K <- 2 # number of regressors (i.e., predictor variables)
ss_total <- sumsquares(d$y)
ss_explained <- sumsquares(fitted(m))
R2 <- ss_explained / ss_total
se <- numeric(2)
for (k in 1:2) {
  mk <- if (k == 1) {
    lm(x1 ~ x2, data = d)
  } else {
    lm(x2 ~ x1, data = d)
  }
  R2k <- sumsquares(fitted(mk)) / ss_total
  se[k] <- sqrt((1 - R2) / ((1-R2k) * (N - K - 1)))
}
names(se) <- c("x1", "x2")
print(se)    x1     x2 
0.0778 0.0778 lm()    x1     x2 
0.0778 0.0778 
Call:
lm(formula = y ~ x1 + x2, data = d)
Residuals:
    Min      1Q  Median      3Q     Max 
-1.8637 -0.4260 -0.0046  0.5224  1.7468 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.95e-17   6.95e-02    0.00        1    
x1          -3.85e-01   7.78e-02   -4.94  3.2e-06 ***
x2           4.69e-01   7.78e-02    6.02  3.1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.695 on 97 degrees of freedom
Multiple R-squared:  0.527, Adjusted R-squared:  0.517 
F-statistic:   54 on 2 and 97 DF,  p-value: <2e-16Since the \(\beta\) coefficients and standard errors we computed ourselves match what lm() returns, it follows that we will obtain the same t-values:
print(cbind(
  beta = betas,
  tval = betas / se,
  pval = pt(abs(betas / se), df = N-K-1, lower.tail = FALSE) * 2
))     beta  tval     pval
x1 -0.385 -4.94 3.23e-06
x2  0.469  6.02 3.09e-08             Estimate Std. Error t value Pr(>|t|)     
(Intercept)  3.95e-17   6.95e-02    0.00        1     
x1          -3.85e-01   7.78e-02   -4.94  3.2e-06 *** 
x2           4.69e-01   7.78e-02    6.02  3.1e-08 *** 
--- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Note
\(N\) is the number of observations and \(K\) is the number of predictor variables.
\[ \textrm{se}^{\textrm{semi-partial}}_{x_k} = \sqrt{\frac{1-R^2}{df_{residual}}} \quad where \quad df_{residual} = N-K-1 \]
Notice that the t-values for the semi-partial correlations and the \(\beta\) coefficients are exactly the same!
df_residual <- N-K-1
se <- sqrt((1 - R2) / df_residual)
print(cbind(
  r_semi = r_semi,
  tval = r_semi / se,
  pval = pt(abs(r_semi / se), df = df_residual, lower.tail = FALSE) * 2
))   r_semi  tval     pval
x1 -0.345 -4.94 3.23e-06
x2  0.421  6.02 3.09e-08             Estimate Std. Error t value Pr(>|t|)     
(Intercept)  3.95e-17   6.95e-02    0.00        1     
x1          -3.85e-01   7.78e-02   -4.94  3.2e-06 *** 
x2           4.69e-01   7.78e-02    6.02  3.1e-08 *** The linear model predicts \(\hat{y} =\beta_0 + \beta_1 x_1 + \beta_2 x_2\). What proportion of the total variance in \(y\) do these predictions account for?
The sumsquares() function is defined on the R Environment slide.
This model has three regression coefficients (i.e., parameters; \(\beta_0\), \(\beta_1\), and \(\beta_2\)) and two regressors (i.e., predictor variables; \(x_1\) and \(x_2\)). There are 100 observations.
How many denominator (i.e., residual) degrees of freedom are there? Work it out for your self before checking the answer.
How many numerator degrees of freedom are there? Work it out for your self before checking the answer.
The relevant F-statistic can be formulated several ways.
\[ F = \frac{MS_{explained}}{MS_{residual}} = \frac{SS_{explained} / df_{explained}}{SS_{residual}/df_{residual}} = \frac{SS_{explained}}{SS_{residual}} \times \frac{df_{residual}}{df_{explained}} \] After separating the sums of squares and degrees of freedom, \(SS_{explained}\) and \(SS_{residual}\) can be divided by the total sum of squares to cast the equation in terms of \(R^2\).
\[ F = \frac{R^2}{1-R^2} \times \frac{df_{residual}}{df_{explained}} = \frac{R^2}{1-R^2} \times \frac{1}{df_{explained}/df_{residual}} \] Which ultimately results in the equation presented in the lecture slides: \[ F = \frac{R^2}{MS_{residual}} \quad \textrm{where} \quad MS_{residual}=\frac{df_{explained}(1-R^2)}{df_{residual}} \]
Does this model, with these two predictor variables and 100 observations, explain more variance than a model consisting of two randomly-selected predictor variables? (Yes, and we match the lm() solution, ignoring rounding differences.)
N <- 100 # number of observations
K <- 2 # number of regressors (i.e., predictor variables)
R2 <- ss_explained / ss_total
df_residual <- (N - K - 1)
MS_residual <- (K * (1 - R2)) / df_residual
Fval <- R2 / MS_residual
pval <- pf(Fval, df1 = K, df2 = df_residual, lower.tail = FALSE)
cat(paste("F-statistic:", round(Fval, 3), "on", K, "and", df_residual, "DF,", "p-value:", sprintf("%.2g", pval)))F-statistic: 53.992 on 2 and 97 DF, p-value: 1.7e-16F-statistic:   54 on 2 and 97 DF,  p-value: <2e-16The F-test conducted on the previous slide can be thought of a comparison of two models:
The residual sum of squares for model (1) is the total sum of squares, and so \(R^2_{model_1} = 0\). Thus, in this case: \[ R^2_{model_2} = R^2_{model_2} - R^2_{model_1} = \Delta R^2_{model_2 - model_1} \] where \(\Delta\) means “change” or “difference”.
The null hypothesis is that \(\Delta R^2 = 0\) between two models.
The “restricted” model includes a subset of the predictor variables from the “full” model.
\[ D = K_{\textrm{full}} - K_{\textrm{restricted}} \]
\[ MS_{\Delta R^2} = \frac{D \times (1 - R^2_{\textrm{full}})}{N - K_{\textrm{full}} - 1} \]
\[ F(D, N - K_{\textrm{full}} - 1) = \frac{R^2_{\textrm{full}} - R^2_{\textrm{restricted}}}{MS_{\Delta R^2}} = \frac{\Delta R^2}{MS_{\Delta R^2}} \]