Regression Models

This note is a reorganization of Dr. Brian Caffo's lecture notes for the Coursera course Regression Models.

Module II : Multivariable Regression

Introduction

An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.

They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression (SLR) is not equipped to handle more than one predictor. Thus a Multivariable regression study tries to answer the questions


Multiple Linear Model

The general linear model extends SLR by adding terms linearly into the model as \[ Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_{p} X_{pi} + \epsilon_{i} = \sum_{k=1}^p \beta_k X_{ki} + \epsilon_{i} \] When \( X_{1i}=1 \), an intercept is included. Least squares (and hence ML estimates under iid Gaussianity of the errors) minimizes \[ \sum_{i=1}^n \left(Y_i - \sum_{k=1}^p \beta_k X_{ki}\right)^2 \] We note that the important linearity is linearity in the coefficients. Thus \( Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots + \beta_{p} X_{pi}^2 + \epsilon_{i} \) is still a linear model. (We've just squared the elements of the predictor variables.)

SLR with Two Variables

Recall that the least squares (LS) estimate for regression through the origin in Module I, \( E[Y_i] = \beta_1 X_{1i} \), was \( \beta_1 = \sum X_i Y_i / \sum X_i^2 \).

Let's consider two regressors, \( E[Y_i] = X_{1i}\beta_1 + X_{2i}\beta_2 = \mu_i \). Also recall that if \( \hat \mu_i \) satisfies \[ \sum_{i=1} (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0 \] for all possible values of \( \mu_i \), then we've found the LS estimates.

\[ \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = \sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) \left\{X_{1i}(\hat \beta_1 - \beta_1) + X_{2i}(\hat \beta_2 - \beta_2) \right\} \] Thus we need \[ \sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{1i} = 0 ~~\mbox{ and }~~ \sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{2i} = 0 \]

Hold \( \hat \beta_1 \) fixed in the second equation, we can get \[ \hat \beta_2 = \frac{\sum_{i=1} (Y_i - X_{1i}\hat \beta_1)X_{2i}}{\sum_{i=1}^n X_{2i}^2} \] Then we can substitute \( \hat \beta_2 \) in the first equation, we will get \[ 0 = \sum_{i=1}^n \left\{Y_i - \frac{\sum_j X_{2j}Y_j}{\sum_j X_{2j}^2}X_{2i} + \hat \beta_1 \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)\right\} X_{1i} = \sum_{i=1}^n \left\{ e_{i, Y | X_2} - \hat \beta_1 e_{i, X_1 | X_2} \right\} X_{1i} \] Where \( e_{i, a | b} := a_i - \frac{\sum_{j=1}^n a_j b_j }{\sum_{i=1}^n b_j^2} b_i \) is the residual when regressing \( b \) from \( a \) without an intercept. Now we get the solution \[ \hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2} X_1} \]

But note that \[ \sum_{i=1}^n e_{i, X_1 | X_2}^2 = \sum_{i=1}^n e_{i, X_1 | X_2} \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right) = \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i} = \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i} \] Because \( \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i} = 0 \). So finally we will have \[ \hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2} \]

Discussion.

Example. Consider \( Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i} \) where \( X_{2i} = 1 \) is an intercept term. Then \( \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2}X_{2i} = \frac{\sum_j X_{1j}}{n} = \bar X_1 \) and \( e_{i, X_1 | X_2} = X_{1i} - \bar X_1 \). Simiarly \( e_{i, Y | X_2} = Y_i - \bar Y \). Thus \[ \hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^n (X_i - \bar X)^2} = Cor(X, Y) \frac{Sd(Y)}{Sd(X)} \]

General SLR

Similarly, for the general case, we need to solve the equations \[ \sum_{i=1}^n (Y_i -\beta_1 X_{1i} - \beta_2 X_{2i} - \ldots - \beta_{p} X_{pi}) X_{ki} = 0 \] for \( k = 1, \ldots, p \) yields \( p \) equations with \( p \) unknowns.

Analysis.

Just so I don't leave you hanging, let's show a way to get estimates. Recall the equations: \[ \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{pi}\hat \beta_p) X_{ki} = 0 \] Hold \( \hat \beta_1, \ldots, \hat \beta_{p-1} \) fixed then we get that \[ \hat \beta_p = \frac{\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{p-1,i}\hat \beta_{p-1}) X_{pi} }{\sum_{i=1}^n X_{ip}^2} \] Then we can substitute them back into the equations, we wind up with \[ \sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = 0 \] Note that \[ X_k = e_{i,X_k|X_p} + \frac{\sum_{i=1}^n X_{ik} X_{ip}}{\sum_{i=1}^n X_{ip^2}} X_p \] and \( \sum_{i=1}^n e_{i,X_j | X_p} X_{ip} = 0 \). Thus \[ \sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = \sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) e_{i,X_k|X_p} = 0 \]

Summary

We have reduced \( p \) LS equations and \( p \) unknowns to \( p-1 \) LS equations and \( p-1 \) unknowns.

Example

n <- 100
x <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- x + x2 + x3 + rnorm(n, sd = 0.1)
e <- function(a, b) a - sum(a * b)/sum(b^2) * b  # Calculate the residual
ey <- e(e(y, x2), e(x3, x2))
ex <- e(e(x, x2), e(x3, x2))
sum(ey * ex)/sum(ex^2)
## [1] 1.003
summary(lm(y ~ x + x2 + x3 - 1))$coefficients  # The -1 removes the intercept term
##    Estimate Std. Error t value   Pr(>|t|)
## x    1.0034   0.009649   104.0 2.701e-101
## x2   1.0087   0.009992   100.9 4.679e-100
## x3   0.9859   0.009727   101.4 3.183e-100

Let's show that order doesn't matter

ey <- e(e(y, x3), e(x2, x3))
ex <- e(e(x, x3), e(x2, x3))
sum(ey * ex)/sum(ex^2)
## [1] 1.003
coef(lm(y ~ x + x2 + x3 - 1))  # The -1 removes the intercept term
##      x     x2     x3 
## 1.0034 1.0087 0.9859

We can calculate the residual

ey <- resid(lm(y ~ x2 + x3 - 1))
ex <- resid(lm(x ~ x2 + x3 - 1))
sum(ey * ex)/sum(ex^2)
## [1] 1.003
coef(lm(y ~ x + x2 + x3 - 1))  #the -1 removes the intercept term
##      x     x2     x3 
## 1.0034 1.0087 0.9859

We can also try to interprete the coefficients \[ E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k \] So that \[ E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p] \] \[ = (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k}+ \sum_{k=1}^p x_{k} \beta_k = \beta_1 \] So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.

All of our SLR quantities can be extended to linear models

Linear models

Linear models are the single most important applied statistical and machine learning techniqe by far, some amazing things that you can accomplish with linear models


Multivariable Regression Examples

Let's take a look at the Swiss fertility data

require(stats)
require(graphics)
library(datasets)
data(swiss)
pairs(swiss, panel = panel.smooth, main = "Swiss data", col = 3 + (swiss$Catholic > 
    50))

plot of chunk unnamed-chunk-4

The dataset swiss contains 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].

All variables but ‘Fertility’ give proportions of the population.

summary(lm(Fertility ~ ., data = swiss))$coefficients
##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)       66.9152   10.70604   6.250 1.906e-07
## Agriculture       -0.1721    0.07030  -2.448 1.873e-02
## Examination       -0.2580    0.25388  -1.016 3.155e-01
## Education         -0.8709    0.18303  -4.758 2.431e-05
## Catholic           0.1041    0.03526   2.953 5.190e-03
## Infant.Mortality   1.0770    0.38172   2.822 7.336e-03

We can interprete the result as

summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  60.3044    4.25126  14.185 3.216e-18
## Agriculture   0.1942    0.07671   2.532 1.492e-02

How can adjustment reverse the sign of an effect?

Let's try a simulation.

n <- 100
x2 <- 1:n
x1 <- 0.01 * x2 + runif(n, -0.1, 0.1)
y = -x1 + x2 + rnorm(n, sd = 0.01)
summary(lm(y ~ x1))$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    2.527      1.131   2.235 2.769e-02
## x1            94.413      1.937  48.740 1.624e-70
summary(lm(y ~ x1 + x2))$coef
##               Estimate Std. Error   t value   Pr(>|t|)
## (Intercept) -0.0007991  0.0019155   -0.4172  6.775e-01
## x1          -0.9781867  0.0162410  -60.2296  1.186e-78
## x2           0.9997786  0.0001669 5990.9776 7.056e-272
par(mfrow = c(1, 2))
plot(x1, y, pch = 21, col = "black", bg = topo.colors(n)[x2], frame = FALSE, 
    cex = 1.5)
title("Unadjusted, color is X2")
abline(lm(y ~ x1), lwd = 2)
plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), pch = 21, col = "black", bg = "lightblue", 
    frame = FALSE, cex = 1.5)
title("Adjusted")
abline(0, coef(lm(y ~ x1 + x2))[2], lwd = 2)

plot of chunk unnamed-chunk-8

If we go back to this data set, we note that

What if we include an unnecessary variable?

z adds no new linear information, since it's a linear combination of variables already included. R just drops terms that are linear combinations of other terms.

z <- swiss$Agriculture + swiss$Education
lm(Fertility ~ . + z, data = swiss)
## 
## Call:
## lm(formula = Fertility ~ . + z, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture       Examination         Education  
##           66.915            -0.172            -0.258            -0.871  
##         Catholic  Infant.Mortality                 z  
##            0.104             1.077                NA

Dummy Variables

Sometimes dummy variables are smart. Consider the linear model \( Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i} \) where each \( X_{i1} \) is binary so that it is a \( 1 \) if measurement \( i \) is in a group and \( 0 \) otherwise.

Then for people in the group \( E[Y_i] = \beta_0 + \beta_1 \) and for people not in the group \( E[Y_i] = \beta_0 \). The LS fits work out to be \( \hat \beta_0 + \hat \beta_1 \) is the mean for those in the group and \( \hat \beta_0 \) is the mean for those not in the group.

More than 2 levels.

Consider a multilevel factor level. For didactic reasons, let's say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)

Another example with the dataset InsectSprays

The data frame InsectSprays contains 72 on 2 variables

data(InsectSprays)
str(InsectSprays)
## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
boxplot(count ~ spray, data = InsectSprays, xlab = "Type of spray", ylab = "Insect count", 
    main = "InsectSprays data", varwidth = TRUE, col = "lightgray")

plot of chunk unnamed-chunk-10

If we take the group A as the reference, the linear model fit will be

summary(lm(count ~ spray, data = InsectSprays))$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  14.5000      1.132 12.8074 1.471e-19
## sprayB        0.8333      1.601  0.5205 6.045e-01
## sprayC      -12.4167      1.601 -7.7550 7.267e-11
## sprayD       -9.5833      1.601 -5.9854 9.817e-08
## sprayE      -11.0000      1.601 -6.8702 2.754e-09
## sprayF        2.1667      1.601  1.3532 1.806e-01

Hard coding the dummy variables

summary(lm(count ~ I(1 * (spray == "B")) + I(1 * (spray == "C")) + I(1 * (spray == 
    "D")) + I(1 * (spray == "E")) + I(1 * (spray == "F")), data = InsectSprays))$coef
##                       Estimate Std. Error t value  Pr(>|t|)
## (Intercept)            14.5000      1.132 12.8074 1.471e-19
## I(1 * (spray == "B"))   0.8333      1.601  0.5205 6.045e-01
## I(1 * (spray == "C")) -12.4167      1.601 -7.7550 7.267e-11
## I(1 * (spray == "D"))  -9.5833      1.601 -5.9854 9.817e-08
## I(1 * (spray == "E")) -11.0000      1.601 -6.8702 2.754e-09
## I(1 * (spray == "F"))   2.1667      1.601  1.3532 1.806e-01

What if we include all 6?

lm(count ~ I(1 * (spray == "B")) + I(1 * (spray == "C")) + I(1 * (spray == "D")) + 
    I(1 * (spray == "E")) + I(1 * (spray == "F")) + I(1 * (spray == "A")), data = InsectSprays)
## 
## Call:
## lm(formula = count ~ I(1 * (spray == "B")) + I(1 * (spray == 
##     "C")) + I(1 * (spray == "D")) + I(1 * (spray == "E")) + I(1 * 
##     (spray == "F")) + I(1 * (spray == "A")), data = InsectSprays)
## 
## Coefficients:
##           (Intercept)  I(1 * (spray == "B"))  I(1 * (spray == "C"))  
##                14.500                  0.833                -12.417  
## I(1 * (spray == "D"))  I(1 * (spray == "E"))  I(1 * (spray == "F"))  
##                -9.583                -11.000                  2.167  
## I(1 * (spray == "A"))  
##                    NA

What if we omit the intercept?

summary(lm(count ~ spray - 1, data = InsectSprays))$coef
##        Estimate Std. Error t value  Pr(>|t|)
## sprayA   14.500      1.132  12.807 1.471e-19
## sprayB   15.333      1.132  13.543 1.002e-20
## sprayC    2.083      1.132   1.840 7.024e-02
## sprayD    4.917      1.132   4.343 4.953e-05
## sprayE    3.500      1.132   3.091 2.917e-03
## sprayF   16.667      1.132  14.721 1.573e-22
unique(ave(InsectSprays$count, InsectSprays$spray))
## [1] 14.500 15.333  2.083  4.917  3.500 16.667

We note that

If we want comparisons between, Spray B and C, say we could refit the model with C (or B) as the reference level. For example

spray2 <- relevel(InsectSprays$spray, "C")
summary(lm(count ~ spray2, data = InsectSprays))$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    2.083      1.132  1.8401 7.024e-02
## spray2A       12.417      1.601  7.7550 7.267e-11
## spray2B       13.250      1.601  8.2755 8.510e-12
## spray2D        2.833      1.601  1.7696 8.141e-02
## spray2E        1.417      1.601  0.8848 3.795e-01
## spray2F       14.583      1.601  9.1083 2.794e-13

We can also do it manually \[ Var(\hat \beta_B - \hat \beta_C) = Var(\hat \beta_B) + Var(\hat \beta_C) - 2 Cov(\hat \beta_B, \hat \beta_C) \]

fit <- lm(count ~ spray, data = InsectSprays)  #A is ref
bbmbc <- coef(fit)[2] - coef(fit)[3]  #B - C
temp <- summary(fit)
se <- temp$sigma * sqrt(temp$cov.unscaled[2, 2] + temp$cov.unscaled[3, 3] - 
    2 * temp$cov.unscaled[2, 3])
t <- (bbmbc)/se
p <- pt(-abs(t), df = fit$df)
out <- c(bbmbc, se, t, p)
names(out) <- c("B - C", "SE", "T", "P")
round(out, 3)
##  B - C     SE      T      P 
## 13.250  1.601  8.276  0.000

Some other thoughts on this data.


Adjustment

Simulation 1

n <- 100
t <- rep(c(0, 1), c(n/2, n/2))
x <- c(runif(n/2), runif(n/2))
beta0 <- 0
beta1 <- 2
tau <- 1
sigma <- 0.2
y <- beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1:(n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1):n]), lwd = 3)
fit <- lm(y ~ x + t)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
points(x[1:(n/2)], y[1:(n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1):n], y[(n/2 + 1):n], pch = 21, col = "black", bg = "salmon", 
    cex = 2)

plot of chunk unnamed-chunk-17

Discussion.

Simulation 2

n <- 100
t <- rep(c(0, 1), c(n/2, n/2))
x <- c(runif(n/2), 1.5 + runif(n/2))
beta0 <- 0
beta1 <- 2
tau <- 0
sigma <- 0.2
y <- beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1:(n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1):n]), lwd = 3)
fit <- lm(y ~ x + t)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
points(x[1:(n/2)], y[1:(n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1):n], y[(n/2 + 1):n], pch = 21, col = "black", bg = "salmon", 
    cex = 2)

plot of chunk unnamed-chunk-18

Discussion.

Simulation 3

n <- 100
t <- rep(c(0, 1), c(n/2, n/2))
x <- c(runif(n/2), 0.9 + runif(n/2))
beta0 <- 0
beta1 <- 2
tau <- -1
sigma <- 0.2
y <- beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1:(n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1):n]), lwd = 3)
fit <- lm(y ~ x + t)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
points(x[1:(n/2)], y[1:(n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1):n], y[(n/2 + 1):n], pch = 21, col = "black", bg = "salmon", 
    cex = 2)

plot of chunk unnamed-chunk-19

Discussion.

Simulation 4

n <- 100
t <- rep(c(0, 1), c(n/2, n/2))
x <- c(0.5 + runif(n/2), runif(n/2))
beta0 <- 0
beta1 <- 2
tau <- 1
sigma <- 0.2
y <- beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1:(n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1):n]), lwd = 3)
fit <- lm(y ~ x + t)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
points(x[1:(n/2)], y[1:(n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1):n], y[(n/2 + 1):n], pch = 21, col = "black", bg = "salmon", 
    cex = 2)

plot of chunk unnamed-chunk-20

Discussion.

Simulation 5

n <- 100
t <- rep(c(0, 1), c(n/2, n/2))
x <- c(runif(n/2, -1, 1), runif(n/2, -1, 1))
beta0 <- 0
beta1 <- 2
tau <- 0
tau1 <- -4
sigma <- 0.2
y <- beta0 + x * beta1 + t * tau + t * x * tau1 + rnorm(n, sd = sigma)
plot(x, y, type = "n", frame = FALSE)
abline(lm(y ~ x), lwd = 2)
abline(h = mean(y[1:(n/2)]), lwd = 3)
abline(h = mean(y[(n/2 + 1):n]), lwd = 3)
fit <- lm(y ~ x + t + I(x * t))
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2] + coef(fit)[4], lwd = 3)
points(x[1:(n/2)], y[1:(n/2)], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(x[(n/2 + 1):n], y[(n/2 + 1):n], pch = 21, col = "black", bg = "salmon", 
    cex = 2)

plot of chunk unnamed-chunk-21

Discussion.

Simulation 6

plot of chunk unnamed-chunk-22

We will do this to investigate the bivariate relationship

library(rgl)
plot3d(x1, x2, y)

Residual relationship

plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), frame = FALSE, col = "black", bg = "lightblue", 
    pch = 21, cex = 2)
abline(lm(I(resid(lm(x1 ~ x2))) ~ I(resid(lm(y ~ x2)))), lwd = 2)

plot of chunk unnamed-chunk-23

Discussion.

Some final thoughts


Residuals, diagnostics, variation

Consider \( Y_i = \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i} \). We'll also assume here that \( \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2) \) and define the residuals as \( e_i = Y_i - \hat Y_i = Y_i - \sum_{k=1}^p X_{ik} \hat \beta_j \)

Our estimate of residual variation is \( \hat \sigma^2 = \frac{\sum_{i=1}^n e_i^2}{n-p} \), the \( n-p \) so that \( E[\hat \sigma^2] = \sigma^2 \)

data(swiss)
par(mfrow = c(2, 2))
fit <- lm(Fertility ~ ., data = swiss)
plot(fit)

plot of chunk unnamed-chunk-24

Influential, high leverage and outlying points plot

n <- 100
x <- rnorm(n)
y <- x + rnorm(n, sd = 0.3)
plot(c(-3, 6), c(-3, 6), type = "n", frame = FALSE, xlab = "X", ylab = "Y")
abline(lm(y ~ x), lwd = 2)
points(x, y, cex = 2, bg = "lightblue", col = "black", pch = 21)
points(0, 0, cex = 2, bg = "darkorange", col = "black", pch = 21)
points(0, 5, cex = 2, bg = "darkorange", col = "black", pch = 21)
points(5, 5, cex = 2, bg = "darkorange", col = "black", pch = 21)
points(5, 0, cex = 2, bg = "darkorange", col = "black", pch = 21)

plot of chunk unnamed-chunk-25

Calling a point an outlier is vague.

Influence measures

Do ?influence.measures to see the full suite of influence measures in stats. The measures include

How to use all of these things?

Examples

Case 1

x <- c(10, rnorm(n))
y <- c(10, c(rnorm(n)))
plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
abline(lm(y ~ x))

plot of chunk unnamed-chunk-26

n <- 100
x <- c(10, rnorm(n))
y <- c(10, c(rnorm(n)))
plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
abline(lm(y ~ x))

plot of chunk unnamed-chunk-27

We notice that the point c(10, 10) has created a strong regression relationship where there shouldn't be one.


Showing a couple of the diagnostic values

fit <- lm(y ~ x)
round(dfbetas(fit)[1:10, 2], 3)
##      1      2      3      4      5      6      7      8      9     10 
##  6.496 -0.022  0.100  0.016  0.016 -0.026 -0.124 -0.010  0.016  0.039
round(hatvalues(fit)[1:10], 3)
##     1     2     3     4     5     6     7     8     9    10 
## 0.479 0.036 0.033 0.010 0.011 0.017 0.046 0.012 0.011 0.012

Case 2

x <- rnorm(n)
y <- x + rnorm(n, sd = 0.3)
x <- c(5, x)
y <- c(5, y)
plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
fit2 <- lm(y ~ x)
abline(fit2)

plot of chunk unnamed-chunk-29

Look at some of the diagnostics

round(dfbetas(fit2)[1:10, 2], 3)
##      1      2      3      4      5      6      7      8      9     10 
## -0.009 -0.076  0.062  0.127 -0.232  0.003  0.063  0.000  0.011  0.001
round(hatvalues(fit2)[1:10], 3)
##     1     2     3     4     5     6     7     8     9    10 
## 0.202 0.011 0.039 0.015 0.021 0.013 0.013 0.011 0.010 0.010

Example described by Stefanski TAS 2007 Vol 61.

## Don't everyone hit this server at once.  Read the paper first.
dat <- read.table("http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt", 
    header = FALSE)
pairs(dat)

plot of chunk unnamed-chunk-31

If we got our P-values, should we bother to do a residual plot?

summary(lm(V1 ~ . - 1, data = dat))$coef
##    Estimate Std. Error t value  Pr(>|t|)
## V2   0.9856    0.12798   7.701 1.989e-14
## V3   0.9715    0.12664   7.671 2.500e-14
## V4   0.8606    0.11958   7.197 8.301e-13
## V5   0.9267    0.08328  11.127 4.778e-28

This is the residual plot, p-values significant, O RLY?

fit <- lm(V1 ~ . - 1, data = dat)
plot(predict(fit), resid(fit), pch = ".")

plot of chunk unnamed-chunk-33


Multivariable Regression

We have an entire class on prediction and machine learning, so we'll focus on modeling. Prediction has a different set of criteria, needs for interpretability and standards for generalizability. In modeling, our interest lies in parsimonious, interpretable representations of the data that enhance our understanding of the phenomena under study.

A model is a lense through which to look at your data. (Scott Zeger)

Under this philosophy, what's the right model? Whatever model connects the data to a true, parsimonious statement about what you're studying. There are nearly uncontable ways that a model can be wrong, in this lecture, we'll focus on variable inclusion and exclusion. Like nearly all aspects of statistics, good modeling decisions are context dependent. A good model for prediction versus one for studying mechanisms versus one for trying to establish causal effects may not be the same.

There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know. (Donald Rumsfeld)

In our context

General Rules for multivariable regression

We could plot of \( R^2 \) versus \( n \). For simulations as the number of variables included equals increases to \( n=100 \). No actual regression relationship exist in any simulation

n <- 100
plot(c(1, n), 0:1, type = "n", frame = FALSE, xlab = "p", ylab = "R^2")
r <- sapply(1:n, function(p) {
    y <- rnorm(n)
    x <- matrix(rnorm(n * p), n, p)
    summary(lm(y ~ x))$r.squared
})
lines(1:n, r, lwd = 2)
abline(h = 1)

plot of chunk unnamed-chunk-34

Variance inflation

n <- 100
nosim <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
betas <- sapply(1:nosim, function(i) {
    y <- x1 + rnorm(n, sd = 0.3)
    c(coef(lm(y ~ x1))[2], coef(lm(y ~ x1 + x2))[2], coef(lm(y ~ x1 + x2 + x3))[2])
})
round(apply(betas, 1, sd), 5)
##      x1      x1      x1 
## 0.02628 0.02661 0.02689

The variance inflation factors include

We can reviste our previous simulation

## doesn't depend on which y you use,
y <- x1 + rnorm(n, sd = 0.3)
a <- summary(lm(y ~ x1))$cov.unscaled[2, 2]
c(summary(lm(y ~ x1 + x2))$cov.unscaled[2, 2], summary(lm(y ~ x1 + x2 + x3))$cov.unscaled[2, 
    2])/a
## [1] 1.017 1.044
temp <- apply(betas, 1, var)
temp[2:3]/temp[1]
##    x1    x1 
## 1.026 1.047

For the swiss data

data(swiss)
fit1 <- lm(Fertility ~ Agriculture, data = swiss)
a <- summary(fit1)$cov.unscaled[2, 2]
fit2 <- update(fit, Fertility ~ Agriculture + Examination, data = swiss)
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education, data = swiss)
c(summary(fit2)$cov.unscaled[2, 2], summary(fit3)$cov.unscaled[2, 2])/a
## [1] 1.892 2.089

Previous Module. Module I : Least Squares and Linear Regression

Next Module. Module III : Generalized Linear Models