Regression Models

This note is a reorganization of Dr. Brian Caffo's lecture notes for the Coursera course Regression Models.

Module I : Least Squares and Linear Regression

Introduction

Questions concerning regression models:

Galton's Data

require(MASS)
## Loading required package: MASS
library(UsingR)
str(galton)
## 'data.frame':    928 obs. of  2 variables:
##  $ child : num  61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
##  $ parent: num  70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
par(mfrow = c(1, 2))
hist(galton$child, col = "blue", breaks = 100, xlab = "Child", ylab = "Count")
hist(galton$parent, col = "red", breaks = 100, xlab = "Parent", ylab = "Count")

plot of chunk galton

Finding the middle via least squares

Consider only the children's heights.

Proof.

\[ \begin{align} \sum_{i=1}^n (Y_i - \mu)^2 & = \ \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \ 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\ \sum_{i=1}^n (\bar Y - \mu)^2 \\ & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2\\ & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \ \end{align} \]

So that \( \sum_{i=1}^n (Y_i - \mu)^2 \) gets the minimum iff \( \mu = \bar Y \).

Regression through the origin

First, let's compare the children's heights and their parents' heights.

On the right figure, the size of point represents number of points at that (X, Y) combination.

par(mfrow = c(1, 2))
plot(x = galton$parent, y = galton$child, pch = 19, col = "blue", xlab = "Parent", 
    ylab = "Child")

freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
plot(as.numeric(as.vector(freqData$parent)), as.numeric(as.vector(freqData$child)), 
    pch = 21, col = "black", bg = "lightblue", cex = 0.15 * freqData$freq, xlab = "Parent", 
    ylab = "Child")

plot of chunk unnamed-chunk-1

Suppose that \( X_i \) are the parents' heights as the predictor or explanatory variable and \( Y_i \) the response, consider picking the slope \( \beta \) that minimizes \[ \sum_{i=1}^n (Y_i - X_i \beta)^2 \] This is exactly using the origin as a pivot point to pick the line that minimizes the sum of the squared vertical distances of the points to the line.

If we subtract the means then the origin is the mean of the parent and children's heights. In such case, the solution is

lm(I(child - mean(child)) ~ I(parent - mean(parent)) - 1, data = galton)
## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) - 
##     1, data = galton)
## 
## Coefficients:
## I(parent - mean(parent))  
##                    0.646

The best fit line is

freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
plot(as.numeric(as.vector(freqData$parent)), as.numeric(as.vector(freqData$child)), 
    pch = 21, col = "black", bg = "lightblue", cex = 0.05 * freqData$freq, xlab = "Parent", 
    ylab = "Child")
lm1 <- lm(galton$child ~ galton$parent)
lines(galton$parent, lm1$fitted, col = "red", lwd = 3)

plot of chunk unnamed-chunk-3


Ordinary Least Squares

This time, let's consider finding the best line Child's Height = \( \beta_0 \) + Parent's Height \( \beta_1 \) i.e. \[ Y = \beta_0 + \beta_1 X \] to minimize the least square residuals \[ \dagger := \sum_{i=1}^n [ Y_i - (\beta_0 + \beta_1 X_i)]^2 \]

Analysis.

Let \( \mu_i := \beta_0 + \beta_1 X_i \) and our estimates be \( \hat \mu_i := \hat \beta_0 + \hat \beta_1 X_i \).

Then we want to minimize \[ \dagger = \sum_{i=1}^n (Y_i - \mu_i)^2 = \sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2 + 2 \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) + \sum_{i=1}^n (\hat \mu_i - \mu_i)^2 \] Suppose that \( \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0 \) then \[ \dagger =\sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2 + \sum_{i=1}^n (\hat \mu_i - \mu_i)^2\geq \sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2 \]

So that if \( \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0 \), then the line \[ Y = \hat \beta_0 + \hat \beta_1 X \] is the least squares line.

Conclusions.

Example with Galton's Data

y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) * sd(y)/sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0, beta1), coef(lm(y ~ x)))
##      (Intercept)      x
## [1,]       23.94 0.6463
## [2,]       23.94 0.6463
beta1 <- cor(y, x) * sd(x)/sd(y)
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0, beta1), coef(lm(x ~ y)))
##      (Intercept)      y
## [1,]       46.14 0.3256
## [2,]       46.14 0.3256
yc <- y - mean(y)
xc <- x - mean(x)
beta1 <- sum(yc * xc)/sum(xc^2)
c(beta1, coef(lm(y ~ x))[2])
##             x 
## 0.6463 0.6463
yn <- (y - mean(y))/sd(y)
xn <- (x - mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
##                   xn 
## 0.4588 0.4588 0.4588
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
plot(as.numeric(as.vector(freqData$parent)), as.numeric(as.vector(freqData$child)), 
    pch = 21, col = "black", bg = "lightblue", cex = 0.05 * freqData$freq, xlab = "Parent", 
    ylab = "Child", xlim = c(62, 74), ylim = c(62, 74))
abline(mean(y) - mean(x) * cor(y, x) * sd(y)/sd(x), sd(y)/sd(x) * cor(y, x), 
    lwd = 3, col = "red")
abline(mean(y) - mean(x) * sd(y)/sd(x)/cor(y, x), sd(y)/sd(x)/cor(y, x), lwd = 3, 
    col = "blue")
abline(mean(y) - mean(x) * sd(y)/sd(x), sd(y)/sd(x), lwd = 2)
points(mean(x), mean(y), cex = 2, pch = 19)

plot of chunk unnamed-chunk-8


Regression to the Mean

Regression to the Mean is a historically famous idea, it answers the questions like

These phenomena are all examples of so-called regression to the mean, which was invented by Francis Galton in the paper Regression towvards mediocrity in hereditary stature. The main idea is to imagine if you simulated pairs of random normals

First, we normalize \( X \) (child's height) and \( Y \) (parent's height) so that they both have mean 0 and variance 1. Recall

data(father.son)
str(father.son)
## 'data.frame':    1078 obs. of  2 variables:
##  $ fheight: num  65 63.3 65 65.8 61.1 ...
##  $ sheight: num  59.8 63.2 63.3 62.8 64.3 ...
y <- (father.son$sheight - mean(father.son$sheight))/sd(father.son$sheight)
x <- (father.son$fheight - mean(father.son$fheight))/sd(father.son$fheight)
rho <- cor(x, y)
myPlot <- function(x, y) {
    plot(x, y, xlab = "Father's height, normalized", ylab = "Son's height, normalized", 
        xlim = c(-3, 3), ylim = c(-3, 3), bg = "lightblue", col = "black", cex = 1.1, 
        pch = 21, frame = FALSE)
}
myPlot(x, y)
abline(0, 1)
abline(0, rho, lwd = 2, col = "red")
abline(0, 1/rho, lwd = 2, col = "blue")
abline(h = 0)
abline(v = 0)

plot of chunk unnamed-chunk-9

Conclusions.


Linear Regression

As least squares is an estimation tool, how do we do inference? We can consider developing a probabilistic model for linear regression \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i} \]

We also need to interpret the two regression coefficients, the intercept and the slope.

Example with the data diamond

The data frame diamond contains 48 observations on 2 variables.

str(diamond)
## 'data.frame':    48 obs. of  2 variables:
##  $ carat: num  0.17 0.16 0.17 0.18 0.25 0.16 0.15 0.19 0.21 0.15 ...
##  $ price: int  355 328 350 325 642 342 322 485 483 323 ...

Let's first plot the data and the fitted regression line

data(diamond)
plot(diamond$carat, diamond$price, xlab = "Mass (carats)", ylab = "Price (SIN $)", 
    bg = "lightblue", col = "black", cex = 1.1, pch = 21, frame = FALSE)
abline(lm(price ~ carat, data = diamond), lwd = 2)

plot of chunk unnamed-chunk-11

The coefficients of the fitted line are

fit <- lm(price ~ carat, data = diamond)
coef(fit)
## (Intercept)       carat 
##      -259.6      3721.0

To get a more interpretable intercept, we can

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)
##            (Intercept) I(carat - mean(carat)) 
##                  500.1                 3721.0

Thus $500.08 is the expected price for the average sized diamond of the data (0.2042 carats).

We can also change the scale, for example, as one carat increase in a diamond is pretty big, what about changing units to 1/10th of a carat? We can just do this by just dividing the coeficient by 10.

We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond. It is the same if we rescale the Xs and refit

fit3 <- lm(price ~ I(carat * 10), data = diamond)
coef(fit3)
##   (Intercept) I(carat * 10) 
##        -259.6         372.1

If we want to predict the price of a diamond, we can do

newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx
## [1]  335.7  745.1 1005.5
predict(fit, newdata = data.frame(carat = newx))
##      1      2      3 
##  335.7  745.1 1005.5

The predicted values at the observed Xs are red and the new Xs are blue

plot(diamond$carat, diamond$price, xlab = "Mass (carats)", ylab = "Price (SIN $)", 
    bg = "lightblue", col = "black", cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd = 2)
points(diamond$carat, predict(fit), pch = 19, col = "red")
lines(c(0.16, 0.16, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.16, coef(fit)[1] + 
    coef(fit)[2] * 0.16))
lines(c(0.27, 0.27, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.27, coef(fit)[1] + 
    coef(fit)[2] * 0.27))
lines(c(0.34, 0.34, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.34, coef(fit)[1] + 
    coef(fit)[2] * 0.34))
text(newx, rep(250, 3), labels = newx, pos = 2)

plot of chunk unnamed-chunk-16


Residuals

In the model \( Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \) where \( \epsilon_i \sim N(0, \sigma^2) \).

The residual is defined as Data = Fit + Residual, i.e. \[ e_i = Y_i - \hat Y_i \] In other words, it is the vertical distance between the observed data point and the regression line. Since the least squares minimizes \( \sum_{i=1}^n e_i^2 \), the \( e_i \) can be thought of as estimates of the \( \epsilon_i \).

Properties of the residuals

If we take a look at the dataset diamond in the previous section

y <- diamond$price
x <- diamond$carat
n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit)  # residual of the fit model
yhat <- predict(fit)
max(abs(e - (y - yhat)))  # To show that e = y - yhat
## [1] 9.486e-13
max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))
## [1] 9.486e-13

In the left figure below, the residuals are the signed length of the red lines and the right figure is the residual plot for the fit model.

par(mfrow = c(1, 2))
plot(diamond$carat, diamond$price, xlab = "Mass (carats)", ylab = "Price (SIN $)", 
    bg = "lightblue", col = "black", cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd = 2)
for (i in 1:n) {
    lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red", lwd = 2)
}
plot(diamond$carat, e, xlab = "Mass (carats)", ylab = "Residuals (SIN $)", bg = "lightblue", 
    col = "black", cex = 1.1, pch = 21, frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1:n) {
    lines(c(x[i], x[i]), c(e[i], 0), col = "red", lwd = 2)
}

plot of chunk unnamed-chunk-18

We also notice that for the non-linear data as shown in the left figure below, if we fit it with linear model, the residual plot is like the right figure below.

par(mfrow = c(1, 2))
x <- runif(100, -3, 3)
y <- x + sin(x) + rnorm(100, sd = 0.2)
plot(x, y)
abline(lm(y ~ x))
plot(x, resid(lm(y ~ x)))
abline(h = 0)

plot of chunk unnamed-chunk-19

For some data, the fit looks very good as shown in the left figur below, but if we plot the residual, the pattern of data suddenly becomes more obvious. This is called Heteroskedasticity.

par(mfrow = c(1, 2))
x <- runif(100, 0, 6)
y <- x + rnorm(100, mean = 0, sd = 0.001 * x)
plot(x, y)
abline(lm(y ~ x))
plot(x, resid(lm(y ~ x)))
abline(h = 0)

plot of chunk unnamed-chunk-20

Residual Variation

In the model \( Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \) where \( \epsilon_i \sim N(0, \sigma^2) \). The ML estimate of \( \sigma^2 \) is \( \frac{1}{n}\sum_{i=1}^n e_i^2 \), the average squared residual. While most people use \[ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2. \] Where \( n-2 \) instead of \( n \) is so that \( E[\hat \sigma^2] = \sigma^2 \). In our diamond example

y <- diamond$price
x <- diamond$carat
n <- length(y)
fit <- lm(y ~ x)
summary(fit)$sigma
## [1] 31.84
sqrt(sum(resid(fit)^2)/(n - 2))
## [1] 31.84

Synthesis of variation

We notice that the total variation of responses \[ \begin{align} \sum_{i=1}^n (Y_i - \bar Y)^2 & = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \\ & = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ \end{align} \]

As \( \bar Y = \hat \beta_0 + \hat \beta_1 \bar X \), we can have \[ \begin{align} Y_i - \hat Y_i &= Y_i - (\bar Y - \hat \beta_1 \bar X) - \hat \beta_1 X_i = (Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X) \\ \hat Y_i - \bar Y &= \bar Y - \hat \beta_1 \bar X + \hat \beta_1 X_i - \bar Y = \hat \beta_1 (X_i - \bar X) \end{align} \] Then \[ \begin{align} \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) &= \sum_{i=1}^n [(Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X)][\hat \beta_1 (X_i - \bar X)] \\ &= \hat \beta_1 \sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X) -\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2 \\ &= \hat \beta_1^2 \sum_{i=1}^n (X_i - \bar X)^2-\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2 = 0 \end{align} \] Thus \[ \sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \]

i.e. Total Variation = Residual Variation + Regression Variation

Recall that for the correlation between the two variables \( R \) is defined as \[ R = \frac{1}{n-1} \sum_{i=1}^n \frac{X_i - \bar X}{s_X} \frac{Y_i - \bar Y}{s_Y} \] Because \( \hat Y_i - \bar Y = \hat \beta_1 (X_i - \bar X) \), we have \[ R^2 = \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \] i.e. \( R^2 \) is the percentage of variation explained by the regression model, or, \( R^2 \) = 1 - Residual Variation\( / \) Total Variation.

Properties of \( R^2 \)

plot of chunk unnamed-chunk-22


Inference in regression - Understanding the regression output

Consider the model \( Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \) with \( \epsilon \sim N(0, \sigma^2) \), we assume that the true model is known.

Recall that statistics like \( \dfrac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}} \) often have the following properties:

  1. Is normally distributed and has a finite sample Student's T distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
  2. Can be used to test \( H_0 : \theta = \theta_0 \) versus \( H_a : \theta >, <, \neq \theta_0 \).
  3. Can be used to create a confidence interval for \( \theta \) via \( \hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta} \) where \( Q_{1-\alpha/2} \) is the relevant quantile from either a normal or T distribution.

The standard error of \( \hat \beta_1 \) conditioned on X is \[ \begin{align} Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y) (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X) \right) }{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2}\\ & = \frac{\sum_{i=1}^n \sigma^2(X_i - \bar X)^2}{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2} \end{align} \]

Idem for \( \hat \beta_0 \), then

In practice, \( \sigma \) is replaced by its estimate. It's probably not surprising that under iid Gaussian errors \( \dfrac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}} \) follows a \( t \) distribution with \( n-2 \) degrees of freedom and a normal distribution for large \( n \). This can be used to create confidence intervals and perform hypothesis tests.

Example diamond dataset

y <- diamond$price
x <- diamond$carat
n <- length(y)
beta1 <- cor(y, x) * sd(y)/sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x  # Residual
sigma <- sqrt(sum(e^2)/(n - 2))
ssx <- sum((x - mean(x))^2)
seBeta0 <- sqrt(1/n + mean(x)^2/ssx) * sigma  # sigma_beta_0
seBeta1 <- sigma/sqrt(ssx)  # sigma_beta_1

tBeta0 <- beta0/seBeta0
tBeta1 <- beta1/seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, 
    pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")

We can then compare the coefficients obtained by our manual computation and the lm function in R

coefTable
##             Estimate Std. Error t value   P(>|t|)
## (Intercept)   -259.6      17.32  -14.99 2.523e-19
## x             3721.0      81.79   45.50 6.751e-40
fit <- lm(y ~ x)
summary(fit)$coefficients
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   -259.6      17.32  -14.99 2.523e-19
## x             3721.0      81.79   45.50 6.751e-40

To get the confidence interval for the coefficients \( \hat \beta_0 \) and \( \hat \beta_1 \), we can do

sumCoef <- summary(fit)$coefficients
sumCoef[1, 1] + c(-1, 1) * qt(0.975, df = fit$df) * sumCoef[1, 2]  # beta_0
## [1] -294.5 -224.8
sumCoef[2, 1] + c(-1, 1) * qt(0.975, df = fit$df) * sumCoef[2, 2]  # beta_1
## [1] 3556 3886

With 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.

Confidence and prediction intervals

Consider predicting \( Y \) at a value of \( X \), the obvious estimate for prediction at point \( x_0 \) is \( \hat \beta_0 + \hat \beta_1 x_0 \)

We can plot the prediction interval as

par(mfrow = c(1, 2))
plot(x, y, frame = FALSE, xlab = "Carat", ylab = "Dollars", pch = 21, col = "black", 
    bg = "lightblue", cex = 2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = 0.01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1/n + (xVals - mean(x))^2/ssx)
se2 <- sigma * sqrt(1 + 1/n + (xVals - mean(x))^2/ssx)
lines(xVals, yVals + 2 * se1)
lines(xVals, yVals - 2 * se1)
lines(xVals, yVals + 2 * se2)
lines(xVals, yVals - 2 * se2)

newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame = FALSE, xlab = "Carat", ylab = "Dollars", pch = 21, col = "black", 
    bg = "lightblue", cex = 2)
abline(fit, lwd = 2)
lines(xVals, p1[, 2])
lines(xVals, p1[, 3])
lines(xVals, p2[, 2])
lines(xVals, p2[, 3])

plot of chunk unnamed-chunk-26

Discussion.


Next Module. Module II : Multivariable Regression