LM Model

  1. Suppose that we have a data \(\{Y_{i}|i=1..n\}\), we will show that for \(\mu=\bar{Y}\) the sum of \(Y_{i}-\mu\) squared is minimized, this is known as the least squares problem. \[\begin{eqnarray*} \sum_{i=1}^{n}\left(Y_i-\mu\right)^2&=&\sum_{i=1}\left(Y_i-\bar{Y}+\bar{Y} -\mu\right)^2\\ &=&\sum_{i=1}^{n}\left(Y_i-\bar{Y}\right)^2- \underbrace{2\sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)\left(\mu-\bar{Y}\right)}_{=0} + \sum_{i=1}^{n}\left(\mu-\bar{Y}\right)^2\\ &=& \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^2+ \sum_{i=1}^{n}\left(\mu-\bar{Y}\right)^2\\ &\geq& \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^2\qquad \text{for }\mu=\bar{Y}\qquad[1] \end{eqnarray*}\]
  2. Suppose that we have a data \(\{Y_{i}|i=1..n\}\) and we want to fit a line in the data 2d cloud that minimizes the squares of the vertical (to the x-axis) distances between the data points and the line points. We will first fit a line with no intercept and use this information for the generalized linear model.

Linear Model With No Intercept:

We will find the slope \(\beta_1\) that minimizes the following sum of squares, \[\begin{eqnarray*} \sum_{i=1}^{n}\left(Y_{i}-\beta_{1}X_{i}\right)^{2} &=& \sum_{i=1}^{n}\left(Y_{i}-\hat{\beta_{1}}X_{i}+ \hat{\beta_{1}}X_{i}-\beta_{1}X_{i}\right)^{2}\\ &=& \sum_{i=1}^{n}\left(Y_{i}-\hat{\beta_{1}}X_{i}\right)^{2} - 2 \sum_{i=1}^{n}\left(Y_i-\hat{\beta_1}X_{i}\right)\left(\hat{\beta_1}- \beta_1\right)X_{i} + \left(\hat{\beta_1}-\beta_1\right)^2\sum_{i=1}^{n} X_{i}^{2} \end{eqnarray*}\] we will minimize the last quantity by setting the middle term equal to zero. \[\begin{eqnarray*} 2\sum_{i=1}^{n}\left(Y_i-\hat{\beta_1}X_{i}\right)\left(\hat{\beta_1}- \beta_1\right)X_{i} &=& 0 \\ \sum_{i=1}^{n}Y_iX_i-\hat{\beta_1}X_{i}^{2} &=& 0 \\ \hat{\beta_{i}} &=& \frac{\sum\limits_{i=1}^{n}X_iY_i}{\sum\limits_{i=1}^{n}X_i^2 }\qquad [2] \end{eqnarray*} \] General Linear Model:

We will fit a line with an intercept by minimizing the following sum of squares. \[\begin{eqnarray*} \sum_{i=1}^{n}\left(Y_{i}-\beta_0-\beta_1 X_{i}\right)^{2} &=& \sum_{i=1}^{n}\left(\underbrace{Y_{i}-\beta_1X_i}_{Y_{i}^{*}}-\beta_0\right)^{2} \\ &=& \sum_{i=1}^{n}\left(Y_{i}^{*}-\beta_0\right)^{2} \text{look at [1]}\\ \hat{\beta_0} &=& \bar{Y^{*}}=\overline{Y-\beta_{1}X}= \bar{Y}-\beta_{1}\bar{X} \end{eqnarray*} \] now that we found \(\hat{\beta_0}\), let’s see what we can do for \(\hat{\beta_1}\), by substituting \(\beta_0\) with \(\hat{\beta_0}\), we get \[ \begin{eqnarray*} \sum_{i=1}^{n}\left(Y_i^{*}-\hat{\beta_0}\right)^{2} &=& \sum_{i=1}^{n}\left(Y_i^{*}-\bar{Y}+\beta_{1}\bar{X}\right)^{2}\\ &=&\sum_{i=1}^{n}\left(Y_{i}-\beta_{1}X_i-\bar{Y}+\beta_1\bar{X}\right)^{2}\\ &=&\sum_{i=1}^{n}\left(\underbrace{Y_{i}-\bar{Y}}_{Y^{n}_{i}} - \beta_1\underbrace{\left(X_i-\bar{X}\right)}_{X_i^n} \right)^{2}\\ &=& \sum_{i=1}^{n}\left(Y_i^n-\beta_1X_i^n\right)^{2} \qquad\text{from [2], model with no intercept}\\ \hat{\beta_{1}} &=& \frac{\sum\limits_{i=1}^{n}X_{i}^{n}Y_{i}^{n}} {\sum\limits_{i=1}^{n}\left[X_{i}^{n}\right]^2} \end{eqnarray*} \] from the last result \[ \begin{eqnarray*} \hat{\beta_{1}} &=& \frac{\sum\limits_{i=1}^{n}X_{i}^{n}Y_{i}^{n}} {\sum\limits_{i=1}^{n}\left[X_{i}^{n}\right]^2}\\ &=& \frac{\sum\limits_{i=1}^{n}\left(X_i-\bar{X}\right) \left(Y_i-\bar{Y}\right)}{\sum\limits_{i=1}^{n}\left(X_i-\bar{X}\right)^{2}} \qquad\text{dividing by n-1 both numerator and denominator}\\ &=&\frac{\text{cov}\left(X,Y\right)}{s_{X}^2}= \frac{\text{cov}\left(X,Y\right)}{s_{X}s_{Y}}\frac{s_Y}{s_X}=\text{cor} \left(X,Y\right)\frac{s_Y}{s_X} \end{eqnarray*} \] .

Residuals

We will load the “UsingR” package and fit a simple linear regression model.

if(!require(UsingR)){install.packages('UsingR')}
library(UsingR)
set.seed(666)

Example With Diamond Data

We will plot the data “diamond” and fit a simple linear regression model.

carat <- diamond$carat
price <- diamond$price
fit   <- lm(price ~ carat,diamond)
plot(carat,price,main = 'Diamond Data With Regression Line',
     col = rgb(0.8,0.2,0,0.6),pch = 19,cex=1.3)
abline(fit,lwd = 2,col = 'red')
segments(carat,price,carat,predict(fit),col='black')

We will now plot the residuals against the carat feature.

x <- seq(min(carat),max(carat)+0.02,0.01)
f <- resid(fit)
plot(x,rep(0,length(x)),col='red',lwd = 2,type = 'l',xlab = 'carat',
     ylab = 'residuals',main = 'residuals against carats',
     ylim = c(min(f),max(f)))
points(carat,f,col = rgb(1,0,0,0.5),cex = 1.2,pch = 19)
segments(carat,0,carat,f)

From the plot we see a random variation of the residuals which makes the linear fit a good choice.

Let us focus on some non linear cases, in which the focus on the residuals makes the linear fit improper,

x <- runif(100,-3,3)
y <- x + sin(x) + rnorm(100,sd=0.2)
plot(x,y,cex=1.2,col=rgb(1,0,0,0.4),pch=19)
fit <- lm(y ~ x,data.frame(x=x,y=y))
abline(fit,lwd = 2 , col = 'cyan')

and now let’s add the residuals

plot(x,y,cex=1.2,col=rgb(1,0,0,0.4),pch=19)
abline(fit,lwd = 2 , col = 'cyan')
segments(x,y,x,predict(fit))

Let’s plot the residuals against the predictor x, from the following figure it is evident that the linear fit will suit the purpose of prediction, since we see a periodical pattern.

plot(seq(-3,3,0.01),rep(0,601),type = 'l',col = 'red',
     xlab = 'x',ylab = 'residuals',main = 'residuals against predictor x')
f <- resid(fit)
points(x,f,cex=1.2,col=rgb(1,0,0,0.5),pch=19)
segments(x,0,x,f)

Another example in which the linear model is not the right choice is the following,

x <- runif(100,0,6)
y <- x + rnorm(100,mean=0,sd=.001*x)
plot(x,y,cex=1.2,col=rgb(1,0,0,0.4),pch=19)
fit <- lm(y ~ x,data.frame(x=x,y=y))
abline(fit,lwd = 2 , col = 'cyan')

we see that the linear model fits its purpose but if we zoom in we reject the linearity.

f <- resid(fit)
plot(seq(0,6,0.01),rep(0,601),type = 'l',col = 'red',
     xlab = 'x',ylab = 'residuals',main = 'residuals against predictor x',
     ylim = c(min(f),max(f)))
points(x,f,cex=1.2,col=rgb(1,0,0,0.5),pch=19)
segments(x,0,x,f)

This is called “heteroscedasticity”.

Some Notes On Error Variation

Suppose our linear model is \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i}\), with \(\epsilon_i\sim\mathcal{N}\left(0,\sigma^2\right)\). The estimation of the population’s error variation naturally comes from the residual variation, in fact the maximum likelihood estimate of \(\sigma^2\) is the average squared residual \(\displaystyle\frac{1}{n}\sum_{i=1}^{n}e_{i}^2\). If an intercept is included, which means that the sum of the residuals is zero in order to get an unbiased estimate we divide by \(n-2\), since two degrees of freedom were “burnt” in order to calculate the unbiased intercept and slope estimations. So \[\hat{\sigma}^{2}=\frac{1}{n-2}\sum_{i=1}^{n}e_{i}^{2}\] and by taking the unbiased \(\sigma\) estimation \(\displaystyle \mathbb{E[}\hat{\sigma}^{2}\mathbb{]}=\sigma^2\). Let’s go back to the diamond example to see this in action,

y <- diamond$price
x <- diamond$carat
n <- length(y)
fit <- lm(y ~ x)
## σ Estimate From The Model
summary(fit)$sigma
[1] 31.84052
## Direct Biased Calculation
sqrt(sum(resid(fit)^2)/n)
[1] 31.17012
## Unbiased Direct Calculation
sqrt(sum(resid(fit)^2)/(n-2))
[1] 31.84052

as a last remark, it is necessary to note the the total variability of the model is equal to the sum of the error variability and the regression variability \[\sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2} = \sum_{i=1}^{n}\left(Y_{i}-\hat{Y}_{i}\right)^{2} + \sum_{i=1}^{n}\left(\hat{Y}_{i}-\bar{Y}\right)^{2}\] which are shown in the right side of the equation respectively.

if(!require(beeswarm)){install.packages('beeswarm')}
Loading required package: beeswarm
residuals <- c(resid(lm(diamond$price ~ 1)),
               resid(lm(diamond$price~diamond$carat)))
model     <- as.factor(c(rep('price only',48),rep('price and carat',48)))
beeswarm(residuals ~ model,col = c(rgb(1,0,0,0.5),rgb(0.6,0,.4,0.5)),
         xlab = 'fitting approach',pch=19,cex=1.2)
boxplot (residuals ~ model,col = c(rgb(1,0,0,0.2),rgb(0,1,0,0.2)),add = TRUE)
title(main = 'Variation Explained By Two Fitting Approaches - Linear Model')

It is worth noticing that there is an important variation decrease, if we include the diamond mass in our fitting model.

R-Squared

The R-squared is the percentage of the total variability which is explained by the fitted model. \[ \mathcal{R}^{2}=\frac{\sum\limits_{i=1}^{n}\left(\hat{Y_i}-\bar{Y}\right)^2} {\sum\limits_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^2} \] it is obvious that \(\mathcal{R}^{2}\in[0,1]\), that it is the sample correlation squared. We must be very careful on how we interpret \(\mathcal{R}^{2}\), since

and so for the above two reasons we see that it may become misleading. A very good example of this is shown in the following graphs.

data("anscombe")
par(mfrow = c(2,2))
for (ii in 1:4){
  x <- anscombe[,ii]
  y <- anscombe[,ii+4]
  plot(x,y,col = 'red',pch=19,cex=1.2,xlab='x',ylab='y')
  abline(lm(y~x))
}

Regression Inference

Our regression model is \[Y_{i}=\beta_0+\beta_1 X_i + \epsilon_i,\qquad \epsilon_i\sim\mathcal{N}(0,\sigma^2)\] and \(\hat{\beta_0}=\bar{Y}-\hat{\beta_1}\bar{X}\) and \(\displaystyle \hat{\beta_1}=\text{cor}(X,Y)\frac{s_Y}{s_X}\) give us the best fit. The statistic \[\frac{\hat{\theta}-\theta}{\sigma_{ \hat{\theta}}}\] is consisted of \(\hat{\theta}\) which is the estimate of interest, \(\theta\) which is the estimand interest, \(\sigma_{\hat{\theta}}\) is the standard error of \(\hat{\theta}\). In many cases these statistics have the following properties

we will use the above information to give confidence intervals and p-values for the intercept and slope of a linear regression model.

Inference For The Regression Parameters

Variance Estimation:

  1. \(\sigma_{\hat{\beta_1}}^2\) estimation: \[ \begin{eqnarray*} \sigma_{\hat{\beta_1}}^2 &=& \text{var}\left(\frac{\sum\limits_{i=1}^{n} [X_i-\bar{X}][Y_i-\bar{Y}]} {\sum\limits_{i=1}^{n}\left[X_{i}-\bar{X}\right]^2}\right)\\ &=& \text{var}\left(\frac{\sum\limits_{i=1}^{n} [X_i-\bar{X}]Y_i} {\sum\limits_{i=1}^{n}\left[X_{i}-\bar{X}\right]^2}\right)\\ &=&\text{var}\left(\frac{\sum\limits_{i=1}^{n} [X_i-\bar{X}][\hat{\beta_0}+\hat{\beta_1}X_i+\epsilon_i]} {\sum\limits_{i=1}^{n}\left[X_{i}-\bar{X}\right]^2}\right)\\ &=&\frac{\text{var}\left(\sum\limits_{i=1}^{n}\left[X_i-\bar{X}\right] \epsilon_i\right)} {\left[\sum\limits_{i=1}^n\left[X_i-\bar{X}\right]^2\right]^2}\\&=& \frac{\sum\limits_{i=1}^{n}\text{var}\left(\left[X_i-\bar{X}\right] \epsilon_i\right)} {\left[\sum\limits_{i=1}^n\left[X_i-\bar{X}\right]^2\right]^2}\\ &=&\frac{\sigma^2\sum\limits_{i=1}^{n}\left[X_i-\bar{X}\right]^2}{ \left[\sum\limits_{i=1}^n\left[X_i-\bar{X}\right]^2\right]^2 }\\ &=&\frac{\sigma^2}{\sum\limits_{i=1}^{n}\left[X_i-\bar{X}\right]^2} \end{eqnarray*} \]

  2. \(\sigma_{\hat{\beta_0}}^2\) estimation: \[ \begin{eqnarray*} \sigma_{\hat{\beta_0}}^{2}&=&\text{var}\left(\bar{Y}-\hat{\beta_{1}} \bar{X}\right)\\ &=& \text{var}\left(\bar{Y}\right) + \text{var}\left(\hat{\beta_{1}} \bar{X}\right)\\ &=& \frac{\sigma^2}{n} + \frac{\sigma^2\bar{X}^2}{\sum\limits_{i=1}^{n} [X_i-\bar{X}]^2}\\ &=&\left(\frac{1}{n}+\frac{\bar{X}^2}{\sum\limits_{i=1}^{n} [X_i-\bar{X}]^2}\right)\sigma^2 \end{eqnarray*} \]

We should note that \(\sigma\) is estimated by the variation we get from the residuals.

Example From The Diamond Set

y <- diamond$price
x <- diamond$carat
beta1 <- cor(x,y)*sd(y)/sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e     <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2)/(n-2))
ssx   <- sum((x-mean(x))^2)

Now we are ready to test the null hypothesis \(\mathcal{H}_{0}:\beta_{j}=0\),

seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0
tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1,
pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "carat")
coefTable
             Estimate Std. Error   t value      P(>|t|)
(Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
carat       3721.0249   81.78588  45.49715 6.751260e-40

In both rows we see an estimate, the standard error, the t-statistic and the p-value. Since both p-values are close to zero we can securely reject the null hypothesis and both betas are not equal to zero. An easy way to see this in R without those complicated calculations is shown below.

summary(lm(price~carat,diamond))$coefficients
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
carat       3721.0249   81.78588  45.49715 6.751260e-40

Confidense Interval And Prediction Interval For The Coefficients (c.i & p.i)

We should recall that the c.i. is \(\hat{\mu}\pm t_{1-\alpha/2,df}\cdot \text{se}\), where \(\hat{\mu}\) is the sample’s mean, df is the degrees of freedom, \(\alpha\) is the alpha level and se the standard error. A \((1-a)\cdot 100 \%\) confidence interval means that if we were to construct the confidence intervals for 100 samples of n size we would expect that \((1-\alpha)\cdot 100\%\) of those intervals contain the true mean of the population’s parameter we are interested in. Let’s construct these intervals for the linear model fit.

fit          <- lm(price ~ carat,diamond)
coefs        <- summary(fit)$coefficients
## c.i. for intercept
coefs[1,1] + c(-1,1)*qt(.975,fit$df)*coefs[1,2]
[1] -294.4870 -224.7649
## c.i. for slope  divide by 10 to see the effect for 0.1 carat plus or minus
(coefs[2,1] + c(-1,1)*qt(.975,fit$df)*coefs[2,2])/10
[1] 355.6398 388.5651

The prediction interval is something different, if we took only one sample and we substituted the standard error by the same quantity plus one under the square root the interval we get contains with 84% probability the true mean the parameter of interest.

Prediction Interval: \(\hat{\mu} \pm t_{df,1-\alpha/2}\cdot \sqrt{1+\text{se}^2}\)

In the case of the linear model for the database “diamond”, for every carat within the range of the data set let us say \(x_0\), the se (standard error) is \[\hat{\sigma}\sqrt{\frac{1}{n}+\frac{(x_0-\bar{X})^2}{\sum\limits_{i=1}^{n} (X_i-\bar{X})^{2}}}\] and the se for the prediction interval around the line we fit is \[\hat{\sigma}\sqrt{1+\frac{1}{n}+\frac{(x_0-\bar{X})^2}{\sum\limits_{i=1}^{n} (X_i-\bar{X})^{2}}}\] Let us incorporate these two intervals into our linear regression model for the “diamond” data set, hopefully “ggplot” allows us to show this without any complex calculations

newx <- data.frame(carat = seq(min(x), max(x), length = 100))
p1   <- data.frame(predict(fit,newx,interval = ("confidence")))
p2   <- data.frame(predict(fit,newx,interval = ("prediction")))
p1$interval <- "confidence"
p2$interval <- "prediction"
p1$x <- newx$x
p2$x <- newx$x
dat  <- rbind(p1, p2)
dat$carat <- unlist(rep(newx,2))
names(dat)[1] <- "price"
g <- ggplot(dat, aes(x=carat,y = price))
g <- g + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = interval),
                     alpha = 0.5)
g <- g + geom_line()
g <- g + geom_point(data = data.frame(x = x, y=y), aes(x = x, y = y),
                    colour=rgb(1,0,0,0.4),size = 3)
g + theme_bw()

we see the confidence interval in red and the prediction interval in cyan, the last shows graphically the range of price prediction for a carat for which we don’t have data, it will fall vertically to the x-axis within the upper and lower cyan colored region with 84% probability!