\(\;\)

Source

Adopted based on the course on regression models offered by Brian Caffo at Johns Hopkins Bloomberg School of Public Health

Intorduction to Least squares and linear regression

Consider trying to answer the following kinds of questions:

Let’s look at the data first, used by Francis Galton in 1885. Galton was a statistician who invented the term and concepts of regression and correlation, founded the journal Biometrika, and was the cousin of Charles Darwin. You may need to run install.packages("UsingR") if the UsingR library is not installed. Lets look at the first few records of the dataset :

library(UsingR) 
library(reshape)
library(knitr)
library(printr)
data(galton)
kable(head(galton),align = 'c')
child parent
61.7 70.5
61.7 68.5
61.7 65.5
61.7 64.5
61.7 64.0
62.2 67.5

Changing the structure of the data to a long format :

long <- melt(galton)
kable(head(long),align = 'c')
variable value
child 61.7
child 61.7
child 61.7
child 61.7
child 61.7
child 62.2

Let’s look at the marginal (parents disregarding children and children disregarding parents) distributions first. Parent distribution is all heterosexual couples. Correction for gender via multiplying female heights by 1.08. Overplotting is an issue from discretization.

library(ggplot2)
g <- ggplot(long, aes(x = value, fill = variable)) 
g <- g + geom_histogram(colour = "black", binwidth=1) 
g <- g + facet_grid(. ~ variable)
g

Finding the middle via least squares

In this section we show how we can find the middle of the distribution using least square. Consider only the children’s heights. How could one describe the “middle”? One definition, let \(Y_i\) be the height of child \(i\) for \(i = 1, \ldots, n = 928\), then define the middle as the value of \(\mu\) that minimizes

\[\sum_{i=1}^n (Y_i - \mu)^2\]

This is physical center of mass of the histrogram. You might have guessed that the answer \(\mu = \bar Y\). Therefore the least squares estimate is the empirical mean as shown below:

g <- ggplot(galton, aes(x = child)) + 
        geom_histogram(fill = "salmon", colour = "black", binwidth=1)+ 
        geom_vline(xintercept = mean(galton$child), size = 3)
g

We can mathematically prove that as given below:

\[ \begin{align} \sum_{i=1}^n \left(Y_i - \mu\right)^2 & = \ \sum_{i=1}^n \left(Y_i - \bar Y + \bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ 2 \sum_{i=1}^n \left(Y_i - \bar Y\right) \left(\bar Y - \mu\right) +\ \sum_{i=1}^n \left(\bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ 2 \left(\bar Y - \mu\right) \sum_{i=1}^n \left(Y_i - \bar Y\right) +\ \sum_{i=1}^n \left(\bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ 2 \left(\bar Y - \mu\right) \left(\left(\sum_{i=1}^n Y_i\right) -\ n \bar Y\right) +\ \sum_{i=1}^n \left(\bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ \sum_{i=1}^n \left(\bar Y - \mu\right)^2\\ & \geq \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 \ \end{align} \]

Plotting the data

Now lets create a scatterplot to compare childrens’ heights and their parents’ heights.

ggplot(galton, aes(x = parent, y = child)) + geom_point()

Unfortunately some of the points overlap in this plot, so in the new plot below Size of point represents number of points at that (X, Y) combination .

library(dplyr)
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")                    
g

Regression through the origin

Suppose that \(X_i\) are the parents’ heights. Consider picking the slope \(\beta\) that minimizes

\[\sum_{i=1}^n (Y_i - X_i \beta)^2\]

This is exactly using the origin as a pivot point picking the line that minimizes the sum of the squared vertical distances of the points to the line. In the code below we subtract the means so that the origin is the mean of the parent and children’s heights. In the lm function below we subtract 1 to get rid of the intercept. Later we’ll talk about why this is the solution.

lm(I(child - mean(child))~ I(parent - mean(parent)) - 1, data = galton)
## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) - 
##     1, data = galton)
## 
## Coefficients:
## I(parent - mean(parent))  
##                   0.6463

Now we re-create the scatterplot with the linear line that that is forced to go through the mean of the parent’s hight and mean of the child’s height. This line has the slope is 0.646.

g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")  
g <- g + geom_smooth(method="lm", formula=y~x)
g

Least squares estimation of regression lines

Fitting the best line

Let \(Y_i\) be the \(i^{th}\) child’s height and \(X_i\) be the \(i^{th}\) (average over the pair of) parents’ heights. Consider finding the best line: Child’s Height = \(\beta_0\) + Parent’s Height \(\beta_1\)

Using the least squares : \[ \sum_{i=1}^n \{Y_i - (\beta_0 + \beta_1 X_i)\}^2 \]

The least squares model fit to the line \(Y = \beta_0 + \beta_1 X\) through the data pairs \((X_i, Y_i)\) with \(Y_i\) as the outcome obtains the line \(Y = \hat \beta_0 + \hat \beta_1 X\) where

\[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]

\(\hat \beta_1\) has the units of \(Y / X\), \(\hat \beta_0\) has the units of \(Y\). The line passes through the point \((\bar X, \bar Y\)).

The slope of the regression line with \(X\) as the outcome and \(Y\) as the predictor is \(Cor(Y, X) Sd(X)/ Sd(Y)\).

The slope is the same one you would get if you centered the data, \((X_i - \bar X, Y_i - \bar Y)\), and did regression through the origin.

If you normalized the data, \(\{ \frac{X_i - \bar X}{Sd(X)}, \frac{Y_i - \bar Y}{Sd(Y)}\}\), the slope is \(Cor(Y, X)\).

Now lets double check these calculations using R on Galton’s data.

y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) *  sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0, beta1), coef(lm(y ~ x)))
(Intercept) x
23.94153 0.6462906
23.94153 0.6462906

As you see the vales from R lm function and our calculations are identical. Now lets see what happens if we reversing the outcome/predictor relationship.

beta1 <- cor(y, x) *  sd(x) / sd(y)
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0, beta1), coef(lm(x ~ y)))
(Intercept) y
46.13535 0.3256475
46.13535 0.3256475

Also you can see below that regression through the origin yields an equivalent slope if you center the data first

yc <- y - mean(y)
xc <- x - mean(x)
beta1 <- sum(yc * xc) / sum(xc ^ 2)
kable(c(beta1, coef(lm(y ~ x))[2]))
0.6462906
x 0.6462906

Another key point we mentioned above was that normalizing variables results in the slope being the correlation and you can see it below

yn <- (y - mean(y))/sd(y)
xn <- (x - mean(x))/sd(x)
kable(c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2]))
0.4587624
0.4587624
0.4587624

Fianlly below would be the linear regression line.

g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")  
g <- g + geom_smooth(method="lm", formula=y~x)
g

Regression to the mean

Regression to the mean is a a historically famous idea and it concerns questions like :

These phenomena are all examples of so-called regression to the mean. It is invented by Francis Galton in the paper “Regression towards mediocrity in hereditary stature” The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886).

Think of it this way, imagine if you simulated pairs of random normals. The largest first ones would be the largest by chance, and the probability that there are smaller for the second simulation is high. In other words \(P(Y < x | X = x)\) gets bigger as \(x\) heads into the very large values. Similarly \(P(Y > x | X = x)\) gets bigger as \(x\) heads to very small values.

Think of the regression line as the intrisic part. Unless \(Cor(Y, X) = 1\) the intrinsic part isn’t perfect

Suppose that we normalize \(X\) (child’s height) and \(Y\) (parent’s height) so that they both have mean 0 and variance 1. Then, recall, our regression line passes through \((0, 0)\) (the mean of the X and Y). If the slope of the regression line is \(Cor(Y,X)\), regardless of which variable is the outcome (recall, both standard deviations are 1). Notice if \(X\) is the outcome and you create a plot where \(X\) is the horizontal axis, the slope of the least squares line that you plot is \(1/Cor(Y, X)\).

Below is an example of this :

library(UsingR)
data(father.son)
y <- (father.son$sheight - mean(father.son$sheight)) / sd(father.son$sheight)
x <- (father.son$fheight - mean(father.son$fheight)) / sd(father.son$fheight)
rho <- cor(x, y)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_point(size = 6, colour = "black", alpha = 0.2)
g = g + geom_point(size = 4, colour = "salmon", alpha = 0.2)
g = g + xlim(-4, 4) + ylim(-4, 4)
g = g + geom_abline(intercept = 0, slope = 1)
g = g + geom_vline(xintercept = 0)
g = g + geom_hline(yintercept = 0)
g = g + geom_abline(intercept = 0, slope = rho, size = 2)
g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)
g = ggplot(data.frame(x, y), aes(x = x, y = y))
g = g + geom_point(size = 5, alpha = .2, colour = "black")
g = g + geom_point(size = 4, alpha = .2, colour = "red")
g = g + geom_vline(xintercept = 0)
g = g + geom_hline(yintercept = 0)
g = g + geom_abline(position = "identity")
g = g + geom_abline(intercept = 0, slope = rho, size = 2)
g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)
g = g + xlab("Father's height, normalized")
g = g + ylab("Son's height, normalized")
g

Thefeore we can conclude the followings :

Statistical linear regression models

Basic regression model with additive Gaussian errors

From previous sections we concluded that Least squares is an estimation tool, but how do we do inference?

Consider developing a probabilistic model for linear regression \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i} \]

Where the \(\epsilon_{i}\) are assumed iid \(N(0, \sigma^2)\).

Please note that \(E[Y_i ~|~ X_i = x_i] = \mu_i = \beta_0 + \beta_1 x_i\) and \(Var(Y_i ~|~ X_i = x_i) = \sigma^2\).

Also ML estimates of \(\beta_0\) and \(\beta_1\) are the least squares estimates that can be found by

\[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]

Interpretting the intercept

Now lets talk about interpretting the intercept regression coefficient.

\(\beta_0\) is the expected value of the response when the predictor is 0 \[ E[Y | X = 0] = \beta_0 + \beta_1 \times 0 = \beta_0 \]

However, this isn’t always of interest, for example when \(X=0\) is impossible or far outside of the range of data. (X is blood pressure, or height etc.)

Shifting your \(X\) values by value \(a\) changes the intercept, but not the slope, as shown below :

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + a \beta_1 + \beta_1 (X_i - a) + \epsilon_i = \tilde \beta_0 + \beta_1 (X_i - a) + \epsilon_i \]

Therefore, often \(a\) is set to \(\bar X\) so that the intercept is interpretted as the expected response at the average \(X\) value.

Interpretting the slope

\(\beta_1\) is the expected change in response for a 1 unit change in the predictor : \[ E[Y ~|~ X = x+1] - E[Y ~|~ X = x] = \beta_0 + \beta_1 (x + 1) - (\beta_0 + \beta_1 x ) = \beta_1 \]

Another important point is the impact of changing the units of \(X\). Multiplication of \(X\) by a factor \(a\) results in dividing the coefficient by a factor of \(a\), as shown in the folloing calculation :

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i = \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i \]

As an example, assume that \(X\) is height in \(m\) and \(Y\) is weight in \(kg\). Then \(\beta_1\) is \(kg/m\). Converting \(X\) to \(cm\) implies multiplying \(X\) by \(100 cm/m\). To get \(\beta_1\) in the right units, we have to divide by \(100 cm /m\) to get it to have the right units :

\[ X m \times \frac{100cm}{m} = (100 X) cm ~~\mbox{and}~~ \beta_1 \frac{kg}{m} \times\frac{1 m}{100cm} = \left(\frac{\beta_1}{100}\right)\frac{kg}{cm} \]

Using regression coeficients for prediction

If we would like to guess the outcome at a particular value of the predictor, say \(X\), the regression model guesses :

\[ \hat \beta_0 + \hat \beta_1 X \]

Lets look at an example. We look at diamond data set from UsingR. Data is diamond prices (Singapore dollars) and diamond weight in carats (standard measure of diamond mass, 0.2 \(g\)). First few records of the data shown below :

data(diamond)
kable (head(diamond),align = 'c')
carat price
0.17 355
0.16 328
0.17 350
0.18 325
0.25 642
0.16 342

Creating a plot of the data and fitting a linear line :

Now we fitt the linear regression model and and find the regression coefficients :

fit <- lm(price ~ carat, data = diamond)
kable(coef(fit), align='c')
(Intercept) -259.6259
carat 3721.0249

Therefore, we estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond. The intercept -259.63 is the expected price of a 0 carat diamond.

Lets get a more interpretable intercept as we described above :

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
kable(coef(fit2), align='c')
(Intercept) 500.0833
I(carat - mean(carat)) 3721.0249

Thus $500.1 is the expected price for the average sized diamond of the data (0.2041667 carats).

In addition, with regrads to the intercept, a one carat increase in a diamond is pretty big, what about changing units to 1/10th of a carat?

We can just do this by just dividing the coeficient by 10. We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond. Showing that it’s the same if we rescale the Xs and refit

fit3 <- lm(price ~ I(carat * 10), data = diamond)
kable(coef(fit3), align='c')
(Intercept) -259.6259
I(carat * 10) 372.1025

Now its time to predict the price of a diamond. Below we predict with two different method but both give identical result.

newx <- c(0.16, 0.27, 0.34)
Method_1 <- coef(fit)[1] + coef(fit)[2] * newx
Method_2  <- predict(fit, newdata = data.frame(carat = newx))
rbind(Method_1, Method_2)
1 2 3
Method_1 335.7381 745.0508 1005.523
Method_2 335.7381 745.0508 1005.523

Graph below presents predicted values at the observed Xs (red) and at the new Xs (lines)

Residuals and residual variation

Residuals

Consider the linear regression model \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\).

Observed outcome \(i\) is \(Y_i\) at predictor value \(X_i\). We also show predicted outcome \(i\) by \(\hat Y_i\) at predictor valuve \(X_i\) which is :

\[ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i \]

We define residual, as the difference between the observed and predicted outcome. It is the vertical distance between the observed data point and the regression line :

\[ e_i = Y_i - \hat Y_i \]

The \(e_i\) can be thought of as estimates of the \(\epsilon_i\). Least squares method minimizes \(\sum_{i=1}^n e_i^2\). Therefore we can show that \(E[e_i] = 0\).

If an intercept is included, \(\sum_{i=1}^n e_i = 0\)

Also, if a regressor variable, \(X_i\), is included in the model \(\sum_{i=1}^n e_i X_i = 0\).

Residuals are useful for investigating poor model fit. Positive residuals are above the line, negative residuals are below. Residuals can be thought of as the outcome (\(Y\)) with the linear association of the predictor (\(X\)) removed. One differentiates residual variation (variation after removing the predictor) from systematic variation (variation explained by the regression model). Residual plots can highlight poor model fit.

lets consider the diamond data set . As we saw ealier the scatterplot of the data would be :

Below we calculate the residucals with two different methods :

y <- diamond$price 
x <- diamond$carat
n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e -(y - yhat)))
## [1] 9.485746e-13
max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))
## [1] 9.485746e-13

In the figure below you can see that residuals are the signed length of the red lines :

plot(diamond$carat, diamond$price,  
     xlab = "Mass (carats)", 
     ylab = "Price (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 2, pch = 21,frame = FALSE)
abline(fit, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)

Now lets plot residuals versus values of X :

plot(x, e,  
     xlab = "Mass (carats)", 
     ylab = "Residuals (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)

We dont observe any specific pattern in the residuals and that is good !

Residual analysis for non-linear data

Now lets repeat this experiment but this time with Non-linear data and see how residuals behave. Consider the follwing non-linear data generated :

x = runif(100, -3, 3); y = x + sin(x) + rnorm(100, sd = .2); 
library(ggplot2)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g

Now lets look at the residual plot :

g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), 
           aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2); 
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g

Clearly the residual plot hilights the cyclic pattern exist in the data.

Heteroskedasticity

Sometime the unusual pattern can not be obserevd form the scaterplot, but the residual plot clearly hilights that. This phenamena is called Heteroskedasticity as you see an example below:

x <- runif(100, 0, 6); y <- x + rnorm(100,  mean = 0, sd = .001 * x); 
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g

with the residual plot :

g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), 
           aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2); 
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g

Estimating residual variation value (Residual standard error)

Consider the model \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\).

The ML estimate of \(\sigma^2\) is \(\frac{1}{n}\sum_{i=1}^n e_i^2\), the average squared residual.

However most people use \[ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2. \]

Where the \(n-2\) instead of \(n\) is so that \(E[\hat \sigma^2] = \sigma^2\)

Lets find the residual variation value in the diamond example with two methods :

y <- diamond$price
x <- diamond$carat
n <- length(y)
fit <- lm(y ~ x)
summary(fit)$sigma
## [1] 31.84052
sqrt(sum(resid(fit)^2) / (n - 2))
## [1] 31.84052

In the summary below it is shown by Residual standard error :

summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.159 -21.448  -0.869  18.972  79.370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -259.63      17.32  -14.99   <2e-16 ***
## x            3721.02      81.79   45.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9778 
## F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16

Summarizing variation

Consider the diamond data residual plot once again :

diamond$e <- resid(lm(price ~ carat, data = diamond))
g = ggplot(diamond, aes(x = carat, y = e))
g = g + xlab("Mass (carats)")
g = g + ylab("Residual price (SIN $)")
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 7, colour = "black", alpha=0.5)
g = g + geom_point(size = 5, colour = "blue", alpha=0.2)
g

Below we plot first the variations around the average price, and second the variations around the regression line :

e = c(resid(lm(price ~ 1, data = diamond)),
      resid(lm(price ~ carat, data = diamond)))
fit = factor(c(rep("Itc", nrow(diamond)),
               rep("Itc, slope", nrow(diamond))))
g = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))
g = g + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)
g = g + xlab("Fitting approach")
g = g + ylab("Residual price")
g

\(\;\)

The total variability in our response is the variability around an intercept (think mean only regression) is \(\sum_{i=1}^n (Y_i - \bar Y)^2\)

The regression variability is the variability that is explained by adding the predictor \(\sum_{i=1}^n (\hat Y_i - \bar Y)^2\)

The error variability is what’s leftover around the regression line \(\sum_{i=1}^n (Y_i - \hat Y_i)^2\)

Neat fact \[ \sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \]

R squared

R squared is the percentage of the total variability that is explained by the linear relationship with the predictor \[ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \]

\(R^2\) is the percentage of variation explained by the regression model.

Always \(0 \leq R^2 \leq 1\)

\(R^2\) is the sample correlation squared.

\(R^2\) can be a misleading summary of model fit. Deleting data can inflate \(R^2\). Adding terms to a regression model always increases \(R^2\).

Consider the following example from anscombe data, and you will observce the folloing :

require(stats)
require(graphics)
data(anscombe)
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
  #print(anova(lmi))
}


## Now, do what you should have done in the first place: PLOTS
op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13))
  abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
#par(op)

How to derive R squared

\[ \begin{align} \sum_{i=1}^n (Y_i - \bar Y)^2 & = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \\ & = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ \end{align} \]

The relation between R squared and r

Recall that \((\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)\) so that \[ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = Cor(Y, X)^2 \] Since, recall, \[ \hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)} \] So, \(R^2\) is literally \(r\) squared.

Inference in regression

Inference

Consider the model that we have discussed so far: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] Where \(\epsilon \sim N(0, \sigma^2)\) and

\[ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X \] \[ \hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} \]

From statistical inference we know that statistics like \(\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}\) often have the following properties :

  1. Is normally distributed and has a finite sample Student’s T distribution if the variance is replaced with a sample estimate (under normality assumptions).

  2. Can be used to test \(H_0 : \theta = \theta_0\) versus \(H_a : \theta >, <, \neq \theta_0\).

  3. Can be used to create a confidence interval for \(\theta\) via \(\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}\) where \(Q_{1-\alpha/2}\) is the relevant quantile from either a normal or T distribution.

We suffice it to say that under assumptions on the ways in which the \(X\) values are collected, the iid sampling model, and mean model, the normal results hold to create intervals and confidence intervals

Considering the standard deviations of the coefficients :

\[ \sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 / \sum_{i=1}^n (X_i - \bar X)^2 \] \[ \sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2 \]

Where in practice, \(\sigma\) is replaced by its estimate, it’s probably not surprising that under iid Gaussian errors \[ \frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}} \]

Follows a \(t\) distribution with \(n-2\) degrees of freedom and a normal distribution for large \(n\). This can be used to create confidence intervals and perform hypothesis tests.

Consider the following example diamond data set :

data(diamond)
y <- diamond$price; 
x <- diamond$carat; 
n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2)) 
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma 
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; 
tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")

We create a coefficient table :

coefTable
Estimate Std. Error t value P(>|t|)
(Intercept) -259.6259 17.31886 -14.99094 0
x 3721.0249 81.78588 45.49715 0

The same result can be achieved from the fit fucntion:

fit <- lm(y ~ x); 
summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -259.6259 17.31886 -14.99094 0
x 3721.0249 81.78588 45.49715 0

Below we get confidence intervals :

sumCoef <- summary(fit)$coefficients
Intercept <- sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
slope <- (sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]) / 10
rbind(Intercept,slope)
Intercept -294.4870 -224.7649
slope 355.6398 388.5651

We can conclude that with 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.

Prediction

Consider predicting \(Y\) at a value of \(X\). Here we are interested in for example :

The obvious estimate for prediction at point \(x_0\) is \[ \hat \beta_0 + \hat \beta_1 x_0 \]

However a standard error is needed to create a prediction interval.

There’s a distinction between intervals for the regression line at point \(x_0\) and the prediction of what a \(y\) would be at point \(x_0\). specifically :

Line at \(x_0\) se, \(\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}\)

Prediction interval se at \(x_0\), \(\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}\)

Lets plot the prediction intervals :

newx = data.frame(x = seq(min(x), max(x), length = 100))
p1 = data.frame(predict(fit, newdata= newx,interval = ("confidence")))
p2 = data.frame(predict(fit, newdata = newx,interval = ("prediction")))
p1$interval = "confidence"
p2$interval = "prediction"
p1$x = newx$x
p2$x = newx$x
dat = rbind(p1, p2)
names(dat)[1] = "y"

g = ggplot(dat, aes(x = x, y = y))
g = g + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = interval), alpha = 0.2) 
g = g + geom_line()
g = g + geom_point(data = data.frame(x = x, y=y), aes(x = x, y = y), size = 4)
g

We can conclude :

Both intervals have varying widths. Least width at the mean of the Xs.

We are quite confident in the regression line, so that interval is very narrow. If we knew \(\beta_0\) and \(\beta_1\) this interval would have zero width.

The prediction interval must incorporate the variabilibity in the data around the line. Even if we knew \(\beta_0\) and \(\beta_1\) this interval would still have width.