Least squares and linear regression

Adopted based on the course on regression models offered by Brian Caffo at Johns Hopkins Bloomberg School of Public Health

Intorduction to Least squares and linear regression

Consider trying to answer the following kinds of questions:

To use the parents’ heights to predict childrens’ heights.
To try to find a parsimonious, easily described mean relationship between parent and children’s heights.
To investigate the variation in childrens’ heights that appears unrelated to parents’ heights (residual variation).
To quantify what impact genotype information has beyond parental height in explaining child height.
To figure out how/whether and what assumptions are needed to generalize findings beyond the data in question.
Why do children of very tall parents tend to be tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called ‘Regression to the mean’.)

Let’s look at the data first, used by Francis Galton in 1885. Galton was a statistician who invented the term and concepts of regression and correlation, founded the journal Biometrika, and was the cousin of Charles Darwin. You may need to run install.packages("UsingR") if the UsingR library is not installed. Lets look at the first few records of the dataset :

library(UsingR) 
library(reshape)
library(knitr)
library(printr)
data(galton)
kable(head(galton),align = 'c')

child	parent
61.7	70.5
61.7	68.5
61.7	65.5
61.7	64.5
61.7	64.0
62.2	67.5

Changing the structure of the data to a long format :

long <- melt(galton)
kable(head(long),align = 'c')

variable	value
child	61.7
child	61.7
child	61.7
child	61.7
child	61.7
child	62.2

Let’s look at the marginal (parents disregarding children and children disregarding parents) distributions first. Parent distribution is all heterosexual couples. Correction for gender via multiplying female heights by 1.08. Overplotting is an issue from discretization.

library(ggplot2)
g <- ggplot(long, aes(x = value, fill = variable)) 
g <- g + geom_histogram(colour = "black", binwidth=1) 
g <- g + facet_grid(. ~ variable)
g

Finding the middle via least squares

In this section we show how we can find the middle of the distribution using least square. Consider only the children’s heights. How could one describe the “middle”? One definition, let $Y_i$ be the height of child $i$ for $i = 1, \ldots, n = 928$, then define the middle as the value of $\mu$ that minimizes

\[\sum_{i=1}^n (Y_i - \mu)^2\]

This is physical center of mass of the histrogram. You might have guessed that the answer $\mu = \bar Y$. Therefore the least squares estimate is the empirical mean as shown below:

g <- ggplot(galton, aes(x = child)) + 
        geom_histogram(fill = "salmon", colour = "black", binwidth=1)+ 
        geom_vline(xintercept = mean(galton$child), size = 3)
g

We can mathematically prove that as given below:

\[ \begin{align} \sum_{i=1}^n \left(Y_i - \mu\right)^2 & = \ \sum_{i=1}^n \left(Y_i - \bar Y + \bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ 2 \sum_{i=1}^n \left(Y_i - \bar Y\right) \left(\bar Y - \mu\right) +\ \sum_{i=1}^n \left(\bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ 2 \left(\bar Y - \mu\right) \sum_{i=1}^n \left(Y_i - \bar Y\right) +\ \sum_{i=1}^n \left(\bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ 2 \left(\bar Y - \mu\right) \left(\left(\sum_{i=1}^n Y_i\right) -\ n \bar Y\right) +\ \sum_{i=1}^n \left(\bar Y - \mu\right)^2 \\ & = \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 + \ \sum_{i=1}^n \left(\bar Y - \mu\right)^2\\ & \geq \sum_{i=1}^n \left(Y_i - \bar Y\right)^2 \ \end{align} \]

Plotting the data

Now lets create a scatterplot to compare childrens’ heights and their parents’ heights.

ggplot(galton, aes(x = parent, y = child)) + geom_point()

Unfortunately some of the points overlap in this plot, so in the new plot below Size of point represents number of points at that (X, Y) combination .

library(dplyr)
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")                    
g

Regression through the origin

Suppose that $X_i$ are the parents’ heights. Consider picking the slope $\beta$ that minimizes

\[\sum_{i=1}^n (Y_i - X_i \beta)^2\]

This is exactly using the origin as a pivot point picking the line that minimizes the sum of the squared vertical distances of the points to the line. In the code below we subtract the means so that the origin is the mean of the parent and children’s heights. In the lm function below we subtract 1 to get rid of the intercept. Later we’ll talk about why this is the solution.

lm(I(child - mean(child))~ I(parent - mean(parent)) - 1, data = galton)

## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) - 
##     1, data = galton)
## 
## Coefficients:
## I(parent - mean(parent))  
##                   0.6463

Now we re-create the scatterplot with the linear line that that is forced to go through the mean of the parent’s hight and mean of the child’s height. This line has the slope is 0.646.

g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")  
g <- g + geom_smooth(method="lm", formula=y~x)
g

Least squares estimation of regression lines

Fitting the best line

Let $Y_i$ be the $i^{th}$ child’s height and $X_i$ be the $i^{th}$ (average over the pair of) parents’ heights. Consider finding the best line: Child’s Height = $\beta_0$ + Parent’s Height $\beta_1$

Using the least squares : \[ \sum_{i=1}^n \{Y_i - (\beta_0 + \beta_1 X_i)\}^2 \]

The least squares model fit to the line $Y = \beta_0 + \beta_1 X$ through the data pairs $(X_i, Y_i)$ with $Y_i$ as the outcome obtains the line $Y = \hat \beta_0 + \hat \beta_1 X$ where

\[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]

$\hat \beta_1$ has the units of $Y / X$, $\hat \beta_0$ has the units of $Y$. The line passes through the point $(\bar X, \bar Y$).

The slope of the regression line with $X$ as the outcome and $Y$ as the predictor is $Cor(Y, X) Sd(X)/ Sd(Y)$.

The slope is the same one you would get if you centered the data, $(X_i - \bar X, Y_i - \bar Y)$, and did regression through the origin.

If you normalized the data, $\{ \frac{X_i - \bar X}{Sd(X)}, \frac{Y_i - \bar Y}{Sd(Y)}\}$, the slope is $Cor(Y, X)$.

Now lets double check these calculations using R on Galton’s data.

y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) *  sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0, beta1), coef(lm(y ~ x)))

(Intercept)	x
23.94153	0.6462906
23.94153	0.6462906

As you see the vales from R lm function and our calculations are identical. Now lets see what happens if we reversing the outcome/predictor relationship.

beta1 <- cor(y, x) *  sd(x) / sd(y)
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0, beta1), coef(lm(x ~ y)))

(Intercept)	y
46.13535	0.3256475
46.13535	0.3256475

Also you can see below that regression through the origin yields an equivalent slope if you center the data first

yc <- y - mean(y)
xc <- x - mean(x)
beta1 <- sum(yc * xc) / sum(xc ^ 2)
kable(c(beta1, coef(lm(y ~ x))[2]))

	0.6462906
x	0.6462906

Another key point we mentioned above was that normalizing variables results in the slope being the correlation and you can see it below

yn <- (y - mean(y))/sd(y)
xn <- (x - mean(x))/sd(x)
kable(c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2]))

0.4587624

Fianlly below would be the linear regression line.

g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+20, show_guide = FALSE))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")  
g <- g + geom_smooth(method="lm", formula=y~x)
g

Regression to the mean

Regression to the mean is a a historically famous idea and it concerns questions like :

Why is it that the children of tall parents tend to be tall, but not as tall as their parents?
Why do children of short parents tend to be short, but not as short as their parents?
Why do parents of very short children, tend to be short, but not a short as their child? And the same with parents of very tall children?
Why do the best performing athletes this year tend to do a little worse the following?

These phenomena are all examples of so-called regression to the mean. It is invented by Francis Galton in the paper “Regression towards mediocrity in hereditary stature” The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886).

Think of it this way, imagine if you simulated pairs of random normals. The largest first ones would be the largest by chance, and the probability that there are smaller for the second simulation is high. In other words $P(Y < x | X = x)$ gets bigger as $x$ heads into the very large values. Similarly $P(Y > x | X = x)$ gets bigger as $x$ heads to very small values.

Think of the regression line as the intrisic part. Unless $Cor(Y, X) = 1$ the intrinsic part isn’t perfect

Suppose that we normalize $X$ (child’s height) and $Y$ (parent’s height) so that they both have mean 0 and variance 1. Then, recall, our regression line passes through $(0, 0)$ (the mean of the X and Y). If the slope of the regression line is $Cor(Y,X)$, regardless of which variable is the outcome (recall, both standard deviations are 1). Notice if $X$ is the outcome and you create a plot where $X$ is the horizontal axis, the slope of the least squares line that you plot is $1/Cor(Y, X)$.

Below is an example of this :

library(UsingR)
data(father.son)
y <- (father.son$sheight - mean(father.son$sheight)) / sd(father.son$sheight)
x <- (father.son$fheight - mean(father.son$fheight)) / sd(father.son$fheight)
rho <- cor(x, y)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_point(size = 6, colour = "black", alpha = 0.2)
g = g + geom_point(size = 4, colour = "salmon", alpha = 0.2)
g = g + xlim(-4, 4) + ylim(-4, 4)
g = g + geom_abline(intercept = 0, slope = 1)
g = g + geom_vline(xintercept = 0)
g = g + geom_hline(yintercept = 0)
g = g + geom_abline(intercept = 0, slope = rho, size = 2)
g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)
g = ggplot(data.frame(x, y), aes(x = x, y = y))
g = g + geom_point(size = 5, alpha = .2, colour = "black")
g = g + geom_point(size = 4, alpha = .2, colour = "red")
g = g + geom_vline(xintercept = 0)
g = g + geom_hline(yintercept = 0)
g = g + geom_abline(position = "identity")
g = g + geom_abline(intercept = 0, slope = rho, size = 2)
g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)
g = g + xlab("Father's height, normalized")
g = g + ylab("Son's height, normalized")
g

Thefeore we can conclude the followings :

If you had to predict a son’s normalized height, it would be $Cor(Y, X) * X_i$
If you had to predict a father’s normalized height, it would be $Cor(Y, X) * Y_i$
Multiplication by this correlation shrinks toward 0 (regression toward the mean)
If the correlation is 1 there is no regression to the mean (if father’s height perfectly determine’s child’s height and vice versa)
Note, regression to the mean has been thought about quite a bit and generalized

Statistical linear regression models

Basic regression model with additive Gaussian errors

From previous sections we concluded that Least squares is an estimation tool, but how do we do inference?

Consider developing a probabilistic model for linear regression \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i} \]

Where the $\epsilon_{i}$ are assumed iid $N(0, \sigma^2)$.

Please note that $E[Y_i ~|~ X_i = x_i] = \mu_i = \beta_0 + \beta_1 x_i$ and $Var(Y_i ~|~ X_i = x_i) = \sigma^2$.

Also ML estimates of $\beta_0$ and $\beta_1$ are the least squares estimates that can be found by

\[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]

Interpretting the intercept

Now lets talk about interpretting the intercept regression coefficient.

$\beta_0$ is the expected value of the response when the predictor is 0 \[ E[Y | X = 0] = \beta_0 + \beta_1 \times 0 = \beta_0 \]

However, this isn’t always of interest, for example when $X=0$ is impossible or far outside of the range of data. (X is blood pressure, or height etc.)

Shifting your $X$ values by value $a$ changes the intercept, but not the slope, as shown below :

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + a \beta_1 + \beta_1 (X_i - a) + \epsilon_i = \tilde \beta_0 + \beta_1 (X_i - a) + \epsilon_i \]

Therefore, often $a$ is set to $\bar X$ so that the intercept is interpretted as the expected response at the average $X$ value.

Interpretting the slope

$\beta_1$ is the expected change in response for a 1 unit change in the predictor : \[ E[Y ~|~ X = x+1] - E[Y ~|~ X = x] = \beta_0 + \beta_1 (x + 1) - (\beta_0 + \beta_1 x ) = \beta_1 \]

Another important point is the impact of changing the units of $X$. Multiplication of $X$ by a factor $a$ results in dividing the coefficient by a factor of $a$, as shown in the folloing calculation :

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i = \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i \]

As an example, assume that $X$ is height in $m$ and $Y$ is weight in $kg$. Then $\beta_1$ is $kg/m$. Converting $X$ to $cm$ implies multiplying $X$ by $100 cm/m$. To get $\beta_1$ in the right units, we have to divide by $100 cm /m$ to get it to have the right units :

\[ X m \times \frac{100cm}{m} = (100 X) cm ~~\mbox{and}~~ \beta_1 \frac{kg}{m} \times\frac{1 m}{100cm} = \left(\frac{\beta_1}{100}\right)\frac{kg}{cm} \]

Using regression coeficients for prediction

If we would like to guess the outcome at a particular value of the predictor, say $X$, the regression model guesses :

\[ \hat \beta_0 + \hat \beta_1 X \]

Lets look at an example. We look at diamond data set from UsingR. Data is diamond prices (Singapore dollars) and diamond weight in carats (standard measure of diamond mass, 0.2 $g$). First few records of the data shown below :

data(diamond)
kable (head(diamond),align = 'c')

carat	price
0.17	355
0.16	328
0.17	350
0.18	325
0.25	642
0.16	342

Creating a plot of the data and fitting a linear line :

Now we fitt the linear regression model and and find the regression coefficients :

fit <- lm(price ~ carat, data = diamond)
kable(coef(fit), align='c')

(Intercept)	-259.6259
carat	3721.0249

Therefore, we estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond. The intercept -259.63 is the expected price of a 0 carat diamond.

Lets get a more interpretable intercept as we described above :

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
kable(coef(fit2), align='c')

(Intercept)	500.0833
I(carat - mean(carat))	3721.0249

Thus $500.1 is the expected price for the average sized diamond of the data (0.2041667 carats).

In addition, with regrads to the intercept, a one carat increase in a diamond is pretty big, what about changing units to 1/10th of a carat?

We can just do this by just dividing the coeficient by 10. We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond. Showing that it’s the same if we rescale the Xs and refit

fit3 <- lm(price ~ I(carat * 10), data = diamond)
kable(coef(fit3), align='c')

(Intercept)	-259.6259
I(carat * 10)	372.1025

Now its time to predict the price of a diamond. Below we predict with two different method but both give identical result.

newx <- c(0.16, 0.27, 0.34)
Method_1 <- coef(fit)[1] + coef(fit)[2] * newx
Method_2  <- predict(fit, newdata = data.frame(carat = newx))
rbind(Method_1, Method_2)

	1	2	3
Method_1	335.7381	745.0508	1005.523
Method_2	335.7381	745.0508	1005.523

Graph below presents predicted values at the observed Xs (red) and at the new Xs (lines)

Residuals and residual variation

Residuals

Consider the linear regression model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.

Observed outcome $i$ is $Y_i$ at predictor value $X_i$. We also show predicted outcome $i$ by $\hat Y_i$ at predictor valuve $X_i$ which is :

\[ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i \]

We define residual, as the difference between the observed and predicted outcome. It is the vertical distance between the observed data point and the regression line :

\[ e_i = Y_i - \hat Y_i \]

The $e_i$ can be thought of as estimates of the $\epsilon_i$. Least squares method minimizes $\sum_{i=1}^n e_i^2$. Therefore we can show that $E[e_i] = 0$.

If an intercept is included, $\sum_{i=1}^n e_i = 0$

Also, if a regressor variable, $X_i$, is included in the model $\sum_{i=1}^n e_i X_i = 0$.

Residuals are useful for investigating poor model fit. Positive residuals are above the line, negative residuals are below. Residuals can be thought of as the outcome ($Y$) with the linear association of the predictor ($X$) removed. One differentiates residual variation (variation after removing the predictor) from systematic variation (variation explained by the regression model). Residual plots can highlight poor model fit.

lets consider the diamond data set . As we saw ealier the scatterplot of the data would be :

Below we calculate the residucals with two different methods :

y <- diamond$price 
x <- diamond$carat
n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e -(y - yhat)))

## [1] 9.485746e-13

max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))

## [1] 9.485746e-13

In the figure below you can see that residuals are the signed length of the red lines :

plot(diamond$carat, diamond$price,  
     xlab = "Mass (carats)", 
     ylab = "Price (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 2, pch = 21,frame = FALSE)
abline(fit, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)

Now lets plot residuals versus values of X :

plot(x, e,  
     xlab = "Mass (carats)", 
     ylab = "Residuals (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)

We dont observe any specific pattern in the residuals and that is good !

Residual analysis for non-linear data

Now lets repeat this experiment but this time with Non-linear data and see how residuals behave. Consider the follwing non-linear data generated :

x = runif(100, -3, 3); y = x + sin(x) + rnorm(100, sd = .2); 
library(ggplot2)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g

Now lets look at the residual plot :

g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), 
           aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2); 
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g

Clearly the residual plot hilights the cyclic pattern exist in the data.

Heteroskedasticity

Sometime the unusual pattern can not be obserevd form the scaterplot, but the residual plot clearly hilights that. This phenamena is called Heteroskedasticity as you see an example below:

x <- runif(100, 0, 6); y <- x + rnorm(100,  mean = 0, sd = .001 * x); 
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g

with the residual plot :

g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), 
           aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2); 
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g

Estimating residual variation value (Residual standard error)

Consider the model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.

The ML estimate of $\sigma^2$ is $\frac{1}{n}\sum_{i=1}^n e_i^2$, the average squared residual.

However most people use \[ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2. \]

Where the $n-2$ instead of $n$ is so that $E[\hat \sigma^2] = \sigma^2$

Lets find the residual variation value in the diamond example with two methods :

y <- diamond$price
x <- diamond$carat
n <- length(y)
fit <- lm(y ~ x)
summary(fit)$sigma

## [1] 31.84052

sqrt(sum(resid(fit)^2) / (n - 2))

## [1] 31.84052

In the summary below it is shown by Residual standard error :

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.159 -21.448  -0.869  18.972  79.370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -259.63      17.32  -14.99   <2e-16 ***
## x            3721.02      81.79   45.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9778 
## F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16

Summarizing variation

Consider the diamond data residual plot once again :

diamond$e <- resid(lm(price ~ carat, data = diamond))
g = ggplot(diamond, aes(x = carat, y = e))
g = g + xlab("Mass (carats)")
g = g + ylab("Residual price (SIN $)")
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 7, colour = "black", alpha=0.5)
g = g + geom_point(size = 5, colour = "blue", alpha=0.2)
g

Below we plot first the variations around the average price, and second the variations around the regression line :

e = c(resid(lm(price ~ 1, data = diamond)),
      resid(lm(price ~ carat, data = diamond)))
fit = factor(c(rep("Itc", nrow(diamond)),
               rep("Itc, slope", nrow(diamond))))
g = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))
g = g + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)
g = g + xlab("Fitting approach")
g = g + ylab("Residual price")
g

$\;$

The total variability in our response is the variability around an intercept (think mean only regression) is $\sum_{i=1}^n (Y_i - \bar Y)^2$

The regression variability is the variability that is explained by adding the predictor $\sum_{i=1}^n (\hat Y_i - \bar Y)^2$

The error variability is what’s leftover around the regression line $\sum_{i=1}^n (Y_i - \hat Y_i)^2$

Neat fact \[ \sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \]

R squared

R squared is the percentage of the total variability that is explained by the linear relationship with the predictor \[ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \]

$R^2$ is the percentage of variation explained by the regression model.

Always $0 \leq R^2 \leq 1$

$R^2$ is the sample correlation squared.

$R^2$ can be a misleading summary of model fit. Deleting data can inflate $R^2$. Adding terms to a regression model always increases $R^2$.

Consider the following example from anscombe data, and you will observce the folloing :

Basically same mean and variance of X and Y.
Identical correlations (hence same $R^2$ ).
Same linear regression relationship.

require(stats)
require(graphics)
data(anscombe)
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
  #print(anova(lmi))
}


## Now, do what you should have done in the first place: PLOTS
op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13))
  abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)

#par(op)

How to derive R squared

\[ \begin{align} \sum_{i=1}^n (Y_i - \bar Y)^2 & = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \\ & = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ \end{align} \]

The relation between R squared and r

Recall that $(\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)$ so that \[ R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = Cor(Y, X)^2 \] Since, recall, \[ \hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)} \] So, $R^2$ is literally $r$ squared.

Inference in regression

Inference

Consider the model that we have discussed so far: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] Where $\epsilon \sim N(0, \sigma^2)$ and

\[ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X \] \[ \hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} \]

From statistical inference we know that statistics like $\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}$ often have the following properties :

Is normally distributed and has a finite sample Student’s T distribution if the variance is replaced with a sample estimate (under normality assumptions).
Can be used to test $H_0 : \theta = \theta_0$ versus $H_a : \theta >, <, \neq \theta_0$.
Can be used to create a confidence interval for $\theta$ via $\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}$ where $Q_{1-\alpha/2}$ is the relevant quantile from either a normal or T distribution.

We suffice it to say that under assumptions on the ways in which the $X$ values are collected, the iid sampling model, and mean model, the normal results hold to create intervals and confidence intervals

Considering the standard deviations of the coefficients :

\[ \sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 / \sum_{i=1}^n (X_i - \bar X)^2 \] \[ \sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2 \]

Where in practice, $\sigma$ is replaced by its estimate, it’s probably not surprising that under iid Gaussian errors \[ \frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}} \]

Follows a $t$ distribution with $n-2$ degrees of freedom and a normal distribution for large $n$. This can be used to create confidence intervals and perform hypothesis tests.

Consider the following example diamond data set :

data(diamond)
y <- diamond$price; 
x <- diamond$carat; 
n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2)) 
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma 
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; 
tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")

We create a coefficient table :

coefTable

	Estimate	Std. Error	t value	P(>\|t\|)
(Intercept)	-259.6259	17.31886	-14.99094	0
x	3721.0249	81.78588	45.49715	0

The same result can be achieved from the fit fucntion:

fit <- lm(y ~ x); 
summary(fit)$coefficients

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-259.6259	17.31886	-14.99094	0
x	3721.0249	81.78588	45.49715	0

Below we get confidence intervals :

sumCoef <- summary(fit)$coefficients
Intercept <- sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
slope <- (sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]) / 10
rbind(Intercept,slope)

Intercept	-294.4870	-224.7649
slope	355.6398	388.5651

We can conclude that with 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.

Prediction

Consider predicting $Y$ at a value of $X$. Here we are interested in for example :

Predicting the price of a diamond given the carat
Predicting the height of a child given the height of the parents

The obvious estimate for prediction at point $x_0$ is \[ \hat \beta_0 + \hat \beta_1 x_0 \]

However a standard error is needed to create a prediction interval.

There’s a distinction between intervals for the regression line at point $x_0$ and the prediction of what a $y$ would be at point $x_0$. specifically :

Line at $x_0$ se, $\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$

Prediction interval se at $x_0$, $\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$

Lets plot the prediction intervals :

newx = data.frame(x = seq(min(x), max(x), length = 100))
p1 = data.frame(predict(fit, newdata= newx,interval = ("confidence")))
p2 = data.frame(predict(fit, newdata = newx,interval = ("prediction")))
p1$interval = "confidence"
p2$interval = "prediction"
p1$x = newx$x
p2$x = newx$x
dat = rbind(p1, p2)
names(dat)[1] = "y"

g = ggplot(dat, aes(x = x, y = y))
g = g + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = interval), alpha = 0.2) 
g = g + geom_line()
g = g + geom_point(data = data.frame(x = x, y=y), aes(x = x, y = y), size = 4)
g

We can conclude :

Both intervals have varying widths. Least width at the mean of the Xs.

We are quite confident in the regression line, so that interval is very narrow. If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.

The prediction interval must incorporate the variabilibity in the data around the line. Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.