Mathematical Underpinnings

This section will eventually, I hope, walk through all of the math beneath linear regression.

A First Look at Response and Explanatory Variables

Let’s assume we have a data set with two variables, \(x\) and \(y\). We suspect there is some kind of dependency, or relationship, or to use the statistical term, correlation between them. First, let’s plot them.

testdata <- data.frame(x=c(1,2,3,4,5,6), y=c(-2,6.6,1.9,10.7,6.3,13.7))
with(testdata, plot(x, y, ylim=c(-3,14)), main="A Linear Relationship")

plot of chunk unnamed-chunk-1

It looks like there is, in fact, some kind of relationship between them. Now how the general pattern of the data points moves from the bottom left to the top right. And this pattern appears to be more or less a straight line. For example, it doesn’t look like this, where the relationship looks like it is curved:

plot(1:6, (1:6)^2,xlab="x",ylab="y",main="A Non-Linear Relationship")

plot of chunk unnamed-chunk-2

Before moving on, keep in mind that linear regression will not work for the data in this plot. That is because the relationship between the explantory and response variables is not linear. There are things we can do, like transform the data, so that the transformed data has a linear relationship. Or it may be that we need to use a technique other than linear regression. But put all of that aside now, and forget this curved plot.

So we believe we have a linear relationship between our \(x\) and \(y\) in our first plot. In fact, it looks like we can draw a straight line to estimate this relationship. Let’s just take a close-your-eyes, pin-the-tail-on-the-donkey try at it:

with(testdata, plot(x, y, ylim=c(-3,14), main="A Linear Relationship\nA very poor attempt to fit the data"))
abline(1, 0.8)

plot of chunk unnamed-chunk-3

We end up with only two data points below our line, and four above. Conclusion: Don’t draw lines with your eyes closed. It annoys the donkey. Let’s make another attempt.

with(testdata, plot(x, y, ylim=c(-3,14), main="A Linear Relationship\nA better fitting line"))
abline(a=0,b=1.8)

plot of chunk unnamed-chunk-4

Our eyeball estimate looks pretty good. But we don’t know if it’s the best possible line. Maybe we could move it up or down a bit. Maybe we could tilt it a bit more, or less, steeply. How do we quantify one line against another? For example, which of the following lines is better?

with(testdata, plot(x, y, ylim=c(-3,14), main="A Linear Relationship\nA better fitting line"))
abline(a=0,b=1.8, col="red")
abline(a=-0.3,b=1.9, col="blue")

plot of chunk unnamed-chunk-5

Our eyes can’t decide. We need an analytical approach. Using such an approach is the first big goal of linear regression: Calculating the best possible fit for a linear model to the data. The second big goal is determining whether that fit is good enough to be useful.

Determining the Cost of a Linear Fit

Let’s go back to our first estimated line. What is the “cost” of this line, compared to other lines? We can calculate any given line’s cost by finding out how far each observed value of our response variable, \(y\), is from the estimated value as provided by our line.

with(testdata, plot(x, y, ylim=c(-3,14), main="A Linear Relationship\nA better fitting line"))
abline(a=0,b=1.8)
for(x in 1:6) {
  lines(c(x,x), c(testdata$y[x], testdata$x[x] * 1.8), lty=3)
}

plot of chunk unnamed-chunk-6

The dashed lines show how far each observed value of \(y\) is from our estimate of it. We can sum up all of those distances.

sum(testdata$y - testdata$x*1.8)
## [1] -0.6

This shows that the total distance is only \(-0.6\). That doesn’t seem right! The first data point alone is about -3.8 (the observed value is 3.8 below its estimate). What went wrong? Note that some of the points are above the line, and some are below. Those above the line have positive values (observed value is greater than the estimate), and others have negative values (observed is below the estimate). They tend to cancel each other out. An analogy would be saying, the grocery store is a mile north, and the school is a mile south, so the total distance for both is zero!

So maybe we can use absloute values of the distances?

sum(abs(testdata$y - testdata$x*1.8))
## [1] 19.4

This seems more reasonable, at least for our current data set. But note that each of our points is roughly the same distance above, or below, the line. What if the data set looked different, what if the data points were further away from the line?

testdata2 <- data.frame(x=c(1,2,3,4,5,6), y=c(-6.6,9.6,-1.9,13.8,3.1,15.4))
with(testdata2, plot(x, y, ylim=c(-6,16)), main="Another Linear Relationship\nData more spread out")
abline(a=0,b=1.8)
for(x in 1:6) {
  lines(c(x,x), c(testdata2$y[x], testdata2$x[x] * 1.8), lty=3)
}

plot of chunk unnamed-chunk-9

Here is the total cost, using our current technique of summing up the absolute values of the differences.

sum(abs(testdata2$y - testdata2$x*1.8))
## [1] 38.8

So the distances of each individual point are about twice as great, and our estimate of the total cost is also about twice what it was. But without going into the mathematical details now (not sure if I’ll ever be able to!), the cost in this case should be more than double…much more. In fact, the distance of each point from its estimate should actually be squared. That gives us the true cost.

sum((testdata2$y - testdata2$x*1.8)^2)
## [1] 259.4

Note that we no longer need to use absolute distances, because the square of any number is never negative. Now let’s also get the cost of our original data set, using the sum of squared distances.

sum((testdata$y - testdata$x*1.8)^2)
## [1] 63.64

So, the cost our second data set is about four times the cost of our original data set. Why? Well, we roughly doubled the distance of each \(y\) value from its estimate, so we multiplied by \(2\). The square of \(2\) is \(4\).

OK, let’s go back to our original data set. Here is the plot again.

with(testdata, plot(x, y, ylim=c(-3,14), main="A Linear Relationship"))
abline(a=0,b=1.8)
for(x in 1:6) {
  lines(c(x,x), c(testdata$y[x], testdata$x[x] * 1.8), lty=3)
}

plot of chunk unnamed-chunk-13

We have determined that the cost of this estimated line, using the sum of squared differences, is about 63. Now let’s try that other estimated line and see what the cost is. The new line is in blue; I’ve included the old line in gray just for reference.

with(testdata, plot(x, y, ylim=c(-3,14), main="A Linear Relationship"))
abline(a=0,b=1.8,col="gray")
abline(a=-0.3,b=1.9, col="blue")
for(x in 1:6) {
  lines(c(x,x), c(testdata$y[x], testdata$x[x] * 1.9 - 0.3), lty=3)
}

plot of chunk unnamed-chunk-14

And here is the cost for the new line.

sum((testdata$y - testdata$x*1.9-0.3)^2)
## [1] 64.79

The cost for this line is just a bit higher. We could try more micro-adjustments, make more plots, caclulate more costs…this could take all day! We need a way of determining the best possible line. Any line is made up of two components: its y-intercept and its slope. So we need to determine the best combination of y-intercept and slope.

Let’s begin by taking one of these items out of consideration. Let’s assume that our y-intercept will be \(0\). In the real world, it will almost never be. But let’s assume for now that \(y=0\) is the best possible intercept for our data (and in fact, it is very close to \(0\)). So for the moment, we only need to concern ourselves with slope.

Minimizing Squared Differences

Finding the best fit for a model between two variables means we need to minimize the cost of all possible models for the data. To do this, we need to determine the distance of all points in the response variable from a given model; the total distance of the response variable from the model is its “fit”. Assume that the response variable is \(Y\). Therefore, we wish to find the minimum value of \(\mu\) in the following sum of squares:

\[\sum_{i=1}^n (Y_i - \mu)^2\]

If you are accustomed to think of \(\mu\) as a mean, particularly a population mean, drop that notion here. It is only an aribtrary constant. The question to consider is the best possible value for \(\mu\). The following proves that the minimum value for \(\mu\) is \(\bar{Y}\), the mean of \(Y\).

The first step is to both add and subtract \(\bar{Y}\) to the equation. The reason will become clear later.

\[\sum_{i=1}^n (Y_i - \mu)^2 = \sum_{i=1}^n (Y_i -\bar{Y} + \bar{Y} - \mu)^2\]

We can now group the left two and the right two terms. \[= \sum_{i=1}^n ((Y_i - \bar{Y}) + (\bar{Y} - \mu))^2\]

Now we complete the square. \[= \sum_{i=1}^n (Y_i - \bar{Y})^2 + 2(Y_i - \bar{Y})(\bar{Y} - \mu) + (\bar{Y} - \mu)^2\]

The above sum of all the terms can be rewritten as sums of the individual terms. \[= \sum_{i=1}^n (Y_i - \bar{Y})^2 + \sum_{i=1}^n 2(Y_i - \bar{Y})(\bar{Y} - \mu) + \sum_{i=1}^n (\bar{Y} - \mu)^2\]

In the middle term, both \(2\) and \((\bar{Y} - \mu)\) are effectively constants, as they do not involve \(i\). (Their values do not change as \(i\) changes.) Therefore, they can be moved in front of the \(\sum\). \[= \sum_{i=1}^n (Y_i - \bar{Y})^2 + 2(\bar{Y} - \mu)\sum_{i=1}^n (Y_i - \bar{Y}) + \sum_{i=1}^n (\bar{Y} - \mu)^2\]

Now, in the middle \(\sum\), we can change it to \(\sum_{i=1}^n Y_i\), and subtract \(n\bar{Y}\) from it. That is because, again \(\bar{Y}\) does not change as \(i\) changes. We just need to sum up \(Y_i\), and then subtract \(\bar{Y}\) from that sum once for each value, i.e., \(n\) times. \[= \sum_{i=1}^n (Y_i - \bar{Y})^2 + 2(\bar{Y} - \mu)[(\sum_{i=1}^n Y_i) -( n\bar{Y})] + \sum_{i=1}^n (\bar{Y} - \mu)^2\]

Note that \(\sum_{i=1}^nY_i\) is equal to \(n\bar{Y}\). If you have three numbers, 1, 2 and 3, then their mean (\(\bar{Y}\)) is \(2\), and so \(n\bar{Y} = n * 2 = 3 * 2 = 6\). And the sum of the numbers is also 6. So that means that \((\sum_{i=1}^n Y-i) - n\bar{Y}\) is \(0\), and so the entire middle term is zero, and drops out.

\[= \sum_{i=1}^n (Y_i - \bar{Y})^2 + \sum_{i=1}^n (\bar{Y} - \mu)^2\]

The right term, because it is squared, is always zero or greater; it can never be a negative number. Therefore:

\[ \sum_{i=1}^n (Y_i - \mu)^2 \ge \sum_{i=1}^n (Y_i - \bar{Y})^2\]

From this, we can conclude that the value of \(\mu\) that gives us the lowest possible sum of squared differences is \(\bar{Y}\).

Linear Regression in R

This section will explore linear regression in R. The ultimate goal is to walk through all of the steps individually, from scratch, and then work toward using the built-in functions in R to make life faster and easier.

Using the lm function

The data set used is mtcars, built into R. We will consider engine displacement versus mileage. Here is a simple plot.

plot(mtcars$mpg, mtcars$disp, cex=0.5)

plot of chunk unnamed-chunk-16

There is apparently a correlation between the two variables. It is negative, and fairly weak. It is not entirely apparent from the data that the relationship is linear, in which case, linear regression would not be appropriate. But for the purpose of illustration, we will assume that it is.

We might try to pick a line to fit the data by eye.

plot(mtcars$mpg, mtcars$disp, cex=0.5)
lines(x=c(10,30), y=c(380,65))

plot of chunk unnamed-chunk-17

Now we can build a linear model of the data, and use that to plot a least-squares regression line. Our guess is in red, the actual regression line is in blue. We weren’t too far off!

mtcars.model <- lm(disp ~ mpg, data=mtcars)
plot(mtcars$mpg, mtcars$disp, cex=0.5)
lines(x=c(10,30), y=c(380,65), col="red")
abline(reg=mtcars.model, col="blue")

plot of chunk unnamed-chunk-18

Note that, in the abline function, the first parameter is mtcars.model. This is an object that can be passed to the coef function, which will return a vector of two values. The first is the line intercept (on the y-axis), and the second is the slope.

You could also plot it like this:

plot(mtcars$mpg, mtcars$disp, cex=0.5)
abline(mtcars.model$coefficients[1], mtcars.model$coefficients[2], col="blue")

plot of chunk unnamed-chunk-19