This blog will attempt a mathematically rigorous derivation of the least squares solution for a single independent variable, and related ancilliary issues.

The Problem

We are presented with two random variables of \(n\) cases, \(X = (x_1, x_2, \ldots, x_n)\), and \(Y = (y_1, y_2, \ldots, y_n)\). We theorize that there is a relationship between them, and would like to 1) Quantify the strength of this relationship, and 2) Sufficiently test to ensure this relationship is not merely random chance.

We further theorize this relationship has the form:

\[y = \beta_0 + \beta_1 x + \epsilon\]

The end result of this model gives us the expected value of \(Y\) given that \(X = x\):

\[E(Y ~|~ X = x)\]

Playing God

God Himself has handed us the population parameters directly, and they are such:

set.seed(1804)
n <- 100
x <- rnorm(n, mean=10, sd=2)
beta_0 <- 5
beta_1 <- 1.56
epsilon <- rnorm(n, mean=0, sd=sqrt(3))
y <- beta_0 + x * beta_1 + epsilon

These are not the ‘hat’ variables, these are the real variables. This practice will allow us to come to precise conclusions about the machinery of regression.

Note that:

\(Y\) really is related to \(X\), simply and linearly
The values of our randomness vector epsilon are independent of each other
The vector epsilon has a single, common variance
And the vector epsilon is normally distributed with a mean of 0 and variance \(\sigma^2 = 3\)

Each of these properties is readily apparent in the graphs below:

par(mfrow=c(2, 2))
plot(density(x), main='Distribution of x')
plot(density(y), main='Distribution of Y')
plot(density(epsilon), main='Distribution of randomness component')
plot(x, y, main='Relationship of x, y')

The Solution

The foundation for our work is set. Our next task is to estimate \(\beta_0\) and \(\beta_1\). Our estimates are notated by a ‘hat’, i.e., \(\hat{\beta_0}\) and \(\hat{\beta_1}\). Since the simulated dataset is set up so nicely, our estimates should get pretty close.

How are these estimates made? Obviously, we want the model we’re developing, i.e.,

\[\hat{y} = \hat{\beta_0} + \hat{\beta_1} x,\]

to be as close to the real \(Y\) as much as possible. That is, we want to minimize the errors, or residuals,

\[\hat{e} = y - \hat{y}\] \[= y - \hat{\beta_0} + \hat{\beta_1} x,\]

the difference between the actual data we have \(y\), and the predictions of our model \(\hat{y}\).

What are we minimizing?

As you could guess, we are dealing with an optimization problem, specifically, minimization. This will involve calculus. We know we need to minimize something to get our fitted parameters:

\[\hat{\beta_0} = \mathrm{argmin}_{\hat{\beta_0}} ~ ??? \] \[\hat{\beta_1} = \mathrm{argmin}_{\hat{\beta_1}} ~ ??? \]

The first thought is to minimize these errors directly, i.e., the sum of errors:

\[\sum_{i=1}^n \hat{e_i}\]

The problem with this arises when we realize that our predictions can be over or under reality, producing positive or negative errors, respectively. Sum of errors suggests all that’s require is to under-fit the data—the more the better—since, e.g., an error of -100 is smaller than an error of -1.

Of course, this is ridiculous. Instead, we will try to minimize the square of the errors, \(\hat{e_i}^2\). This has the effect of making all errors positive, avoiding the above issue. (It also has the effect of penalizing larger errors disproportionately more than smaller errors, which is often desirable.) Thus we arrive at the something we want to mimimize, also known as the residual sum of squares:

\[RSS = \sum_{i=1}^n \hat{e_i}^2\]

Minimizing for the intercept \(\hat{\beta_0}\)

Solve the problem:

\[\hat{\beta_0} = \mathrm{argmin}_{\hat{\beta_0}} ~ \sum_{i=1}^n \hat{e_i}^2 \]

in the usual way.

First, use the alternative definition of residuals given above:

\[\hat{\beta_0} = \mathrm{argmin}_{\hat{\beta_0}} ~ \sum_{i=1}^n (y_i - \hat{\beta_0} + \hat{\beta_1} x_i)^2\]

Step 1: Find partial derivative

Find the partial derivative:

\[\frac{\partial RSS}{\partial \hat{\beta_0}} \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2\]

Via the chain rule, set \(u = y_i - \hat{\beta_0} - \hat{\beta_1} x_i\) and let \(f = u^2\). The partial derivative of \(u\) is naturally \(2u\), and of \(f\) it is \(-1\). This gives us:

\[ = 2 \cdot -1 \cdot \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)\] \[ = -2 \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)\]

Step 2: Set partial derivative to 0 and solve

\[-2 \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i) = 0\] \[ = \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)\] \[ = \sum_{i=1}^n y_i - \sum_{i=1}^n \hat{\beta_0} - \sum_{i=1}^n \hat{\beta_1} x_i\]

Since \(\hat{\beta_0}\) is a constant, and we know that adding it to itself \(n\) times is equivalent to multiplying it by \(n\), we can write:

\[ = \sum_{i=1}^n y_i - n \hat{\beta_0} - \sum_{i=1}^n \hat{\beta_1} x_i\]

Add the middle term to both sides, and then divide them by \(n\):

\[ n \hat{\beta_0} = \sum_{i=1}^n y_i - \sum_{i=1}^n \hat{\beta_1} x_i\]

\[\Rightarrow ~ \hat{\beta_0} = \frac{\sum_{i=1}^n y_i}{n} - \hat{\beta_1} \frac{\sum_{i=1}^n x_i}{n}\]

These fractions are, of course, means. Finally:

\[\Rightarrow ~ \hat{\beta_0} = \overline{y} - \hat{\beta_1} \overline{x}\]

Minimizing for the intercept \(\hat{\beta_1}\)

Solve the problem:

\[\hat{\beta_1} = \mathrm{argmin}_{\hat{\beta_1}} ~ \sum_{i=1}^n \hat{e_i}^2 \]

\[ = \mathrm{argmin}_{\hat{\beta_1}} ~ \sum_{i=1}^n (y_i - \hat{\beta_0} + \hat{\beta_1} x_i)^2 \]

Step 1: Find partial derivative

Find the partial derivative:

\[\frac{\partial RSS}{\partial \hat{\beta_1}} \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2\]

Via the chain rule, set \(u = y_i - \hat{\beta_0} - \hat{\beta_1} x_i\) and let \(f = u^2\). The partial derivative of \(u\) is naturally \(2u\), and of \(f\) it is \(-x_i\). This gives us:

\[ = 2 \cdot x_i \cdot \left( \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i) \right)\]

Step 2: Set partial derivative to 0 and solve

\[ 2x \left( \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i) \right) = 0\]

\[ = \sum_{i=1}^n (2xy - 2 \hat{\beta_0} x - 2 \hat{\beta_1} x^2)\]

\[ = \sum_{i=1}^n (xy - \hat{\beta_0} x - \hat{\beta_1} x^2)\]

\[ = \sum_{i=1}^n x y - \hat{\beta_0} \sum_{i=1}^n x - \hat{\beta_1} \sum_{i=1}^n x^2\]

Substitute in the result from above, \(\hat{\beta_0} = \overline{y} - \hat{\beta_1} \overline{x}\):

\[ = \sum_{i=1}^n x y - (\overline{y} - \hat{\beta_1} \overline{x}) \sum_{i=1}^n x - \hat{\beta_1} \sum_{i=1}^n x^2\]

\[ = \sum_{i=1}^n x y - \overline{y} \sum_{i=1}^n x + \hat{\beta_1} \overline{x} \sum_{i=1}^n x - \hat{\beta_1} \sum_{i=1}^n x^2\]

Factor out \(\hat{\beta_1}\) from the last two terms:

\[ = \sum_{i=1}^n x y - \overline{y} \sum_{i=1}^n x + \hat{\beta_1} \left( \overline{x} \sum_{i=1}^n x - \sum_{i=1}^n x^2 \right)\]

Subtract the latter term from both sides:

\[\Rightarrow ~ -\hat{\beta_1} \left( \overline{x} \sum_{i=1}^n x - \sum_{i=1}^n x^2 \right) = \sum_{i=1}^n x y - \overline{y} \sum_{i=1}^n x\]

Divide both sides by \(-\left( \overline{x} \sum_{i=1}^n x - \sum_{i=1}^n x^2 \right)\):

\[\Rightarrow ~ \hat{\beta_1} = \frac{\sum_{i=1}^n x y - \overline{y} \sum_{i=1}^n x }{\sum_{i=1}^n x^2 - \overline{x} \sum_{i=1}^n x}\]

Step 3: A more interpetable \(\hat{\beta_1}\)

This expression can be further worked to be more interpretable:

\[\frac{SS_{xy}}{SS_{x}} = \frac{\sum_{i=1}^n (x_i - \overline{x}) (y_i - \overline{y})}{ \sum_{i=1}^n (x_i - \overline{x})^2 }\]

Where \(SS_{xy}\) is the sum of deviations from \(\overline{x}\) multiplied by the sum of deviations from \(\overline{y}\), and \(SS_{x}\) is the square sum of deviations from \(\overline{x}\).

Testing the Solution

Now we can apply to derived solution to our contrived data set:

# Accepts two numeric vectors of the same length n, and returns the slope and
# intercept parameters of the best fitting line, via the least squares method.
# 
estimate_least_squares <- function(x, y){
  
  sxy <- sum((x - mean(x)) * (y - mean(y)))
  sxx <- sum((x - mean(x))^2)
  beta_1 <- sxy / sxx

  beta_0 <- mean(y) - beta_1 * mean(x)
  
  return(list('intercept'=beta_0,
              'slope'=beta_1,
              'sxy'=sxy,
              'sxx'=sxx))
  
}

m <- estimate_least_squares(x, y)
m

## $intercept
## [1] 2.767302
## 
## $slope
## [1] 1.807502
## 
## $sxy
## [1] 701.6798
## 
## $sxx
## [1] 388.2041

Verify our math and implementation is correct by comparing these results to R’s built-in lm():

m_lm <- lm(y ~ x)
coef(m_lm)

## (Intercept)           x 
##    2.767302    1.807502

Success! Examine how closely our fitted line (red) is to the actual population line (blue):

plot(x, y)
abline(a=5, b=1.56, col='blue')
abline(m_lm, col='red')

We see our regression model does not predict exactly—both estimated parameters are off from what we know are the real parameters. However, it does seem to have mostly captured the systematic variation of the population.

Let’s write another function to access the residuals of our model. We observe the residuals are approximately normal. As expected, the residuals meet all the assumptions necessary for valid inference via regression analysis.

I plan to examine residual analysis in a further blog.

# Accepts the results of estimate_least_squares, describing a simple linear
# model, including estimated slope and intercept parameters, as well as the
# numeric vector to predict on.
ls_predict <- function(estimate, x) {
  return(estimate$intercept + estimate$slope * x)
}

y_hat <- ls_predict(m, x)

resid <- y - y_hat

plot(density(resid), main='Residuals')

qqnorm(rstandard(m_lm))
qqline(rstandard(m_lm))

plot(y_hat, resid)
abline(h=0, col='darkgray', lty=3)

Deriving the Least Squares Solution

Ben Horvath

September 2019

The Problem

Playing God

The Solution

What are we minimizing?

Minimizing for the intercept \(\hat{\beta_0}\)

Step 1: Find partial derivative

Step 2: Set partial derivative to 0 and solve

Minimizing for the intercept \(\hat{\beta_1}\)

Step 1: Find partial derivative

Step 2: Set partial derivative to 0 and solve

Step 3: A more interpetable \(\hat{\beta_1}\)

Testing the Solution