Hypothetical data

We start with two data vectors, x and y.

x <- c(9, 3, 12, 1, 4)
y <- c(6, 7, 10, 4, 4)
plot(x, y)

Calculus explanation

We are trying to a draw a straight line with the formula:

\[ y = mx + b \]

or equally:

\[ y = b_0 + b_1 x \]

Upon finding b_0 and b_1 we can plug x into the right hand side of the equation to find “y_predicted.” In order to find the best b_0 and b_1, in least squares method we are trying to minimize squared error. In other words, we are trying to minimize:

\[ y_{predicted} = b_0 + b_1 x \] \[ error = \sum(y_{predicted} - y)^2 \]

If we picture a graph of y_predicted plotted against error (ie, various different errors based on various different values for the model), this will be a parabola. We want the bottom of the parabola, the point at which error is minimized.

Using calculus, we can find that by taking the first derivative and setting it to zero. Since y_predicted is actually composed of b_0 and b_1, we actually have to do this twice. i.e., we take the partial derivative of b_0 and the partial derivative of b_1, and then solve the two equations.

The partial derivative of (b_0 + b_1 * x - y)^2 for b_1 is:

\[ \sum 2 (b_0 + b_1 x - y) x \] \[ = (\sum 2 x) b_0 + (\sum 2 x^2) b_1 - (\sum 2 x y) \]

The partial derivative of (b_0 + b_1 * x - y)^2 for b_0 is:

\[ \sum 2 (b_0 + b_1 x - y) \] \[ = (\sum 2 x^0) b_0 + (\sum 2 x) b_1 - (\sum 2 y) \]

(i.e., you just take the first derivative of b_0 or b_1 and treat the other as a constant)

Plugging in our actual x and y values from above:

Partial derivative of b_1:

\[ 0 = (2*b_0*1 + 2*b_1*1^2 - 2*4*1) + (2*b_0*3 + 2*b_1*3^2 - 2*7*3) + (2*b_0*4 + 2*b_1*4^2 - 2*4*4) + (2*b_0*9 + 2*b_1*9^2 - 2*6*9) + (2*b_0*12 + 2*b_1*12^2 - 2*10*12) \] \[ 58*b_0 + 502*b_1 - 430 = 0 \]

Partial derivative of b_0:

\[ 0 = 2*(b_0 + b_1 * 1 - 4) + 2*(b_0 + b_1 * 3 - 7) + 2*(b_0 + b_1 * 4 - 4) + 2*(b_0 + b_1 * 9 - 6) + 2*(b_0 + b_1 * 12 - 10) \] \[ 10*b_0 + 58*b_1 - 62 = 0 \]

So now we just solve the two equation system:

\[ 58 b_0 + 502 b_1 - 430 = 0 \] \[ 10 b_0 + 58 b_1 - 62 = 0 \]

Doing that we get 0.4251208 for b_1 and 3.7343 for b_0.

We check our answer against the lm() function in R:

lm(y ~ x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      3.7343       0.4251

We see they are the same.

Linear algebra explanation

We can think of x as a vector and y as another vector.

x <- cbind(x)
y <- cbind(y)
x
##       x
## [1,]  9
## [2,]  3
## [3,] 12
## [4,]  1
## [5,]  4
y
##       y
## [1,]  6
## [2,]  7
## [3,] 10
## [4,]  4
## [5,]  4

If y were a perfect linear combination of x, there would be some scalar, we’ll call it a_1, that we could multiply to x in order to get to y. In other words:

\[ a_1 x = y \]

To allow for an intercept, we can add a_0 to the equation:

\[ a_0 + a_1 x = y \]

In other words:

\[ a_0 \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ \end{bmatrix} + a_1 \begin{bmatrix} 9 \\ 3 \\ 12 \\ 1 \\ 4 \\ \end{bmatrix} = \begin{bmatrix} 6 \\ 7 \\ 10 \\ 4 \\ 4 \\ \end{bmatrix} \]

or equally:

\[ \begin{bmatrix} 1 & 9 \\ 1 & 3 \\ 1 & 12 \\ 1 & 1 \\ 1 & 4 \\ \end{bmatrix} \begin{bmatrix}a_0\\a_1\end{bmatrix} = \begin{bmatrix} 6 \\ 7 \\ 10 \\ 4 \\ 4 \\ \end{bmatrix} \]

So we are essentially solving the typical Ax = b formula. To solve this we would multiply both sides on the left by A-1 , but the problem here is that A is not invertible. So we first have to multiply by AT and then multiply by the inverse of ATA .

Step by step:

\[ A x = b \] \[ A^T A x = A^T b \] \[ (A^T A)^{-1} A^T A x = (A^T A)^{-1} A^T b \] \[ x = (A^T A)^{-1} A^T b \]

We do this in R to see if we get the same answer as before.

A <- cbind(rep(1,5), x)
b <- y
solve(t(A) %*% A) %*% t(A) %*% b
##           y
##   3.7342995
## x 0.4251208

We do.