Proof for Regression Line

This follows the videos at Khan Academy for proving the formula for a linear regression line. Thanks, Sal!

\[SE_{line} = \sum\limits_{i=1}^n (y_i - (mx_i+b))^2\]

\[SE_{line} = (y_1 - (mx_1+b))^2 + (y_2 - (mx_2+b))^2 + \dots + (y_n - mx_n+b))^2\]

\[= y_1^2 - 2y_1(mx_1+b) + (mx_1+b)^2\] \[+ y_2^2 - 2y_2(mx_2+b) + (mx_2+b)^2\] \[\vdots\] \[+ y_n^2 - 2y_n(mx_n+b) + (mx_n+b)^2\]

\[=y_1^2 - 2y_1mx_1 - 2y_1b + m^2x_1^2 + 2mx_1b + b^2\] \[+y_2^2 - 2y_2mx_2 - 2y_2b + m^2x_2^2 + 2mx_2b + b^2\] \[\vdots\] \[+y_n^2 - 2y_nmx_n - 2y_nb + m^2x_n^2 + 2mx_nb + b^2\]

\[=(y_1^2 + y_2^2+\dots+y_n^2) -2m(y_1x_1 + y_2x_2 + \dots + y_nx_n) - 2b(y_1+y_2+\dots+y_n)+2m(x_1+x_2+\dots+x_n) + nb^2\]

Note that:

\[\overline{y^2} = \frac{y_1^2+y_2^2+\dots+y_n^2}{n}\]

And so:

\[y_1^2+y_2^2+\dots+y_n^2 = n\overline{y^2}\]

Similarly:

\[x_1y_1 + x_2y_2 + \dots+x_ny_n = n\overline{xy}\]

\[SE_{line} = n\overline{y^2} - 2mn\overline{xy} -2bn\overline{y} + m^2n\overline{x^2} + 2mbn\overline{x} + nb^2\]

Another way of working out the above:

\[SE_{line} = \sum\limits_{i=1}^n (y_i - (mx_i + b))^2\] \[= \sum\limits_{i=1}^n y_i^2 - 2y_i(mx_i + b) + (mx_i + b)^2\] \[=\sum\limits_{i=1}^n y_i^2 -2my_ix_i - 2y_ib + m^2x_i^2 + 2mx_ib + b^2\] \[= n\overline{y^2} - 2mn\overline{xy} -2bn\overline{y} + m^2n\overline{x^2} + 2mnb\overline{x} + nb^2\]

Now we can optimize (by minimizing) the above expression. It represents a surface. Everything can be considered a constant except the m’s and the b’s. The latter can vary to form a surface in three dimensions. So m and b are both axes, and the squared error is the third axis. A three-dimenaional parabola is formed. The goal is to find the lowest possible point in this three-dimensional parabola, i.e.:

\[\frac{\delta SE}{\delta m} = 0\]

(this is the partial derivative for the slope)

And:

\[\frac{\delta SE}{\delta b} = 0\]

(this is the partial derivative for the y-intercept)

So the next step is to take the partial derivative of \(SE_{line}\) with respect to \(m\).

\[SE_{line} = n\overline{y_2} - 2mn\overline{xy} -2bn\overline{y} + m^2n\overline{x^2} + 2mbn\overline{x} + nb^2\]

The first term, \(n\overline{y_2}\), has no \(m\) term in it, so it is a constant. This is also true of the third term, \(-2bn\overline{y}\), and the last term, \(nb^2\).

So:

\[\frac{\delta SE}{\delta m} = -2n\overline{xy} + 2mn\overline{x^2} +2bn\overline{x}\]

\[\frac{\delta SE}{\delta b} = -2n\overline{y} + 2mn\overline{x} + 2nb\]

Now we solve for 0 for each of these partial derviatives.

First, for m:

\[-2n\overline{xy} + 2mn\overline{x^2} +2bn\overline{x} = 0\] \[2n(-\overline{xy} + \overline{x^2}m + b\overline{x}) = 0\] \[-\overline{xy} + \overline{x^2}m + b\overline{x} = 0\]

Second, for b (first, second and fourth terms are all constants):

\[-2n\overline{y} + 2mn\overline{x} + 2nb = 0\] \[-\overline{y} + m\overline{x} + b = 0\]

Now rewrite, moving toward \(mx+b\) form:

\[m\overline{x^2} + b\overline{x}= \overline{xy}\] \[m\overline{x} + b = \overline{y}\]

We want both of these in \(mx + b\) form, and the second is already there. We can see that the point \((\overline{x},\overline{y})\) lies on the optimized/minimized line.

So for the first, divide both sides by \(\overline{x}\):

\[m\frac{\overline{x^2}}{\overline{x}} + b = \frac{\overline{xy}}{\overline{x}}\]

Now we have another point on the minimized line, \(\frac{\overline{x^2}}{\overline{x}}, \frac{\overline{xy}}{\overline{x}}\)

Now we can finish the problem two ways, 1) use the two points to find the line or 2) solve both equations.

Method 1: Solve for m

Subtract one equation from the other (multiply one by -1 first, then add them):

\[m\overline{x} + b = \overline{y}\] \[-m\frac{\overline{x^2}}{\overline{x}} - b = -\frac{\overline{xy}}{\overline{x}}\]

This results in:

\[m(\overline{x} - \frac{\overline{x^2}}{\overline{x}}) = \overline{y} - \frac{\overline{xy}}{\overline{x}}\] \[m = \frac{\overline{y} - \frac{\overline{xy}}{\overline{x}}}{\overline{x} - \frac{\overline{x^2}}{\overline{x}}}\]

If you compare this to the two points we found earlier, you see that this is the exact same result if we used those two points to determine the slope: the change in y’s over the change in x’s.

Next, simplify by multiplying numerator and denominator by \(\overline{x}\):

\[m = \frac{\overline{y} - \frac{\overline{x}\overline{y}}{\overline{x}}}{\overline{x} - \frac{\overline{x^2}}{\overline{x}}} \times \frac{\overline{x}}{\overline{x}} = \frac{\overline{x}\overline{y}-\overline{xy}}{(\overline{x})^2 - \overline{x^2}}\]

Now you can plug in the actual values to find \(m\), and then use it to solve for \(b\) in \(m\overline{x} + b = \overline{y}\), or:

\[b = \overline{y} - m\overline{x}\]