Assumptions and Definitions

Assume that the relationship between \(X\) and \(Y\) is determined by a hypothetical model of the form:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

We collect a sample of paired observations \((x_i, y_i)\) to use as a training set. We used the training set to create a fitted model. The fitted model is an approximation of the hypothetical model, and can be written in either of the following (equivalent) forms:

\[\hat Y = \hat \beta_0 + \hat \beta_1 X\] \[Y = \hat \beta_0 + \hat \beta_1 X + \hat \varepsilon\]

Given a paired observation \(\left(x_i, y_i \right)\), the fitted value of y given \(x_i\) is given by:

\[\hat y_i = \hat\beta_0 + \hat\beta_1 x_i\]

The residual associated with the observation is given by:

\[\hat \varepsilon_i = y_i - \hat y_i\]

Least Squares Regression

we will select our fitted model \(\hat Y = \hat \beta_0 + \hat \beta_1 X\) by choosing parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\) so that the following quantities are minimized on the training set:

\[SSE = \sum\limits_{i=1}^n \hat \varepsilon_i^2 \hspace{2 em} MSE = \frac{1}{n}\sum\limits_{i=1}^n \hat \varepsilon_i^2\]

At present, we will work with \(SSE\). To emphasize that \(SSE\) is a function of the proposed parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\), we sometimes write:

\[SSE\left ( \hat\beta_0, \hat\beta_1 \right) = \sum\limits_{i=1}^n \hat \varepsilon_i^2 = \sum\limits_{i=1}^n \left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right) ^2\] ### Preliminary Derivations

Before we derive the optimal values for our parameter estimates, we need to establish two preliminary results. These results now will make our derivation of \(\hat\beta_0\) and \(\hat\beta_1\) go a bit smoother. Assume that we have a sample of \(n\) paired observations \((x_i, y_i)\). We define the quantities \(S_{XX}\) and \(S_{XY}\) as follows:

\[S_{XX} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)^2\] \[S_{XY} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)\left(y_i - \bar y \right)\] Notice that we can rewrite the expression for \(SXY\) as follows:

\[S_{XY} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)\left(y_i - \bar y \right) = \sum\limits_{i=1}^n \left(x_i y_i - \bar x y_i - \bar y x_i + \bar x \bar y \right) \] \[= \sum\limits_{i=1}^n x_i y_i - \bar x \sum\limits_{i=1}^n y_i - \bar y \sum\limits_{i=1}^n x_i + \sum\limits_{i=1}^n\bar x \bar y\] \[ = \sum\limits_{i=1}^n x_i y_i - n\bar x \bar y - n\bar y \bar x + n\bar x \bar y = \sum\limits_{i=1}^n x_i y_i - n\bar x \bar y \]

In summary, we have shown that: \[S_{XY} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)\left(y_i - \bar y \right) = \sum\limits_{i=1}^n x_i y_i - n\bar x \bar y\] A very similar argument shows that: \[S_{XX} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)^2 = \sum\limits_{i=1}^n x_i^2 - n\bar x^2\] We will need both of these expressions in the near future.

Derivation of Parameter Estimates

We will now derive formulas for \(\hat\beta_0\) and \(\hat\beta_1\) that will minimize \(SSE\left ( \hat\beta_0, \hat\beta_1 \right)\). To do this, we will need to differentiate this function with respect to both \(\hat\beta_0\) and \(\hat\beta_1\), and then set the resulting expressions to 0.

\[\frac{\partial}{\partial\hat\beta_0} SSE\left ( \hat\beta_0, \hat\beta_1 \right) = -2 \sum\limits_{i=1}^n \left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right)\] \[\frac{\partial}{\partial\hat\beta_1} SSE\left ( \hat\beta_0, \hat\beta_1 \right) = -2 \sum\limits_{i=1}^n x_i\left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right)\]

Setting these two expressions to zero and dividing both sides by -2, we get:

\[\sum\limits_{i=1}^n \left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right) = 0\] \[\sum\limits_{i=1}^n x_i\left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right) = 0\]

The two equations above are referred to as the normal equations. We need to solve these equations for \(\hat\beta_0\) and \(\hat\beta_1\).

Notice that we can rewrite the first normal equation as follows:

\[\sum\limits_{i=1}^n y_i - \sum\limits_{i=1}^n\hat\beta_0 - \sum\limits_{i=1}^n\hat\beta_1 x_i = 0\]

\[n \bar y - n\hat\beta_0 -n \hat\beta_1 \bar x = 0\] \[\hat\beta_0 = \bar y - \hat\beta_1 \bar x\]

We will now work on the second normal equation:

\[\sum\limits_{i=1}^n \left(x_i y_i - \hat\beta_0 x_i - \hat\beta_1 x_i^2 \right) = 0\] \[ \sum\limits_{i=1}^n x_i y_i - \sum\limits_{i=1}^n \hat\beta_0 x_i - \sum\limits_{i=1}^n \hat\beta_1 x_i^2 = 0\] \[ \sum\limits_{i=1}^n x_i y_i - n\bar x \hat\beta_0 - \hat\beta_1\sum\limits_{i=1}^n x_i^2 = 0\] \[ n \bar x \hat\beta_0 + \hat\beta_1\sum\limits_{i=1}^n x_i^2 = \sum\limits_{i=1}^n x_i y_i\]

When working with the first normal equation above, we showed that \(\hat\beta_0 = \bar y - \hat\beta_1 \bar x\). We use this expression to substitute out \(\hat\beta_0\) in the last expression we derived from the second normal equation.

\[ n \bar x \left(\bar y - \hat\beta_1 \bar x \right) + \hat\beta_1\sum\limits_{i=1}^n x_i^2 = \sum\limits_{i=1}^n x_i y_i\] \[ n \bar x \bar y - n \hat\beta_1 \bar x^2 + \hat\beta_1\sum\limits_{i=1}^n x_i^2 = \sum\limits_{i=1}^n x_i y_i\]

Collecting the \(\hat\beta_1\) terms together gives:

\[ \hat\beta_1 \left( \sum\limits_{i=1}^n x_i^2 - n \bar x^2 \right) = \sum\limits_{i=1}^n x_i y_i - n \bar x \bar y\] Solving for \(\hat\beta_1\) yields:

\[ \hat\beta_1 = \frac{\sum\limits_{i=1}^n x_i y_i - n \bar x \bar y}{\sum\limits_{i=1}^n x_i^2 - n \bar x^2}\] Applying the results from our “preliminaries” section above, we conclude that:

\[\hat\beta_1= \frac{S_{XY}}{S_{XX}} \hspace{20px} \mathrm{and} \hspace{20px}\hat\beta_0 = \bar y - \hat\beta_1 \bar x\]

This provides us with formulas for the parameter estimates that will result in \(SSE\) being minimized.

Additional Comments

Alternate form for \(\hat\beta_1\)

There are many different ways to write the formula for \(\hat\beta_1\). One commonly encountered formula is \(\hat\beta_1 = \frac{\mathrm{cov}[X,Y]}{s_X^2}\). This formula can be derived from our previous formula for \(\hat\beta_1\) by multiplying the top and bottom of the expression by \(1/n\).

Normal Equations

Notice that the the normal equations:

\[\sum\limits_{i=1}^n \left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right) = 0 \hspace{10px} \mathrm{and} \hspace{10px}\sum\limits_{i=1}^n x_i\left( y_i - \hat\beta_0 - \hat\beta_1 x_i \right) = 0\]

can be written as:

\[\sum\limits_{i=1}^n \hat \varepsilon_i = 0\hspace{10px} \mathrm{and} \hspace{10px}\sum\limits_{i=1}^n x_i \hat \varepsilon_i = 0\]

These versions of the normal equations will be useful to use in deriving certain results in the future.

Sample Means

Recall that our formula for the estimate of the intercept was given by \(\hat\beta_0 = \bar y - \hat\beta_1 \bar x\). We can rewrite this equation as \(\bar y = \hat\beta_0 - \hat\beta_1 \bar x\). This demonstrates that the point \((\bar x, \bar y)\) lies on the least squares regression line.

Summary

Given a sample of \(n\) paired observations \((x_i, y_i)\), we define \(S_{XX}\) and \(S_{XY}\) as follows: \[S_{XX} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)^2\] \[S_{XY} = \sum\limits_{i=1}^n \left(x_i - \bar x \right)\left(y_i - \bar y \right)\]

The least squares regression line \(\hat y = \hat\beta_0 + \hat\beta_1 x\) is obtained by using the following parameter estimates:

\[\hat\beta_1= \frac{S_{XY}}{S_{XX}} = \frac{\mathrm{cov}[X,Y]}{s_X^2} \hspace{20px} \mathrm{and} \hspace{20px}\hat\beta_0 = \bar y - \hat\beta_1 \bar x\]

The least squares regression line satisfies the following properties:

  1. \(SSE = \sum\limits_{i=1}^n \hat \varepsilon_i^2\) is minimized.

  2. \(\sum\limits_{i=1}^n \hat \varepsilon_i = 0\hspace{5px}\) and \(\hspace{5px}\sum\limits_{i=1}^n x_i \hat \varepsilon_i = 0\)

  3. The line passes through the point \((\bar x, \bar y)\).

