Session Overview


Session 1: Calculating Residuals, \(R^2\), and Error Variance (45 minutes)

1. Refresher on Parameter Selection by Least Squares Method

The goal of the least squares method is to select the parameters \(\hat{\beta}_1\) and \(\hat{\beta}_2\) such that the sum of squared residuals (SSR) is minimized. The residuals (\(e_i\)) are defined as:

\[ e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_1 + \hat{\beta}_2 x_i) \]

The sum of squared residuals (SSR) is:

\[ SSR = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( y_i - (\hat{\beta}_1 + \hat{\beta}_2 x_i) \right)^2 \]

To minimize \(SSR\), we take the partial derivatives of \(SSR\) with respect to \(\hat{\beta}_1\) and \(\hat{\beta}_2\), set them to zero, and solve for \(\hat{\beta}_1\) and \(\hat{\beta}_2\). This gives us the normal equations:

\[ \frac{\partial SSR}{\partial \hat{\beta}_1} = -2 \sum_{i=1}^n (y_i - \hat{\beta}_1 - \hat{\beta}_2 x_i) = 0 \] \[ \frac{\partial SSR}{\partial \hat{\beta}_2} = -2 \sum_{i=1}^n x_i (y_i - \hat{\beta}_1 - \hat{\beta}_2 x_i) = 0 \]

Solving these equations yields the least squares estimators:

\[ \hat{\beta}_1 = \bar{y} - \hat{\beta}_2 \bar{x} \] \[ \hat{\beta}_2 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

Here: - \(\bar{y}\) is the mean of \(y\), - \(\bar{x}\) is the mean of \(x\), - \(\hat{\beta}_1\) is the intercept, - \(\hat{\beta}_2\) is the slope.


2. Refresher on Residuals and \(R^2\) (10 minutes)

  • Objective: Recap the concepts of residuals and \(R^2\) and their interpretations.
  • Key Points:
    • Residuals (\(e_i\)): \[ e_i = y_i - \hat{y}_i = y_i - (\alpha + \hat{\beta}_2 x_i) \]
      • What is it?: The difference between the actual value (\(y_i\)) and the predicted value (\(\hat{y}_i\)).
      • Why is it important?: Residuals help us measure the error in our predictions. Smaller residuals indicate a better-fitting model.
    • \(R^2\) (Coefficient of Determination): \[ R^2 = \frac{\text{Explained Variation}}{\text{Total Variation}} = 1 - \frac{\text{Unexplained Variation}}{\text{Total Variation}}=1 - \frac{\sum e_i^2}{\sum (y_i - \bar{y})^2} \]
Step 1: Define Total Variation

Total variation is the sum of squared deviations of \(y_i\) from the mean \(\bar{y}\):

\[ \text{Total Variation} = \sum_{i=1}^n (y_i - \bar{y})^2 \]

Step 2: Define Unexplained Variation

Unexplained variation is the sum of squared residuals (\(e_i\)):

\[ \text{Unexplained Variation} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 \]

  • What is it?: \(R^2\) measures the proportion of variation in \(y\) that is explained by \(x\).
  • Why is it important?: A high \(R^2\) (close to 1) indicates that the model explains a large portion of the variation in \(y\). A low \(R^2\) (close to 0) indicates that the model does not explain much of the variation.

3. Hands-On Calculation of Residuals and \(R^2\) (15 minutes)

  • Objective: Calculate residuals and \(R^2\) manually using Google Sheets.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: We calculated sum of residuals \(e_i^2\) in the previous session. Therefore, compute and \((y_i - \bar{y})^2\) for each observation.
      • Instruction: Square the residuals and deviations from the mean.
    • Step 2: Sum \((y_i - \bar{y})^2\) to get \(\sum (y_i - \bar{y})^2\).
      • Instruction: Use the SUM function in Google Sheets.
      • Why?: These sums are used to calculate \(R^2\).
    • Step 3: Calculate \(R^2\) using the formula. \[ R^2 = 1 - \frac{\sum e_i^2}{\sum (y_i - \bar{y})^2} \]
      • Instruction: Divide the sums to compute \(R^2\).
      • Why?: \(R^2\) tells us how well the model explains the variation in \(y\). For example, if \(R^2 = 0.85\), it means that 85% of the variation in sales is explained by advertising spending.
    • Discussion: What does \(R^2\) tell us about the model’s explanatory power? How well does advertising spending explain sales? Are there other factors that might influence sales?

4. Estimate of Error Variance (\(s^2\)) and Why We Use \(s\) (15 minutes)

  • Objective: Calculate the estimate of error variance (\(s^2\)) and explain why we use \(s\) (the standard error of the regression) instead of the variance of the regression error term.
  • Key Points:
    • Estimate of Error Variance (\(s^2\)): \[ s^2 = \frac{1}{n-2} \sum_{i=1}^{n} e_i^2 \]
      • What is it?: \(s^2\) estimates the variance of the error term (\(\varepsilon_i\)) in the regression model.
      • Why is it important?: It measures the variability of the residuals, which helps us assess the accuracy of the regression model.
    • Why We Use \(s\) Instead of the Variance of the Regression Error Term:
      • The true variance of the regression error term (\(\sigma^2\)) is unknown in practice. We estimate it using \(s^2\), which is based on the residuals (\(e_i\)).
      • \(s\) (the standard error of the regression) is the square root of \(s^2\): \[ s = \sqrt{s^2} \]
      • \(s\) is used in hypothesis testing, confidence intervals, and prediction intervals because it provides a measure of the spread of the residuals around the regression line.
      • Using \(s\) instead of the true variance (\(\sigma^2\)) accounts for the fact that we are working with sample data and need to estimate the variability of the errors.
    • Why \(n-2\)?:
      • \(n-2\) is the degrees of freedom, where \(n\) is the number of observations and 2 is the number of parameters estimated (\(\hat{\beta}_1\) and \(\hat{\beta}_2\)).
      • Why is it important?: Dividing by \(n-2\) instead of \(n\) gives us an unbiased estimate of the error variance.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Compute \(s^2\) using the formula. \[ s^2 = \frac{1}{n-2} \sum_{i=1}^{n} e_i^2 \]
      • Instruction: Divide the sum of squared residuals by \(n-2\).
      • Why?: This gives us an unbiased estimate of the error variance.
    • Step 2: Compute \(s\) (the standard error of the regression).
      • Instruction: Take the square root of \(s^2\).
      • Why?: \(s\) is used in hypothesis testing and prediction intervals.
    • Discussion: Why do we use \(s\) instead of the true variance (\(\sigma^2\))? How does \(s\) help us understand the uncertainty in our regression model?

Session 2: Prediction Interval, Variance of \(\hat{\beta}_2\) and Hypothesis Testing

2.1 Prediction Interval for \(y\) (10 minutes)

  • Objective: Calculate the prediction interval for \(y\) manually using Google Sheets.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Predict sales for a week with $7,000 and $8,000 spent on advertising.
      • Instruction: Use the regression equation \(\hat{y} = \hat{\beta}_1 + \hat{\beta}_2 x\) to make the predictions.
      • Why?: Predictions help us understand how the model can be applied in real-world scenarios.
    • Step 2: Calculate the prediction interval for \(y\).
      • Instruction: Use the formula: \[ \text{Prediction Interval} = (\hat{y}_0 - ks, \hat{y}_0 + ks) \] where \(k = 2\) for an approximate 95% prediction interval.
      • Why?: The prediction interval gives us a range of values within which we expect the actual value of \(y\) to fall.
    • Discussion: What does the prediction interval tell us about the uncertainty in our predictions? How can we use this interval in decision-making?

2.2 Variance of \(\hat{\beta}_2\)

  • Objective: Calculate the variance of the slope coefficient (\(\sigma_{\hat{\beta}_2}^2\)).
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Compute \((x_i - \bar{x})^2\) for each observation.
      • Instruction: Square the deviations from the mean.
      • Why?: These calculations are part of the formula for the variance of the slope coefficient.
    • Step 2: Sum \((x_i - \bar{x})^2\) to get \(\sum (x_i - \bar{x})^2\).
      • Instruction: Use the SUM function in Google Sheets.
      • Why?: This sum is used to calculate the variance of the slope coefficient.
    • Step 3: Calculate \(\sigma_{\hat{\beta}_2}^2\) using the formula. \[ \sigma_{\hat{\beta}_2}^2 = \frac{s^2}{\sum (x_i - \bar{x})^2} \]
      • Instruction: Divide \(s^2\) by the sum of squared deviations.
      • Why?: \(\sigma_{\hat{\beta}_2}^2\) tells us the variability of the slope coefficient, which helps us assess the precision of our estimate.

Interpretation of the Formula

1. Numerator: Estimate of Error Variance (\(s^2\))

  • What it represents:
    \(s^2\) is the estimated variance of the regression errors (\(\epsilon_i\)). It measures how much the actual data points (\(y_i\)) deviate from the predicted values (\(\hat{y}_i\)) on average.
  • Why it matters:
    • A larger \(s^2\) means the model has higher prediction error (residuals are large), leading to greater uncertainty in \(\hat{\beta}_2\).
    • A smaller \(s^2\) implies the regression line fits the data tightly, reducing uncertainty in the slope estimate.

2. Denominator: Sum of Squared Deviations of \(x\) (\(\sum (x_i - \bar{x})^2\))

  • What it represents:
    This term measures the spread/variability of the independent variable (\(x\)) around its mean (\(\bar{x}\)).
  • Why it matters:
    • A larger denominator (more spread in \(x\)) means the slope estimate \(\hat{\beta}_2\) is more precise (lower variance). Intuitively, if \(x\) varies widely, it’s easier to detect its relationship with \(y\).
    • A smaller denominator (less spread in \(x\)) makes \(\hat{\beta}_2\) less precise (higher variance). If all \(x_i\) are close to \(\bar{x}\), small changes in \(y\) could drastically alter the slope.

2.3 Hypothesis Testing for \(\hat{\beta}_2\) (15 minutes)

  • Objective: Perform a hypothesis test for the slope parameter (\(\hat{\beta}_2\)) manually using Google Sheets.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Compute the standard error of the slope coefficient (\(s_{\hat{\beta}_2}\)).
      • Instruction: Take the square root of \(\sigma_{\hat{\beta}_2}^2\).
      • Why?: \(s_{\hat{\beta}_2}\) measures the standard deviation of the slope coefficient, which is used in hypothesis testing.
    • Step 2: Calculate the t-statistic for the slope coefficient.
      • Instruction: Use the formula: \[ t_{\hat{\beta}_2} = \frac{\hat{\beta}_2}{s_{\hat{\beta}_2}} \]
      • Why?: The t-statistic helps us test whether the slope coefficient is significantly different from zero.
    • Step 3: Compare the t-statistic to the critical value.
      • Instruction: Use a t-distribution table to find the critical value for a 95% confidence level with \(n-2\) degrees of freedom.

      • Why?: If the t-statistic exceeds the critical value, we reject the null hypothesis (\(H_0: \beta = 0\)).

      • Rule-of-thumb for large \(n\) : reject \(H_0\) if \(t_{\hat{\beta}_2} < -2\) or \(t_{\hat{\beta}_2} > 2\).

    • Discussion: What does the t-statistic tell us about the significance of the slope coefficient? How does this affect our interpretation of the regression model?

2.4 Confidence Interval for\(\hat{\beta}_2\)

  • Step 4: Calculate the 95% confidence interval for the slope coefficient manually using Google Sheets.
    • Instruction: Use the formula: \[ \text{95% Confidence Interval} = \hat{\beta}_2 \pm t_{\alpha/2, n-2} \cdot s_{\hat{\beta}_2} \] where \(t_{\alpha/2, n-2}\) is the critical value from the t-distribution table.
    • Why?: The confidence interval gives us a range of values within which we expect the true slope coefficient to lie.
  • Discussion: What does \(\sigma_{\hat{\beta}_2}^2\) tell us about the precision of the slope coefficient? How does it affect our confidence in the regression model?

Why This Matters

Homework/Follow-Up