Loading [MathJax]/jax/output/HTML-CSS/jax.js

Session Overview


Session 1: Calculating Residuals, R2, and Error Variance (45 minutes)

1. Refresher on Parameter Selection by Least Squares Method

The goal of the least squares method is to select the parameters ˆβ1 and ˆβ2 such that the sum of squared residuals (SSR) is minimized. The residuals (ei) are defined as:

ei=yiˆyi=yi(ˆβ1+ˆβ2xi)

The sum of squared residuals (SSR) is:

SSR=ni=1e2i=ni=1(yi(ˆβ1+ˆβ2xi))2

To minimize SSR, we take the partial derivatives of SSR with respect to ˆβ1 and ˆβ2, set them to zero, and solve for ˆβ1 and ˆβ2. This gives us the normal equations:

SSRˆβ1=2ni=1(yiˆβ1ˆβ2xi)=0 SSRˆβ2=2ni=1xi(yiˆβ1ˆβ2xi)=0

Solving these equations yields the least squares estimators:

ˆβ1=ˉyˆβ2ˉx ˆβ2=(xiˉx)(yiˉy)(xiˉx)2

Here: - ˉy is the mean of y, - ˉx is the mean of x, - ˆβ1 is the intercept, - ˆβ2 is the slope.


2. Refresher on Residuals and R2 (10 minutes)

  • Objective: Recap the concepts of residuals and R2 and their interpretations.
  • Key Points:
    • Residuals (ei): ei=yiˆyi=yi(α+ˆβ2xi)
      • What is it?: The difference between the actual value (yi) and the predicted value (ˆyi).
      • Why is it important?: Residuals help us measure the error in our predictions. Smaller residuals indicate a better-fitting model.
    • R2 (Coefficient of Determination): R2=Explained VariationTotal Variation=1Unexplained VariationTotal Variation=1e2i(yiˉy)2
Step 1: Define Total Variation

Total variation is the sum of squared deviations of yi from the mean ˉy:

Total Variation=ni=1(yiˉy)2

Step 2: Define Unexplained Variation

Unexplained variation is the sum of squared residuals (ei):

Unexplained Variation=ni=1e2i=ni=1(yiˆyi)2

  • What is it?: R2 measures the proportion of variation in y that is explained by x.
  • Why is it important?: A high R2 (close to 1) indicates that the model explains a large portion of the variation in y. A low R2 (close to 0) indicates that the model does not explain much of the variation.

3. Hands-On Calculation of Residuals and R2 (15 minutes)

  • Objective: Calculate residuals and R2 manually using Google Sheets.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: We calculated sum of residuals e2i in the previous session. Therefore, compute and (yiˉy)2 for each observation.
      • Instruction: Square the residuals and deviations from the mean.
    • Step 2: Sum (yiˉy)2 to get (yiˉy)2.
      • Instruction: Use the SUM function in Google Sheets.
      • Why?: These sums are used to calculate R2.
    • Step 3: Calculate R2 using the formula. R2=1e2i(yiˉy)2
      • Instruction: Divide the sums to compute R2.
      • Why?: R2 tells us how well the model explains the variation in y. For example, if R2=0.85, it means that 85% of the variation in sales is explained by advertising spending.
    • Discussion: What does R2 tell us about the model’s explanatory power? How well does advertising spending explain sales? Are there other factors that might influence sales?

4. Estimate of Error Variance (s2) and Why We Use s (15 minutes)

  • Objective: Calculate the estimate of error variance (s2) and explain why we use s (the standard error of the regression) instead of the variance of the regression error term.
  • Key Points:
    • Estimate of Error Variance (s2): s2=1n2ni=1e2i
      • What is it?: s2 estimates the variance of the error term (εi) in the regression model.
      • Why is it important?: It measures the variability of the residuals, which helps us assess the accuracy of the regression model.
    • Why We Use s Instead of the Variance of the Regression Error Term:
      • The true variance of the regression error term (σ2) is unknown in practice. We estimate it using s2, which is based on the residuals (ei).
      • s (the standard error of the regression) is the square root of s2: s=s2
      • s is used in hypothesis testing, confidence intervals, and prediction intervals because it provides a measure of the spread of the residuals around the regression line.
      • Using s instead of the true variance (σ2) accounts for the fact that we are working with sample data and need to estimate the variability of the errors.
    • Why n2?:
      • n2 is the degrees of freedom, where n is the number of observations and 2 is the number of parameters estimated (ˆβ1 and ˆβ2).
      • Why is it important?: Dividing by n2 instead of n gives us an unbiased estimate of the error variance.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Compute s2 using the formula. s2=1n2ni=1e2i
      • Instruction: Divide the sum of squared residuals by n2.
      • Why?: This gives us an unbiased estimate of the error variance.
    • Step 2: Compute s (the standard error of the regression).
      • Instruction: Take the square root of s2.
      • Why?: s is used in hypothesis testing and prediction intervals.
    • Discussion: Why do we use s instead of the true variance (σ2)? How does s help us understand the uncertainty in our regression model?

Session 2: Prediction Interval, Variance of ˆβ2 and Hypothesis Testing

2.1 Prediction Interval for y (10 minutes)

  • Objective: Calculate the prediction interval for y manually using Google Sheets.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Predict sales for a week with $7,000 and $8,000 spent on advertising.
      • Instruction: Use the regression equation ˆy=ˆβ1+ˆβ2x to make the predictions.
      • Why?: Predictions help us understand how the model can be applied in real-world scenarios.
    • Step 2: Calculate the prediction interval for y.
      • Instruction: Use the formula: Prediction Interval=(ˆy0ks,ˆy0+ks) where k=2 for an approximate 95% prediction interval.
      • Why?: The prediction interval gives us a range of values within which we expect the actual value of y to fall.
    • Discussion: What does the prediction interval tell us about the uncertainty in our predictions? How can we use this interval in decision-making?

2.2 Variance of ˆβ2

  • Objective: Calculate the variance of the slope coefficient (σ2ˆβ2).
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Compute (xiˉx)2 for each observation.
      • Instruction: Square the deviations from the mean.
      • Why?: These calculations are part of the formula for the variance of the slope coefficient.
    • Step 2: Sum (xiˉx)2 to get (xiˉx)2.
      • Instruction: Use the SUM function in Google Sheets.
      • Why?: This sum is used to calculate the variance of the slope coefficient.
    • Step 3: Calculate σ2ˆβ2 using the formula. σ2ˆβ2=s2(xiˉx)2
      • Instruction: Divide s2 by the sum of squared deviations.
      • Why?: σ2ˆβ2 tells us the variability of the slope coefficient, which helps us assess the precision of our estimate.

Interpretation of the Formula

1. Numerator: Estimate of Error Variance (s2)

  • What it represents:
    s2 is the estimated variance of the regression errors (ϵi). It measures how much the actual data points (yi) deviate from the predicted values (ˆyi) on average.
  • Why it matters:
    • A larger s2 means the model has higher prediction error (residuals are large), leading to greater uncertainty in ˆβ2.
    • A smaller s2 implies the regression line fits the data tightly, reducing uncertainty in the slope estimate.

2. Denominator: Sum of Squared Deviations of x ((xiˉx)2)

  • What it represents:
    This term measures the spread/variability of the independent variable (x) around its mean (ˉx).
  • Why it matters:
    • A larger denominator (more spread in x) means the slope estimate ˆβ2 is more precise (lower variance). Intuitively, if x varies widely, it’s easier to detect its relationship with y.
    • A smaller denominator (less spread in x) makes ˆβ2 less precise (higher variance). If all xi are close to ˉx, small changes in y could drastically alter the slope.

2.3 Hypothesis Testing for ˆβ2 (15 minutes)

  • Objective: Perform a hypothesis test for the slope parameter (ˆβ2) manually using Google Sheets.
  • Exercise: Use the Advertising-Sales dataset.
    • Step 1: Compute the standard error of the slope coefficient (sˆβ2).
      • Instruction: Take the square root of σ2ˆβ2.
      • Why?: sˆβ2 measures the standard deviation of the slope coefficient, which is used in hypothesis testing.
    • Step 2: Calculate the t-statistic for the slope coefficient.
      • Instruction: Use the formula: tˆβ2=ˆβ2sˆβ2
      • Why?: The t-statistic helps us test whether the slope coefficient is significantly different from zero.
    • Step 3: Compare the t-statistic to the critical value.
      • Instruction: Use a t-distribution table to find the critical value for a 95% confidence level with n2 degrees of freedom.

      • Why?: If the t-statistic exceeds the critical value, we reject the null hypothesis (H0:β=0).

      • Rule-of-thumb for large n : reject H0 if tˆβ2<2 or tˆβ2>2.

    • Discussion: What does the t-statistic tell us about the significance of the slope coefficient? How does this affect our interpretation of the regression model?

2.4 Confidence Interval forˆβ2

  • Step 4: Calculate the 95% confidence interval for the slope coefficient manually using Google Sheets.
    • Instruction: Use the formula: 95% Confidence Interval=ˆβ2±tα/2,n2sˆβ2 where tα/2,n2 is the critical value from the t-distribution table.
    • Why?: The confidence interval gives us a range of values within which we expect the true slope coefficient to lie.
  • Discussion: What does σ2ˆβ2 tell us about the precision of the slope coefficient? How does it affect our confidence in the regression model?

Why This Matters

Homework/Follow-Up