Regression Model

  • A simple linear regression is a method to model a linear relationship between two variables
  • This is the mathematical equation
    • \(y = \beta_0 + \beta_1x + \epsilon\)
      • \(\beta_0\) is the intercept
      • \(\beta_1\) is the slope
      • \(\epsilon\) is the error term

The Best Fit

  • The line of best fit should be one line that minimizes the sum of squared residuals, given below
    • \(e_i = y_i - \hat{y_i}\)
    • Minimize \(\sum(y_i - \hat{y_i})^2\)
      • The result estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

Example Data Set

  • To introduce the linear regression visually, we can start with a simple data set
plot_ly(
  data = cars,
  x=~speed,
  y=~dist,
  type="scatter",
  mode="markers"
) %>%
layout(
  title="Car Speed vs Stopping Distance",
  xaxis=list(title="Speed (mph)"),
  yaxis=list(title="Stopping Distance (ft)")
)

Regression Line

  • A linear regression line is added in blue

Residuals

  • Each point is some distance from the regression line shown in red
  • The y position of the dot minus the y position of the line

Squared Residuals

- Each square has area \(e_i^2\) - Note each square is deformed from the scale, squares that extend beyond the plot have been removed, their residuals can still be seen.

How the Regression Line is Chosen

  • The regression line is chosen to minimize the total area of all squares, given below
    • \(SSE = \sum(y_i-\hat{y_i})^2\)
    • The SSE in this example is:
b0=coef(model)[1]
b1=coef(model)[2]
sum((cars$dist - (b0+b1*cars$speed))^2)
## [1] 11353.52

Testing a Different Line

  • We can test the SSE of another regression line to check if the one given by R has a smaller total error than another line
  • First we will see mathematically, then visually
b0_test=-25
b1_test=4.3
sum((cars$dist - (b0_test+b1_test*cars$speed))^2)
## [1] 11693.52
  • Indeed, just slightly different parameters gave a higher total error

Visual Demonstration

  • Here is the line from before tested visually

Minimum Squares Regression is a Function Minimum

  • When testing many different slopes, you can plot the SSE vs. Slope to find a parabolic function with a minimum

Minimum Value cont.

  • The minimum value of this function (~3.93) is the slope of the regression line that minimizes total error, chosen by R automatically
    • The same process can be repeated with the x-intercept as well
    • The process can also be done with a multivariable function of intercept and slope to create a 3-D bowl shape, where the parameters of the best fit line are still the minimum.