Simple Linear Regression

  • Goal: Model how one variable changes with another
  • Example: Predict height from age, sales from advertising

What is Simple Linear Regression?

  • One predictor x, one outcome y
  • We assume a straight line pattern with some noise
  • We will:
    • Choose a line
    • Measure errors
    • Pick the line with the smallest total error

Basic Equation

\[ y = \beta_0 + \beta_1 x + \varepsilon \]

  • \(y\): response (what we want to predict)
  • \(x\): predictor (what we use to predict)
  • \(\beta_0\): intercept (value of \(y\) when \(x = 0\))
  • \(\beta_1\): slope (change in \(y\) when \(x\) increases by 1)
  • \(\varepsilon\): random noise

How to find the best slope and intercept

We draw a line to get predicted values:

\[ \hat y_i = \hat\beta_0 + \hat\beta_1 x_i \]

The residual for point \(i\) is:

\[ \text{residual}_i = y_i - \hat y_i \]

Least squares chooses \(\hat\beta_0, \hat\beta_1\) to make

\[ \sum_{i=1}^n \text{residual}_i^2 \]

as small as possible.

Resulting plot

How to code it

ggplot(data, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "X",
       y = "Y") +
  theme_minimal()

Explanation:

geom_smooth(        # This add a fitted curve to the plot
  method = "lm",    # "lm" = Linear Model, on x and y variables
  se = FALSE,       # "se" = enables or disables confidence band
  color = "red"     # This makes the line red
)

Limitations

  • Not everything can be modeled with a straight line
  • Outliers can throw off your equation if not accounted for

Multiple dimensions

Linear regressions can work with multiple dimensions…

Look forwards to next lesson where you learn to work with 3D data!