2025-10-16

Slide 1: Introduction to Linear Regression

What is Linear Regression?

  • reflects relationship between two quantifyable variables
  • data of normal distribution, assumes relationship is linear
  • focus: statistical relationships

Slide 2: The mathematical equation

If you’re familiar with the equation for a line, then this should be familiar:

\[ y = \beta_0 + \beta_1 x_i + \varepsilon_i \]

where:

  • \(y\) = dependent variable, often the observed result

  • \(x_i\) = independent variable, often the thing we change

  • \(\beta_0\) = intercept (when \(x\) is 0, what is the observed result, \(y\)?)

  • \(\beta_1\) = slope (change in \(y\) for a unit change in \(x\))

  • \(\varepsilon_i\) = the error value for the equation (noise, variation in the data)…

but we’ll get back to epsilon. Ignore that for now.

Slide 3: Basic Example 1

#ggplot is used to plot graphs. aes stands for aesthetic, and can 
#be used to specify your axes from the specified data. geom point 
#plots your data values, and geom_smooth is your line of fit (lm 
#means linear model). se represents standard error, but we'll get
#back to that.
ggplot(data = trees, aes(x = Girth, y = Volume))+
  geom_point(color="darkorchid", size=2) +
  geom_smooth(method="lm", se=FALSE, color="black")
## `geom_smooth()` using formula = 'y ~ x'

Slide 4: What was that?

The black line represents the simple linear regression model: \[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \varepsilon_i \]

In this case, our x value is girth.

The other values in the equation:

  • \(\hat{\beta}_0\): intercept — what is the predicted volume when girth is 0?

  • \(\hat{\beta}_1\): slope - what is the average rate of change of volume with increasing values of girth?

  • \(y\): output - what the corresponding volume of the tree is to the girth val.

  • \(\varepsilon_i\)?

Slide 5: But what about the noise?

When we fit a line to data, not all points lie exactly on the line. The more scattered these points have, the more noise they create.

The differences between the actual data and the predicted values on the line are called residuals.

\[ \varepsilon_i = y_i - \hat{y}_i \]

What residuals tell us:

  • How far each point is from the regression line (like the black one)
  • Whether our model underestimates or overestimates the relationship between two variables
  • Whether linear regression models the data accurately

Slide 6: Confidence Intervals

A confidence interval shows the range of values where the actual regression line might be optimally placed.

For the purposes of this presentation, linear models often have a confidence interval of 95%.

But,

  • Narrower intervals → more confidence in the predicted values.
  • Wider intervals → more uncertainty, especially at the edges of the data.

Let’s look at the example on the next slide to see what that really means.

Slide 7:

model <- lm(Volume ~ Girth, data = trees) #specifying our model again
trees$predicted <- predict(model) #the points on the line
trees$residuals <- residuals(model) #the differences between actual
#and predicted

ggplot(trees, aes(x = Girth, y = Volume)) +
  geom_point(color = "darkorchid", size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  geom_segment(aes(xend = Girth, yend = predicted), color = "cornflowerblue")
## `geom_smooth()` using formula = 'y ~ x'

The shaded area around the line represents the 95% confidence interval.
  • It accounts for any uncertainty in our predictions.
The blue lines represent the residuals.
  • If there were more, there would be more noise.

Slide 8: A Complex Example

  • Let’s look at a more complex example from the diamonds dataset.

As you can see, the line of best fit here could be linear. But is linear the best choice? What other kinds of best fit models could work?

References