Understanding Linear Regression

Why study linear regression?

Linear regression is one of the most common tools in Statistics for describing and predicting the relationship between a quantitative response and a quantitative predictor.

This presentation uses the mtcars data and studies how car weight (wt, in 1000 pounds) is related to fuel efficiency (mpg).

Questions answered by regression:

Is there a clear linear trend?
How can we estimate the slope of that trend?
How well does the fitted line explain the data?
What mpg should we predict for a car of a given weight?

The simple linear regression model

For one predictor variable, the model is

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,2,\dots,n \]

where:

\(Y_i\) is the response value
\(x_i\) is the predictor value
\(\beta_0\) is the intercept
\(\beta_1\) is the slope
\(\varepsilon_i\) is random error

In this example:

\[ \text{mpg}_i = \beta_0 + \beta_1\,\text{wt}_i + \varepsilon_i \]

A negative slope would mean heavier cars tend to get lower gas mileage.

Least squares estimation

The regression line is chosen by minimizing the sum of squared errors:

\[ S(\beta_0,\beta_1) = \sum_{i=1}^{n}\left(y_i - \beta_0 - \beta_1 x_i\right)^2 \]

The least-squares estimates are

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} \]

So the fitted line is

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x \]

For the mtcars example, the fitted line is

\[ \widehat{\text{mpg}} = 37.29 - 5.34\,\text{wt} \]

What the slope means

The estimated slope is

\[ \hat{\beta}_1 = -5.34 \]

Interpretation:

For each increase of 1 unit in wt (that is, 1000 pounds), the predicted fuel efficiency changes by about -5.34 mpg.
Because the slope is negative, heavier cars are predicted to have lower mpg.

A useful goodness-of-fit measure is

\[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} \]

For this model,

\[ R^2 = 0.753 \]

So about 75.3% of the variability in mpg is explained by car weight alone.

Example: scatterplot with fitted line

The downward trend shows a clear negative linear association between weight and mpg.

Example: residual plot

A good residual plot should look roughly pattern-free and centered around 0.

Example: distribution of residuals

This plot helps check whether the errors are reasonably centered and roughly symmetric.

Interactive plotly view

This interactive figure lets the viewer inspect individual cars and compare observed values with the fitted trend.

Prediction example

Using the fitted line, the predicted mpg for a car with weight

\[ \text{wt} = 3 \]

\[ \widehat{\text{mpg}} = 21.25 \]

A 95% confidence interval for the mean mpg at this weight is

\[ (20.12,\ 22.38) \]

This gives a practical example of how regression can be used for estimation and prediction.

R code used to create Fuel Efficiency vs Car Weight

library(ggplot2)

cars_df <- mtcars
model <- lm(mpg ~ wt, data = cars_df)

ggplot(cars_df, aes(x = wt, y = mpg)) +
  geom_point(size = 2.6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Fuel Efficiency vs Car Weight",
    x = "Weight (1000 pounds)",
    y = "Miles per gallon"
  ) +
  theme_minimal(base_size = 18)

Key takeaways

Linear regression models how a response changes with a predictor.
The least-squares line minimizes squared residuals.
In the mtcars example, heavier cars tend to have lower mpg.
Plots and residuals help assess whether the model is reasonable.
Regression is useful for both explanation and prediction.