Why Linear Regression?

Simple linear regression estimates how one quantitative variable changes when another quantitative variable changes.

In this presentation, the goal is to predict fuel efficiency using the built-in mtcars dataset:

  • Response variable: miles per gallon (mpg)
  • Predictor variable: car weight (wt), measured in thousands of pounds

The practical question is simple: Do heavier cars tend to use more fuel?

Research Question and Hypothesis

Research question: Can car weight help predict miles per gallon?

A reasonable expectation is that heavier cars have lower MPG because more mass usually requires more energy to move.

For this example, we expect a negative slope:

\[ \beta_1 < 0 \]

That means predicted MPG should decrease as weight increases.

Math Slide 1: Population Model

The simple linear regression model is:

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \]

where:

  • \(y_i\) is observed MPG for car \(i\)
  • \(x_i\) is weight for car \(i\)
  • \(\beta_0\) is the population intercept
  • \(\beta_1\) is the population slope
  • \(\epsilon_i\) is random error not explained by weight

Math Slide 2: Estimated Line and Least Squares

Because \(\beta_0\) and \(\beta_1\) are unknown, we estimate them from the sample:

\[ \hat{y}_i = b_0 + b_1x_i \]

The least-squares method chooses \(b_0\) and \(b_1\) to minimize total squared error:

\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Smaller residuals mean the fitted line is closer to the observed data.

Data Used

The mtcars dataset contains measurements for 32 cars.

First 8 rows used in the analysis
car mpg wt hp cyl
Mazda RX4 Mazda RX4 21.0 2.620 110 6
Mazda RX4 Wag Mazda RX4 Wag 21.0 2.875 110 6
Datsun 710 Datsun 710 22.8 2.320 93 4
Hornet 4 Drive Hornet 4 Drive 21.4 3.215 110 6
Hornet Sportabout Hornet Sportabout 18.7 3.440 175 8
Valiant Valiant 18.1 3.460 105 6
Duster 360 Duster 360 14.3 3.570 245 8
Merc 240D Merc 240D 24.4 3.190 62 4

This dataset is useful because it contains a clear numerical relationship between weight and fuel efficiency.

ggplot 1: Scatterplot with Regression Line

The downward slope shows that heavier cars generally have lower MPG.

ggplot 2: Residual Plot

Residuals are prediction errors: observed MPG minus predicted MPG.

ggplot 3: Prediction Interval View

This extra plot goes beyond the minimum requirement and shows uncertainty around predictions.

Plotly: Interactive 3D Plot

This interactive plot uses weight, horsepower, and MPG together.

The interactive view helps reveal whether the two-dimensional regression pattern still makes sense when horsepower is considered.

Estimated Regression Equation

The fitted model is:

\[\widehat{mpg} = 37.29 + -5.34(wt)\]

The model has 0.753 as its \(R^2\) value.

Interpretation:

  • The slope is negative, so predicted MPG decreases as weight increases.
  • \(R^2\) measures the proportion of MPG variability explained by weight.

Prediction Example

For a car weighing 3.0 thousand pounds, the model predicts:

\[\widehat{mpg} = 37.29 + -5.34(3.0) = 21.25\]

So a 3,000-pound car is predicted to get about 21.25 MPG.

This is a prediction, not a guarantee. Real cars can differ because of horsepower, engine design, aerodynamics, driving behavior, and other factors.

R Code Used

library(ggplot2)
library(plotly)
library(knitr)

data(mtcars)
mtcars2 <- mtcars
mtcars2$car <- rownames(mtcars)
mtcars2$cyl <- factor(mtcars2$cyl)

model <- lm(mpg ~ wt, data = mtcars2)
mtcars2$predicted_mpg <- fitted(model)
mtcars2$residual <- resid(model)

summary(model)

ggplot(mtcars2, aes(x = wt, y = mpg)) +
  geom_point(aes(shape = cyl), size = 3) +
  geom_smooth(method = "lm", se = TRUE) +
  theme_minimal()

fig3d <- plot_ly(mtcars2, x = ~wt, y = ~hp, z = ~mpg,
                 color = ~cyl, type = "scatter3d", mode = "markers")
layout(fig3d, title = "Weight, Horsepower, and MPG")

Model Assumptions

Simple linear regression works best when these assumptions are reasonable:

  • The relationship between \(x\) and \(y\) is approximately linear.
  • Residuals are centered around zero.
  • Residuals have roughly constant spread.
  • Observations are independent.
  • There are no extreme outliers controlling the model.

The residual plot is one way to check these assumptions visually.

Conclusion

This example shows how simple linear regression connects statistics, visualization, and prediction.

Main takeaways:

  • Car weight is a strong predictor of fuel efficiency in this dataset.
  • The fitted slope is negative, meaning heavier cars tend to have lower MPG.
  • ggplot helps visualize the fitted line and residuals.
  • Plotly adds interactive exploration with a third variable.
  • LaTeX equations make the statistical model clear and precise.