Simple Linear Regression: Predicting Fuel Efficiency

Why Linear Regression?

Simple linear regression estimates how one quantitative variable changes when another quantitative variable changes.

In this presentation, the goal is to predict fuel efficiency using the built-in mtcars dataset:

Response variable: miles per gallon (mpg)
Predictor variable: car weight (wt), measured in thousands of pounds

The practical question is simple: Do heavier cars tend to use more fuel?

Research Question and Hypothesis

Research question: Can car weight help predict miles per gallon?

A reasonable expectation is that heavier cars have lower MPG because more mass usually requires more energy to move.

For this example, we expect a negative slope:

\[ \beta_1 < 0 \]

That means predicted MPG should decrease as weight increases.

Math Slide 1: Population Model

The simple linear regression model is:

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \]

where:

\(y_i\) is observed MPG for car \(i\)
\(x_i\) is weight for car \(i\)
\(\beta_0\) is the population intercept
\(\beta_1\) is the population slope
\(\epsilon_i\) is random error not explained by weight

Math Slide 2: Estimated Line and Least Squares

Because \(\beta_0\) and \(\beta_1\) are unknown, we estimate them from the sample:

\[ \hat{y}_i = b_0 + b_1x_i \]

The least-squares method chooses \(b_0\) and \(b_1\) to minimize total squared error:

\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Smaller residuals mean the fitted line is closer to the observed data.

Data Used

The mtcars dataset contains measurements for 32 cars.

First 8 rows used in the analysis
	car	mpg	wt	hp	cyl
Mazda RX4	Mazda RX4	21.0	2.620	110	6
Mazda RX4 Wag	Mazda RX4 Wag	21.0	2.875	110	6
Datsun 710	Datsun 710	22.8	2.320	93	4
Hornet 4 Drive	Hornet 4 Drive	21.4	3.215	110	6
Hornet Sportabout	Hornet Sportabout	18.7	3.440	175	8
Valiant	Valiant	18.1	3.460	105	6
Duster 360	Duster 360	14.3	3.570	245	8
Merc 240D	Merc 240D	24.4	3.190	62	4

This dataset is useful because it contains a clear numerical relationship between weight and fuel efficiency.

ggplot 1: Scatterplot with Regression Line

The downward slope shows that heavier cars generally have lower MPG.

ggplot 2: Residual Plot

Residuals are prediction errors: observed MPG minus predicted MPG.

ggplot 3: Prediction Interval View

This extra plot goes beyond the minimum requirement and shows uncertainty around predictions.

Plotly: Interactive 3D Plot

This interactive plot uses weight, horsepower, and MPG together.

The interactive view helps reveal whether the two-dimensional regression pattern still makes sense when horsepower is considered.

Estimated Regression Equation

The fitted model is:

\[\widehat{mpg} = 37.29 + -5.34(wt)\]

The model has 0.753 as its \(R^2\) value.

Interpretation:

The slope is negative, so predicted MPG decreases as weight increases.
\(R^2\) measures the proportion of MPG variability explained by weight.

Prediction Example

For a car weighing 3.0 thousand pounds, the model predicts:

\[\widehat{mpg} = 37.29 + -5.34(3.0) = 21.25\]

So a 3,000-pound car is predicted to get about 21.25 MPG.

This is a prediction, not a guarantee. Real cars can differ because of horsepower, engine design, aerodynamics, driving behavior, and other factors.

R Code Used

library(ggplot2)
library(plotly)
library(knitr)

data(mtcars)
mtcars2 <- mtcars
mtcars2$car <- rownames(mtcars)
mtcars2$cyl <- factor(mtcars2$cyl)

model <- lm(mpg ~ wt, data = mtcars2)
mtcars2$predicted_mpg <- fitted(model)
mtcars2$residual <- resid(model)

summary(model)

ggplot(mtcars2, aes(x = wt, y = mpg)) +
  geom_point(aes(shape = cyl), size = 3) +
  geom_smooth(method = "lm", se = TRUE) +
  theme_minimal()

fig3d <- plot_ly(mtcars2, x = ~wt, y = ~hp, z = ~mpg,
                 color = ~cyl, type = "scatter3d", mode = "markers")
layout(fig3d, title = "Weight, Horsepower, and MPG")

Model Assumptions

Simple linear regression works best when these assumptions are reasonable:

The relationship between \(x\) and \(y\) is approximately linear.
Residuals are centered around zero.
Residuals have roughly constant spread.
Observations are independent.
There are no extreme outliers controlling the model.

The residual plot is one way to check these assumptions visually.

Conclusion

This example shows how simple linear regression connects statistics, visualization, and prediction.

Main takeaways:

Car weight is a strong predictor of fuel efficiency in this dataset.
The fitted slope is negative, meaning heavier cars tend to have lower MPG.
ggplot helps visualize the fitted line and residuals.
Plotly adds interactive exploration with a third variable.
LaTeX equations make the statistical model clear and precise.