Regression and Prediction

Overview

Linear regression is one of the most common tools in statistics.

It is used when we want to understand how one numerical variable changes as another numerical variable changes.

This presentation uses the mtcars dataset to study the connection between:

wt = vehicle weight
mpg = miles per gallon

The goal is to see whether heavier cars tend to use more fuel.

Model Formula

A simple linear regression model is written as

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

where

\(Y_i\) is the response
\(X_i\) is the predictor
\(\beta_0\) is the intercept
\(\beta_1\) is the slope
\(\varepsilon_i\) is the error term

For this dataset, the model becomes

\[ mpg_i = \beta_0 + \beta_1 wt_i + \varepsilon_i \]

What the Coefficients Mean

Once the model is estimated, the fitted line can be written as

\[ \hat{Y} = b_0 + b_1 X \]

In this example,

\[ \widehat{mpg} = 37.29 - 5.34(wt) \]

This tells us two important things:

the intercept is the predicted mpg when weight is 0
the slope shows how mpg changes when weight increases by 1 unit

Since the slope is negative, mpg decreases as weight increases.

Looking at the Data

The first few rows of the data are shown below.

##                    mpg    wt
## Mazda RX4         21.0 2.620
## Mazda RX4 Wag     21.0 2.875
## Datsun 710        22.8 2.320
## Hornet 4 Drive    21.4 3.215
## Hornet Sportabout 18.7 3.440
## Valiant           18.1 3.460

Even from a quick look, weight and fuel economy seem related.

A scatterplot helps reveal that pattern more clearly.

ggplot: Scatterplot with Fitted Line

This graph shows a clear downward trend.

Cars with greater weight usually have lower fuel efficiency.

Summary of the Regression

The regression output gives estimates for the intercept, slope, and measures of model fit.

## 
## Call:
## lm(formula = mpg ~ wt, data = cars_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The slope is negative, which supports the visual pattern from the scatterplot.

The \(R^2\) value also tells us how much variation in mpg is explained by weight.

Another Mathematical View

The line of best fit is chosen by minimizing the sum of squared residuals:

\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

A residual is the difference between an observed value and a predicted value:

\[ e_i = y_i - \hat{y}_i \]

Smaller residuals mean the model predictions are closer to the observed data.

ggplot: Residual Histogram

This plot helps us look at the distribution of prediction errors.

Residuals that are roughly centered near 0 are a good sign.

Interactive Plotly Display

This interactive graph makes it easier to inspect each car individually.

Example of R Code

The code below produces the scatterplot with the regression line.

ggplot(cars_data, aes(x = wt, y = mpg)) +
  geom_point(size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Fuel Efficiency and Vehicle Weight",
    x = "Weight (1000 pounds)",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

This is a simple example of how statistical graphics can be created directly in R.

Final Thoughts

Simple linear regression is useful because it helps explain and predict relationships between variables.

From this example, we learned that:

weight and fuel efficiency are strongly related
the relationship is negative
regression gives both a visual and mathematical summary of the data

This method is widely used in statistics, business, engineering, and data science.