What is Linear Regression?

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.

Simple linear regression involves:

  • One dependent variable (response): \(Y\)
  • One independent variable (predictor): \(X\)
  • A linear relationship between them

The Model:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon\) is the error term.

The Mathematical Foundation

The population regression line represents the true relationship:

\[E(Y|X) = \beta_0 + \beta_1 X\]

The sample regression line estimates this relationship:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

Residuals measure the difference between observed and predicted values:

\[e_i = Y_i - \hat{Y}_i\]

Least Squares Criterion: The sum of squared residuals is minimized:

\[\text{SSE} = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\]

Example Dataset

This analysis uses the mtcars dataset to explore the relationship between car weight and fuel efficiency.

data(mtcars)
head(mtcars[, c("wt", "mpg")], 5)
##                      wt  mpg
## Mazda RX4         2.620 21.0
## Mazda RX4 Wag     2.875 21.0
## Datsun 710        2.320 22.8
## Hornet 4 Drive    3.215 21.4
## Hornet Sportabout 3.440 18.7
cat("Correlation:", round(cor(mtcars$wt, mtcars$mpg), 3))
## Correlation: -0.868

Question: Does heavier weight lead to lower fuel efficiency?

Visualizing the Relationship: Code

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "steelblue", alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, 
              color = "red", fill = "pink") +
  labs(title = "Car Weight vs. Fuel Efficiency",
       x = "Weight (1000 lbs)", y = "MPG") +
  theme_minimal(base_size = 12)

Visualizing the Relationship: Results

Observation: A clear negative linear relationship is observed - heavier cars have lower MPG.

Fitting the Linear Model

model <- lm(mpg ~ wt, data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Interpreting the Results

Regression Equation:

\[\widehat{\text{MPG}} = 37.29 - 5.34 \times \text{Weight}\]

Key Findings:

  • Intercept (37.29): Expected MPG when weight = 0
  • Slope (-5.34): For every 1000 lbs increase, MPG decreases by 5.34
  • R² = 0.753: Weight explains 75.3% of MPG variation
  • p-value < 0.001: The relationship is statistically significant

3D Visualization of Regression

wt_range <- seq(min(mtcars$wt), max(mtcars$wt), 
                length.out = 30)
mpg_pred <- predict(model, 
                    newdata = data.frame(wt = wt_range))

plot_ly() %>%
  add_trace(x = mtcars$wt, y = mtcars$mpg, 
            z = fitted(model), type = "scatter3d", 
            mode = "markers", 
            marker = list(size = 5, color = "blue"),
            name = "Observed") %>%
  add_trace(x = wt_range, y = mpg_pred, z = mpg_pred,
            type = "scatter3d", mode = "lines",
            line = list(color = "red", width = 5),
            name = "Regression") %>%
  layout(scene = list(
    xaxis = list(title = "Weight"),
    yaxis = list(title = "Predicted MPG"),
    zaxis = list(title = "Fitted MPG")),
    title = "3D View: Observed vs. Fitted")

Model Diagnostics: Code

diag_data <- data.frame(
  fitted = fitted(model),
  residuals = residuals(model),
  standardized = rstandard(model)
)

p1 <- ggplot(diag_data, 
             aes(x = fitted, y = residuals)) +
  geom_point(color = "steelblue", size = 2) +
  geom_hline(yintercept = 0, color = "red", 
             linetype = "dashed") +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted", y = "Residuals") +
  theme_minimal(base_size = 10)

p2 <- ggplot(diag_data, 
             aes(sample = standardized)) +
  stat_qq(color = "steelblue", size = 2) +
  stat_qq_line(color = "red") +
  labs(title = "Normal Q-Q Plot",
       x = "Theoretical", y = "Standardized") +
  theme_minimal(base_size = 10)

gridExtra::grid.arrange(p1, p2, ncol = 2)

Model Diagnostics: Results

Conclusion: Residuals appear randomly scattered with no clear pattern. The Q-Q plot shows approximate normality. Model assumptions are reasonably satisfied.

Making Predictions

Prediction Formula:

For a new observation with \(X = x_0\):

\[\hat{Y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0\]

Confidence Interval (for mean response):

\[\hat{Y}_0 \pm t_{\alpha/2, n-2} \cdot SE(\hat{Y}_0)\]

Prediction Interval (for individual response):

\[\hat{Y}_0 \pm t_{\alpha/2, n-2} \cdot SE(\text{pred})\]

Note: \(SE(\text{pred}) > SE(\hat{Y}_0)\) due to individual variation.

Prediction Bands: Code

# Create prediction data
new_data <- data.frame(
  wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 100))

# Get confidence and prediction intervals
pred_conf <- predict(model, newdata = new_data, 
                     interval = "confidence", level = 0.95)
pred_pred <- predict(model, newdata = new_data, 
                     interval = "prediction", level = 0.95)

# Combine into data frame
plot_data <- data.frame(
  wt = new_data$wt, fit = pred_conf[, "fit"],
  conf_lwr = pred_conf[, "lwr"], conf_upr = pred_conf[, "upr"],
  pred_lwr = pred_pred[, "lwr"], pred_upr = pred_pred[, "upr"])

# Create plot with prediction bands
ggplot() +
  geom_ribbon(data = plot_data, 
              aes(x = wt, ymin = pred_lwr, ymax = pred_upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = plot_data, 
              aes(x = wt, ymin = conf_lwr, ymax = conf_upr),
              fill = "blue", alpha = 0.4) +
  geom_line(data = plot_data, aes(x = wt, y = fit), 
            color = "red", size = 1.2) +
  geom_point(data = mtcars, aes(x = wt, y = mpg), size = 2) +
  labs(title = "Confidence and Prediction Bands",
       x = "Weight (1000 lbs)", y = "MPG") +
  theme_minimal(base_size = 11)

Prediction Bands: Results

The blue band represents the confidence interval for the mean response, while the lighter blue band shows the prediction interval for individual observations.

Key Takeaways

When to Use:

  • Exploring relationships between two continuous variables
  • Making predictions based on a single predictor
  • Testing hypotheses about linear associations

Limitations:

  • Assumes linear relationship
  • Sensitive to outliers
  • Only one predictor

Extensions:

  • Multiple Linear Regression (multiple predictors)
  • Polynomial Regression (non-linear patterns)
  • Robust Regression (handling outliers)

Tip: Always check assumptions and visualize your data!