# Global options for R chunks
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE)

# Load essential libraries
library(ggplot2)
library(dplyr)

Introduction: Modeling Relationships

Welcome to this module on Simple Linear Regression (SLR). Our goal in regression is to formally model the linear relationship between two quantitative variables:

  1. Response Variable (\(y\)): The variable we are trying to predict or explain.

  2. Explanatory Variable (\(x\)): The variable used to predict the response.

We will use the built-in mtcars dataset for demonstration, examining the relationship between vehicle weight (wt, in 1000s of lbs) and fuel efficiency (mpg, miles per gallon).

# Display the first few rows and structure of the relevant variables
mtcars %>% 
  select(wt, mpg) %>%
  head()

1. Visualization: Scatterplots

Before modeling, we must always visualize the relationship. A scatterplot is the foundational tool for this, allowing us to assess the form (linear, curved), direction (positive, negative), and strength of the association, as well as identify potential outliers.

We use ggplot2 to create an elegant and informative scatterplot.

#R Code: Creating the Scatterplot
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "#0D47A1") +
  labs(
    title = "Fuel Efficiency vs. Vehicle Weight",
    x = "Vehicle Weight (1000s of lbs)",
    y = "Miles Per Gallon (MPG)"
  ) +
  theme_minimal(base_size = 14)

Observation: The plot suggests a strong, negative, and reasonably linear relationship. As weight increases, MPG tends to decrease.

2. Quantifying Association: Correlation

The Pearson product-moment correlation coefficient, r , quantifies the strength and direction of the linear association between two quantitative variables.

  • r is always between −1 and +1.

  • r > 0 indicates a positive association.

  • r < 0 indicates a negative association.

  • |r| close to 1 indicates a strong linear relationship.

  • |r| close to 0 indicates a weak linear relationship.

Formula (Theoretically):

\(r_{xy}=\frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2}\sum{(Y_i - \bar{Y})^2}}}\)

The correlation is the average product of the standardized values (z-scores) of x and y.

Correlation does not imply causation!

Let’s calculate r for our cars data.

# Calculate the correlation coefficient
correlation_r <- cor(cars$speed, cars$dist)

cat("The correlation coefficient (r) is:", round(correlation_r, 4), "\n")
The correlation coefficient (r) is: 0.8069 

Interpretation: An r close to 0.8 is a strong, positive linear relationship, confirming our visual inspection.

3. Simple Linear Regression (SLR): Modeling the Relationship

A Simple Linear Regression model fits a straight line to the data to predict a response variable (Y) from an explanatory variable (X).

The model is defined by the Least-Squares Regression Line (LSRL):

\[ \hat{y} = b_0 + b_1 x \]

  • \(\hat{y}\) is the predicted value of the response variable.

  • \(b_0\) is the y-intercept (the predicted value of Y when X=0).

  • \(b_1\) is the slope (the change in \(\hat{y}\) for a one-unit increase in \(X\)).

The line is found by minimizing the sum of the squared vertical distances (the residuals, \(e_i = y_i − \hat{y_i}\)) from the points to the line.

Calculating the Coefficients

The least-squares estimates for the slope (\(b_1\)) and intercept (\(b_0\)) are:

  • \(b_1 = r\left(\frac{s_x}{s_y}\right)\)

  • \(b_0 = \bar{y} - b_1\bar{x}\)

Running the Model in R

In R, we use the lm() (linear model) function.

# Fit the linear model: dist is predicted by speed
slr_model <- lm(dist ~ speed, data = cars)

# View the model summary
summary(slr_model)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Focus on the Coefficients:

  • The Intercept (\(b_0\)) is the estimate for the baseline stopping distance.

  • The Slope (\(b_1\)) is the estimated increase in stopping distance for every 1 mph increase in speed.

Prediction Example

What is the predicted stopping distance for a car traveling at 15 mph?

\[ \hat{y} = b_0 + b_1(15) \]

# Use the predict function for a 15 mph car
new_data <- data.frame(speed = 15)
predicted_dist <- predict(slr_model, new_data)

cat("Predicted stopping distance at 15 mph:", round(predicted_dist, 2), "ft\n")
Predicted stopping distance at 15 mph: 41.41 ft

Visualizing the Model

It’s good practice to add the regression line to the scatterplot.

# R Markdown Chunk: Plot with Regression Line
ggplot(cars, aes(x = speed, y = dist)) +
  geom_point(color = "darkblue", size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "red") + # Add the LSRL
  labs(
    title = "SLR of Stopping Distance on Speed",
    x = "Speed (mph)",
    y = "Stopping Distance (ft)"
  ) +
  theme_minimal()

4. Inference for Regression: Making Generalizations

To move beyond describing the sample data to inferring about the larger population.

Assumptions (The “LINE” Conditions)

For valid inference, the simple linear regression model assumes:

  1. Linearity: The relationship between X and Y is linear.

  2. Independence: The observations are independent.

  3. Normality: For any fixed X, the distribution of Y (and thus the residuals) is normal.

  4. Equal Variance (Homoscedasticity): The variability of Y (and thus the residuals) is the same across all X values.

We check these assumptions primarily through residual plots and Normal Q-Q plots.

Hypothesis Testing for the Slope

The most common inference test is whether the slope in the population, \(\beta_1\), is zero. If \(\beta_1 = 0\), there is no linear relationship between \(X\) and \(Y\).

  • Null Hypothesis (\(H_0\)): \(\beta_1 = 0\) (No linear relationship)

  • Alternative Hypothesis (\(H_a\)): \(\beta_1 \ne 0\) (A linear relationship exists)

The summary(slr_model) output provides the t-statistic and p-value for this test.

T-statistic for \(\beta_1\)

\[ t = \frac{(b_1 - \beta_{1,0})}{SE_{b_1}}\]

where \(\beta_{1,0}\) is the hypothesized value (usually 0).

In the summary() output, look at the row for speed:

# R Markdown Chunk: Inference Interpretation
# Focus on the 'Coefficients' table from summary(slr_model)
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept) -17.5791   6.7584  -2.601   0.0123 *
# speed         3.9324   0.4155   9.464   1.49e-12 ***
  • t-value (9.464): The number of standard errors the sample slope ($b_1) is from zero. A large absolute t is evidence against \(H_0\).

  • Pr(>|t|) (\(1.49×10^{−12}\)): The p-value. Since this is extremely small (much less than \(\alpha\)=0.05), we reject \(H_0\).

  • Conclusion: We have strong evidence that the true population slope (\(\beta_1\)) is not zero, meaning there is a statistically significant linear relationship between speed and stopping distance.

Confidence Intervals for the Slope

A Confidence Interval (CI) provides a range of plausible values for the true population slope \(\beta_1\)

# Calculate the 95% Confidence Interval for the coefficients
confint(slr_model, level = 0.95)
                 2.5 %    97.5 %
(Intercept) -31.167850 -3.990340
speed         3.096964  4.767853

Interpretation: We are 95% confident that for every 1 mph increase in speed, the true mean stopping distance increases by an amount between the lower and upper bounds of the interval for the speed coefficient.

Conclusion

We’ve covered the full framework of Simple Linear Regression:

  1. Visualize with a scatterplot.

  2. Quantify the linear relationship with correlation (r).

  3. Model the relationship with the LSRL using lm().

  4. Infer about the population slope (\(\beta_1\)) using t-tests and confidence intervals.

Now is to apply these steps to a new dataset and practice interpreting the summary() output.

