What is Simple Linear Regression?

Simple Linear Regression (SLR) is a statistical method used to model the relationship between two continuous variables:

  • One independent variable (or predictor), denoted as \(X\).
  • One dependent variable (or response), denoted as \(Y\).

The Goal: To find a linear equation that best predicts the value of \(Y\) given a value of \(X\).

For example, can we predict… A student’s exam score (\(Y\)) based on the hours they studied (\(X\))? A car’s fuel efficiency (\(Y\)) based on its weight (\(X\))?

Finding the “Best” Line

We find the “best” values for \(\beta_0\) and \(\beta_1\) by using the Method of Least Squares.This method minimizes the sum of the squared differences between the actual observed values (\(y_i\)) and the predicted values (\(\hat{y}_i\)). These differences are called residuals.Our goal is to find the \(\beta_0\) and \(\beta_1\) that minimize this function:\[\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2\]

Example: mtcars Dataset

Let’s use the built-in R dataset mtcars to answer a question: “Can we predict a car’s fuel efficiency (mpg) based on its weight (wt)?”Dependent Variable (\(Y\)): mpg (Miles Per Gallon)Independent Variable (\(X\)): wt (Weight in 1000s of lbs)We will fit the model:mpg = β0 + β1 * wt

Step 1: Plot the Data (ggplot 1)

Step 2: Add the Regression Line (ggplot 2)

Step 3: Make it Interactive (plotly)

Step 4: Interpret the Results

# 'fit' was calculated in the setup chunk summary(fit)$coefficients 
# Intercept ($\beta_0$): r round(coef(fit)[1], 3). A (hypothetical) car weighing 0 lbs would get 37.3 MPG. This point anchors the line but isn't physically meaningful. 
# wt ($\beta_1$): r round(coef(fit)[2], 3). For every 1 unit increase in wt (i.e., for every 1000 lbs heavier), the car's fuel efficiency decreases by 5.34 MPG on average. 

Step 5: Check Assumptions

# The 'aug' object was created in the setup chunk 
# .fitted = predicted values 
# .resid = residual values ggplot(aug, aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.7, color = "#007A87") + geom_hline(yintercept = 0, linetype = "dashed", color = "red") + labs(title = "Residuals vs. Fitted Values", x = "Fitted (Predicted) MPG", y = "Residuals (Actual - Predicted)") + theme_minimal(base_size = 14)