2025-03-15

What is Simple Linear Regression?

Simple linear regression (SLR) is a statistical method used to study and estimate relationships between two continuous variables, i.e., the predictor variable (x) and the response variable (y).

The chief type of relationship SLR addresses is the statistical relationship -
where the relationships between the variables are not exact.

Examples:

  • Caloric Intake and Weight Gain
  • Hours of Study and Test Scores
  • Drug Dosage and Blood Pressure

Simple Linear Regression Model

The “best fitting line” that summarizes the trend between two continuous variables is defined by: \[\hat{y_i} = \beta_0 + \beta_1 x_i + \varepsilon_i\] Where:

  • \(\hat{y_i}\) is the predicted response for a unit \(i\)
  • \(x_i\) is the predictor value for a unit \(i\)
  • \(\varepsilon_i\) is the error term
  • \(\beta_0\) is the intercept parameter
  • \(\beta_1\) is the slope parameter

Assumptions

The SLR model relies on 4 key assumptions:

  1. Linearity - The relationship between the predictor and response variables must be linear, presenting as a straight line
  2. Independence - The error terms (\(\epsilon_i\)) must be independent of one another and between observations
  3. Normality - The error terms (\(\epsilon_i\)) must follow a normal distribution with mean zero at each value of the predictor variable
  4. Equal Variance - The error terms (\(\epsilon_i\)) must maintain constant variance across all values of the predictor variable

Least Square Method

The Least Square Method is used to decide on a line that “best fits” the data. It aims to minimize the sum of the squared prediction errors in the model.

\[Q=\sum_{i=1}^{n}(y_i-\hat{y_i})^2\] Where:

  • \(e_i=y_i-\hat{y_i}\) is the prediction error for some \(i\)
  • \(e^2_i=(y_i-\hat{y_i})^2\) is the squared prediction error for some \(i\)
  • \(\sum_{i=1}^{n}\) is the summation of squared prediction errors of \(i\) in \(n\)
  • \(Q\) is the quantity of squared prediction errors

The mtcars Data-set

For an example of SLR application, we can use the programming language R. Built into R, is the mtcars data-set which we will use to create a plot and add a linear regression line.

data(mtcars) # Load the data-set
head(mtcars) # Display a few rows
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Application Using ggplot

The ggplot library allows easy creation of visualizations in R. Below is example code to plot horsepower (HP) against miles per gallon (MPG). HP and MPG are columns collected from the mtcars data-set.

plot = ggplot(mtcars, aes(x=hp, y=mpg)) + 
  geom_point() +
  ggtitle("Horsepower versus Miles Per Gallon") + 
  xlab("Horsepower") + 
  ylab("Miles Per Gallon")

Plot of HP vs MPG

We can observe an inverse relationship between HP and MPG.

Adding Linear Regression

Similarly, it is easy to add a line of best fit with ggplot. Below is sample code to add a linear regression line to the plot previously created. By default, ggplot will include a confidence interval shaded in gray. To display the equation for the line of best fit, use the extension library ‘ggpmisc’.

model = plot + 
  geom_smooth(method="lm") +
  # Use 'ggpmisc' library to show equation
  stat_poly_eq(
    use_label("eq"), 
    parse = TRUE, label.x = 0.95, label.y = 0.95
  ) +
  geom_point()

Resulting Plot

Assessing Normality: Residuals (Using plotly)

To check if the error terms follow an approximately normal distribution, we can use another visualization library - plotly. Below is an example of how we can use plotly to create a histogram of estimated error terms (residuals).

residuals = residuals(lm(mpg ~ hp, data = mtcars)) # Calc. residuals
res_plot = plot_ly(x = ~residuals, type = "histogram") %>%
  layout(title = "Histogram of Residuals (HP vs. MPG)",
         xaxis = list(title = "Residuals"),
         yaxis = list(title = "Count"))

Resulting Histogram

Changing Predictor Variable

Let’s do another example, this time analyzing the relationship between weight and miles per gallon.

plot_2 = ggplot(mtcars, aes(x=wt, y=mpg)) + 
  geom_point() +
  ggtitle("Weight versus Miles Per Gallon") + 
  xlab("Weight (1000 lbs)") + 
  ylab("Miles per Gallon")

Plot of Weight vs. MPG

Weight vs. MPG (Linear Regression)

Summary

  • Simple linear regression can be used to analyze and estimate relationships between two continuous variables
  • There are four assumptions to consider before performing SLR on data:
    • Linearity
    • Independence
    • Normality
    • Equal Variance
  • While you can implement SLR yourself, R has numerous libraries (ggplot, plotly) that assists programmatically