2025-09-17

Introducing: Simple Linear Regression

Linear regression is a key tool in understanding data sets with predictor variables. In the real world, most data is influenced by several factors, but do not let that scare you from trying to analyze the world around you. For this example we will make a simple linear regression model for a data set that has only one significant predictor variable. Once you get the hang of it, linear regression makes your data all that more interesting!

How to Calculate a Simple Linear Regression

The formula for simple linear regression looks like this: \(Y = \beta_0 + \beta_1 X_1 + \epsilon\) where \(\beta_1\) represents the change in Y for every one-unit change in \(X_1\) , holding all other variables constant. This is a useful way to see how each predictor variable independently affects an outcome variable.

Meet our data

The data set we are using in this example is called “mtcars” and has data on fuel consumption, horsepower, weight, and other variables that affect a cars performance from the 1974 ‘Motor Trend’ magazine.
For example, we can use this data to take a look at the distribution of miles per gallon and a cars weight.

plot_ly(
    data=mtcars,
    x=~wt, y=~mpg,
    type="scatter",
    mode="markers"
) %>%
layout(
  title="Distribution of MPG by Car Weight",
  xaxis=list(title="Weight"),
  yaxis=list(title="Miles Per Gallon (MPG)")
)

Distribution of MPG by Car Weight

Example of a Simple Linear Regression

This model summary shows the relationship between a cars weight and its resulting miles per gallon.

mtcars_wt = lm(mpg ~ wt, data = mtcars)
summary(mtcars_wt)
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Plotting Your Linear Regression

Lets analyze how weight and the type of transmission affect a cars miles per gallon. Our equation can be found by looking at the values in the column titled “Estimate”. \(Y = 37.322 - 5.353 X_1 + \epsilon\) where \(X_1\) represents the cars weight. Weight is a significant predictor variable for a cars miles per gallon rate, with every increase in weight by 1000lbs decreasing the cars miles per gallon by approximately 5.34 gallons, holding all other variables constant.

ggplot(mtcars, aes(x=wt,y=mpg)) +
    geom_point(color="red") +
    geom_smooth(method ="lm", se = FALSE, color="blue") +
    labs(
        title="Simple Linear Regression: MPG by Weight",
        x="Weight (1000lbs)",
        y="Miles per Gallons (MPG)"
    )

Simple Linear Regression: MPG by Weight

`geom_smooth()` using formula = 'y ~ x'

Simple Linear Regression: Non-Significant Variable

The number of cylinders a car has is a non-significant variable according to this data set, and that is especially clear when you take a look at the graph of a simple linear regression on the number of cylinders and miles per gallon over a scatter plot of the distribution of number of cylinders and miles per gallon. Simple linear regression can be absolutely useless if the relationship presented in the data is not significant or useful.

mtcars_cyl = lm(mpg ~ cyl, data = mtcars)
ggplot(mtcars, aes(x=cyl,y=mpg)) +
    geom_point(color="red") +
    geom_smooth(method ="lm", se = FALSE, color="blue") +
    labs(
        title="Simple Linear Regression: MPG by Number of Cylinders",
        x="Number of Cylinders",
        y="Miles per Gallons (MPG)"
    )

Simple Linear Regression: MPG by Number of Cylinders

`geom_smooth()` using formula = 'y ~ x'

Multiple Linear Regression

When data gets larger and the relationships become more complicated it can be useful to analyze the way multiple variables changing affects an outcome. Multiple linear regression is a helpful tool when there are multiple significant variables, and can also be helpful in determining which variables are the significant ones. Here is an example of a multiple linear regression to determine miles per gallon using the cars weight, type of transmission, engine shape, and number of cylinders, gears, and carburetors.

mtcars_multi = lm(mpg ~ wt + cyl + vs + am + gear + carb, data = mtcars)
summary(mtcars_multi)
Call:
lm(formula = mpg ~ wt + cyl + vs + am + gear + carb, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6224 -1.1054 -0.3032  1.5267  5.3178 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.7632     8.0153   4.212 0.000287 ***
wt           -2.3913     1.0083  -2.372 0.025721 *  
cyl          -0.9629     0.7006  -1.374 0.181558    
vs            0.6684     1.8410   0.363 0.719622    
am            1.8291     1.8772   0.974 0.339187    
gear          0.3484     1.4096   0.247 0.806809    
carb         -0.8326     0.5508  -1.512 0.143157    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.588 on 25 degrees of freedom
Multiple R-squared:  0.8513,    Adjusted R-squared:  0.8157 
F-statistic: 23.86 on 6 and 25 DF,  p-value: 3.268e-09

Linear Regression: Can’t Live Without It

Simple linear regression helps us understand how one predictor variable affects an outcome. For example, we determined that car weight has a strong negative effect on miles per gallon. However, not all predictors are significant. The number of cylinders a car has, for example.
When an outcome is influenced by multiple factors, multiple linear regression provides a more accurate analysis by taking into account multiple predictor variables.
As you can see, linear regression is a powerful tool for identifying significant relationships and making predictions from data.