Simple Linear Regression

2026-04-12

What is Linear Regression?

Linear Regression is used to model the association between two variables
This line created by the model can also determine the strength of the relationship between variables
While great for linear data sets, Linear Regression may not be as strong of an indicator for non-linear data such as hyperbolic or exponential relationships
The Linear Regression Model can also be used to help predict further data points to a specific level of statistical confidence

The Linear Regression Model

\[y = \beta_0 + \beta_1 x + \varepsilon\] - \(y\) being the dependent variable

\(\beta_0\): The y-intercept
\(\beta_1\): The slope of the line
\(x\): The independent variable
\(\varepsilon\): The error term

Least Squares Formulas

\[\beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}\]

\[\beta_0 = \bar{y} - \beta_1\bar{x}\]

\(\beta_1\): Calculated first since \(\beta_0\) depends on it
\(\bar{x}\) and \(\bar{y}\): The means of \(x\) and \(y\) values in the dataset respectively
These formulas minimize the sum of squared residuals

mtcars dataset

We will be using the built in mtcars dataset to demonstrate the Linear Regression Model in R
For the 2 variables, we will choose wt (weight of the car) and mpg (average miles per gallon)
Intuitively, we expect the relationship to be negative (heavier cars having lower average mpg)

R function explanation

This function takes the 2 variables (wt and mpg) to create the Linear Regression
The summary(model) provides a detailed summary of the Linear Regression and how well it fits to the given data
The summary includes:
- The values of \(\beta_1\) and \(\beta_0\)
- The \(R^2\) value: A number between 0 and 1 that determines the strength of the linear regression line
- The p-values: typically a p-value of <0.05 is used to show strong statistical significance
- Residual error: How far the data tends to be from the Linear Regression line
- As well as other statistical information about the model

R Code

model <- lm(mpg ~ wt, data = mtcars)
summary(model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Model shown on a scatter plot

The Linear Regression line is shown in red to show the nearest linear estimation of weight to mpg
The shaded area shows the confidence interval for the line (a thinner shaded area shows a higher confidence for predicting additional data points)

## `geom_smooth()` using formula = 'y ~ x'

Interactive Plot

This interactive plot_ly graph shows the individual values of the data points, give it a try and see how well the Regression Model fits visually

Conclusion

The Linear Regression Model is extremely useful to determine relationships between 2 variables and allows us to test our initial hypotheses to the actual results
As we suspected, the relationship between weight and mpg are negative and have a strong correlation taken from \(R^2\) = 0.7528