2026-04-12

What is Linear Regression?

  • Linear Regression is used to model the association between two variables
  • This line created by the model can also determine the strength of the relationship between variables
  • While great for linear data sets, Linear Regression may not be as strong of an indicator for non-linear data such as hyperbolic or exponential relationships
  • The Linear Regression Model can also be used to help predict further data points to a specific level of statistical confidence

The Linear Regression Model

\[y = \beta_0 + \beta_1 x + \varepsilon\] - \(y\) being the dependent variable

  • \(\beta_0\): The y-intercept
  • \(\beta_1\): The slope of the line
  • \(x\): The independent variable
  • \(\varepsilon\): The error term

Least Squares Formulas

\[\beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}\]

\[\beta_0 = \bar{y} - \beta_1\bar{x}\]

  • \(\beta_1\): Calculated first since \(\beta_0\) depends on it
  • \(\bar{x}\) and \(\bar{y}\): The means of \(x\) and \(y\) values in the dataset respectively
  • These formulas minimize the sum of squared residuals

mtcars dataset

  • We will be using the built in mtcars dataset to demonstrate the Linear Regression Model in R
  • For the 2 variables, we will choose wt (weight of the car) and mpg (average miles per gallon)
  • Intuitively, we expect the relationship to be negative (heavier cars having lower average mpg)

R function explanation

  • This function takes the 2 variables (wt and mpg) to create the Linear Regression
  • The summary(model) provides a detailed summary of the Linear Regression and how well it fits to the given data
  • The summary includes:
    • The values of \(\beta_1\) and \(\beta_0\)
    • The \(R^2\) value: A number between 0 and 1 that determines the strength of the linear regression line
    • The p-values: typically a p-value of <0.05 is used to show strong statistical significance
    • Residual error: How far the data tends to be from the Linear Regression line
    • As well as other statistical information about the model

R Code

model <- lm(mpg ~ wt, data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Model shown on a scatter plot

  • The Linear Regression line is shown in red to show the nearest linear estimation of weight to mpg
  • The shaded area shows the confidence interval for the line (a thinner shaded area shows a higher confidence for predicting additional data points)
## `geom_smooth()` using formula = 'y ~ x'

Interactive Plot

  • This interactive plot_ly graph shows the individual values of the data points, give it a try and see how well the Regression Model fits visually

Conclusion

  • The Linear Regression Model is extremely useful to determine relationships between 2 variables and allows us to test our initial hypotheses to the actual results
  • As we suspected, the relationship between weight and mpg are negative and have a strong correlation taken from \(R^2\) = 0.7528