2025-03-27

What is Simple Linear Regression?

  • A method in statistics used to model a relationship between two quantitative variables

  • Allows the estimation of a dependent variable as an independent variable changes.

  • Assumptions of linear regression:

  • The method assumes a linear relationship between the independent and dependent variables

  • Observations are independent from one another

  • The data follows a normal distribution

  • The size of the error in a prediction doesn’t change significantly

How to perform Simple Linear Regression

  • The goal of linear regression is to find the best fitting line through the data

  • This is done by finding the regression coefficient that minimizes the total error \(\epsilon\)

  • The formula to perform a simple linear regression is:

  • \(Y = \beta_0 + \beta_1X + \epsilon\)

  • \(Y\) is the dependent variable (outcome)

  • \(X\) is the independent variable (that will change)

  • \(\beta_0\) is the y-intercept (or value of Y when X is 0)

  • \(\beta_1\) is the regression coefficient (or how much we expect y to change as x increases)

  • \(\epsilon\) is the error (or how much variation there is in the estimated regression coefficient)

Example with the airquality Dataset in R (overview)

  • This dataset contains 6 variables.
  • The first 6 (of 153) observations are:
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
  • We can model the linear regression between Ozone and Temp like: \(Ozone = \beta_0 + \beta_1(Temp) + \epsilon\)

  • If a dataset in R has NA values, we can remove them by na.omit(datasetName):

Example with the airquality Dataset in R (plotly)

Now with ggplot2

  • Next, we can generate the plot with ggplot2
  • This plot is a little better because it shows the linear regression line (estimation line) with the cross-section/shaded area (expected value interval)

The R code (previous plot)

# the first parameter is NA omitted data, method="lm" for linear model (adds your fit line)
ggplot(removedNAData, aes(x = Temp, y = Ozone)) + geom_point() +
geom_smooth(method = "lm") + coord_cartesian(ylim = c(0,165))

Understanding and Interpreting Linear Regression

  • Our formula: \(Ozone = \beta_0 + \beta_1(Temp) + \epsilon\) can be interpreted in a more understandable way

  • Based on the data and linear regression model(s), we can get the regression coefficient \(\beta_1\):

## B1 is:  2.43911
  • Since the X variable (or Temp) is multiplied by the Regression Coefficient B1, we can interpret the following:

  • Every 1 degree increase in Temp results in a 2.43911 unit increase in Ozone - which is quite meaningful

Conclusion

  • Simple Linear Regression is used for predicting given a relationship is suspected, whether it be weather related or not.
  • In these slides (from airquality dataset), we’ve shown that Temperature impacts Ozone concentration
  • We can confidently say that this model helps predict ozone concentration based on Temperature (in New York)