Introduction

Simple Linear Regression is defined by Google as “a statistical method used to model the relationship between one dependent variable and one independent variable, aiming to find a straight line that best represents the data points”

  • It finds a linear relationship by using the following model: \[ Y = \beta_0 + \beta_1 X \]

  • Linear regression is used to find the points of \(\beta_0\) which is the intercept and \(\beta_1\) which is the slope.

  • It is used in Engineering, Economics, Biology, etc.

Mathematical Formula Breakdown

Simple linear regression shows the relationship between the independent \(X\) variable and dependent \(Y\) variable, and is modeled by the equation: \[ Y = \beta_0 + \beta_1 X \] Where:

  • \(\beta_0\) is the \(Y\) intercept value when \(X = 0\)

  • \(\beta_1\) is the slope, or change in \(Y\) per change in \(X\)

Visualizing a Simple Linear Relationship

To visualize an example of a simple linear relationship using linear data, let’s take a look at the data from Yellowstone National Park’s “Old Faithful”:

Model Fit and Interpretation

Let’s take a look at the same data from the previous slide with a linear model that shows the regression line, slope, and intercept:

Code Breakdown

Now, let’s walk through the code that was used to create the plot and show the regression line, y-intercept, and slope in the simple linear regression formula:

model <- lm(eruptions ~ waiting, data = faithful)
coeffs <- coef(model)

ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() +
  geom_smooth(method = "lm") + annotate("text", x = 50, y = 5, 
  label = paste0("y = ", round(coeffs[2], 2), "x + ",
  round(coeffs[1], 2)))

Code Breakdown (continued)

  • lm(eruptions ~ waiting, data = faithful) “lm” stands for linear model, this line fits a simple regression model using the eruptions (dependent variable) and waiting (independent variable)

  • coef(model) gets the intercept and slope coefficients from the model

  • ggplot(faithful, aes(x = waiting, y = eruptions)) creates a plot using the built-in Old Faithful data set, and sets the x and y axis variables

  • geom_point() creates the scatter plot aspect of the graph

  • geom_smooth(method = “lm”) adds the regression line to the plot, and specifies that it must be a linear model

  • annotate(“text”, x = 50, y = 5, label = paste0(“y =”, round(coeffs[2], 2), “x +”,round(coeffs[1], 2))) displays the regression equation on the plot

Tree Volume as a Function of Width

Finally, using the built in Trees data set, we can see wider trees tend to have more volume:

Conclusion

Simple linear regression is a foundational statistical model which is used widely to show the relationship between two variables.

  • It estimates how a change in one variable affects the other

  • This can be visually seen in the graphs of the Old Faithful data as well as the Tree data shown previously

  • We can use slope and y-intercept to assess linear data and predict future data, which is a real-world application