Simple Linear Regression

Introduction

Simple Linear Regression is defined by Google as “a statistical method used to model the relationship between one dependent variable and one independent variable, aiming to find a straight line that best represents the data points”

It finds a linear relationship by using the following model: \[ Y = \beta_0 + \beta_1 X \]
Linear regression is used to find the points of \(\beta_0\) which is the intercept and \(\beta_1\) which is the slope.
It is used in Engineering, Economics, Biology, etc.

Mathematical Formula Breakdown

Simple linear regression shows the relationship between the independent \(X\) variable and dependent \(Y\) variable, and is modeled by the equation: \[ Y = \beta_0 + \beta_1 X \] Where:

\(\beta_0\) is the \(Y\) intercept value when \(X = 0\)
\(\beta_1\) is the slope, or change in \(Y\) per change in \(X\)

Visualizing a Simple Linear Relationship

To visualize an example of a simple linear relationship using linear data, let’s take a look at the data from Yellowstone National Park’s “Old Faithful”:

Model Fit and Interpretation

Let’s take a look at the same data from the previous slide with a linear model that shows the regression line, slope, and intercept:

Code Breakdown

Now, let’s walk through the code that was used to create the plot and show the regression line, y-intercept, and slope in the simple linear regression formula:

model <- lm(eruptions ~ waiting, data = faithful)
coeffs <- coef(model)

ggplot(faithful, aes(x = waiting, y = eruptions)) + geom_point() +
  geom_smooth(method = "lm") + annotate("text", x = 50, y = 5, 
  label = paste0("y = ", round(coeffs[2], 2), "x + ",
  round(coeffs[1], 2)))

Code Breakdown (continued)

lm(eruptions ~ waiting, data = faithful) “lm” stands for linear model, this line fits a simple regression model using the eruptions (dependent variable) and waiting (independent variable)
coef(model) gets the intercept and slope coefficients from the model
ggplot(faithful, aes(x = waiting, y = eruptions)) creates a plot using the built-in Old Faithful data set, and sets the x and y axis variables
geom_point() creates the scatter plot aspect of the graph
geom_smooth(method = “lm”) adds the regression line to the plot, and specifies that it must be a linear model
annotate(“text”, x = 50, y = 5, label = paste0(“y =”, round(coeffs[2], 2), “x +”,round(coeffs[1], 2))) displays the regression equation on the plot

Tree Volume as a Function of Width

Finally, using the built in Trees data set, we can see wider trees tend to have more volume:

Conclusion

Simple linear regression is a foundational statistical model which is used widely to show the relationship between two variables.

It estimates how a change in one variable affects the other
This can be visually seen in the graphs of the Old Faithful data as well as the Tree data shown previously
We can use slope and y-intercept to assess linear data and predict future data, which is a real-world application