March 13, 2023

What is Simple Linear Regression?

Simple linear regression is a statistical method used to model the relationship between a dependent variable and a single independent variable. It assumes a linear relationship between the two variables and estimates the slope and intercept of the line that best fits the data.

Simple linear regression can be used to:

  1. Show the strength of a relationship between two variables
  2. Predict the value of the dependent variable at a certain value of the independent variable

These slides will use the Iris dataset to show how to perform a simple linear regression and how to interpret the results.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

How to Perform a Simple Linear Regression

Formula: \(y = \beta_0 + \beta_1 x + \epsilon\)

Where:

\(y\) is the predicted value of the dependent variable

\(\beta_0\) is the y-intercept

\(\beta_1\) is the regression coefficient

\(x\) is the independent variable

\(\epsilon\) is the error of the estimate

Calculating a Simple Linear Regression Using R

Linear regression calculates the line of best fit of the data by obtaining the regression coefficient (\(\beta_1\)) that minimizes the total error (\(\epsilon\)) of the model

R has built in functions to help us perform a simple linear regression to explore the relationship between Sepal Length (dependent variable) and Petal Length (independent variable)

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24675 -0.29657 -0.01515  0.27676  1.00269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.30660    0.07839   54.94   <2e-16 ***
## Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

Plugging in Values to the Simple Linear Regression Equation

Plugging in the values obtained from the summary of the linear model on the previous slide:

\(y = Sepal Length\)

\(\beta_0 = 4.31\)

\(\beta_1 = 0.41\)

\(x = Petal Length\)

\(\epsilon = \pm 0.019\)

For the Equation: \(Sepal Length = 4.31 + 0.41*Petal Length \pm 0.019\)

Also keep in mind the p-value (p<0.001) obtained from the previous slide

Plotting the Simple Linear Regression Line of Best Fit Over the Iris Dataset

By plotting the line of best fit calculated on the previous slide over the entire Iris dataset, we will better visualize the relationship between sepal length and petal length:

sl_pl <- ggplot(iris, aes(x = Petal.Length, y = Sepal.Length))+ 
          geom_point()+ 
          ggtitle("Sepal Length in Relation to Petal Length Within Iris Flowers")+ 
          geom_smooth(method='lm', formula= y~x, se=FALSE)

The resulting graph will be shown on the next slide.

Simple Linear Regression Graph

Interpretting the Results

It is important to note that regression models can be used to predict the value of the dependent variable only in the range of values where there is a recorded response. That is why on the previous slide, the line of best fit originated from Petal Length of 1 and not 0 - since there was no petal lengths of 0 in the Iris dataset.

Remember the p-value of this model from a previous slide? (p<0.001) This value along with the visual aide from the graph on the previous slide indicate that Petal Length has a statistically significant relationship with Sepal Length.

Exploring Other Relationships in the Iris Dataset

Not all relationships between variables will have a linear relationship. To visualize this point lets explore the relationship between petal width and sepal width with a plot.

fit <- lm(Sepal.Width ~ Petal.Width, data = iris)
sw_pw <- plot_ly(data = iris, x = ~Petal.Width)%>%
                  add_markers(y = ~Sepal.Width, name="Point")%>%
                  layout(title = "Sepal Width in Relation to Petal Width Within Iris Flowers")%>% 
                  add_lines(x = ~Petal.Width, y = fitted(fit), name="Regression Line")

The resulting graph will be featured in the next slide. Notice how the graph does not represent a linear relationship.

Graph of Petal Width and Sepal Width

Exploring Other Fits of the Iris Dataset

While Simple Linear Regression Models have many practical uses cases, it is not the only way to show the strength of a relationship between two variables. Lets revisit the relationship between Petal Length and Sepal Length and fit a curved line to the data.

sl_pl_curve <- ggplot(iris, aes(x = Petal.Length, y = Sepal.Length))+ 
          geom_point()+ 
          ggtitle("Sepal Length in Relation to Petal Length Within Iris Flowers")+ 
          geom_smooth(se = FALSE, method = "gam", formula = y ~ s(log(x)))

The resulting curved fit graph is shown on the next slide.

Curved Line Fitted to Iris Data

Closing

Summary:

Simple linear regression can be used to:

  1. Show the strength of a relationship between two variables
  2. Predict the value of the dependent variable at a certain value of the independent variable

While simple linear regression is a powerful tool for the above use cases, it is not the only function to fit to a dataset and is not applicable for all variable relationships.