2025-02-06

Introduction to Simple Linear Regression

What is simple linear regression?

Simple linear regression is a statistical method used to predict the relationship between two or more variables; a continuous independent (sometimes multiple), and continuous dependent.

Linear regression shows either a positive or negative relationship between the variables - examining how the dependent variable changes according to the independent variable(s).

The Math Behind the Scenes

Simple linear regression uses the following formula;

\(Y = \beta_0 + \beta_1 X + \varepsilon\)

The purpose of this function is to predict the linear response that the dependent variable (Y) has to the independent/explanatory variable (X).

What does each variable represent?

Simple Linear Regression Function Explained

\(Y = \beta_0 + \beta_1 X + \varepsilon\)

Y = dependent variable

\(\beta_0\) = the intercept value (also known as the value of Y when X = 0)

\(\beta_1\) = the slope

X = independent variable

\(\varepsilon\) = the error term (also known as the unexplained variations in Y, or the difference in observed and predicted values)

Simple Linear Regression Using iris

For the visual representations of simple linear regression, we will be using the built-in data set iris for our data.

This data set contains 5 variables relating to the iris flower; sepal length, sepal width, petal length, petal width, and species. (note; the sepal of a flower is the outermost part that protects the buds, while the petals are the colorful bits that you pick off when considering if someone is in love with you or not - in case clarification was required!) Some example rows shown below;

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Fitting the Model

Before we visualize our linear regression, we must first fit the variables to our function;

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

This tidbit of code lets us assign sepal length as our independent variable, and petal length as our dependent variable, as well as prepare our model for our unknown coefficients. Without looking at the model, so far we know that

petal.length = \(\beta_0\) + \(\beta_1\) * sepal.length + \(\varepsilon\)

If we so desire, we could take a look at our model summary to predict our missing coefficients.

Model Summary

## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

Model Summary (cont.)

Based on the summary, we can see that;

  • The intercept (\(\beta_0\)) = -7.101. This indicates that if sepal length were to be 0, petal length would be in the negatives (which is not important for real-world applications).

  • The slope (\(\beta_1\)) = 1.858. This indicates that for each 1cm increase in sepal length, petal length will increase by 1.858 cm in conjunction.

Visualizing Data

Now that we have a clear idea as to what each component of our function represents, as well as predicted values, we can visualize our data;

Visualizing Data (cont.)

In our visualization on the previous slide, we can already see a clear positive linear trend between sepal length and petal length.

Now, we create a plot including a regression line, which shows the trends predicted in our model. The next slide will visualize the plot as depicted by the following R code;

ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) +

geom_point(color = “seagreen”) +

geom_smooth(formula = y ~ x, method = “lm”, color = “cadetblue3”) +

labs(title = “Sepal Length vs Petal Length”, x = “Sepal Length (cm)”, y = “Petal Length (cm)”)

Visualizing Regression

Regression and Residuals

Based on our plot, we can see that;

  • There indeed is a positive linear relationship between sepal length and petal length

  • The regression line, for the most part, follows our data points

If we desire, we could take it a step further and consider the residual plot;

  • Residuals are the difference between actual values and predicted values

  • They measure how far off our model is from the actual data

  • Ideally, they should be close to zero, without pattern

Visualizing Residuals

Conclusion

Our residual plot shows somewhat of a pattern - indicating that the relationship between sepal length and petal length is not perfectly linear.

  • This can also be noted from our regression model, since our regression line did not strictly align with every data point

  • A likely cause for the variation could be the fact that there are 3 different types of species in the iris data set, all carrying different traits in appearance

For future reference, we would be wise to consider the species data individually - however, for a simple linear regression model, this one still did its job!