Introduction

  • Simple Linear Regression (SLR) is a type of Linear Regression model that uses one independent variable (input) to predict a target value (dependent variable) while assuming a staight-line relationship between the two.
  • This presentation will be using the built-in base R data set “cars” and “penguins” as examples.

Data Set - cars

  • First we’ll demonstrate SLR with the data ser ‘cars’.
  • Data given for the speed of cars and the distances taken to stop.
  • Note that the data were recorded in the 1920s.
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Scatter Plot - cars

  • Here is the scatter plot for the data set ‘cars’
  • As you can see, you can tell there’s a direct relationship between car speed and breaking distance right away.

Simple Linear Regression Equation

  • \(y\) is the dependent variable (output).
  • \(x\) is the independent variable (input).
  • \(m\) is the slope of the line.
  • \(b\) is the intercept (when x is zero). \[ \begin{aligned} y = m x + b \end{aligned} \]

Simple Linear Regression Equation Interpreted

  • \(y\) is the dependent variable (predicted value).
  • \(\beta_0\) is the \(y\)-intercept when \(x\) is zero.
  • \(\beta_1\) is slope
    • Positive value indicates a direct relationship.
    • Negative value indicates an inverse relationship.
  • \(x\) is the independent variable.
  • \(\epsilon\) is the error of the estimate. \[ \begin{aligned} y = \beta_0 + \beta_1 x + \epsilon \end{aligned} \]

SLR - cars

  • With the above equation, we have drawn the Simple Line Regression for the data set ‘cars’

SLR - cars Code

  • Code in ggplot showing how Single Line Regression is created
ggplot(data = cars, aes(x = speed, y = dist)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE)

Data Set - penguins

  • Next we’ll be going over the data set ‘penguins’ to see if there’s a relationship between bill length and bill depth.
  • Data on adult penguins covering three species found on three islands in the Palmer Archipelago, Antarctica, including their size (flipper length, body mass, bill dimensions), and sex.
## # A tibble: 6 × 7
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 1 more variable: sex <fct>

Assumptions of Linear Regression

  1. Linearity
  2. Independence of Errors
  3. Normally Distributed
  4. Equal Variances

Failed Prediction Example

  • Using the data set ‘penguins’ we’ll demonstrate what happens when assumptions are incorrect for a data set.
  • We’ll begin with a 3D scatter plot to see if there are any relationships between bill dimensions, flipper length, and species.
  • From a glance, it looks like there might be something possible.
  • When trying to predict bill depth taken from bill length, the values do are not in the same range as known values as seen in the next slides.

3D Scatter Plot - penguins

Failed Example of Predicted Regression Line