Scenario

When studying the penguins on the Palmer islands last week, we saw that quantitative variables “Bill Length” and “Flipper Length” had an interesting relationship in our scatterplot. Can we use Multiple Linear Regression to better understand this pattern?

Initial Data Analysis

As before, we start by doing some exploratory data analysis to determine more about this data set.

names(PData)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
dim(PData)
## [1] 333   8
head(PData)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           36.7          19.3               193        3450
## 5 Adelie  Torgersen           39.3          20.6               190        3650
## 6 Adelie  Torgersen           38.9          17.8               181        3625
## # ℹ 2 more variables: sex <fct>, year <int>

We can observe that there are four quantitative variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. There are three categorical variables of interest: island, species, and sex.

Recall, we saw that the two variables with the largest negative correlation were (Bill Depth vs Flipper Length), coming in with a correlation coefficient of -0.578.

PData %>%
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  cor()
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956   0.5894511
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917  -0.4720157
## flipper_length_mm      0.6530956    -0.5777917         1.0000000   0.8729789
## body_mass_g            0.5894511    -0.4720157         0.8729789   1.0000000

However, when we compared the two variables in a scatterplot, we saw that the pattern was strange, with two “clumps” of data. We moved on, but agreed that we would come back to reexamine the “clumps” at a future time. Today is that day! Let’s re-examine those clumps!

ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm))+
  geom_point()

Determining the Cause of the Clumps

Given what we have seen in Multiple Regression, we can reasonably suspect that the clumps are due to one of the categorical variables. Perhaps the clumps are due to penguins from different species? or on different islands? or of different sexes? Let’s explore each possibility.

ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm, color=island))+
  geom_point()

ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm, color=sex))+
  geom_point()

ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm, color=species))+
  geom_point()

It seems like species is the culprit! Gentoo penguins have a different (flipper length) vs (bill depth) relationship then the other species of penguins. Our goal now is to perform a multiple regression analysis, introducing an indicator and an interaction variable for the “Gentoo” penguins, and to use this to create two lines of best fit: one for Gentoos, and one for non-Gentoos. Let’s get to work!

Performing a Multiple Regression Analysis, using Gentoo as an Indicator and Interaction.

  We must:
 
  Create a new indicator and interaction variable for the Gentoo penguin population! We should do this using the "mutate" command.
      PDataGentoo<- mutate(PData, GentooIND = ifelse(species=="Gentoo", 1, 0))
  Once we have done that, we need to create a multiple regression estimating Bill Depth as a function of: Flipper Length, Gentoo Indicator, Gentoo Interaction
  PDataGentoo<-mutate(PDataGentoo, GentooINT = GentooIND*flipper_length_mm)
mlm.bill_depth.flipper_length.Ind.Int <- lm(PDataGentoo$bill_depth_mm~PDataGentoo$flipper_length_mm+PDataGentoo$GentooIND+PDataGentoo$GentooINT)
mlm.bill_depth.flipper_length.Ind.Int
## 
## Call:
## lm(formula = PDataGentoo$bill_depth_mm ~ PDataGentoo$flipper_length_mm + 
##     PDataGentoo$GentooIND + PDataGentoo$GentooINT)
## 
## Coefficients:
##                   (Intercept)  PDataGentoo$flipper_length_mm  
##                       6.38683                        0.06244  
##         PDataGentoo$GentooIND          PDataGentoo$GentooINT  
##                     -14.50409                        0.04396
  Our coefficients are as follows:
Intercept = 6.38683
flipper_length_mm= 0.06244
Indicator= -14.50409
Interaction = 0.04396

Our full equation is as follows:
Depth = 6.38683 + 0.06244(Length)-14.50409 (Indicator) +     0.04396(Interaction)
Depth = 6.38683 + 0.06244(Length)-14.50409(Indicator) + 0.04396(Indicator*Length)

  We then convert this multiple regression into two linear equations, one for Gentoo and one for non-Gentoo, and getting a slope and an intercept for each.

For the non-gentoos, our Indicator = 0, so our equation becomes as follows:

  Depth = 6.38683 + 0.06244(Length)-14.50409(0) + 0.04396(0*Length)
  Depth = 6.38683 + 0.06244(Length)

  For the gentoos, our Indicator = 1, so our equation becomes as follows:
  Depth = 6.38683 + 0.06244(Length)-14.50409(1) + 0.04396(1*Length)
  Depth = (6.38683-14.50409)+ (0.06244+0.04396)*Length
  Depth = -8.11726+0.1064(Length)

  Finally, we plot the original graph, along with both linear equations.
  Our moment of truth! Let's plot these lines and see
 
 ggplot(PDataGentoo,aes(x=flipper_length_mm, y=bill_depth_mm, color=species))+
  geom_point()+
  geom_abline(intercept=6.38683, slope=0.06244, color="red")+
  geom_abline(intercept=-8.11726, slope=0.1064, color="blue")