When studying the penguins on the Palmer islands last week, we saw that quantitative variables “Bill Length” and “Flipper Length” had an interesting relationship in our scatterplot. Can we use Multiple Linear Regression to better understand this pattern?
As before, we start by doing some exploratory data analysis to determine more about this data set.
names(PData)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
dim(PData)
## [1] 333 8
head(PData)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## # ℹ 2 more variables: sex <fct>, year <int>
We can observe that there are four quantitative variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. There are three categorical variables of interest: island, species, and sex.
Recall, we saw that the two variables with the largest negative correlation were (Bill Depth vs Flipper Length), coming in with a correlation coefficient of -0.578.
PData %>%
select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
cor()
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.0000000 -0.2286256 0.6530956 0.5894511
## bill_depth_mm -0.2286256 1.0000000 -0.5777917 -0.4720157
## flipper_length_mm 0.6530956 -0.5777917 1.0000000 0.8729789
## body_mass_g 0.5894511 -0.4720157 0.8729789 1.0000000
However, when we compared the two variables in a scatterplot, we saw that the pattern was strange, with two “clumps” of data. We moved on, but agreed that we would come back to reexamine the “clumps” at a future time. Today is that day! Let’s re-examine those clumps!
ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm))+
geom_point()
Given what we have seen in Multiple Regression, we can reasonably suspect that the clumps are due to one of the categorical variables. Perhaps the clumps are due to penguins from different species? or on different islands? or of different sexes? Let’s explore each possibility.
ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm, color=island))+
geom_point()
ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm, color=sex))+
geom_point()
ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm, color=species))+
geom_point()
It seems like species is the culprit! Gentoo penguins have a different (flipper length) vs (bill depth) relationship then the other species of penguins. Our goal now is to perform a multiple regression analysis, introducing an indicator and an interaction variable for the “Gentoo” penguins, and to use this to create two lines of best fit: one for Gentoos, and one for non-Gentoos. Let’s get to work!
We must:
Create a new indicator and interaction variable for the Gentoo penguin population! We should do this using the "mutate" command.
PDataGentoo<- mutate(PData, GentooIND = ifelse(species=="Gentoo", 1, 0))
Once we have done that, we need to create a multiple regression estimating Bill Depth as a function of: Flipper Length, Gentoo Indicator, Gentoo Interaction
PDataGentoo<-mutate(PDataGentoo, GentooINT = GentooIND*flipper_length_mm)
mlm.bill_depth.flipper_length.Ind.Int <- lm(PDataGentoo$bill_depth_mm~PDataGentoo$flipper_length_mm+PDataGentoo$GentooIND+PDataGentoo$GentooINT)
mlm.bill_depth.flipper_length.Ind.Int
##
## Call:
## lm(formula = PDataGentoo$bill_depth_mm ~ PDataGentoo$flipper_length_mm +
## PDataGentoo$GentooIND + PDataGentoo$GentooINT)
##
## Coefficients:
## (Intercept) PDataGentoo$flipper_length_mm
## 6.38683 0.06244
## PDataGentoo$GentooIND PDataGentoo$GentooINT
## -14.50409 0.04396
Our coefficients are as follows:
Intercept = 6.38683
flipper_length_mm= 0.06244
Indicator= -14.50409
Interaction = 0.04396
Our full equation is as follows:
Depth = 6.38683 + 0.06244(Length)-14.50409 (Indicator) + 0.04396(Interaction)
Depth = 6.38683 + 0.06244(Length)-14.50409(Indicator) + 0.04396(Indicator*Length)
We then convert this multiple regression into two linear equations, one for Gentoo and one for non-Gentoo, and getting a slope and an intercept for each.
For the non-gentoos, our Indicator = 0, so our equation becomes as follows:
Depth = 6.38683 + 0.06244(Length)-14.50409(0) + 0.04396(0*Length)
Depth = 6.38683 + 0.06244(Length)
For the gentoos, our Indicator = 1, so our equation becomes as follows:
Depth = 6.38683 + 0.06244(Length)-14.50409(1) + 0.04396(1*Length)
Depth = (6.38683-14.50409)+ (0.06244+0.04396)*Length
Depth = -8.11726+0.1064(Length)
Finally, we plot the original graph, along with both linear equations.
Our moment of truth! Let's plot these lines and see
ggplot(PDataGentoo,aes(x=flipper_length_mm, y=bill_depth_mm, color=species))+
geom_point()+
geom_abline(intercept=6.38683, slope=0.06244, color="red")+
geom_abline(intercept=-8.11726, slope=0.1064, color="blue")