In this project we practice building multiple regression models that include indicator and interaction variables. The project has two parts. In Part 1 we work through a complete example together using the Palmer Penguins dataset. In Part 2 you carry out your own analysis from start to finish.
In an earlier script we noticed that a scatterplot of flipper length vs. bill depth showed two distinct “clumps” of points. We identified species — and specifically Gentoo penguins — as the likely cause. Here we build a multiple regression model that accounts for this.
Let’s re-examine the scatterplot that showed the two clumps, coloring by species so we can see what is going on.
ggplot(PData, aes(x = flipper_length_mm, y = bill_depth_mm)) +
geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(PData, aes(x = flipper_length_mm, y = bill_depth_mm, color = species)) +
geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The Gentoo penguins are clearly behaving differently from the Adelie and Chinstrap penguins! The two groups have different slopes AND different intercepts. This tells us we need both an indicator variable and an interaction variable in our model.
We use mutate() to add two new columns to the data. The indicator variable (which we name GentooIND) equals 1 for Gentoo penguins and 0 for all others. The interaction variable is the product of the indicator and the predictor (in this case: GentooIND times flipper length).
PenguinDataGentoo <- mutate(PData, GentooIND = ifelse(species == "Gentoo", 1, 0))
PenguinDataGentoo <- mutate(PenguinDataGentoo, GentooINT = GentooIND * flipper_length_mm)
We predict bill depth from flipper length, the Gentoo indicator, and the interaction variable.
mlm.BDepth.FLength.Ind.Int <- lm(PenguinDataGentoo$bill_depth_mm ~
PenguinDataGentoo$flipper_length_mm +
PenguinDataGentoo$GentooIND +
PenguinDataGentoo$GentooINT)
mlm.BDepth.FLength.Ind.Int
##
## Call:
## lm(formula = PenguinDataGentoo$bill_depth_mm ~ PenguinDataGentoo$flipper_length_mm +
## PenguinDataGentoo$GentooIND + PenguinDataGentoo$GentooINT)
##
## Coefficients:
## (Intercept) PenguinDataGentoo$flipper_length_mm
## 6.59425 0.06140
## PenguinDataGentoo$GentooIND PenguinDataGentoo$GentooINT
## -14.83111 0.04551
Our coefficients are:
So our full equation is: Bill Depth = 6.39 + 0.06(Flipper Length) − 14.50(Indicator) + 0.044(Interaction)
For non-Gentoo penguins (Indicator = 0):
Bill Depth = 6.39 + 0.06(Flipper Length) − 14.50(0) + 0.044(0 × Flipper Length)
Bill Depth = 6.39 + 0.06(Flipper Length)
For Gentoo penguins (Indicator = 1):
Bill Depth = 6.39 + 0.06(Flipper Length) − 14.50(1) + 0.044(1 × Flipper Length)
Bill Depth = (6.39 − 14.50) + (0.06 + 0.044)(Flipper Length)
Bill Depth = −8.11 + 0.104(Flipper Length)
ggplot(PData, aes(x = flipper_length_mm, y = bill_depth_mm, color = species)) +
geom_point() +
geom_abline(intercept = 6.39, slope = 0.06, color = "red") +
geom_abline(intercept = -8.11, slope = 0.104, color = "blue")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The red line describes non-Gentoo penguins and the blue line describes Gentoo penguins. The two lines fit their respective clusters well, and they are clearly not parallel — confirming that the interaction term was necessary. —
Choose one of the following two options. For whichever option you choose, carry out the full analysis described in the sections below.
State which option you chose:
Use the full PData dataset. Create an indicator variable
for Adelie penguins (AdelieIND = 1 if
Adelie, 0 otherwise). Predict bill_length_mm from
bill_depth_mm, your indicator, and the interaction variable
AdelieINT = AdelieIND * bill_depth_mm.
PData$AdelieIND <- ifelse(PData$species == "Adelie", 1, 0)
PData$AdelieINT <- PData$AdelieIND * PData$bill_depth_mm
Make a scatterplot of your two quantitative variables (x = predictor, y = response), coloring the points by your grouping variable. Do the two groups appear to have different slopes, different intercepts, or both?
ggplot(PData, aes(x = bill_depth_mm, y = bill_length_mm, color = as.factor(AdelieIND))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) + # Adds the regression lines for each group
labs(title = "Bill Length vs. Bill Depth by Species Group",
x = "Bill Depth (mm)",
y = "Bill Length (mm)",
color = "Is Adelie?") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The graph is a boring graph. The two groups appear to have similar slopes with each other. The only difference is their y-intercept which is different.
Use mutate() and ifelse() to create your
indicator variable and your interaction variable.
PData <- PData %>%
mutate(AdelieIND = ifelse(species == "Adelie", 1, 0),
AdelieINT = AdelieIND * bill_depth_mm
)
Fit your model using lm(). Report the four coefficients
(intercept, predictor slope, indicator, interaction).
penguin_model <- lm(bill_length_mm ~ bill_depth_mm + AdelieIND + AdelieINT, PData)
summary(penguin_model)
##
## Call:
## lm(formula = bill_length_mm ~ bill_depth_mm + AdelieIND + AdelieINT,
## data = PData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4073 -1.7905 0.0191 1.6900 10.9582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.42413 1.61222 21.352 < 2e-16 ***
## bill_depth_mm 0.83634 0.09877 8.467 7.78e-16 ***
## AdelieIND -11.35605 3.64933 -3.112 0.00202 **
## AdelieINT 0.02069 0.20362 0.102 0.91912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.653 on 338 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7638
## F-statistic: 368.6 on 3 and 338 DF, p-value: < 2.2e-16
Write out the full regression equation:
y-hat = 54.424 + 0.836(bill_depth_mm) − 11.356(AdelieIND) + 0.021(AdelieINT)
Use the four coefficients to work out the prediction equation for each group, the same way we did in Section 1E. Show your algebra.
Group 1 (reference category, indicator = 0):
y-hat = 54.424 + 0.836(bill_depth_mm)
Group 2 (indicator = 1):
y-hat = 54.424 + 0.836(bill_depth_mm) − 11.356(AdelieIND) + 0.021(AdelieINT) y-hat = (54.424 - 11.356) + (0.836 + 0.021) —
Use geom_abline() to draw both lines on top of your
color-coded scatterplot. Use the group equations from 2D for the slope
and intercept of each line.
ggplot(PData, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 54.42413, slope = 0.836, color = "red", size = 1) +
geom_abline(intercept = (54.424 - 11.356), slope = (0.836 + 0.021), color = "blue", size = 1) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Do the lines fit their respective groups well? Are they parallel or do they cross? Does this match what you expected from your scatterplot in 2A?
No the lines do not fit their respective groups at all. They are not parallel and do not cross. This does not match what I expected from the scatterplot in 2A.