Multiple Regression with Indicator and Interaction Variables

In this project we practice building multiple regression models that include indicator and interaction variables. The project has two parts. In Part 1 we work through a complete example together using the Palmer Penguins dataset. In Part 2 you carry out your own analysis from start to finish.


Part 1: Worked Example — Gentoo vs. Non-Gentoo Penguins

In an earlier script we noticed that a scatterplot of flipper length vs. bill depth showed two distinct “clumps” of points. We identified species — and specifically Gentoo penguins — as the likely cause. Here we build a multiple regression model that accounts for this.


1A: The Motivating Scatterplot

Let’s re-examine the scatterplot that showed the two clumps, coloring by species so we can see what is going on.

ggplot(PData, aes(x = flipper_length_mm, y = bill_depth_mm)) +
  geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(PData, aes(x = flipper_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

The Gentoo penguins are clearly behaving differently from the Adelie and Chinstrap penguins! The two groups have different slopes AND different intercepts. This tells us we need both an indicator variable and an interaction variable in our model.


1B: Creating the Indicator and Interaction Variables

We use mutate() to add two new columns to the data. The indicator variable (which we name GentooIND) equals 1 for Gentoo penguins and 0 for all others. The interaction variable is the product of the indicator and the predictor (in this case: GentooIND times flipper length).

PenguinDataGentoo <- mutate(PData, GentooIND = ifelse(species == "Gentoo", 1, 0))
PenguinDataGentoo <- mutate(PenguinDataGentoo, GentooINT = GentooIND * flipper_length_mm)

1C: Fitting the Multiple Regression Model

We predict bill depth from flipper length, the Gentoo indicator, and the interaction variable.

mlm.BDepth.FLength.Ind.Int <- lm(PenguinDataGentoo$bill_depth_mm ~
                                   PenguinDataGentoo$flipper_length_mm +
                                   PenguinDataGentoo$GentooIND +
                                   PenguinDataGentoo$GentooINT)
mlm.BDepth.FLength.Ind.Int
## 
## Call:
## lm(formula = PenguinDataGentoo$bill_depth_mm ~ PenguinDataGentoo$flipper_length_mm + 
##     PenguinDataGentoo$GentooIND + PenguinDataGentoo$GentooINT)
## 
## Coefficients:
##                         (Intercept)  PenguinDataGentoo$flipper_length_mm  
##                             6.59425                              0.06140  
##         PenguinDataGentoo$GentooIND          PenguinDataGentoo$GentooINT  
##                           -14.83111                              0.04551

Our coefficients are:

  • Intercept = 6.39
  • flipper_length_mm = 0.06
  • GentooIND = -14.50
  • GentooINT = 0.044

So our full equation is: Bill Depth = 6.39 + 0.06(Flipper Length) − 14.50(Indicator) + 0.044(Interaction)


1E: Recovering the Two Group Equations

For non-Gentoo penguins (Indicator = 0):

Bill Depth = 6.39 + 0.06(Flipper Length) − 14.50(0) + 0.044(0 × Flipper Length)

Bill Depth = 6.39 + 0.06(Flipper Length)

For Gentoo penguins (Indicator = 1):

Bill Depth = 6.39 + 0.06(Flipper Length) − 14.50(1) + 0.044(1 × Flipper Length)

Bill Depth = (6.39 − 14.50) + (0.06 + 0.044)(Flipper Length)

Bill Depth = −8.11 + 0.104(Flipper Length)


1E: Plotting the Two Lines of Best Fit

ggplot(PData, aes(x = flipper_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  geom_abline(intercept = 6.39,  slope = 0.06,  color = "red") +
  geom_abline(intercept = -8.11, slope = 0.104, color = "blue")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

The red line describes non-Gentoo penguins and the blue line describes Gentoo penguins. The two lines fit their respective clusters well, and they are clearly not parallel — confirming that the interaction term was necessary. —

Part 2: Your Turn

Choose one of the following two options. For whichever option you choose, carry out the full analysis described in the sections below.


Option 1 (Penguins — species indicator for Adelie): Use the full PData dataset. Create an indicator variable for Adelie penguins (AdelieIND = 1 if Adelie, 0 otherwise). Predict bill_length_mm from bill_depth_mm, your indicator, and the interaction variable AdelieINT = AdelieIND * bill_depth_mm.

Option 2 (Penguins — species indicator for Chinstrap): Use the full PData dataset. Create an indicator variable for Chinstrap penguins (ChinstrapIND = 1 if Chinstrap, 0 otherwise). Predict bill_length_mm from body_mass_g, your indicator, and the interaction variable ChinstrapINT = ChinstrapIND * body_mass_g.


State which option you chose:

Option 2


2A: The Motivating Scatterplot

Make a scatterplot of your two quantitative variables (x = predictor, y = response), coloring the points by your grouping variable. Do the two groups appear to have different slopes, different intercepts, or both?

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.2.0     ✔ tidyr     1.3.2
## ✔ readr     2.1.6     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(PData, aes(x = body_mass_g, y = bill_length_mm, color = species)) +
  geom_point() +
  theme_minimal() +
  labs(title = "Bill Length vs Body Mass by Species",
       x = "Body Mass (g)",
       y = "Bill Length (mm)")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Describe what you see:

The Chinstrap and non‑Chinstrap penguins follow different linear patterns. Chinstrap penguins tend to have longer bills for body mass, meaning a higher intercept. The slopes are also different — Chinstrap increases more steeply with body mass — so the groups differ in both slope and intercept.

2B: Creating the Indicator and Interaction Variables

Use mutate() and ifelse() to create your indicator variable and your interaction variable.

PData2 <- PData %>%
  mutate(
    ChinstrapIND = ifelse(species == "Chinstrap", 1, 0),
    ChinstrapINT = ChinstrapIND * body_mass_g
  )

2C: Fitting the Multiple Regression Model

Fit your model using lm(). Report the four coefficients (intercept, predictor slope, indicator, interaction).

model2 <- lm(bill_length_mm ~ body_mass_g + ChinstrapIND + ChinstrapINT,
             data = PData2)

summary(model2)
## 
## Call:
## lm(formula = bill_length_mm ~ body_mass_g + ChinstrapIND + ChinstrapINT, 
##     data = PData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5495 -1.8062 -0.0397  1.6958  9.3138 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.1685101  0.8220135  23.319  < 2e-16 ***
## body_mass_g   0.0054502  0.0001869  29.160  < 2e-16 ***
## ChinstrapIND 13.0056828  3.1871760   4.081 5.61e-05 ***
## ChinstrapINT -0.0009875  0.0008416  -1.173    0.241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.582 on 338 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7764 
## F-statistic: 395.7 on 3 and 338 DF,  p-value: < 2.2e-16

Write out the full regression equation:

y-hat=19.1685+0.0054502(body_mass_g)+13.0057(ChinstrapIND)−0.0009875(ChinstrapINT)

2D: Recovering the Two Group Equations

Use the four coefficients to work out the prediction equation for each group, the same way we did in Section 1E. Show your algebra.

Group 1 (reference category, indicator = 0):

y-hat=19.1685+0.0054502(body_mass_g)+13.0057(ChinstrapIND)−0.0009875(ChinstrapINT) y-hat=19.1685+0.0054502(body_mass_g)+13.0057(0)−0.0009875(0) y-hat=19.1685+0.0054502(body_mass_g)

Group 2 (indicator = 1):

y-hat=19.1685+0.0054502(body_mass_g)+13.0057(ChinstrapIND)−0.0009875(ChinstrapINT) y-hat=19.1685+0.0054502(body_mass_g)+13.0057(1)−0.0009875(body_mass_g) y-hat=(19.1685+13.0057)+(0.0054502−0.0009875)(body_mass_g) y-hat=32.1742+0.0044627(body_mass_g)


2E: Plotting the Two Lines of Best Fit

Use geom_abline() to draw both lines on top of your color-coded scatterplot. Use the group equations from 2D for the slope and intercept of each line.

ggplot(PData2, aes(x = body_mass_g, y = bill_length_mm, color = species)) +
  geom_point() +
  
  # Line for non-Chinstrap penguins (indicator = 0)
  geom_abline(intercept = 19.1685, slope = 0.0054502, color = "blue", size = 1.2) +
  
  # Line for Chinstrap penguins (indicator = 1)
  geom_abline(intercept = 32.1742, slope = 0.0044627, color = "blue", size = 1.2) +
  
  labs(title = "Bill Length vs Body Mass with Fitted Lines",
       x = "Body Mass (g)",
       y = "Bill Length (mm)") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Do the lines fit their respective groups well? Are they parallel or do they cross? Does this match what you expected from your scatterplot in 2A?

The two fitted lines match their groups well. The Chinstrap line sits higher, which makes sense because Chinstrap penguins have longer bills on average. The slopes are a little different — the Chinstrap slope is a bit flatter, so the lines are not parallel, but they also do not cross. This matches what I saw in the scatterplot from 2A: the Chinstrap points form a higher cluster with a similar but slightly different trend compared to the other species.