Scenario: Linear Regression with the Palmer Penguins Data

In this project we use the Palmer Penguins dataset to practice fitting and interpreting simple linear regression models. The project has two parts. In Part 1, we work through a complete example together — making a scatterplot, computing the correlation coefficient, fitting a regression line, interpreting the slope and intercept, making a prediction, and computing a residual. In Part 2, you will carry out your own analysis from start to finish.


Part 1: Worked Example Does flipper length predict body mass in Gentoo penguins?

1A: Filtering our data

We filter to Gentoo penguins and drop any rows with missing values in our two variables of interest.

gentoo <- penguins %>%
  filter(species == "Gentoo",
         !is.na(flipper_length_mm),
         !is.na(body_mass_g))

1B: Scatterplot

We begin by making a scatterplot to look at the association before computing anything.

ggplot(gentoo, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Flipper Length vs. Body Mass (Gentoo Penguins)",
    x     = "Flipper Length (mm)",
    y     = "Body Mass (g)"
  )

There is a moderate-to-strong, positive, linear association between flipper length and body mass for Gentoo penguins. Penguins with longer flippers tend to have greater body mass. There are no obvious outliers or major departures from linearity.


1C: Correlation Coefficient

cor(gentoo$flipper_length_mm, gentoo$body_mass_g, use = "complete.obs")
## [1] 0.7026665

We see that the correlation coefficient is 0.703. This positive value confirms the positive direction we saw in the scatterplot. The magnitude (0.703) indicates a moderately strong linear association — penguins with longer flippers tend to be heavier, and this tendency is fairly consistent across the sample.


1D: Fitting the Regression Line

We use lm() to fit the line of best fit, with body mass as the response variable and flipper length as the explanatory variable.

model <- lm(body_mass_g ~ flipper_length_mm, data = gentoo)
coef(model)
##       (Intercept) flipper_length_mm 
##        -6787.2806           54.6225

The regression equation is: y-hat = 54.62x - 6787.28, where x is flipper length in mm and y-hat is predicted body mass in grams.

Interpretation of the slope: For each additional 1 mm of flipper length, the model predicts body mass to increase by approximately 54.62 grams, on average.

Interpretation of the intercept: The intercept of -6787.28 grams is the predicted body mass when flipper length is 0 mm. This is physically impossible, so the intercept does not have a meaningful real-world interpretation here — it simply positions the line correctly within the range of the data.


1E: Scatterplot with Regression Line

ggplot(gentoo, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue", linewidth = 1) +
  labs(
    title    = "Flipper Length vs. Body Mass (Gentoo Penguins)",
    x        = "Flipper Length (mm)",
    y        = "Body Mass (g)"
  )
## `geom_smooth()` using formula = 'y ~ x'


1F: Making a Prediction

What body mass does the model predict for a Gentoo penguin with a flipper length of 210 mm?

54.62*210 - 6787.28
## [1] 4682.92

The model predicts a body mass of approximately 4683 grams for a Gentoo penguin with a 210 mm flipper.

The residual is approximately -283 grams. This is negative, meaning this penguin’s actual body mass is about 283 grams below what the model predicts for a penguin of its flipper length. It sits below the regression line.


Part 2: Your Turn

Now it is your turn to carry out a complete regression analysis. Choose one of the following options.


Option 1: Bill depth (mm) predicting body mass (g) for Gentoo penguins.

Option 2: Bill length (mm) predicting body mass (g) for Chinstrap penguins.

Option 3: Bill length (mm) predicting bill depth (mm) for Chinstrap penguins.


State which option you chose:

Write your answer here.

"Bill length (mm) predicting bill depth (mm) for **Chinstrap** penguins."
## [1] "Bill length (mm) predicting bill depth (mm) for **Chinstrap** penguins."

2A: Filter and Describe Your Data

Filter to the appropriate species and drop rows with missing values in your two variables.

chinstrap <- penguins %>%
  filter(species == "Chinstrap",
         !is.na(bill_length_mm),
         !is.na(bill_depth_mm))

2B: Scatterplot

Make a scatterplot with the explanatory variable on the x-axis and the response variable on the y-axis. Add appropriate axis labels and a title.

ggplot(chinstrap, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Bill Length vs. Bill Depth (Chinstrap Penguins)",
    x     = "Bill Length (mm)",
    y     = "Bill Depth (mm)"
  )

Describe the association (direction, strength, linearity, any notable outliers):

"There is a moderate-to-strong, positive, linear association between bill length and bill depth for Chinstrap penguins. Penguins with longer bills tend to have greater bill depth There is one obvious outliers or major departures from linearity."
## [1] "There is a moderate-to-strong, positive, linear association between bill length and bill depth for Chinstrap penguins. Penguins with longer bills tend to have greater bill depth There is one obvious outliers or major departures from linearity."

2C: Correlation Coefficient

Compute your correlation coefficient using appropriate code. Report the value and interpret it in the context of your chosen variables.

cor(chinstrap$bill_length_mm, chinstrap$bill_depth_mm, use = "complete.obs")
## [1] 0.6535362

Interpretation of r:

Write your answer here.

"We see that the correlation coefficient is 0.6535. This positive value confirms the positive direction we saw in the scatterplot. The magnitude (0.6535) indicates a moderately strong linear association --- penguins with longer bills tend to have greater bill depth as well, and this tendency is fairly consistent across the sample."
## [1] "We see that the correlation coefficient is 0.6535. This positive value confirms the positive direction we saw in the scatterplot. The magnitude (0.6535) indicates a moderately strong linear association --- penguins with longer bills tend to have greater bill depth as well, and this tendency is fairly consistent across the sample."

2D: Fitting the Regression Line

Use lm() to fit the regression line. Extract the slope and intercept using coef() and write out the regression equation in the form y-hat = mx + b

model <- lm(bill_length_mm ~ bill_depth_mm, data = chinstrap)
coef(model)
##   (Intercept) bill_depth_mm 
##     13.427908      1.922084

Regression equation:

y-hat = 1.922084 x + 13.427908

For each additional 1 mm of bill length, the model predicts bill depth to increase by approximately 1.922 grams, on average.

Interpretation of the intercept: The intercept of 13.427908 mm is the predicted bill depth when bill length is 0 mm. This is physically impossible, so the intercept does not have a meaningful real-world interpretation here — it simply positions the line correctly within the range of the data.


2E: Scatterplot with Regression Line

Add the regression line to your scatterplot.

ggplot(chinstrap, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue", linewidth = 1) +
  labs(
    title    = "Bill Length vs. Bill Depth (Chinstrap Penguins)",
    x        = "Bill Length (mm)",
    y        = "Bill Depth (mm)"
  )
## `geom_smooth()` using formula = 'y ~ x'


2F: Making a Prediction and Calculating a Residual

Choose a point of (x,y) that sits in your scatterplot. Use your linear model to predict the y-hat value that corresponds with this x-value. Report and interpret the result.

What bill length does the model predict for a Chinstrap penguin with a bill depth of 20 mm?

1.922084*20 + 13.427908
## [1] 51.86959
50.3-51.86959
## [1] -1.56959

The residual is approximately -1.56959 mm. This is negative, meaning this penguin’s actual bill length is about 1.56959 mm below what the model predicts for a penguin of its bill length. It sits below the regression line. Interpretation:

Write your answer here.