In this project we use the Palmer Penguins dataset to practice fitting and interpreting simple linear regression models. The project has two parts. In Part 1, we work through a complete example together — making a scatterplot, computing the correlation coefficient, fitting a regression line, interpreting the slope and intercept, making a prediction, and computing a residual. In Part 2, you will carry out your own analysis from start to finish.
We filter to Gentoo penguins and drop any rows with missing values in our two variables of interest.
gentoo <- penguins %>%
filter(species == "Gentoo",
!is.na(flipper_length_mm),
!is.na(body_mass_g))
We begin by making a scatterplot to look at the association before computing anything.
ggplot(gentoo, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(alpha = 0.6) +
labs(
title = "Flipper Length vs. Body Mass (Gentoo Penguins)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
There is a moderate-to-strong, positive, linear association between flipper length and body mass for Gentoo penguins. Penguins with longer flippers tend to have greater body mass. There are no obvious outliers or major departures from linearity.
cor(gentoo$flipper_length_mm, gentoo$body_mass_g, use = "complete.obs")
## [1] 0.7026665
We see that the correlation coefficient is 0.703. This positive value confirms the positive direction we saw in the scatterplot. The magnitude (0.703) indicates a moderately strong linear association — penguins with longer flippers tend to be heavier, and this tendency is fairly consistent across the sample.
We use lm() to fit the line of best fit, with body mass
as the response variable and flipper length as the explanatory
variable.
model <- lm(body_mass_g ~ flipper_length_mm, data = gentoo)
coef(model)
## (Intercept) flipper_length_mm
## -6787.2806 54.6225
The regression equation is: y-hat = 54.62x - 6787.28, where x is flipper length in mm and y-hat is predicted body mass in grams.
Interpretation of the slope: For each additional 1 mm of flipper length, the model predicts body mass to increase by approximately 54.62 grams, on average.
Interpretation of the intercept: The intercept of -6787.28 grams is the predicted body mass when flipper length is 0 mm. This is physically impossible, so the intercept does not have a meaningful real-world interpretation here — it simply positions the line correctly within the range of the data.
ggplot(gentoo, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue", linewidth = 1) +
labs(
title = "Flipper Length vs. Body Mass (Gentoo Penguins)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
## `geom_smooth()` using formula = 'y ~ x'
What body mass does the model predict for a Gentoo penguin with a flipper length of 210 mm?
54.62*210 - 6787.28
## [1] 4682.92
The model predicts a body mass of approximately 4683 grams for a Gentoo penguin with a 210 mm flipper.
One Gentoo penguin in our dataset has a flipper length of 210 mm and an actual body mass of 4,400 grams. What is its residual?
4400 - 4683
## [1] -283
The residual is approximately -283 grams. This is negative, meaning this penguin’s actual body mass is about 283 grams below what the model predicts for a penguin of its flipper length. It sits below the regression line.
Now it is your turn to carry out a complete regression analysis. Choose one of the following options.
Option 1: Bill depth (mm) predicting body mass (g) for Gentoo penguins.
Option 2: Bill length (mm) predicting body mass (g) for Chinstrap penguins.
Option 3: Bill length (mm) predicting bill depth (mm) for Chinstrap penguins.
option 1
Filter to the appropriate species and drop rows with missing values in your two variables.
gentoo2 <- penguins %>%
filter(species == "Gentoo",
!is.na(bill_depth_mm),
!is.na(body_mass_g))
gentoo2
## # A tibble: 123 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 46.1 13.2 211 4500
## 2 Gentoo Biscoe 50 16.3 230 5700
## 3 Gentoo Biscoe 48.7 14.1 210 4450
## 4 Gentoo Biscoe 50 15.2 218 5700
## 5 Gentoo Biscoe 47.6 14.5 215 5400
## 6 Gentoo Biscoe 46.5 13.5 210 4550
## 7 Gentoo Biscoe 45.4 14.6 211 4800
## 8 Gentoo Biscoe 46.7 15.3 219 5200
## 9 Gentoo Biscoe 43.3 13.4 209 4400
## 10 Gentoo Biscoe 46.8 15.4 215 5150
## # ℹ 113 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Make a scatterplot with the explanatory variable on the x-axis and the response variable on the y-axis. Add appropriate axis labels and a title.
ggplot(gentoo2, aes(x = bill_depth_mm, y = body_mass_g)) +
geom_point(alpha = 0.6) +
labs(
title = "Bill Depth vs. Body Mass (Gentoo Penguins)",
x = "Bill Depth (mm)",
y = "Body Mass (g)"
)
Describe the association (direction, strength, linearity, any notable outliers):
there is a weak, positive, roughly linear association between bill depth and body mass for Gentoo penguins. the penguins with deeper bills tend to be slightly heavier, but the relationship is not very strong. no major outliers.
Compute your correlation coefficient using appropriate code. Report the value and interpret it in the context of your chosen variables.
cor(gentoo2$bill_depth_mm, gentoo2$body_mass_g, use = "complete.obs")
## [1] 0.719085
Interpretation of r:
Use lm() to fit the regression line. Extract the slope
and intercept using coef() and write out the regression
equation in the form y-hat = mx + b
model2 <- lm(body_mass_g ~ bill_depth_mm, data = gentoo2)
coef(model2)
## (Intercept) bill_depth_mm
## -458.9852 369.4406
Regression equation:
y-hat = 369.4406 x + -458.9852
Interpret the slope in context:
for each additional 1 mm increase the model predicts the y‑variable to increase by about 369.44 g
Interpret the intercept in context.:
the intercept represents the predicted value of the response variable when the explanatory variable equals 0. since an bill depth of 0 is not realistic for penguins, the intercept does not have a meaningful real‑world interpretation and it simply positions the regression line correctly within the observed data range.
Add the regression line to your scatterplot.
ggplot(gentoo2, aes(x = bill_depth_mm, y = body_mass_g)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue", linewidth = 1) +
labs(
title = "Bill Depth vs. Body Mass (Gentoo Penguins)",
x = "Bill Depth (mm)",
y = "Body Mass (g)"
)
## `geom_smooth()` using formula = 'y ~ x'
Choose a reasonable value of x within the range of your data. Use your linear model to predict the corresponding y-hat value. Report and interpret the result.
369.4406*15 - 458.9852
## [1] 5082.624
The x value I chose: 15
Predicted y: 5082.624
Find the actual y value for the first penguin in your filtered dataset. Compute its residual by hand (actual minus predicted). Interpret the sign of the residual — is this penguin above or below the regression line?
first_actual <- gentoo2$body_mass_g[1]
first_predicted <- 369.4406*gentoo2$bill_depth_mm[1] - 458.9852
first_actual - first_predicted
## [1] 82.36928
Residual:
212.93
Interpretation:
the residual is positive, meaning this penguin is heavier than the model predicts. it sits above the regression line.