Introduction

The Palmer Archipelago in Antarctica is home to three distinct species of penguins: Adelie, Chinstrap, and Gentoo. In ecological studies, body mass is a critical indicator of an animal’s overall health and reproductive success. However, capturing and weighing penguins can be stressful for the birds.

The purpose of this research is to develop a reliable statistical model to predict body mass (g) using non-invasive physical measurements, specifically flipper length (mm). As a result, if the model succeeds in accurately predicting this relationship, scientists would be able to monitor penguins’ health without any invasive methods.

The “penguins” dataset is used for this purpose with the following variables:

species (the biological type: Adelie, Chinstrap, or Gentoo) island (the specific location where the penguin was found) bill_len (the horizontal length of the beak from face to tip) bill_dep (the vertical thickness or height of the beak) flipper_len (the length of the wing in millimeters) body_mass (the weight of the bird in grams) sex (the biological gender: male or female) year (the year the study took place: 2007, 2008, or 2009)

As you can see, there are 8 main variables; however, only a few of them are going to be used in this research (flipper length and body mass).

Missing Values

We begin by loading the necessary libraries and the peguinsdataset.

library("tidyverse")
library("ggfortify")
glimpse (penguins)
## Rows: 344
## Columns: 8
## $ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
## $ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
## $ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
## $ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
## $ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
## $ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
## $ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

The glimpse command identified some missing variables (NA) in the dataset. To prevent errors in our analysis, we use colSums to identify exactly where these gaps are and then use drop_na to get rid of those incomplete rows.

MISS <- colSums(is.na(penguins))
print(MISS)
##     species      island    bill_len    bill_dep flipper_len   body_mass 
##           0           0           2           2           2           2 
##         sex        year 
##          11           0
Clean_Bird <- penguins %>%
  drop_na()

glimpse(Clean_Bird)
## Rows: 333
## Columns: 8
## $ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
## $ bill_len    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6, 34.6…
## $ bill_dep    <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2, 21.1…
## $ flipper_len <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 185, 195…
## $ body_mass   <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800, 4400…
## $ sex         <fct> male, female, female, female, male, female, male, female, …
## $ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Now there are no issues with the missing values.

Visualization

Now that we have established a clean workspace, we can accurately visualize the data distribution. We will use ggplot to generate a scatterplot comparing flipper length against body mass, grouping the data by species.

Clean_Bird %>%
  ggplot(aes(x = flipper_len, y = body_mass)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Flipper Length vs. Body Mass",
       subtitle = "A strong positive correlation is visible",
       x = "Flipper Length (mm)",
       y = "Body Mass (g)") +
  theme_minimal()

As we examine the scatterplot above, we can identify several distinct features that validate our approach. First, notice the clear positive trend: as we move to the right along the X-axis (increasing flipper length), the body mass on the Y-axis consistently rises, confirming our hypothesis. Furthermore, the data points do not curve or scatter randomly; instead, they cluster tightly around the red regression line, which serves as our visual confirmation that a linear model is appropriate for this data. Finally, observe the gray shaded region surrounding that red line, which represents the 95% confidence interval. Because this band is very narrow, it indicates that our linear model provides a precise estimate of the population mean, with no extreme outliers disrupting the pattern.

Linear Regression

Having visually confirmed the strong linear relationship between flipper length and body mass in our Exploratory Data Analysis, we now move to Statistical Modeling. While the scatterplot suggests a correlation, it does not quantify it. To do so, we will construct a Simple Linear Regression model. This method allows us to calculate the precise mathematical equation of the line we observed, transforming our visual intuition into a predictive tool.

Model_Bird <- lm(body_mass ~ flipper_len, data = Clean_Bird)
autoplot(Model_Bird)

To validate our model, we check the diagnostic plots. The normal Q-Q plot looks excellent, with points hugging the straight line, confirming that our math is valid. Additionally, residuals vs. leverage confirms we have no dangerous outliers skewing the results. However, the residuals vs. fitted plot reveals a critical flaw: the points form distinct clusters rather than a random scatter. This suggests our model is too simple and is ignoring biological group differences. To fix this, we must update our model to include species.

Model_BirdE <- lm(body_mass ~ flipper_len + species, data = Clean_Bird)
autoplot(Model_BirdE)

In the first model, the residuals formed distinct clusters and followed a curved, “wavy” trend line, indicating that the model was biased and ignoring biological groups. In the final model, however, those clusters have merged into a random scatter, and the trend line has flattened out along zero. This transition confirms that adding Species successfully removed the bias, resulting in a model that is both properly specified and valid for prediction.

Summary

Having visually confirmed that our model is statistically valid and free from bias, we can now proceed to quantify the specific relationships. To do this, we examine the model summary to determine the precise influence of each coefficient on penguin body mass

summary(Model_BirdE)
## 
## Call:
## lm(formula = body_mass ~ flipper_len + species, data = Clean_Bird)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -898.8 -252.0  -24.8  229.8 1191.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4013.18     586.25  -6.846 3.74e-11 ***
## flipper_len         40.61       3.08  13.186  < 2e-16 ***
## speciesChinstrap  -205.38      57.57  -3.568 0.000414 ***
## speciesGentoo      284.52      95.43   2.981 0.003083 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 373.3 on 329 degrees of freedom
## Multiple R-squared:  0.787,  Adjusted R-squared:  0.7851 
## F-statistic: 405.3 on 3 and 329 DF,  p-value: < 2.2e-16

Based on the model summary, we can fully define the linear relationship between physical dimensions and penguin health. The intercept of -4013.18 serves as the mathematical baseline for our equation, while the linear effect of flipper length is positive and significant; specifically, for every single millimeter the flipper grows, the bird gains approximately 40.61 grams of mass. We also see that species acts as a powerful grouping factor, with Chinstrap penguins weighing significantly less (-205.38g) and Gentoo penguins weighing significantly more (+284.52g) than the reference group, confirming distinct biological builds. The Residual Standard Error is 373.3, which represents the average “noise” in our predictions. Despite this variance, the Adjusted R-squared for the model is approximately 0.79, indicating that nearly 79% of the variation in body mass is successfully explained by the model. This high explanatory power is likely due to the strong coefficient values relative to the error, confirming that flipper length and species are robust predictors of penguin size.

Conclusion

Our predictive model yields largely positive results, suggesting that the current linear framework is a valid starting point for analysis. The diagnostic plots indicate that the core assumptions of linearity and normality are generally met, meaning our predictions are statistically sound rather than random guesses. However, while the model is acceptable for this stage of research, it is not yet ideal; a closer inspection of the residuals suggests that the variance is not perfectly uniform across all groups. Specifically, the spread of the data points appears to fluctuate depending on the species type, implying that some species may inherently have more variable body masses than others. For now, this level of precision is sufficient to confirm our main hypothesis, but future iterations of this model could be improved by exploring more complex techniques to better account for these biological irregularities. Despite these minor limitations, the strong predictive power of the model implies that field ecologists can now accurately estimate the overall health and condition of a penguin population simply by measuring flipper length, significantly reducing the need for invasive weighing procedures.