The Palmer Archipelago in Antarctica is home to three distinct species of penguins: Adelie, Chinstrap, and Gentoo. In ecological studies, body mass is a critical indicator of an animal’s overall health and reproductive success. However, capturing and weighing penguins can be stressful for the birds.
The purpose of this research is to develop a reliable statistical model to predict body mass (g) using non-invasive physical measurements, specifically flipper length (mm). As a result, if the model succeeds in accurately predicting this relationship, scientists would be able to monitor penguins’ health without any invasive methods.
The “penguins” dataset is used for this purpose with the following variables:
species (the biological type: Adelie, Chinstrap, or Gentoo) island (the specific location where the penguin was found) bill_len (the horizontal length of the beak from face to tip) bill_dep (the vertical thickness or height of the beak) flipper_len (the length of the wing in millimeters) body_mass (the weight of the bird in grams) sex (the biological gender: male or female) year (the year the study took place: 2007, 2008, or 2009)
As you can see, there are 8 main variables; however, only a few of them are going to be used in this research (flipper length and body mass).
We begin by loading the necessary libraries and the
peguinsdataset.
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
## $ bill_len <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
## $ bill_dep <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
## $ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
## $ body_mass <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
## $ sex <fct> male, female, female, NA, female, male, female, male, NA, …
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
The glimpse command identified some missing variables
(NA) in the dataset. To prevent errors in our analysis, we
use colSums to identify exactly where these gaps are and
then use drop_na to get rid of those incomplete rows.
## species island bill_len bill_dep flipper_len body_mass
## 0 0 2 2 2 2
## sex year
## 11 0
## Rows: 333
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
## $ bill_len <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6, 34.6…
## $ bill_dep <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2, 21.1…
## $ flipper_len <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 185, 195…
## $ body_mass <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800, 4400…
## $ sex <fct> male, female, female, female, male, female, male, female, …
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Now there are no issues with the missing values.
Now that we have established a clean workspace, we can accurately visualize the data distribution. We will use ggplot to generate a scatterplot comparing flipper length against body mass, grouping the data by species.
Clean_Bird %>%
ggplot(aes(x = flipper_len, y = body_mass)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
labs(title = "Flipper Length vs. Body Mass",
subtitle = "A strong positive correlation is visible",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
As we examine the scatterplot above, we can identify several distinct
features that validate our approach. First, notice the clear positive
trend: as we move to the right along the X-axis (increasing flipper
length), the body mass on the Y-axis consistently rises, confirming our
hypothesis. Furthermore, the data points do not curve or scatter
randomly; instead, they cluster tightly around the red regression line,
which serves as our visual confirmation that a linear model is
appropriate for this data. Finally, observe the gray shaded region
surrounding that red line, which represents the 95% confidence interval.
Because this band is very narrow, it indicates that our linear model
provides a precise estimate of the population mean, with no extreme
outliers disrupting the pattern.
Having visually confirmed the strong linear relationship between flipper length and body mass in our Exploratory Data Analysis, we now move to Statistical Modeling. While the scatterplot suggests a correlation, it does not quantify it. To do so, we will construct a Simple Linear Regression model. This method allows us to calculate the precise mathematical equation of the line we observed, transforming our visual intuition into a predictive tool.
To validate our model, we check the diagnostic plots. The normal Q-Q
plot looks excellent, with points hugging the straight line, confirming
that our math is valid. Additionally, residuals vs. leverage confirms we
have no dangerous outliers skewing the results. However, the residuals
vs. fitted plot reveals a critical flaw: the points form distinct
clusters rather than a random scatter. This suggests our model is too
simple and is ignoring biological group differences. To fix this, we
must update our model to include species.
In the first model, the residuals formed distinct clusters and followed
a curved, “wavy” trend line, indicating that the model was biased and
ignoring biological groups. In the final model, however, those clusters
have merged into a random scatter, and the trend line has flattened out
along zero. This transition confirms that adding Species successfully
removed the bias, resulting in a model that is both properly specified
and valid for prediction.
Having visually confirmed that our model is statistically valid and free from bias, we can now proceed to quantify the specific relationships. To do this, we examine the model summary to determine the precise influence of each coefficient on penguin body mass
##
## Call:
## lm(formula = body_mass ~ flipper_len + species, data = Clean_Bird)
##
## Residuals:
## Min 1Q Median 3Q Max
## -898.8 -252.0 -24.8 229.8 1191.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4013.18 586.25 -6.846 3.74e-11 ***
## flipper_len 40.61 3.08 13.186 < 2e-16 ***
## speciesChinstrap -205.38 57.57 -3.568 0.000414 ***
## speciesGentoo 284.52 95.43 2.981 0.003083 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 373.3 on 329 degrees of freedom
## Multiple R-squared: 0.787, Adjusted R-squared: 0.7851
## F-statistic: 405.3 on 3 and 329 DF, p-value: < 2.2e-16
Based on the model summary, we can fully define the linear relationship between physical dimensions and penguin health. The intercept of -4013.18 serves as the mathematical baseline for our equation, while the linear effect of flipper length is positive and significant; specifically, for every single millimeter the flipper grows, the bird gains approximately 40.61 grams of mass. We also see that species acts as a powerful grouping factor, with Chinstrap penguins weighing significantly less (-205.38g) and Gentoo penguins weighing significantly more (+284.52g) than the reference group, confirming distinct biological builds. The Residual Standard Error is 373.3, which represents the average “noise” in our predictions. Despite this variance, the Adjusted R-squared for the model is approximately 0.79, indicating that nearly 79% of the variation in body mass is successfully explained by the model. This high explanatory power is likely due to the strong coefficient values relative to the error, confirming that flipper length and species are robust predictors of penguin size.
Our predictive model yields largely positive results, suggesting that the current linear framework is a valid starting point for analysis. The diagnostic plots indicate that the core assumptions of linearity and normality are generally met, meaning our predictions are statistically sound rather than random guesses. However, while the model is acceptable for this stage of research, it is not yet ideal; a closer inspection of the residuals suggests that the variance is not perfectly uniform across all groups. Specifically, the spread of the data points appears to fluctuate depending on the species type, implying that some species may inherently have more variable body masses than others. For now, this level of precision is sufficient to confirm our main hypothesis, but future iterations of this model could be improved by exploring more complex techniques to better account for these biological irregularities. Despite these minor limitations, the strong predictive power of the model implies that field ecologists can now accurately estimate the overall health and condition of a penguin population simply by measuring flipper length, significantly reducing the need for invasive weighing procedures.