Palmer Penguins Analysis

🐧 Introduction

This presentation walks through a full data analysis of the Palmer Penguins dataset using R and Quarto. We’ll demonstrate:

  • Data dictionary 📖
  • Exploratory Data Analysis (EDA) 📊
  • Hypothesis testing 🔍
  • Linear and logistic regression modeling 📈
  • Model evaluation and testing 🎯

📖 Data Dictionary

The palmerpenguins dataset includes:

Variable Description
species Penguin species (Adelie, Chinstrap, Gentoo)
island Island name (Biscoe, Dream, Torgersen)
bill_length_mm Bill length (mm)
bill_depth_mm Bill depth (mm)
flipper_length_mm Flipper length (mm)
body_mass_g Body mass (g)
sex Male or Female
year Study year (2007–2009)

📦 Load Packages & Data

library(tidyverse)
library(palmerpenguins)
library(janitor)
library(tidymodels)

penguins <- penguins %>% clean_names() %>% drop_na()

📊 EDA - Distribution of Body Mass

penguins %>% 
  ggplot(aes(x = body_mass_g, fill = species)) +
  geom_histogram(bins = 30, alpha = 0.7) +
  labs(title = "Distribution of Body Mass by Species")

📊 EDA - Flipper vs Body Mass

penguins %>% 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Flipper Length vs Body Mass")

🔍 Hypothesis Test - Body Mass by Sex

t.test(body_mass_g ~ sex, data = penguins)

    Welch Two Sample t-test

data:  body_mass_g by sex
t = -8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -840.5783 -526.2453
sample estimates:
mean in group female   mean in group male 
            3862.273             4545.685 

We test if there’s a significant difference in average body mass between male and female penguins.

📈 Linear Regression - Predict Body Mass

set.seed(123)
split <- initial_split(penguins, prop = 0.8)
train <- training(split)
test <- testing(split)

lm_model <- linear_reg() %>% 
  set_engine("lm") %>% 
  fit(body_mass_g ~ flipper_length_mm + bill_length_mm + species, data = train)

summary(lm_model$fit)

Call:
stats::lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm + 
    species, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-797.77 -228.95  -41.52  204.95 1051.60 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3716.670    579.993  -6.408 6.84e-10 ***
flipper_length_mm    25.972      3.458   7.511 9.32e-13 ***
bill_length_mm       63.726      8.040   7.926 6.60e-14 ***
speciesChinstrap   -760.859     92.314  -8.242 8.29e-15 ***
speciesGentoo       125.330     99.393   1.261    0.208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 339.4 on 261 degrees of freedom
Multiple R-squared:  0.8275,    Adjusted R-squared:  0.8249 
F-statistic: 313.1 on 4 and 261 DF,  p-value: < 2.2e-16

🎯 Evaluate Linear Model

preds <- predict(lm_model, new_data = test) %>% 
  bind_cols(test)

metrics <- preds %>% 
  metrics(truth = body_mass_g, estimate = .pred)
metrics
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     342.   
2 rsq     standard       0.811
3 mae     standard     279.   

🤖 Logistic Regression - Predict Sex

log_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  fit(sex ~ bill_length_mm + flipper_length_mm + body_mass_g, data = train)

tidy(log_model)
# A tibble: 4 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        6.46     2.95          2.19 0.0287       
2 bill_length_mm     0.134    0.0370        3.63 0.000280     
3 flipper_length_mm -0.117    0.0247       -4.74 0.00000210   
4 body_mass_g        0.00264  0.000438      6.04 0.00000000159

🎯 Evaluate Logistic Model

log_preds <- predict(log_model, new_data = test, type = "prob") %>% 
  bind_cols(predict(log_model, test), test)

log_preds <- log_preds %>% 
  mutate(truth = test$sex)

conf_mat(log_preds, truth = truth, estimate = .pred_class)
          Truth
Prediction female male
    female     19    9
    male        9   30

📈 ROC Curve and AUC

roc <- roc_curve(log_preds, truth = truth, .pred_male)
autoplot(roc)
roc_auc(log_preds, truth = truth, .pred_male)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.144

✅ Conclusion

  • EDA showed strong relationships between body mass and flipper length.
  • Hypothesis testing confirmed sex differences in body mass.
  • Linear regression predicted mass fairly well.
  • Logistic regression predicted sex with good accuracy.
  • ROC and AUC helped assess logistic model performance.

💡 Try It Yourself!

  • Explore other predictors like bill depth or island.
  • Try logistic regression to classify species.
  • Use cross-validation or tune hyperparameters!