Palmer Penguins Analysis

🐧 Introduction

This presentation walks through a full data analysis of the Palmer Penguins dataset using R and Quarto. We’ll demonstrate:

Data dictionary 📖
Exploratory Data Analysis (EDA) 📊
Hypothesis testing 🔍
Linear and logistic regression modeling 📈
Model evaluation and testing 🎯

📖 Data Dictionary

The palmerpenguins dataset includes:

Variable	Description
species	Penguin species (Adelie, Chinstrap, Gentoo)
island	Island name (Biscoe, Dream, Torgersen)
bill_length_mm	Bill length (mm)
bill_depth_mm	Bill depth (mm)
flipper_length_mm	Flipper length (mm)
body_mass_g	Body mass (g)
sex	Male or Female
year	Study year (2007–2009)

📦 Load Packages & Data

library(tidyverse)
library(palmerpenguins)
library(janitor)
library(tidymodels)

penguins <- penguins %>% clean_names() %>% drop_na()

📊 EDA - Distribution of Body Mass

penguins %>% 
  ggplot(aes(x = body_mass_g, fill = species)) +
  geom_histogram(bins = 30, alpha = 0.7) +
  labs(title = "Distribution of Body Mass by Species")

📊 EDA - Flipper vs Body Mass

penguins %>% 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Flipper Length vs Body Mass")

🔍 Hypothesis Test - Body Mass by Sex

t.test(body_mass_g ~ sex, data = penguins)


    Welch Two Sample t-test

data:  body_mass_g by sex
t = -8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -840.5783 -526.2453
sample estimates:
mean in group female   mean in group male 
            3862.273             4545.685

We test if there’s a significant difference in average body mass between male and female penguins.

📈 Linear Regression - Predict Body Mass

set.seed(123)
split <- initial_split(penguins, prop = 0.8)
train <- training(split)
test <- testing(split)

lm_model <- linear_reg() %>% 
  set_engine("lm") %>% 
  fit(body_mass_g ~ flipper_length_mm + bill_length_mm + species, data = train)

summary(lm_model$fit)


Call:
stats::lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm + 
    species, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-797.77 -228.95  -41.52  204.95 1051.60 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3716.670    579.993  -6.408 6.84e-10 ***
flipper_length_mm    25.972      3.458   7.511 9.32e-13 ***
bill_length_mm       63.726      8.040   7.926 6.60e-14 ***
speciesChinstrap   -760.859     92.314  -8.242 8.29e-15 ***
speciesGentoo       125.330     99.393   1.261    0.208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 339.4 on 261 degrees of freedom
Multiple R-squared:  0.8275,    Adjusted R-squared:  0.8249 
F-statistic: 313.1 on 4 and 261 DF,  p-value: < 2.2e-16

🎯 Evaluate Linear Model

preds <- predict(lm_model, new_data = test) %>% 
  bind_cols(test)

metrics <- preds %>% 
  metrics(truth = body_mass_g, estimate = .pred)
metrics

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     342.   
2 rsq     standard       0.811
3 mae     standard     279.

🤖 Logistic Regression - Predict Sex

log_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  fit(sex ~ bill_length_mm + flipper_length_mm + body_mass_g, data = train)

tidy(log_model)

# A tibble: 4 × 5
  term              estimate std.error statistic       p.value
  <chr>                <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)        6.46     2.95          2.19 0.0287       
2 bill_length_mm     0.134    0.0370        3.63 0.000280     
3 flipper_length_mm -0.117    0.0247       -4.74 0.00000210   
4 body_mass_g        0.00264  0.000438      6.04 0.00000000159

🎯 Evaluate Logistic Model

log_preds <- predict(log_model, new_data = test, type = "prob") %>% 
  bind_cols(predict(log_model, test), test)

log_preds <- log_preds %>% 
  mutate(truth = test$sex)

conf_mat(log_preds, truth = truth, estimate = .pred_class)

          Truth
Prediction female male
    female     19    9
    male        9   30

📈 ROC Curve and AUC

roc <- roc_curve(log_preds, truth = truth, .pred_male)
autoplot(roc)

roc_auc(log_preds, truth = truth, .pred_male)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.144

✅ Conclusion

EDA showed strong relationships between body mass and flipper length.
Hypothesis testing confirmed sex differences in body mass.
Linear regression predicted mass fairly well.
Logistic regression predicted sex with good accuracy.
ROC and AUC helped assess logistic model performance.

💡 Try It Yourself!

Explore other predictors like bill depth or island.
Try logistic regression to classify species.
Use cross-validation or tune hyperparameters!