Palmer Penguins

The data was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. This data contain information about different species of penguins, culmen length, culmen depth, flipper length, body mass, the island they hail from, and the sex of the penguin.

Artwork by @allison_horst

Explore The Data

head(p)
## # A tibble: 6 × 17
##   studyName `Sample Number` Species          Region Island Stage `Individual ID`
##   <chr>               <dbl> <chr>            <chr>  <chr>  <chr> <chr>          
## 1 PAL0708                 1 Adelie Penguin … Anvers Torge… Adul… N1A1           
## 2 PAL0708                 2 Adelie Penguin … Anvers Torge… Adul… N1A2           
## 3 PAL0708                 3 Adelie Penguin … Anvers Torge… Adul… N2A1           
## 4 PAL0708                 4 Adelie Penguin … Anvers Torge… Adul… N2A2           
## 5 PAL0708                 5 Adelie Penguin … Anvers Torge… Adul… N3A1           
## 6 PAL0708                 6 Adelie Penguin … Anvers Torge… Adul… N3A2           
## # ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
## #   `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## #   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## #   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
names(p)
##  [1] "studyName"           "Sample Number"       "Species"            
##  [4] "Region"              "Island"              "Stage"              
##  [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
## [10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
## [13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
## [16] "Delta 13 C (o/oo)"   "Comments"

Data Cleaning

p = p[,c(3,5,10:16)]  #Remove columns that are not needed.    
p = p[-4,] #Drop row 4, it is all NAs

names(p)[3] = "CLength"  #Rename columns to be easier to use.
names(p)[4] = "CDepth"
names(p)[5] = "FLength"
names(p)[6] = "BMass"
names(p)[8] = "D15"
names(p)[9] = "D13"

head(p)
## # A tibble: 6 × 9
##   Species                  Island CLength CDepth FLength BMass Sex     D15   D13
##   <chr>                    <chr>    <dbl>  <dbl>   <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Adelie Penguin (Pygosce… Torge…    39.1   18.7     181  3750 MALE  NA     NA  
## 2 Adelie Penguin (Pygosce… Torge…    39.5   17.4     186  3800 FEMA…  8.95 -24.7
## 3 Adelie Penguin (Pygosce… Torge…    40.3   18       195  3250 FEMA…  8.37 -25.3
## 4 Adelie Penguin (Pygosce… Torge…    36.7   19.3     193  3450 FEMA…  8.77 -25.3
## 5 Adelie Penguin (Pygosce… Torge…    39.3   20.6     190  3650 MALE   8.66 -25.3
## 6 Adelie Penguin (Pygosce… Torge…    38.9   17.8     181  3625 FEMA…  9.19 -25.2
#Create a missing value map
missmap(p, main = "Missingness Map of Penguins")

Impute Data

Notice there is some missing data. To keep all records, we will impute the data using the mice package. Also, we set Male = 0 and Female = 1 to use for imputing and to set up as our response variable.

head(p)
##                               Species    Island CLength CDepth FLength BMass
## 1 Adelie Penguin (Pygoscelis adeliae) Torgersen    39.1   18.7     181  3750
## 2 Adelie Penguin (Pygoscelis adeliae) Torgersen    39.5   17.4     186  3800
## 3 Adelie Penguin (Pygoscelis adeliae) Torgersen    40.3   18.0     195  3250
## 4 Adelie Penguin (Pygoscelis adeliae) Torgersen    36.7   19.3     193  3450
## 5 Adelie Penguin (Pygoscelis adeliae) Torgersen    39.3   20.6     190  3650
## 6 Adelie Penguin (Pygoscelis adeliae) Torgersen    38.9   17.8     181  3625
##   Sex     D15       D13
## 1   0 8.47173 -25.21315
## 2   1 8.94956 -24.69454
## 3   1 8.36821 -25.33302
## 4   1 8.76651 -25.32426
## 5   0 8.66496 -25.29805
## 6   1 9.18718 -25.21799

Data Visualization

Variables & Descriptions

# Create a data frame
DF <- data.frame (
  Variables = c("Species", "Culmen Length", "Culmen Depth", "Flipper Length", "Body Mass", "Island", "Sex"),
  Description = c("penguin species (Chinstrap, Adélie, or Gentoo)", "culmen length (mm)", "culmen depth (mm)", "flipper length (mm)", "body mass (g)", "island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)", "penguin sex (0 = Male, 1 = Female)")
)

formattable(DF, list(Variables = color_bar("#48cae4"), Description = color_bar("#e9c46a")), align = c("c", "l"))
Variables Description
Species penguin species (Chinstrap, Adélie, or Gentoo)
Culmen Length culmen length (mm)
Culmen Depth culmen depth (mm)
Flipper Length flipper length (mm)
Body Mass body mass (g)
Island island name (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica)
Sex penguin sex (0 = Male, 1 = Female)

Penguin Reference

Artwork by @allison_horst

Comparing Species and Location

q<-ggplot(p, aes(x = Island,  fill = Species)) +
  geom_bar(stat = "count", position = "dodge", col = "black") +
  labs(x = "Which Antartic Island", 
       y = "The Number of Penguins", 
       title = "What Type of Penguin and Where",
       fill = "Penguin Species",
       caption = "Voronyak 2023") + 
  theme(axis.text.x = element_text(angle = -90)) +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

q + coord_flip()

The bar chart shows which type of penguin was sampled and on which Antartic island. There seems to be only Adelie Penguins on the Torgersen Island, Both Chinstrap and Adelie Penguins in Dream Island, and both Gentoo and Adelie Penguins on Biscoe Island. Note: this seems to imply that only these types of penguins are on those island, but it may not be the case.

These islands are located on the Antarctic Peninsula.

Comparing Bill Length and Flipper Length (mm)

ggplot(p, aes(x = CLength, y = FLength))+  
  geom_point(aes(colour = factor(Species)), size = 3)+ 
  labs(colour = "Species", x = "Bill Length (mm)", y = "Flipper Length (mm)")

The scatter plot shows how the three species of penguins differ based on culmen length (bill length) and flipper length measured in mm.

Correlation

library("ggcorrplot")

# Compute a correlation matrix
corr <- round(cor(p[,c(3:6)]), 1)
# Visualize
ggcorrplot(corr, p.mat = cor_pmat(p[,c(3:6)]),
           hc.order = TRUE, type = "lower",
           color = c("#FC4E07", "white", "#00AFBB"),
           outline.col = "white", lab = TRUE)

The correlation plot shows how each of the main predictor(Flipper Length, Culmen Length, Culmen Depth, and Body Mass) are related in some manner. There is a strong positive correlation between the flipper length and the body mass as well as between the culmen length and the flipper length. The culmen depth seems to be strong negatively correlation with the flipper length. In other words, as the flipper length increases, the depth of the bill seems to decrease.

All but culmen length and culmen depth seem to be correlated strongly.

Species Body Mass

ggplot(p, aes(BMass)) +
  geom_histogram(aes(fill = Species, color = Species), bins = 20, 
                 position = "identity", alpha = 0.5) +
  scale_fill_viridis_d() +
  scale_color_viridis_d() + 
  labs(x = "Body Mass (g)", y = "Count")

Shown in the layer histogram, we see that the Chinstrap and Adelie Penguin have about the same Body Mass (g), but there were less Chinstrap Penguins sampled. We can also see that the Gentoo penguin is quite a bit larger, in fact almost double the mass. We can see that the mass appears to be normally distributed.

Grouped Scatter Plot With Marginal Density Plots

ggscatterhist(
  p, x = "CLength", y = "CDepth",
  color = "Species", size = 3, alpha = 0.6,
  palette = c("#00AFBB", "#E7B800", "#FC4E07"),
  xlab = "Bill Length",
  ylab = "Bill Depth",
  margin.params = list(fill = "Species", color = "black", size = 0.2)) 

The largest penguin, the Chinstrap, has both the longest and deepest bill. You can see that the Adelie has just about at deep of a bill, but it is rather short, and similarly the Gentoo has a long bill, but a shallow bill.

Finding Significant Predictors for the Sex of a Penguin

Using Logistic Regression to find predictors.

glm <- glm(Sex ~., data = p[,c(2:9)], family = binomial)
summary(glm)
## 
## Call:
## glm(formula = Sex ~ ., family = binomial, data = p[, c(2:9)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6533  -0.1728   0.0085   0.2119   3.9455  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     118.336143  18.596728   6.363 1.97e-10 ***
## IslandDream       0.415319   0.764033   0.544  0.58673    
## IslandTorgersen  -0.081012   0.837234  -0.097  0.92292    
## CLength          -0.198438   0.076679  -2.588  0.00966 ** 
## CDepth           -2.165977   0.294483  -7.355 1.91e-13 ***
## FLength           0.100191   0.046287   2.165  0.03042 *  
## BMass            -0.007115   0.001132  -6.287 3.24e-10 ***
## D15              -2.746287   0.677328  -4.055 5.02e-05 ***
## D13               1.516261   0.465853   3.255  0.00113 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 475.43  on 342  degrees of freedom
## Residual deviance: 134.74  on 334  degrees of freedom
## AIC: 152.74
## 
## Number of Fisher Scoring iterations: 7
glm.step <- step(glm, direction = "both")
## Start:  AIC=152.74
## Sex ~ Island + CLength + CDepth + FLength + BMass + D15 + D13
## 
##           Df Deviance    AIC
## - Island   2   135.23 149.23
## <none>         134.74 152.74
## - FLength  1   140.00 156.00
## - CLength  1   142.55 158.55
## - D13      1   146.50 162.50
## - D15      1   155.36 171.36
## - BMass    1   234.39 250.39
## - CDepth   1   272.12 288.12
## 
## Step:  AIC=149.23
## Sex ~ CLength + CDepth + FLength + BMass + D15 + D13
## 
##           Df Deviance    AIC
## <none>         135.23 149.23
## - FLength  1   140.44 152.44
## + Island   2   134.74 152.74
## - CLength  1   142.58 154.58
## - D13      1   148.11 160.11
## - D15      1   155.41 167.41
## - BMass    1   241.41 253.41
## - CDepth   1   289.18 301.18
summary(glm.step)
## 
## Call:
## glm(formula = Sex ~ CLength + CDepth + FLength + BMass + D15 + 
##     D13, family = binomial, data = p[, c(2:9)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6480  -0.1734   0.0079   0.2080   3.9214  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 118.935186  18.646437   6.378 1.79e-10 ***
## CLength      -0.181377   0.071577  -2.534 0.011276 *  
## CDepth       -2.152909   0.289543  -7.436 1.04e-13 ***
## FLength       0.099330   0.046020   2.158 0.030897 *  
## BMass        -0.007222   0.001103  -6.546 5.91e-11 ***
## D15          -2.704466   0.672845  -4.019 5.83e-05 ***
## D13           1.562456   0.461732   3.384 0.000715 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 475.43  on 342  degrees of freedom
## Residual deviance: 135.23  on 336  degrees of freedom
## AIC: 149.23
## 
## Number of Fisher Scoring iterations: 7
glm.step
## 
## Call:  glm(formula = Sex ~ CLength + CDepth + FLength + BMass + D15 + 
##     D13, family = binomial, data = p[, c(2:9)])
## 
## Coefficients:
## (Intercept)      CLength       CDepth      FLength        BMass          D15  
##  118.935186    -0.181377    -2.152909     0.099330    -0.007222    -2.704466  
##         D13  
##    1.562456  
## 
## Degrees of Freedom: 342 Total (i.e. Null);  336 Residual
## Null Deviance:       475.4 
## Residual Deviance: 135.2     AIC: 149.2

We can see that Culmen Length, Culmen Depth, Flipper Length, Body Mass, D15, and D13 are statistically significant predictors using Logistic Regression.

Below are the results from using tree classification model.

tree <- rpart(Sex ~., data = p[,c(2:9)], control = rpart.control(maxdepth = 3), method = "class")
# Variable importance plot using a barchart
Heights = as.numeric(tree$variable.importance) # Heights of bars
Names = names(tree$variable.importance) # Names of bars
names(Heights) = Names # Convert Heights to a vector with each value named
library(RColorBrewer)
coul <- brewer.pal(5, "Set2") 
par(mar = c(5, 6, 3, 2))
barplot(sort(Heights),  # Sort bars to create a Pareto chart
        horiz = TRUE,
        main = "Importance of Predictors Using a Classification Tree",
        las = 2,
        col=coul
)

It appears that the predictors are very similar when using the classification tree or the logistic regression models.

#Partition 
set.seed(1)
n = nrow(p)
ind = createDataPartition(1:n,p = 0.6)[[1]]
train.df = p[ind, ]
valid.df = p[-ind, ]

glm <- glm(Sex ~ CLength + CDepth + FLength + BMass + D15 + D13, data = train.df, family = binomial)  ## Note Stable isotope values of carbon (delta13C) and nitrogen (delta15N)
glm.step <- step(glm, direction = "backward")
## Start:  AIC=103.69
## Sex ~ CLength + CDepth + FLength + BMass + D15 + D13
## 
##           Df Deviance    AIC
## - FLength  1   91.184 103.18
## <none>         89.693 103.69
## - CLength  1   92.417 104.42
## - D13      1   94.427 106.43
## - D15      1  102.664 114.66
## - BMass    1  140.446 152.45
## - CDepth   1  184.753 196.75
## 
## Step:  AIC=103.18
## Sex ~ CLength + CDepth + BMass + D15 + D13
## 
##           Df Deviance    AIC
## - CLength  1   92.695 102.69
## <none>         91.184 103.18
## - D13      1   94.449 104.45
## - D15      1  102.730 112.73
## - BMass    1  165.301 175.30
## - CDepth   1  194.380 204.38
## 
## Step:  AIC=102.69
## Sex ~ CDepth + BMass + D15 + D13
## 
##          Df Deviance    AIC
## - D13     1   94.679 102.68
## <none>        92.695 102.69
## - D15     1  107.356 115.36
## - CDepth  1  195.120 203.12
## - BMass   1  248.111 256.11
## 
## Step:  AIC=102.68
## Sex ~ CDepth + BMass + D15
## 
##          Df Deviance    AIC
## <none>        94.679 102.68
## - D15     1  107.487 113.49
## - CDepth  1  195.180 201.18
## - BMass   1  252.907 258.91
predict.prob <- predict(glm.step, valid.df, type = "response")

predict.label =  ifelse(predict.prob > 0.5, 1, 0)
observed.label = valid.df$Sex

confusionMatrix(factor(predict.label), factor(observed.label), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 57  2
##          1  8 69
##                                           
##                Accuracy : 0.9265          
##                  95% CI : (0.8689, 0.9642)
##     No Information Rate : 0.5221          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8521          
##                                           
##  Mcnemar's Test P-Value : 0.1138          
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.8769          
##          Pos Pred Value : 0.8961          
##          Neg Pred Value : 0.9661          
##              Prevalence : 0.5221          
##          Detection Rate : 0.5074          
##    Detection Prevalence : 0.5662          
##       Balanced Accuracy : 0.9244          
##                                           
##        'Positive' Class : 1               
## 
plot.lift(observed.label, predict.prob) 

forecast::accuracy(predict.prob, valid.df$Sex)
##                   ME      RMSE       MAE  MPE MAPE
## Test set -0.03501756 0.2513043 0.1276271 -Inf  Inf

Classification Tree for Penguins Sex

ct1 <- rpart(Sex ~., data = train.df,control = rpart.control(maxdepth = 3))

fancyRpartPlot(ct1, main = "Penguin Classifcation Tree", sub ="Voronyak 2023", palettes = c("Greens"), type = 1)

ct1.pred <- predict(ct1, valid.df)
ct1.pred = round(ct1.pred, digits = 0)
forecast::accuracy(ct1.pred, valid.df$Sex)
##                   ME      RMSE       MAE  MPE MAPE
## Test set -0.02941176 0.3429972 0.1176471 -Inf  Inf
ct2 <- rpart(Sex ~.,data = valid.df)
best.cp = ct2$cptable[which.min(ct2$cptable[, "xerror"]), "CP"]
pruned.tree = prune(ct2, cp = best.cp)

fancyRpartPlot(pruned.tree, main = "Pruned Penguin Classifcation Tree", sub ="Voronyak 2023", palettes = c("Blues"), type = 1)

ct2.pred <- predict(ct2, valid.df)
ct2.pred = round(ct2.pred, digits = 0)
forecast::accuracy(ct2.pred, valid.df$Sex)
##          ME      RMSE        MAE  MPE MAPE
## Test set  0 0.2970443 0.08823529 -Inf  Inf

As we can see the pruned tree has the better RMSE where it started the tree with Body Mass.

Neural Network

Using the top three predictors culmen depth, body mass, and Nitrogen Delta 15N, we will create a neural network to help predict sex.

# Normalize the training set to get a recipe
norm.values = caret::preProcess(train.df, method = "range")

# Normalize both training and validation data. 
train.norm <- predict(norm.values, train.df)
valid.norm <- predict(norm.values, valid.df)

set.seed(2)
nn <- neuralnet(formula = (Sex == "0") + (Sex == "1") ~ BMass + CDepth + D15, 
               data = train.norm, 
               hidden = c(2),           
               act.fct = "logistic",    
               linear.output = FALSE,
               rep = 10,       
               lifesign="none",  
)

plot(nn, rep = "best",col.entry = "#00AFBB", col.hidden = "#E7B800", col.out = "#FC4E07") 

pred = predict(nn, 
               newdata = valid.norm, 
               rep = which.min(nn$result.matrix[1,]))
head(pred) 
##            [,1]         [,2]
## 3  5.301436e-11 1.000000e+00
## 4  9.498602e-01 4.967008e-02
## 5  9.998096e-01 1.862962e-04
## 8  1.004312e-04 9.999011e-01
## 9  9.999924e-01 7.397279e-06
## 10 5.652588e-12 1.000000e+00
pred.train = predict(nn, newdata = train.norm) # based on normalized training data

pred.valid = predict(nn, newdata = valid.norm) # based on normalized validation data

Min = min(train.df$Sex) # Keep in mind that we normalized data 
# based on the minimums and maximums of the train data
Max = max(train.df$Sex)

pred.ori.valid = as.numeric(pred.valid)*(Max-Min) + Min

forecast::accuracy(pred.ori.valid, valid.df$Sex)
##                  ME      RMSE       MAE  MPE MAPE
## Test set 0.07223528 0.9296434 0.8879938 -Inf  Inf

Based off of the RMSE, we see that the neural net isn’t as good of a model as the classification trees nor the logistic models. Also, note that neural networks take a long time to process thus using either the trees or logistic models would be best.

Comparing Attributes of Male and Female Penguins

We will first compare the body mass of the Male and Female penguins and then compare the bills. This, hopefully, will show what the classification trees above show, that both body mass and bill depth have a large role in determining male or female. Thus, if you didn’t know how to identify the sex, you could just measure the mass and the bill depth and could be pretty certain of the sex.

Adelie = subset(p, Species == "Adelie Penguin (Pygoscelis adeliae)")    #create subsets of data based off of species
Gentoo = subset(p, Species == "Gentoo penguin (Pygoscelis papua)")
Chinstrap = subset(p, Species == "Chinstrap penguin (Pygoscelis antarctica)")

Adelie$Sex <-gsub("1","Female", Adelie$Sex)  #covert 0's and 1's back to Male and Female
Adelie$Sex <-gsub("0","Male", Adelie$Sex)
Adelie$Sex <- as.character(Adelie$Sex)

Gentoo$Sex <-gsub("1","Female", Gentoo$Sex)
Gentoo$Sex <-gsub("0","Male", Gentoo$Sex)
Gentoo$Sex <- as.character(Gentoo$Sex)

Chinstrap$Sex <-gsub("1","Female", Chinstrap$Sex)
Chinstrap$Sex <-gsub("0","Male", Chinstrap$Sex)
Chinstrap$Sex <- as.character(Chinstrap$Sex)


plot1 <- ggplot(Adelie, aes(BMass)) +
  geom_histogram(aes(fill = Sex, color = Sex), bins = 20, 
                 position = "identity", alpha = 0.5) +
  scale_fill_viridis_d() +
  scale_color_viridis_d() +
 labs(x = "Body Mass", y = "Number of Adelie Penguins", title = "Body Mass of Adelie Penguins" ,subtitle = "Males vs. Females") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

plot2 <- ggplot(Gentoo, aes(BMass)) +
  geom_histogram(aes(fill = Sex, color = Sex), bins = 20, 
                 position = "identity", alpha = 0.5) +
  scale_fill_viridis_d() +
  scale_color_viridis_d()+
  labs(x = "Body Mass", y = "Number of Gentoo Penguins", title = "Body Mass of Gentoo Penguins" ,subtitle = "Males vs. Females") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

plot3 <- ggplot(Chinstrap, aes(BMass)) +
  geom_histogram(aes(fill = Sex, color = Sex), bins = 20, 
                 position = "identity", alpha = 0.5) +
  scale_fill_viridis_d() +
  scale_color_viridis_d() +
  labs(x = "Body Mass", y = "Number of Chinstrap Penguins", title = "Body Mass of Chinstrap Penguins" ,subtitle = "Males vs. Females") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

grid.arrange(plot1, plot2, plot3, ncol=2, nrow=2)

It is quite clear from the data that the Males tend to have a larger Body Mass than the Females for all three species of penguins. Thus, this would be a good predictor for determining the sex of the species.

plot4 <- ggplot(Adelie, aes(CDepth)) +
  geom_histogram(aes(fill = Sex, color = Sex), bins = 20, 
                 position = "identity", alpha = 0.5) +
  scale_fill_viridis_d() +
  scale_color_viridis_d() +
 labs(x = "Bill Depth", y = "Number of Adelie Penguins", title = "Bill Depth of Adelie Penguins" ,subtitle = "Males vs. Females") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

plot5 <- ggplot(Gentoo, aes(CDepth)) +
  geom_histogram(aes(fill = Sex, color = Sex), bins = 20, 
                 position = "identity", alpha = 0.5) +
  scale_fill_viridis_d() +
  scale_color_viridis_d()+
  labs(x = "Bill Depth", y = "Number of Gentoo Penguins", title = "Bill Depth of Gentoo Penguins" ,subtitle = "Males vs. Females") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

plot6 <- ggplot(Chinstrap, aes(CDepth)) +
  geom_histogram(aes(fill = Sex, color = Sex), bins = 20, 
                 position = "identity", alpha = 0.5) +
  scale_fill_viridis_d() +
  scale_color_viridis_d() +
  labs(x = "Bill Depth", y = "Number of Chinstrap Penguins", title = "Bill Depth of Chinstrap Penguins" ,subtitle = "Males vs. Females") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

grid.arrange(plot4, plot5, plot6, ncol=2, nrow=2)

We can also see that the predictor culmen depth, i.e. bill depth, is also a good predictor. It is clear that the the male penguins of any of the three species have deeper bills.

Findings

After visualizing the data it is clear to see that the Gentoo penguin is the largest of the three and Adelie/Chinstrap are both relatively close in size, but they differ by bill sizes. The Adelie tends to have a deeper bill and the Chinstrap has a longer bill. We can also see that the Male penguins tend to be larger in size and have a deeper bill. These two predictors, bill depth and body mass, are the best predictors when finding the sex of the penguin. Therefor by just measuring bill depth and body mass you’d be able to determine the sex of the penguin with high accuracy.

Citation

Palmer Station Antarctica LTER and K. Gorman. 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f (Accessed 2023-04-14).