Data Exploration

penguins_df <-palmerpenguins::penguins

The penguins dataset is composed of 344 datapoints and has one response variable (species) and seven explanatory variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year).

Intuitively, we know that the year variable shouldn’t make a difference in the penguin species, so we will remove it from the dataset. We are now left with two categorical variables (island and sex) and four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm andbody_mass_g).

# remove year variable
penguins_df <- penguins_df %>%
  select(-year)

Dealing with Missing Values

summary(penguins_df)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex     
##  Min.   :172.0     Min.   :2700   female:165  
##  1st Qu.:190.0     1st Qu.:3550   male  :168  
##  Median :197.0     Median :4050   NA's  : 11  
##  Mean   :200.9     Mean   :4202               
##  3rd Qu.:213.0     3rd Qu.:4750               
##  Max.   :231.0     Max.   :6300               
##  NA's   :2         NA's   :2

From the summay statistics, we can see that 5 out of 6 of the explanatory variables have at least some null values. It doesn’t make sense to impute the null records of our categorical variable (sex), so we will remove them from our dataset. We will impute the missing data of the numerical values using the Multivariate imputation by chained equations (MICE) method. Multiple imputation involves creating multiple predictions for each missing value, helping to account for the uncertainty in the individual imputations.

# remove records with null sex values
penguins_df <- penguins_df %>% filter(!is.na(sex))

# impute null values using MICE method
penguins_df <- complete(mice(data = penguins_df,
                         method = "pmm", print = FALSE), 3)

The final dataset contains 333 records with 6 explanatory variables.

Variable Relationships

Let’s take a look at the relationships of the numeric variables against each other.

corr.d <- penguins_df %>% select_if(is.numeric) %>% cor()
corr.d[lower.tri(corr.d, diag = TRUE)] <- NA
corrplot(corr.d, type = "upper", diag = FALSE)

We can see that flipper_length_mm and body_mass_g are highly positively correlated, which means the larger the flipper length, the heavier the penguin. We should take care to include at most one of these variables in our models.

We can also take a look at how the variables relate to the species:

caret::featurePlot(x = penguins_df %>% select_if(is.numeric),
        y = penguins_df$species,
        plot = "pairs",
        auto.key = list(columns = 3))

We can see that the Gentoo species is more easily separable from the other two species. Gentoo penguins tend to have a smaller bill depth, larger flipper length, and larger body mass than the other species.

Creating a Binary Response Variable

The response variable is composed of 3 distinct categories - Adelie, Chinstrap, and Gentoo. Because our first task is to create a binary classifier, we will need to manipulate the data to develop two classes. Looking at the breakout of species per island, we can see that the Chinstrap species only lives on Dream island and the Gentoo species only lives on Biscoe island. The Adelie species lives on Biscoe, Dream, and Torgersen.

penguins_df %>% 
  group_by(species, island) %>% 
  summarise(n_records = n())
## # A tibble: 5 x 3
## # Groups:   species [3]
##   species   island    n_records
##   <fct>     <fct>         <int>
## 1 Adelie    Biscoe           44
## 2 Adelie    Dream            55
## 3 Adelie    Torgersen        47
## 4 Chinstrap Dream            68
## 5 Gentoo    Biscoe          119

This means that if we encounter a penguin on Torgersen island, we immediately know that it must be of the Adelie species. However, if we encounter a penguin on Biscoe island, it can be either of the Adelie or Gentoo species. Similarly, if we encounter a penguin on Dream island, we know it can either be of the Adelie or Chinstrap species. For this reason, we will compress our species response variable into two categories: Adelie (1) or Other (0). If it is other, we can deduce from the island variable whether the species is Chinstrap or Gentoo.

# create species_binary classification response variable
penguins_df <- penguins_df %>%
  mutate(species_binary = ifelse(species == 'Adelie','Adelie','Other'))

penguins_df$species_binary <- factor(penguins_df$species_binary, levels = c("Other", "Adelie"))

Train-Test Split

The two models will be trained on 70% of the dataset and validated on the remaining 30% of the set.

which_train <- sample(x = c(TRUE, FALSE), size = nrow(penguins_df), replace = TRUE, prob = c(0.7, 0.3))
train_set <- penguins_df[which_train, ]
test_set <- penguins_df[!which_train, ]

The training set contains 234 records and the test set contains 99 records.

Binary Logistic Regression

Backwards stepwise regression is performed and the result is a model with the following variables: island, bill_depth_mm, and flipper_length_mm.

log_model <- glm(species_binary ~ island + bill_depth_mm + flipper_length_mm, 
                 family = 'binomial', 
                 data = train_set)

summary(log_model)
## 
## Call:
## glm(formula = species_binary ~ island + bill_depth_mm + flipper_length_mm, 
##     family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2612  -0.2583  -0.1169   0.1789   2.3208  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         24.83366    6.76813   3.669 0.000243 ***
## islandDream         -2.97353    0.84756  -3.508 0.000451 ***
## islandTorgersen     15.55662 1097.71653   0.014 0.988693    
## bill_depth_mm        0.75447    0.18670   4.041 5.32e-05 ***
## flipper_length_mm   -0.18604    0.03149  -5.909 3.45e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 318.83  on 233  degrees of freedom
## Residual deviance: 116.38  on 229  degrees of freedom
## AIC: 126.38
## 
## Number of Fisher Scoring iterations: 17

Coefficient Interpretation

In general, we know that the larger the coefficient, the more of an impact it has on the positive classification (Adelie species). With this in mind, we can interpret positive coefficients as being more indicative of Adelie and negative coefficients as being more indicative of either Gentoo or Chinstrap.

Some interesting things to note:

  • island: If a penguin lives on Dream island it is less likely to be of the Adelie species. If it lives on Torgersen it is more likely to be of the Adelie species.
  • bill_depth_mm: A positive value means that the larger the bill depth, the more likely the penguin is of the Adelie species.
  • flipper_length_mm: A negative value means that the larger the flipper size, the less likely the penguin is of the Adelie species.

All of these coefficients align with our previous exploratory analysis!

Metrics

Now we can predict on our test set and evaluate the model.

log_preds <- predict(log_model, test_set, type = 'response')
class_prediction <- factor(ifelse(log_preds > 0.50, "Adelie", "Other"), 
                           levels = c("Other", "Adelie"))

log_auc <- auc(response = test_set$species_binary, predictor = log_preds)
log_cm <- confusionMatrix(data = class_prediction, reference =test_set$species_binary)
log_accuracy <- log_cm$overall['Accuracy']
log_tpr <- log_cm$byClass['Sensitivity']
log_fpr <- 1 - log_tpr
log_tnr <- log_cm$byClass['Specificity']
log_fnr <- 1 - log_tnr
  • AUC: 0.9633797
  • Accuracy: 0.8888889
  • TPR (Sensitivity): 0.9038462
  • FPR (1-TPR): 0.0961538
  • TNR (Specificity): 0.8723404
  • FNR (1-TPR): 0.1276596

Multinomial Logistic Regression

First, we will define “Adelie” as the reference level (or “baseline species”) for the dataset. This means that our trained model will result in coefficients of the features for the remaining two species in relation to Adelie. We will start with a baseline model that includes all features.

require(nnet)
train_set$species <- relevel(train_set$species, ref = "Adelie")
test_set$species <- relevel(test_set$species, ref = "Adelie")

multinom_model <- multinom(species ~ ., data = train_set %>% select(-species_binary))
## # weights:  27 (16 variable)
## initial  value 257.075276 
## iter  10 value 11.160107
## iter  20 value 2.050734
## iter  30 value 0.036582
## iter  40 value 0.000690
## iter  50 value 0.000433
## iter  60 value 0.000153
## final  value 0.000088 
## converged
summary(multinom_model)
## Call:
## multinom(formula = species ~ ., data = train_set %>% select(-species_binary))
## 
## Coefficients:
##           (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap  -208.60591    56.40348       -59.25874       13.49186     -20.20993
## Gentoo      -20.58418   -85.90795       -46.68750       14.94218     -18.69232
##           flipper_length_mm body_mass_g   sexmale
## Chinstrap       -0.02590278 -0.01495788 -15.89397
## Gentoo          -2.36043896  0.04887849 -23.91683
## 
## Std. Errors:
##            (Intercept)  islandDream islandTorgersen bill_length_mm
## Chinstrap 0.9736602091 9.736602e-01    1.804478e-12   1.178404e+02
## Gentoo    0.0000243027 4.063193e-06    2.063369e-13   1.040539e-03
##           bill_depth_mm flipper_length_mm body_mass_g      sexmale
## Chinstrap  3.210054e+01      18.703559483  1.22941044 6.109724e-04
## Gentoo     4.447128e-04       0.004762779  0.09902177 2.457824e-05
## 
## Residual Deviance: 0.0001760353 
## AIC: 32.00018

The super low AIC is indicative of a good model fit, so we will leave all of the variables in the model.

Coefficient Interpretation

Each row in the summary table corresponds to a model equation. The first row compares species = Chinstrap to the baseline, species = Adelie. Similarly, the second row compares species = Gentoo to the baseline, species = Adelie. The model equations result in a value that is represented as the log of odds, which is the log of the probability that the species is either Chinstrap or Gentoo divided by the probability that the species is Adelie:

\[ln(\frac{P(species = Other)}{P(species = Adelie)})\]

However, we often interpret the output of multinomial logistic regression models in terms of the relative risk (or odds). This is simply the ratio of the probability of choosing the non-baseline category over the probability of choosing the baseline category, or the exponentiated version of the log-odds ratio. We can exponentiate our coefficients to make better sense of them.

exp(coef(multinom_model))
##            (Intercept)  islandDream islandTorgersen bill_length_mm
## Chinstrap 2.532820e-91 3.131264e+24    1.837625e-26       723501.5
## Gentoo    1.149227e-09 4.905162e-38    5.295123e-21      3085375.5
##           bill_depth_mm flipper_length_mm body_mass_g      sexmale
## Chinstrap  1.670854e-09        0.97442982   0.9851534 1.251228e-07
## Gentoo     7.621327e-09        0.09437879   1.0500928 4.102537e-11

The coefficients represent the change in odds ratio with a 1 unit increase in the variable. In general, a value of 1 represents that there is no change in the odds. However, a value greater than 1 represents an increase and value less than 1 represents a decrease.

We can look at the Chinstrap model coefficients first:

  • Island: Chinstraps are a lot more likely to be on Dream island and a lot less likely to be on Torgersen island than Adelie. This makes sense – in our exploratory analysis, we noted that our dataset contains Chinstraps are only on Dream island and Adelie are the only species located on Torgersen.
  • Bill length: A 1 unit increase in bill length results in a very large odds ratio change. In other words, as the bill length increases, the species is more likely to be Chinstrap. (Chinstraps tend to have larger bill lengths.)
  • Bill depth: A 1 unit increase in bill depth results in a near-zero odds ratio. This means that as the bill depth increases, the species is more likely to be Adelie. (Chinstraps tend to have a smaller bill depth.)
  • Flipper length and body mass: A 1 unit increase in both of these variables results in an odds ratio close to 1. This means that Chinstrap and Adelie have similar flipper lengths and body masses.

Next, we’ll look at the Gentoo model coefficients:

  • Island: Gentoos are a lot less likely to be on Dream or Torgersen Islands than Adelie.
  • Bill length: A 1 unit increase in bill length results in a very large odds ratio change. In other words, as the bill length increases, the species is more likely to be Gentoo. (Gentoos tend to have larger bill lengths.)
  • Bill depth: A 1 unit increase in bill depth results in a near-zero odds ratio. This means that as the bill depth increases, the species is more likely to be Adelie. (Gentoos tend to have a smaller bill depth.)
  • Flipper length: A 1 unit increase in the flipper length results in a small odds ratio. This means that as the flipper length increases, the species is more likely to be Adelie. (Gentoos have a smaller flipper length.)
  • Body mass: A 1 unit increase in body mass results in an odds ratio close to 1. This means that Gentoo and Adelie have similar body masses.

Metrics

Finally, we can predict on our test dataset and calculate the accuracy of our model:

multinom_preds <- predict(multinom_model, test_set, type = "class")
multinom_cm <- table(test_set$species, multinom_preds)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(multinom_cm))/sum(multinom_cm))*100,2)
## [1] 97.98