penguins_df <-palmerpenguins::penguins
The penguins dataset is composed of 344 datapoints and has one response variable (species) and seven explanatory variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year).
Intuitively, we know that the year variable shouldn’t make a difference in the penguin species, so we will remove it from the dataset. We are now left with two categorical variables (island and sex) and four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm andbody_mass_g).
# remove year variable
penguins_df <- penguins_df %>%
select(-year)
summary(penguins_df)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex
## Min. :172.0 Min. :2700 female:165
## 1st Qu.:190.0 1st Qu.:3550 male :168
## Median :197.0 Median :4050 NA's : 11
## Mean :200.9 Mean :4202
## 3rd Qu.:213.0 3rd Qu.:4750
## Max. :231.0 Max. :6300
## NA's :2 NA's :2
From the summay statistics, we can see that 5 out of 6 of the explanatory variables have at least some null values. It doesn’t make sense to impute the null records of our categorical variable (sex), so we will remove them from our dataset. We will impute the missing data of the numerical values using the Multivariate imputation by chained equations (MICE) method. Multiple imputation involves creating multiple predictions for each missing value, helping to account for the uncertainty in the individual imputations.
# remove records with null sex values
penguins_df <- penguins_df %>% filter(!is.na(sex))
# impute null values using MICE method
penguins_df <- complete(mice(data = penguins_df,
method = "pmm", print = FALSE), 3)
The final dataset contains 333 records with 6 explanatory variables.
Let’s take a look at the relationships of the numeric variables against each other.
corr.d <- penguins_df %>% select_if(is.numeric) %>% cor()
corr.d[lower.tri(corr.d, diag = TRUE)] <- NA
corrplot(corr.d, type = "upper", diag = FALSE)
We can see that flipper_length_mm and body_mass_g are highly positively correlated, which means the larger the flipper length, the heavier the penguin. We should take care to include at most one of these variables in our models.
We can also take a look at how the variables relate to the species:
caret::featurePlot(x = penguins_df %>% select_if(is.numeric),
y = penguins_df$species,
plot = "pairs",
auto.key = list(columns = 3))
We can see that the Gentoo species is more easily separable from the other two species. Gentoo penguins tend to have a smaller bill depth, larger flipper length, and larger body mass than the other species.
The response variable is composed of 3 distinct categories - Adelie, Chinstrap, and Gentoo. Because our first task is to create a binary classifier, we will need to manipulate the data to develop two classes. Looking at the breakout of species per island, we can see that the Chinstrap species only lives on Dream island and the Gentoo species only lives on Biscoe island. The Adelie species lives on Biscoe, Dream, and Torgersen.
penguins_df %>%
group_by(species, island) %>%
summarise(n_records = n())
## # A tibble: 5 x 3
## # Groups: species [3]
## species island n_records
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 55
## 3 Adelie Torgersen 47
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 119
This means that if we encounter a penguin on Torgersen island, we immediately know that it must be of the Adelie species. However, if we encounter a penguin on Biscoe island, it can be either of the Adelie or Gentoo species. Similarly, if we encounter a penguin on Dream island, we know it can either be of the Adelie or Chinstrap species. For this reason, we will compress our species response variable into two categories: Adelie (1) or Other (0). If it is other, we can deduce from the island variable whether the species is Chinstrap or Gentoo.
# create species_binary classification response variable
penguins_df <- penguins_df %>%
mutate(species_binary = ifelse(species == 'Adelie','Adelie','Other'))
penguins_df$species_binary <- factor(penguins_df$species_binary, levels = c("Other", "Adelie"))
The two models will be trained on 70% of the dataset and validated on the remaining 30% of the set.
which_train <- sample(x = c(TRUE, FALSE), size = nrow(penguins_df), replace = TRUE, prob = c(0.7, 0.3))
train_set <- penguins_df[which_train, ]
test_set <- penguins_df[!which_train, ]
The training set contains 234 records and the test set contains 99 records.
Backwards stepwise regression is performed and the result is a model with the following variables: island, bill_depth_mm, and flipper_length_mm.
log_model <- glm(species_binary ~ island + bill_depth_mm + flipper_length_mm,
family = 'binomial',
data = train_set)
summary(log_model)
##
## Call:
## glm(formula = species_binary ~ island + bill_depth_mm + flipper_length_mm,
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2612 -0.2583 -0.1169 0.1789 2.3208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 24.83366 6.76813 3.669 0.000243 ***
## islandDream -2.97353 0.84756 -3.508 0.000451 ***
## islandTorgersen 15.55662 1097.71653 0.014 0.988693
## bill_depth_mm 0.75447 0.18670 4.041 5.32e-05 ***
## flipper_length_mm -0.18604 0.03149 -5.909 3.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 318.83 on 233 degrees of freedom
## Residual deviance: 116.38 on 229 degrees of freedom
## AIC: 126.38
##
## Number of Fisher Scoring iterations: 17
In general, we know that the larger the coefficient, the more of an impact it has on the positive classification (Adelie species). With this in mind, we can interpret positive coefficients as being more indicative of Adelie and negative coefficients as being more indicative of either Gentoo or Chinstrap.
Some interesting things to note:
island: If a penguin lives on Dream island it is less likely to be of the Adelie species. If it lives on Torgersen it is more likely to be of the Adelie species.bill_depth_mm: A positive value means that the larger the bill depth, the more likely the penguin is of the Adelie species.flipper_length_mm: A negative value means that the larger the flipper size, the less likely the penguin is of the Adelie species.All of these coefficients align with our previous exploratory analysis!
Now we can predict on our test set and evaluate the model.
log_preds <- predict(log_model, test_set, type = 'response')
class_prediction <- factor(ifelse(log_preds > 0.50, "Adelie", "Other"),
levels = c("Other", "Adelie"))
log_auc <- auc(response = test_set$species_binary, predictor = log_preds)
log_cm <- confusionMatrix(data = class_prediction, reference =test_set$species_binary)
log_accuracy <- log_cm$overall['Accuracy']
log_tpr <- log_cm$byClass['Sensitivity']
log_fpr <- 1 - log_tpr
log_tnr <- log_cm$byClass['Specificity']
log_fnr <- 1 - log_tnr
First, we will define “Adelie” as the reference level (or “baseline species”) for the dataset. This means that our trained model will result in coefficients of the features for the remaining two species in relation to Adelie. We will start with a baseline model that includes all features.
require(nnet)
train_set$species <- relevel(train_set$species, ref = "Adelie")
test_set$species <- relevel(test_set$species, ref = "Adelie")
multinom_model <- multinom(species ~ ., data = train_set %>% select(-species_binary))
## # weights: 27 (16 variable)
## initial value 257.075276
## iter 10 value 11.160107
## iter 20 value 2.050734
## iter 30 value 0.036582
## iter 40 value 0.000690
## iter 50 value 0.000433
## iter 60 value 0.000153
## final value 0.000088
## converged
summary(multinom_model)
## Call:
## multinom(formula = species ~ ., data = train_set %>% select(-species_binary))
##
## Coefficients:
## (Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm
## Chinstrap -208.60591 56.40348 -59.25874 13.49186 -20.20993
## Gentoo -20.58418 -85.90795 -46.68750 14.94218 -18.69232
## flipper_length_mm body_mass_g sexmale
## Chinstrap -0.02590278 -0.01495788 -15.89397
## Gentoo -2.36043896 0.04887849 -23.91683
##
## Std. Errors:
## (Intercept) islandDream islandTorgersen bill_length_mm
## Chinstrap 0.9736602091 9.736602e-01 1.804478e-12 1.178404e+02
## Gentoo 0.0000243027 4.063193e-06 2.063369e-13 1.040539e-03
## bill_depth_mm flipper_length_mm body_mass_g sexmale
## Chinstrap 3.210054e+01 18.703559483 1.22941044 6.109724e-04
## Gentoo 4.447128e-04 0.004762779 0.09902177 2.457824e-05
##
## Residual Deviance: 0.0001760353
## AIC: 32.00018
The super low AIC is indicative of a good model fit, so we will leave all of the variables in the model.
Each row in the summary table corresponds to a model equation. The first row compares species = Chinstrap to the baseline, species = Adelie. Similarly, the second row compares species = Gentoo to the baseline, species = Adelie. The model equations result in a value that is represented as the log of odds, which is the log of the probability that the species is either Chinstrap or Gentoo divided by the probability that the species is Adelie:
\[ln(\frac{P(species = Other)}{P(species = Adelie)})\]
However, we often interpret the output of multinomial logistic regression models in terms of the relative risk (or odds). This is simply the ratio of the probability of choosing the non-baseline category over the probability of choosing the baseline category, or the exponentiated version of the log-odds ratio. We can exponentiate our coefficients to make better sense of them.
exp(coef(multinom_model))
## (Intercept) islandDream islandTorgersen bill_length_mm
## Chinstrap 2.532820e-91 3.131264e+24 1.837625e-26 723501.5
## Gentoo 1.149227e-09 4.905162e-38 5.295123e-21 3085375.5
## bill_depth_mm flipper_length_mm body_mass_g sexmale
## Chinstrap 1.670854e-09 0.97442982 0.9851534 1.251228e-07
## Gentoo 7.621327e-09 0.09437879 1.0500928 4.102537e-11
The coefficients represent the change in odds ratio with a 1 unit increase in the variable. In general, a value of 1 represents that there is no change in the odds. However, a value greater than 1 represents an increase and value less than 1 represents a decrease.
We can look at the Chinstrap model coefficients first:
Next, we’ll look at the Gentoo model coefficients:
Finally, we can predict on our test dataset and calculate the accuracy of our model:
multinom_preds <- predict(multinom_model, test_set, type = "class")
multinom_cm <- table(test_set$species, multinom_preds)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(multinom_cm))/sum(multinom_cm))*100,2)
## [1] 97.98