The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled.
| Name | data |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.
## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
From this plot, we can make a few key observations:
These island observations are valuable information in differentiating penguin species.
However, the sex of the penguins does not offer much information as the proportion is about even across all species. We can also note a few missing observations labeled as NA.
When looking at body measurements we see that Adelie and Chinstrap penguins largely overlap except for bill_length. This suggests that we might be able to use bill_depth, body_mass and flipper_length to differentiate the Gentoo penguins from the other species. However, the Adelie penguin stands out from the other others in bill_length
We noted from the data summary above that 11 observations were missing for the sex variable. Given that sex provided no useful information for differentiation, we can safely drop it from our analysis. There is also no reason to believe that the year the observation was taken would have any impact on the morphology of the penguins. Therefore, we also drop year from our predictor variables. There are also two observations which are missing body measurements altogether, so these rows will be dropped altogether.
We start by reducing the number of classes to two in order to apply logistic regression with a binary outcome: Gentoo and non-Gentoo penguins.
The first fitted model using all of the independent variables of interests is not statistically significant. We proceed to remove predictors one at a time until we are left with only significant predictors. As we can see from the exploratory data above, a number of variables allow for perfect separation ofthe classes. This makes logistic regression unstable as shown by the p values close to 1.
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = species ~ ., family = binomial(link = "logit"),
## data = m1_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.047e-05 -2.100e-08 -2.100e-08 2.100e-08 4.329e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.482e+02 3.346e+05 0.000 1.000
## bill_length_mm 7.144e-03 5.058e+03 0.000 1.000
## bill_depth_mm -1.142e+01 9.097e+03 -0.001 0.999
## flipper_length_mm 1.239e+00 1.424e+03 0.001 0.999
## body_mass_g 1.805e-02 2.219e+01 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.3415e+02 on 332 degrees of freedom
## Residual deviance: 7.3116e-09 on 328 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
There are a number of ways to reduce the model. The coefficients below are still unstable. The order of elimnination of variables may seem arbitrary but the elimination order presented in the two subsequent models ultimately yields the the most significant predictors.
m1b <- glm(species ~ . - flipper_length_mm, data=m1_data ,family=binomial(link="logit"))
summary(m1b)##
## Call:
## glm(formula = species ~ . - flipper_length_mm, family = binomial(link = "logit"),
## data = m1_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.843e-05 -2.100e-08 -2.100e-08 2.100e-08 8.759e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.229e+01 5.579e+05 0.000 1.000
## bill_length_mm 3.122e+00 1.387e+04 0.000 1.000
## bill_depth_mm -2.033e+01 1.057e+04 -0.002 0.998
## body_mass_g 5.044e-02 3.841e+01 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.3415e+02 on 332 degrees of freedom
## Residual deviance: 1.8161e-08 on 329 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 25
The resulting model reduced to statistically significant predictors leaves only the bill_length_mm and bill_depth_mm variables. We can interpret the coefficients as follows:
m1c <- glm(species ~ . - flipper_length_mm - body_mass_g, data=m1_data ,family=binomial(link="logit"))
summary(m1c)##
## Call:
## glm(formula = species ~ . - flipper_length_mm - body_mass_g,
## family = binomial(link = "logit"), data = m1_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.83346 -0.01519 -0.00119 0.01196 3.03982
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 48.1175 14.8730 3.235 0.00122 **
## bill_length_mm 0.5561 0.1344 4.138 3.50e-05 ***
## bill_depth_mm -4.4750 1.0500 -4.262 2.03e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.154 on 332 degrees of freedom
## Residual deviance: 29.822 on 330 degrees of freedom
## AIC: 35.822
##
## Number of Fisher Scoring iterations: 10
## (Intercept) bill_length_mm bill_depth_mm
## 7.891412e+20 1.743836e+00 1.138983e-02
We can also interpret the marginal effects as follows:
## (Intercept) bill_length_mm bill_depth_mm
## 0.604837619 0.006990031 -0.056251274
From the classification report below we can pick out the following metrics:
glm.probs <- predict(m1c, type="response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
results <- tibble(target=m1_data$species, pred=glm.pred)
results <- results %>% mutate(pred.class = as.factor(pred), target.class = as.factor(target))
print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 211 3
## 1 3 116
##
## Accuracy : 0.982
## 95% CI : (0.9612, 0.9934)
## No Information Rate : 0.6426
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9608
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9748
## Specificity : 0.9860
## Pos Pred Value : 0.9748
## Neg Pred Value : 0.9860
## Prevalence : 0.3574
## Detection Rate : 0.3483
## Detection Prevalence : 0.3574
## Balanced Accuracy : 0.9804
##
## 'Positive' Class : 1
##
The multinomial logistic regression model uses the same starting data as the logistic regression above. We relevel the Adelie species of penguins as a reference level for multinomial regression. The function multinom_metrics is introduced to summarize the p-values of each model as well as their classification performance.
multinom_metrics <- function(model) {
z <- summary(model)$coefficients/summary(model)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
print("P-values:")
print(p)
multi.pred <- predict(model, type="class", model2_data)
results <- tibble(target=model2_data$species, pred=multi.pred)
results <- results %>% mutate(pred.class = pred, target.class = target)
print(confusionMatrix(results$pred.class,results$target.class))
}The first multinomial model contains all the predictors that were used in the initial logistic model. The model output differs from the usual glm above so we rely on the p values calculated by the multinom_metrics function to evaluate the statistical significance of the predictors. We note from the confusion matrix below that the model results in perfect classification of penguins species with 100% accuracy. However, we also note in the output below that the body_mass_g predictor has a very high p-value for both Chinstrap and Gentoo penguins. This predictor is dropped in subsequent models.
## Call:
## multinom(formula = species ~ ., data = model2_data)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -34.60273 58.94543 -84.81399 -2.643720
## Gentoo -4.70502 43.75912 -91.60364 -1.639715
## body_mass_g
## Chinstrap -0.132491128
## Gentoo 0.007448619
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 4.4008386318 73.52088725 51.084116136 15.71457598
## Gentoo 0.0003319148 0.01602115 0.006439479 0.06591798
## body_mass_g
## Chinstrap 0.3479725
## Gentoo 1.4739115
##
## Residual Deviance: 0.0009955954
## AIC: 20.001
## [1] "P-values:"
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 3.774758e-15 0.4226972 0.09685792 0.8663995
## Gentoo 0.000000e+00 0.0000000 0.00000000 0.0000000
## body_mass_g
## Chinstrap 0.7033875
## Gentoo 0.9959678
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 146 0 0
## Chinstrap 0 68 0
## Gentoo 0 0 119
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.989, 1)
## No Information Rate : 0.4384
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.4384 0.2042 0.3574
## Detection Rate 0.4384 0.2042 0.3574
## Detection Prevalence 0.4384 0.2042 0.3574
## Balanced Accuracy 1.0000 1.0000 1.0000
By removing the body_mass_g predictor from the variables to include in the model, we end up with a model with only statististically significant predictors. The confusion matrix reveals that 3 penguins have been misclassified but the model accuracy remains very high at 99.1%. This is our preferred model.
## Call:
## multinom(formula = species ~ . - body_mass_g, data = model2_data)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -7.918499 2.7701294 -4.387811 -0.1732906
## Gentoo -3.551966 0.5140629 -32.264687 2.5082206
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 16.431601172 0.95068234 1.76771076 0.1055478
## Gentoo 0.001695331 0.07848069 0.02933815 0.3704786
##
## Residual Deviance: 15.27589
## AIC: 31.27589
## [1] "P-values:"
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 0.6298722 3.570210e-03 0.0130574 1.006270e-01
## Gentoo 0.0000000 5.746648e-11 0.0000000 1.285905e-11
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 145 2 0
## Chinstrap 1 66 0
## Gentoo 0 0 119
##
## Overall Statistics
##
## Accuracy : 0.991
## 95% CI : (0.9739, 0.9981)
## No Information Rate : 0.4384
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9859
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9932 0.9706 1.0000
## Specificity 0.9893 0.9962 1.0000
## Pos Pred Value 0.9864 0.9851 1.0000
## Neg Pred Value 0.9946 0.9925 1.0000
## Prevalence 0.4384 0.2042 0.3574
## Detection Rate 0.4354 0.1982 0.3574
## Detection Prevalence 0.4414 0.2012 0.3574
## Balanced Accuracy 0.9912 0.9834 1.0000
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap 0.0003639481 15.960700 1.242790e-02 0.8408932
## Gentoo 0.0286682265 1.672071 9.719064e-15 12.2830539
The model logit coefficients are transformed to relative risk ratios (odds) and can be interpretted as follows. In comparison to Adelie penguins and keeping all other variables constant:
For comparison, we also take a look at another model that uses the same predictors as the final logistic regression model above which is more parsimonious. We see the predictors remain significant but the accuracy has dropped slighly to 96.4%. Because of the reduction in accuracy on the training data, the previous model (Multinomial 2) is preferred.
## Call:
## multinom(formula = species ~ . - flipper_length_mm - body_mass_g,
## data = model2_data)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm
## Chinstrap -25.35966 2.230280 -3.979108
## Gentoo 23.81971 2.716294 -8.310034
##
## Std. Errors:
## (Intercept) bill_length_mm bill_depth_mm
## Chinstrap 13.77359 0.6990241 1.497086
## Gentoo 20.24687 0.7167481 1.823403
##
## Residual Deviance: 47.78413
## AIC: 59.78413
## [1] "P-values:"
## (Intercept) bill_length_mm bill_depth_mm
## Chinstrap 0.06559516 0.0014199603 7.862884e-03
## Gentoo 0.23940958 0.0001508011 5.178307e-06
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 143 3 0
## Chinstrap 3 61 2
## Gentoo 0 4 117
##
## Overall Statistics
##
## Accuracy : 0.964
## 95% CI : (0.9379, 0.9812)
## No Information Rate : 0.4384
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9435
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9795 0.8971 0.9832
## Specificity 0.9840 0.9811 0.9813
## Pos Pred Value 0.9795 0.9242 0.9669
## Neg Pred Value 0.9840 0.9738 0.9906
## Prevalence 0.4384 0.2042 0.3574
## Detection Rate 0.4294 0.1832 0.3514
## Detection Prevalence 0.4384 0.1982 0.3634
## Balanced Accuracy 0.9817 0.9391 0.9823