The penguins data has 344 rows and 8 columns. Read more about the data at the package website.
penguins <- palmerpenguins::penguins %>%
dplyr::select(-year)
# Code for info is in appendix
info(penguins)
| column | type | min | max | zero_count | na_count | unique_values | levels |
|---|---|---|---|---|---|---|---|
| species | factor | NA | NA | NA | 0 | 3 | Adelie, Gentoo, Chinstrap |
| island | factor | NA | NA | NA | 0 | 3 | Biscoe, Dream, Torgersen |
| bill_length_mm | numeric | 32.1 | 59.6 | 0 | 2 | 164 | NA |
| bill_depth_mm | numeric | 13.1 | 21.5 | 0 | 2 | 80 | NA |
| flipper_length_mm | integer | 172.0 | 231.0 | 0 | 2 | 55 | NA |
| body_mass_g | integer | 2700.0 | 6300.0 | 0 | 2 | 94 | NA |
| sex | factor | NA | NA | NA | 11 | 2 | male, female |
The distribution of penguins by sex is 50/50 for all species. The species representation in the sample, however, is not equal.
table(penguins[, c('species', 'sex')], useNA = "always")
## sex
## species female male <NA>
## Adelie 73 73 6
## Chinstrap 34 34 0
## Gentoo 58 61 5
## <NA> 0 0 0
table(penguins$species, useNA = "always") / nrow(penguins)
##
## Adelie Chinstrap Gentoo <NA>
## 0.4418605 0.1976744 0.3604651 0.0000000
Gentoo are all on Biscoe island, Chinstrap are all on Dream island, and Adelie can be found on all three. I am interested in whether the characteristics of the three populations of Adelie are different.
table(penguins[, c('species', 'island')], useNA = 'always')
## island
## species Biscoe Dream Torgersen <NA>
## Adelie 44 56 52 0
## Chinstrap 0 68 0 0
## Gentoo 124 0 0 0
## <NA> 0 0 0 0
adelie <- penguins %>% filter(species == 'Adelie')
table(adelie[, c('island', 'sex')], useNA = 'always')
## sex
## island female male <NA>
## Biscoe 22 22 0
## Dream 27 28 1
## Torgersen 24 23 5
## <NA> 0 0 0
adelie %>%
mutate(ID = 1:nrow(.)) %>%
drop_na() %>%
dplyr::select(-species, -sex) %>%
pivot_longer(
cols = -c('ID', 'island'),
names_to = 'field',
values_to = 'value') %>%
ggplot() +
aes(x = value, fill = island) +
geom_density(alpha = 0.75) +
facet_wrap(~field, scales = 'free')
There is something interesting with bill_depth_mm on Torgerson island, but nothing appears dramatic enough for me to want to include island.
Based on this sample of birds, if we know the island a bird comes from, there are at most two species it can belong to. In the case of Torgersen island, we know it has to be Adelie. However, I want to build models that do not rely on island.
library(GGally)
penguins %>%
drop_na %>%
dplyr::select(
species,
bill_length_mm,
bill_depth_mm,
flipper_length_mm,
body_mass_g) %>%
GGally::ggpairs(
progress = FALSE,
mapping = aes(
fill = species, color = species),
columns = setdiff(names(.), 'species'))
If you look at the distributions of bill_depth_mm and body_mass_g and their scatter plot, something really interesting pops out. There is a perfect linear separation of Gentoo from the other two species. It’s as if you could use the ratio of bill depth to body mass to predict whether an individual is Gentoo. Then, use a conditional logistic model to split Adelie from Chinstrap.
Based on the work above, I will not include island or sex in the models that follow. But, the four continuous variables all appear helpful in discriminating between the three groups.
Define training and testing sets for all parts. There are 11 records with NAs – I’m going to ignore them since there are so few.
set.seed(0)
# TRUE means part of training set = 70% of observations
L <- sample(
c(TRUE, FALSE),
size = nrow(penguins),
replace = TRUE,
prob = c(0.7, 0.3))
ptrain <- penguins %>% filter(L) %>%
drop_na() %>% dplyr::select(-island, -sex)
ptest <- penguins %>% filter(!L) %>%
drop_na() %>% dplyr::select(-island, -sex)
dim(ptrain)
## [1] 237 5
dim(ptest)
## [1] 96 5
Use the MASS package to run LDA on the training data.
lda1 <- MASS::lda(species ~ ., data = ptrain)
print(lda1)
## Call:
## lda(species ~ ., data = ptrain)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4261603 0.2067511 0.3670886
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie 38.69604 18.16139 190.0396 3664.109
## Chinstrap 48.97347 18.60000 196.2245 3801.020
## Gentoo 47.68391 15.04023 217.4023 5091.954
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 0.094069290 0.457127654
## bill_depth_mm -1.077153563 0.048832353
## flipper_length_mm 0.081419728 -0.032188603
## body_mass_g 0.001402786 -0.001432835
##
## Proportion of trace:
## LD1 LD2
## 0.8526 0.1474
p1 <- predict(lda1)
table(p1$class, ptrain$species)
##
## Adelie Chinstrap Gentoo
## Adelie 100 1 0
## Chinstrap 1 48 0
## Gentoo 0 0 87
p1n <- predict(lda1, newdata = ptest)
table(p1n$class, ptest$species)
##
## Adelie Chinstrap Gentoo
## Adelie 45 2 0
## Chinstrap 0 17 0
## Gentoo 0 0 32
This model mis-classifies 2 observations each in both the training and test data. The question I have is whether a more parsimonious model could do just as well. One thing we noticed in HW #1 was that the logistic model didn’t converge when all four measurements were in the model. However, LDA doesn’t have this problem. But, a subset of the features worked just fine in the logistic model, maybe it will here to. There are not many possible combinations of features, so I can check all of them.
# The four variables I care about
cols <- c(
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")
# Build a "masking" matrix
a <- c(TRUE, FALSE)
b <- matrix(c(
rep(a, each = 8, length.out = 16),
rep(a, each = 4, length.out = 16),
rep(a, each = 2, length.out = 16),
rep(a, each = 1, length.out = 16)),
nrow = 16)
# Temp vectors to store output
train_error <- c()
test_error <- c()
# Iterate through all but 16th combo
for (i in 1:15) {
r <- b[i, ]
cols_i <- cols[r]
mod_i <- MASS::lda(
species ~ .,
data = ptrain[, c(cols_i, 'species')])
train_error[i] <- sum(predict(mod_i)$class != ptrain$species)
test_error[i] <- sum(predict(mod_i, newdata = ptest)$class != ptest$species)
}
# Prepare an exhibit
mod_form <- apply(b[1:15, ], 1, function(r) {
paste(cols[r], collapse = ' + ')
})
data.frame(
`Model Formula` = mod_form,
`Training Error Rate` = train_error / nrow(ptrain),
`Test Error Rate` = test_error / nrow(ptest),
check.names = FALSE) %>%
arrange(`Test Error Rate`) %>%
knitr::kable(digits = c(0, 3, 3))
| Model Formula | Training Error Rate | Test Error Rate |
|---|---|---|
| bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g | 0.008 | 0.021 |
| bill_length_mm + bill_depth_mm + body_mass_g | 0.008 | 0.021 |
| bill_length_mm + bill_depth_mm + flipper_length_mm | 0.004 | 0.031 |
| bill_length_mm + flipper_length_mm + body_mass_g | 0.021 | 0.062 |
| bill_length_mm + bill_depth_mm | 0.021 | 0.073 |
| bill_length_mm + flipper_length_mm | 0.034 | 0.073 |
| bill_length_mm + body_mass_g | 0.034 | 0.083 |
| bill_depth_mm + flipper_length_mm + body_mass_g | 0.177 | 0.146 |
| bill_depth_mm + flipper_length_mm | 0.177 | 0.156 |
| flipper_length_mm + body_mass_g | 0.203 | 0.167 |
| flipper_length_mm | 0.198 | 0.198 |
| bill_depth_mm + body_mass_g | 0.219 | 0.240 |
| bill_length_mm | 0.224 | 0.271 |
| bill_depth_mm | 0.274 | 0.271 |
| body_mass_g | 0.266 | 0.292 |
It appears that the model without flipper_length_mm performs just as well as the saturated model with this test set. This is probably because flipper_length_mm and body_mass_g have high correlation (see the pairs plot above).
The MASS package also has a function for QDA. I can’t imagine that QDA will perform better than the 3-feature LDA model above. See the pairs plot – there is perfect linear separation of Gentoo and the other two species using body mass and bill depth. The addition of bill length separates Adelie and Chinstrap.
qda2 <- MASS::qda(
species ~ bill_length_mm + bill_depth_mm + body_mass_g, data = ptrain)
print(qda2)
## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + body_mass_g, data = ptrain)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4261603 0.2067511 0.3670886
##
## Group means:
## bill_length_mm bill_depth_mm body_mass_g
## Adelie 38.69604 18.16139 3664.109
## Chinstrap 48.97347 18.60000 3801.020
## Gentoo 47.68391 15.04023 5091.954
p2 <- predict(qda2)
table(p2$class, ptrain$species)
##
## Adelie Chinstrap Gentoo
## Adelie 100 1 0
## Chinstrap 1 48 0
## Gentoo 0 0 87
p2n <- predict(qda2, newdata = ptest)
table(p2n$class, ptest$species)
##
## Adelie Chinstrap Gentoo
## Adelie 45 2 0
## Chinstrap 0 17 0
## Gentoo 0 0 32
The performance is identical!
# Iterate through all but 16th combo
for (i in 1:15) {
r <- b[i, ]
cols_i <- cols[r]
mod_i <- MASS::qda(
species ~ .,
data = ptrain[, c(cols_i, 'species')])
train_error[i] <- sum(predict(mod_i)$class != ptrain$species)
test_error[i] <- sum(predict(mod_i, newdata = ptest)$class != ptest$species)
}
data.frame(
`Model Formula` = mod_form,
`Training Error Rate` = train_error / nrow(ptrain),
`Test Error Rate` = test_error / nrow(ptest),
check.names = FALSE) %>%
arrange(`Test Error Rate`) %>%
knitr::kable(digits = c(0, 3, 3))
| Model Formula | Training Error Rate | Test Error Rate |
|---|---|---|
| bill_length_mm + bill_depth_mm + body_mass_g | 0.008 | 0.021 |
| bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g | 0.008 | 0.031 |
| bill_length_mm + bill_depth_mm + flipper_length_mm | 0.008 | 0.031 |
| bill_length_mm + flipper_length_mm + body_mass_g | 0.021 | 0.062 |
| bill_length_mm + bill_depth_mm | 0.025 | 0.073 |
| bill_length_mm + flipper_length_mm | 0.030 | 0.073 |
| bill_length_mm + body_mass_g | 0.034 | 0.083 |
| bill_depth_mm + flipper_length_mm | 0.181 | 0.167 |
| bill_depth_mm + flipper_length_mm + body_mass_g | 0.152 | 0.177 |
| bill_depth_mm + body_mass_g | 0.207 | 0.198 |
| flipper_length_mm + body_mass_g | 0.190 | 0.198 |
| flipper_length_mm | 0.194 | 0.198 |
| bill_depth_mm | 0.249 | 0.240 |
| bill_length_mm | 0.224 | 0.260 |
| body_mass_g | 0.266 | 0.292 |
Because the QDA performs no better than LDA, we should prefer LDA.
library(e1071)
nb3 <- e1071::naiveBayes(species ~ ., data = ptrain)
print(nb3)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Adelie Chinstrap Gentoo
## 0.4261603 0.2067511 0.3670886
##
## Conditional probabilities:
## bill_length_mm
## Y [,1] [,2]
## Adelie 38.69604 2.613424
## Chinstrap 48.97347 2.891278
## Gentoo 47.68391 2.760345
##
## bill_depth_mm
## Y [,1] [,2]
## Adelie 18.16139 1.2084676
## Chinstrap 18.60000 1.0866616
## Gentoo 15.04023 0.9140327
##
## flipper_length_mm
## Y [,1] [,2]
## Adelie 190.0396 6.526746
## Chinstrap 196.2245 7.237821
## Gentoo 217.4023 6.282978
##
## body_mass_g
## Y [,1] [,2]
## Adelie 3664.109 457.9767
## Chinstrap 3801.020 352.9252
## Gentoo 5091.954 491.9038
p3 <- predict(nb3, ptrain)
table(p3, ptrain$species)
##
## p3 Adelie Chinstrap Gentoo
## Adelie 98 3 0
## Chinstrap 3 46 0
## Gentoo 0 0 87
p3n <- predict(nb3, newdata = ptest)
table(p3n, ptest$species)
##
## p3n Adelie Chinstrap Gentoo
## Adelie 43 3 0
## Chinstrap 2 16 0
## Gentoo 0 0 32
This doesn’t perform as well as the others. It is because the assumption of independence between the features is inappropriate with this data – there is strong correlation between the X’s.
# Iterate through all but 16th combo
for (i in 1:15) {
r <- b[i, ]
cols_i <- cols[r]
mod_i <- naiveBayes(
species ~ .,
data = ptrain[, c(cols_i, 'species')])
train_error[i] <- sum(predict(mod_i, ptrain) != ptrain$species)
test_error[i] <- sum(predict(mod_i, ptest) != ptest$species)
}
data.frame(
`Model Formula` = mod_form,
`Training Error Rate` = train_error / nrow(ptrain),
`Test Error Rate` = test_error / nrow(ptest),
check.names = FALSE) %>%
arrange(`Test Error Rate`) %>%
knitr::kable(digits = c(0, 3, 3))
| Model Formula | Training Error Rate | Test Error Rate |
|---|---|---|
| bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g | 0.025 | 0.052 |
| bill_length_mm + bill_depth_mm + flipper_length_mm | 0.021 | 0.052 |
| bill_length_mm + bill_depth_mm + body_mass_g | 0.021 | 0.062 |
| bill_length_mm + flipper_length_mm + body_mass_g | 0.034 | 0.062 |
| bill_length_mm + flipper_length_mm | 0.034 | 0.073 |
| bill_length_mm + body_mass_g | 0.059 | 0.104 |
| bill_length_mm + bill_depth_mm | 0.042 | 0.115 |
| flipper_length_mm + body_mass_g | 0.194 | 0.177 |
| flipper_length_mm | 0.194 | 0.198 |
| bill_depth_mm + flipper_length_mm + body_mass_g | 0.173 | 0.219 |
| bill_depth_mm + flipper_length_mm | 0.177 | 0.219 |
| bill_depth_mm + body_mass_g | 0.194 | 0.229 |
| bill_depth_mm | 0.249 | 0.240 |
| bill_length_mm | 0.224 | 0.260 |
| body_mass_g | 0.266 | 0.292 |
In this case, the algorithm suggests we should leave off body mass, but the performance of LDA is better.
My comments about the three model types are included above. In summary, LDA works best because of the natural linear separation between the species. Naive Bayes is probably not appropriate because of the strong multicollinearity in the X’s. Just like in the logistic models from HW #1, we don’t want to include both body mass and flipper length because they have very high correlation. In the case of logisitic regression, including both caused the algorithm to not converge. With these model types we do not have that problem, but parsimony is still preferred.