CUNY Spring 2021 Data 622 HW #2

Penguins data

The penguins data has 344 rows and 8 columns. Read more about the data at the package website.

penguins <- palmerpenguins::penguins %>% 
  dplyr::select(-year)
# Code for info is in appendix
info(penguins)

column	type	min	max	zero_count	na_count	unique_values	levels
species	factor	NA	NA	NA	0	3	Adelie, Gentoo, Chinstrap
island	factor	NA	NA	NA	0	3	Biscoe, Dream, Torgersen
bill_length_mm	numeric	32.1	59.6	0	2	164	NA
bill_depth_mm	numeric	13.1	21.5	0	2	80	NA
flipper_length_mm	integer	172.0	231.0	0	2	55	NA
body_mass_g	integer	2700.0	6300.0	0	2	94	NA
sex	factor	NA	NA	NA	11	2	male, female

Distribution of Sex

The distribution of penguins by sex is 50/50 for all species. The species representation in the sample, however, is not equal.

table(penguins[, c('species', 'sex')], useNA = "always")
##            sex
## species     female male <NA>
##   Adelie        73   73    6
##   Chinstrap     34   34    0
##   Gentoo        58   61    5
##   <NA>           0    0    0
table(penguins$species, useNA = "always") / nrow(penguins)
## 
##    Adelie Chinstrap    Gentoo      <NA> 
## 0.4418605 0.1976744 0.3604651 0.0000000

Islands

Gentoo are all on Biscoe island, Chinstrap are all on Dream island, and Adelie can be found on all three. I am interested in whether the characteristics of the three populations of Adelie are different.

table(penguins[, c('species', 'island')], useNA = 'always')
##            island
## species     Biscoe Dream Torgersen <NA>
##   Adelie        44    56        52    0
##   Chinstrap      0    68         0    0
##   Gentoo       124     0         0    0
##   <NA>           0     0         0    0
adelie <- penguins %>% filter(species == 'Adelie')
table(adelie[, c('island', 'sex')], useNA = 'always')
##            sex
## island      female male <NA>
##   Biscoe        22   22    0
##   Dream         27   28    1
##   Torgersen     24   23    5
##   <NA>           0    0    0

adelie %>% 
  mutate(ID = 1:nrow(.)) %>% 
  drop_na() %>% 
  dplyr::select(-species, -sex) %>% 
  pivot_longer(
    cols = -c('ID', 'island'), 
    names_to = 'field', 
    values_to = 'value') %>% 
  ggplot() +
  aes(x = value, fill = island) +
  geom_density(alpha = 0.75) +
  facet_wrap(~field, scales = 'free')

There is something interesting with bill_depth_mm on Torgerson island, but nothing appears dramatic enough for me to want to include island.

Based on this sample of birds, if we know the island a bird comes from, there are at most two species it can belong to. In the case of Torgersen island, we know it has to be Adelie. However, I want to build models that do not rely on island.

Pairs Plot

library(GGally)
penguins %>% 
  drop_na %>% 
  dplyr::select(
    species, 
    bill_length_mm, 
    bill_depth_mm, 
    flipper_length_mm, 
    body_mass_g) %>% 
  GGally::ggpairs(
    progress = FALSE,
    mapping = aes(
      fill = species, color = species),
    columns = setdiff(names(.), 'species'))

If you look at the distributions of bill_depth_mm and body_mass_g and their scatter plot, something really interesting pops out. There is a perfect linear separation of Gentoo from the other two species. It’s as if you could use the ratio of bill depth to body mass to predict whether an individual is Gentoo. Then, use a conditional logistic model to split Adelie from Chinstrap.

Part A: Linear Discriminant Analysis

Based on the work above, I will not include island or sex in the models that follow. But, the four continuous variables all appear helpful in discriminating between the three groups.

Define training and testing sets for all parts. There are 11 records with NAs – I’m going to ignore them since there are so few.

set.seed(0)
# TRUE means part of training set = 70% of observations
L <- sample(
  c(TRUE, FALSE), 
  size = nrow(penguins), 
  replace = TRUE, 
  prob = c(0.7, 0.3))

ptrain <- penguins %>% filter(L)  %>% 
  drop_na() %>% dplyr::select(-island, -sex)

ptest  <- penguins %>% filter(!L) %>% 
  drop_na() %>% dplyr::select(-island, -sex)

dim(ptrain)
## [1] 237   5
dim(ptest)
## [1] 96  5

Use the MASS package to run LDA on the training data.

lda1 <- MASS::lda(species ~ ., data = ptrain)
print(lda1)
## Call:
## lda(species ~ ., data = ptrain)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4261603 0.2067511 0.3670886 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie          38.69604      18.16139          190.0396    3664.109
## Chinstrap       48.97347      18.60000          196.2245    3801.020
## Gentoo          47.68391      15.04023          217.4023    5091.954
## 
## Coefficients of linear discriminants:
##                            LD1          LD2
## bill_length_mm     0.094069290  0.457127654
## bill_depth_mm     -1.077153563  0.048832353
## flipper_length_mm  0.081419728 -0.032188603
## body_mass_g        0.001402786 -0.001432835
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8526 0.1474
p1 <- predict(lda1)
table(p1$class, ptrain$species)
##            
##             Adelie Chinstrap Gentoo
##   Adelie       100         1      0
##   Chinstrap      1        48      0
##   Gentoo         0         0     87
p1n <- predict(lda1, newdata = ptest)
table(p1n$class, ptest$species)
##            
##             Adelie Chinstrap Gentoo
##   Adelie        45         2      0
##   Chinstrap      0        17      0
##   Gentoo         0         0     32

This model mis-classifies 2 observations each in both the training and test data. The question I have is whether a more parsimonious model could do just as well. One thing we noticed in HW #1 was that the logistic model didn’t converge when all four measurements were in the model. However, LDA doesn’t have this problem. But, a subset of the features worked just fine in the logistic model, maybe it will here to. There are not many possible combinations of features, so I can check all of them.

# The four variables I care about
cols <- c(
  "bill_length_mm", 
  "bill_depth_mm", 
  "flipper_length_mm", 
  "body_mass_g")

# Build a "masking" matrix
a <- c(TRUE, FALSE)
b <- matrix(c(
  rep(a, each = 8, length.out = 16),
  rep(a, each = 4, length.out = 16),
  rep(a, each = 2, length.out = 16),
  rep(a, each = 1, length.out = 16)),
  nrow = 16)

# Temp vectors to store output
train_error <- c()
test_error <- c()

# Iterate through all but 16th combo
for (i in 1:15) {
  r <- b[i, ]
  cols_i <- cols[r]
  mod_i <- MASS::lda(
    species ~ .,
    data = ptrain[, c(cols_i, 'species')])
  train_error[i] <- sum(predict(mod_i)$class != ptrain$species)
  test_error[i] <- sum(predict(mod_i, newdata = ptest)$class != ptest$species)
}

# Prepare an exhibit
mod_form <- apply(b[1:15, ], 1, function(r) {
  paste(cols[r], collapse = ' + ')
})

data.frame(
  `Model Formula` = mod_form, 
  `Training Error Rate` = train_error / nrow(ptrain),
  `Test Error Rate` = test_error / nrow(ptest),
  check.names = FALSE) %>% 
  arrange(`Test Error Rate`) %>% 
  knitr::kable(digits = c(0, 3, 3))

Model Formula	Training Error Rate	Test Error Rate
bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g	0.008	0.021
bill_length_mm + bill_depth_mm + body_mass_g	0.008	0.021
bill_length_mm + bill_depth_mm + flipper_length_mm	0.004	0.031
bill_length_mm + flipper_length_mm + body_mass_g	0.021	0.062
bill_length_mm + bill_depth_mm	0.021	0.073
bill_length_mm + flipper_length_mm	0.034	0.073
bill_length_mm + body_mass_g	0.034	0.083
bill_depth_mm + flipper_length_mm + body_mass_g	0.177	0.146
bill_depth_mm + flipper_length_mm	0.177	0.156
flipper_length_mm + body_mass_g	0.203	0.167
flipper_length_mm	0.198	0.198
bill_depth_mm + body_mass_g	0.219	0.240
bill_length_mm	0.224	0.271
bill_depth_mm	0.274	0.271
body_mass_g	0.266	0.292

It appears that the model without flipper_length_mm performs just as well as the saturated model with this test set. This is probably because flipper_length_mm and body_mass_g have high correlation (see the pairs plot above).

Part B: Quadratic Discriminant Analysis

The MASS package also has a function for QDA. I can’t imagine that QDA will perform better than the 3-feature LDA model above. See the pairs plot – there is perfect linear separation of Gentoo and the other two species using body mass and bill depth. The addition of bill length separates Adelie and Chinstrap.

qda2 <- MASS::qda(
  species ~ bill_length_mm + bill_depth_mm + body_mass_g, data = ptrain)
print(qda2)
## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + body_mass_g, data = ptrain)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4261603 0.2067511 0.3670886 
## 
## Group means:
##           bill_length_mm bill_depth_mm body_mass_g
## Adelie          38.69604      18.16139    3664.109
## Chinstrap       48.97347      18.60000    3801.020
## Gentoo          47.68391      15.04023    5091.954
p2 <- predict(qda2)
table(p2$class, ptrain$species)
##            
##             Adelie Chinstrap Gentoo
##   Adelie       100         1      0
##   Chinstrap      1        48      0
##   Gentoo         0         0     87
p2n <- predict(qda2, newdata = ptest)
table(p2n$class, ptest$species)
##            
##             Adelie Chinstrap Gentoo
##   Adelie        45         2      0
##   Chinstrap      0        17      0
##   Gentoo         0         0     32

The performance is identical!

# Iterate through all but 16th combo
for (i in 1:15) {
  r <- b[i, ]
  cols_i <- cols[r]
  mod_i <- MASS::qda(
    species ~ .,
    data = ptrain[, c(cols_i, 'species')])
  train_error[i] <- sum(predict(mod_i)$class != ptrain$species)
  test_error[i] <- sum(predict(mod_i, newdata = ptest)$class != ptest$species)
}

data.frame(
  `Model Formula` = mod_form, 
  `Training Error Rate` = train_error / nrow(ptrain),
  `Test Error Rate` = test_error / nrow(ptest),
  check.names = FALSE) %>% 
  arrange(`Test Error Rate`) %>% 
  knitr::kable(digits = c(0, 3, 3))

Model Formula	Training Error Rate	Test Error Rate
bill_length_mm + bill_depth_mm + body_mass_g	0.008	0.021
bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g	0.008	0.031
bill_length_mm + bill_depth_mm + flipper_length_mm	0.008	0.031
bill_length_mm + flipper_length_mm + body_mass_g	0.021	0.062
bill_length_mm + bill_depth_mm	0.025	0.073
bill_length_mm + flipper_length_mm	0.030	0.073
bill_length_mm + body_mass_g	0.034	0.083
bill_depth_mm + flipper_length_mm	0.181	0.167
bill_depth_mm + flipper_length_mm + body_mass_g	0.152	0.177
bill_depth_mm + body_mass_g	0.207	0.198
flipper_length_mm + body_mass_g	0.190	0.198
flipper_length_mm	0.194	0.198
bill_depth_mm	0.249	0.240
bill_length_mm	0.224	0.260
body_mass_g	0.266	0.292

Because the QDA performs no better than LDA, we should prefer LDA.

Part C: Naive Bayes

library(e1071)
nb3 <- e1071::naiveBayes(species ~ ., data = ptrain)
print(nb3)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    Adelie Chinstrap    Gentoo 
## 0.4261603 0.2067511 0.3670886 
## 
## Conditional probabilities:
##            bill_length_mm
## Y               [,1]     [,2]
##   Adelie    38.69604 2.613424
##   Chinstrap 48.97347 2.891278
##   Gentoo    47.68391 2.760345
## 
##            bill_depth_mm
## Y               [,1]      [,2]
##   Adelie    18.16139 1.2084676
##   Chinstrap 18.60000 1.0866616
##   Gentoo    15.04023 0.9140327
## 
##            flipper_length_mm
## Y               [,1]     [,2]
##   Adelie    190.0396 6.526746
##   Chinstrap 196.2245 7.237821
##   Gentoo    217.4023 6.282978
## 
##            body_mass_g
## Y               [,1]     [,2]
##   Adelie    3664.109 457.9767
##   Chinstrap 3801.020 352.9252
##   Gentoo    5091.954 491.9038
p3 <- predict(nb3, ptrain)
table(p3, ptrain$species)
##            
## p3          Adelie Chinstrap Gentoo
##   Adelie        98         3      0
##   Chinstrap      3        46      0
##   Gentoo         0         0     87
p3n <- predict(nb3, newdata = ptest)
table(p3n, ptest$species)
##            
## p3n         Adelie Chinstrap Gentoo
##   Adelie        43         3      0
##   Chinstrap      2        16      0
##   Gentoo         0         0     32

This doesn’t perform as well as the others. It is because the assumption of independence between the features is inappropriate with this data – there is strong correlation between the X’s.

# Iterate through all but 16th combo
for (i in 1:15) {
  r <- b[i, ]
  cols_i <- cols[r]
  mod_i <- naiveBayes(
    species ~ .,
    data = ptrain[, c(cols_i, 'species')])
  train_error[i] <- sum(predict(mod_i, ptrain) != ptrain$species)
  test_error[i] <- sum(predict(mod_i, ptest) != ptest$species)
}

data.frame(
  `Model Formula` = mod_form, 
  `Training Error Rate` = train_error / nrow(ptrain),
  `Test Error Rate` = test_error / nrow(ptest),
  check.names = FALSE) %>% 
  arrange(`Test Error Rate`) %>% 
  knitr::kable(digits = c(0, 3, 3))

Model Formula	Training Error Rate	Test Error Rate
bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g	0.025	0.052
bill_length_mm + bill_depth_mm + flipper_length_mm	0.021	0.052
bill_length_mm + bill_depth_mm + body_mass_g	0.021	0.062
bill_length_mm + flipper_length_mm + body_mass_g	0.034	0.062
bill_length_mm + flipper_length_mm	0.034	0.073
bill_length_mm + body_mass_g	0.059	0.104
bill_length_mm + bill_depth_mm	0.042	0.115
flipper_length_mm + body_mass_g	0.194	0.177
flipper_length_mm	0.194	0.198
bill_depth_mm + flipper_length_mm + body_mass_g	0.173	0.219
bill_depth_mm + flipper_length_mm	0.177	0.219
bill_depth_mm + body_mass_g	0.194	0.229
bill_depth_mm	0.249	0.240
bill_length_mm	0.224	0.260
body_mass_g	0.266	0.292

In this case, the algorithm suggests we should leave off body mass, but the performance of LDA is better.

Part D

My comments about the three model types are included above. In summary, LDA works best because of the natural linear separation between the species. Naive Bayes is probably not appropriate because of the strong multicollinearity in the X’s. Just like in the logistic models from HW #1, we don’t want to include both body mass and flipper length because they have very high correlation. In the case of logisitic regression, including both caused the algorithm to not converge. With these model types we do not have that problem, but parsimony is still preferred.