load packages, data

This was the second assignment from the class data 622. We would use the “penguin” data set and apply three distinct classification algorithms to predict for “species”.

if(!require(pacman)){install.packages("pacman"); require(pacman)}
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 3.6.2
packages <- c('nnet', 'glue', 'broom', 'MASS', 'caret', 'InformationValue', 'Hmisc', 'kableExtra', 'corrplot', 'tidyverse', 'ROCR', 'palmerpenguins', 'mice', 'nortest', 'ellipse')
pacman::p_load(char = packages)

df = penguins

df %>% head() %>% kableExtra::kable()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

data exploration : metadata

We looked at the data type, number of missing and unique values for each variable. We would explore the data and also impute the data set before fitting the models.

str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

metadata = data.frame(colName = names(df)) %>%
    dplyr::mutate(dtype = sapply(df, class),
                  missing = colSums(is.na(df)),
                  unique = sapply(df, function(x) length(unique(x))))

metadata %>% kableExtra::kable()
colName dtype missing unique
species factor 0 3
island factor 0 3
bill_length_mm numeric 2 165
bill_depth_mm numeric 2 81
flipper_length_mm integer 2 56
body_mass_g integer 2 95
sex factor 11 3
year integer 0 3

data exploration : physiological measures

We have 3 distinct species. Let’s focus on the four main physiological measures, i.e. “bill_length_mm”, “bill_depth_mm”, “flipper_length_mm”, and “body_mass_g”, and see how they differ.

bodyVars = metadata %>% dplyr::filter(unique >3) %>% .$colName %>% as.character

dfGather <- df %>%
    dplyr::select(all_of(c("species", bodyVars))) %>%
    dplyr::mutate(species = as.character(species)) %>%
    tidyr::gather(key, value, -species)

ggplot(dfGather, aes(value, color = species)) +
    geom_density() +
    theme(legend.position = "top") +
    geom_vline(data = aggregate(value ~ species + key, dfGather, median), 
               aes(xintercept = value,
                   color = species),
               linetype = "dashed") +
    facet_wrap(~ key, nrow = 5, scales = "free")
## Warning: Removed 8 rows containing non-finite values (stat_density).

data exploration : feature plot

While the above ggplot displayed the distribution of each continuous variable for the three species (and we could see that Adelie and Chinstrap are highly similar to each other), below feature plot further demonstrated how these four physiological measures related to each other, and how these measures could classify the species, e.g. Adelie and Chinstrap were almost identical in most of these measures (as the circles overlapped with each other).

dfTemp = df[complete.cases(df), ] %>% dplyr::select(species, bill_length_mm:body_mass_g)

caret::featurePlot(x = dfTemp %>% dplyr::select(-species),
                   y = dfTemp$species,
                   plot = "ellipse",
                   # add a key at the top
                   auto.key = list(columns = 3))

data cleanup : impute missing values, minus the variable “year”

Next, we would have to impute the data set. We took out a useless variable “year” before modeling. Moreover, we would conduct a correlation matrix to see the correlation among the four physiological measures.

# impute missing values
init = mice(df, maxit = 0) 
meth = init$method
predM = init$predictorMatrix
meth[c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")] = "norm" 
meth[c("sex")] = "logreg" 

set.seed(1234)
imputed <- mice(df, method = meth, predictorMatrix = predM, m = 5)
## 
##  iter imp variable
##   1   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
df2 <- complete(imputed) %>% dplyr::select(-year)

# double check and make sure no more missing values
colSums(is.na(df2))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 0                 0 
## flipper_length_mm       body_mass_g               sex 
##                 0                 0                 0

# df2 - ready to build model
df2 %>% head()
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen       39.10000      18.70000           181.000    3750.000
## 2  Adelie Torgersen       39.50000      17.40000           186.000    3800.000
## 3  Adelie Torgersen       40.30000      18.00000           195.000    3250.000
## 4  Adelie Torgersen       44.08715      20.98134           191.335    4194.923
## 5  Adelie Torgersen       36.70000      19.30000           193.000    3450.000
## 6  Adelie Torgersen       39.30000      20.60000           190.000    3650.000
##      sex
## 1   male
## 2 female
## 3 female
## 4   male
## 5 female
## 6   male

# let's take a look of the bodyVars after completing the whole data set
df2 %>%
        dplyr::select(bodyVars) %>%
        cor() %>%
        corrplot(method = "number", type = "upper", order = "hclust")
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(bodyVars)` instead of `bodyVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

data prep : scale data, Anderson Darling test for normality, split data

We would scale the data, and then conduct the Anderson Darling test for normality. We would split the data and set up a train control before building models. The four physiological measures had from mild to high correlation to each other. The “body_mass_g” was probably the only variable that had an underlying normal distribution for the three species, but the other three measures were close enough to normality (judging from above ggplot). We would do a 70/30 split, i.e. 70% of our data for training purpose, while the remaining 30% would be used for testing. In addition, we applied a 10-fold cross-validation for our model training exercise.

# scale data
dfScaled <- df2 %>% dplyr::mutate_at(vars(bodyVars), scale)

# Anderson Darling test for normality
ad.test.list = lapply(1:length(bodyVars), function(i){
    aggregate(eval(sym(bodyVars[i])) ~ species, data = dfScaled, FUN = function(x) ad.test(x)$p.value) %>%
        as.data.frame() %>%
        dplyr::mutate(var = bodyVars[i]) %>%
        dplyr::select(var, species, ad.test.p.value = V1)
})

names(ad.test.list) = bodyVars

ad.test.list %>% dplyr::bind_rows() %>%
    dplyr::mutate(is_normal = dplyr::case_when(ad.test.p.value > .05 ~ 1, TRUE ~ 0))
##                  var   species ad.test.p.value is_normal
## 1     bill_length_mm    Adelie     0.705413971         1
## 2     bill_length_mm Chinstrap     0.038582632         0
## 3     bill_length_mm    Gentoo     0.185708400         1
## 4      bill_depth_mm    Adelie     0.074340606         1
## 5      bill_depth_mm Chinstrap     0.271415846         1
## 6      bill_depth_mm    Gentoo     0.034553821         0
## 7  flipper_length_mm    Adelie     0.367196463         1
## 8  flipper_length_mm Chinstrap     0.791306198         1
## 9  flipper_length_mm    Gentoo     0.004220275         0
## 10       body_mass_g    Adelie     0.111789540         1
## 11       body_mass_g Chinstrap     0.366284805         1
## 12       body_mass_g    Gentoo     0.138980393         1

# split data
set.seed(1234)
index <- caret::createDataPartition(dfScaled$species, p = .7, list = FALSE)
trainSet <- dfScaled[index, ]
testSet <- dfScaled[-index, ]

# trainSetControl
trainSet.control <- caret::trainControl(method = "cv", number = 10, savePredictions = "final", classProbs = TRUE)

Linear Discriminant Analysis

We achieved 100% accuracy!

fit_lda = caret::train(species~., data = trainSet, method = "lda", 
                       trControl = trainSet.control, metric = "Accuracy")

fit_lda$pred %>% head()
##   parameter   pred    obs    Adelie    Chinstrap       Gentoo rowIndex Resample
## 1      none Adelie Adelie 0.9999991 9.114867e-07 2.988432e-22        7   Fold01
## 2      none Adelie Adelie 0.9968483 3.151695e-03 2.959611e-19        9   Fold01
## 3      none Adelie Adelie 0.9999959 4.117413e-06 9.578573e-20       12   Fold01
## 4      none Adelie Adelie 0.9998156 1.843733e-04 3.058554e-17       28   Fold01
## 5      none Adelie Adelie 1.0000000 4.759773e-10 4.677373e-22       37   Fold01
## 6      none Adelie Adelie 0.9999765 2.351812e-05 3.223636e-14       52   Fold01

pred_lda = predict(fit_lda, testSet)

caret::confusionMatrix(testSet$species, pred_lda)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        45         0      0
##   Chinstrap      0        20      0
##   Gentoo         0         0     37
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9645, 1)
##     No Information Rate : 0.4412     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           1.0000        1.0000
## Specificity                 1.0000           1.0000        1.0000
## Pos Pred Value              1.0000           1.0000        1.0000
## Neg Pred Value              1.0000           1.0000        1.0000
## Prevalence                  0.4412           0.1961        0.3627
## Detection Rate              0.4412           0.1961        0.3627
## Detection Prevalence        0.4412           0.1961        0.3627
## Balanced Accuracy           1.0000           1.0000        1.0000

Quadratic Discriminant Analysis

We removed the “island” variable to resolve the rank deficiency problem and achieved 99% accuracy!

fit_qda = caret::train(species~., data = trainSet %>% dplyr::select(-island), method = "qda", 
                       trControl = trainSet.control, metric = "Accuracy")

fit_qda$pred %>% head()
##   parameter   pred    obs    Adelie    Chinstrap       Gentoo rowIndex Resample
## 1      none Adelie Adelie 1.0000000 2.323055e-14 2.225360e-65       10   Fold01
## 2      none Adelie Adelie 0.9999928 7.158867e-06 7.131863e-37       12   Fold01
## 3      none Adelie Adelie 1.0000000 1.134423e-12 1.861348e-54       16   Fold01
## 4      none Adelie Adelie 0.9815971 1.840293e-02 3.813363e-34       20   Fold01
## 5      none Adelie Adelie 1.0000000 2.810009e-15 1.635939e-50       22   Fold01
## 6      none Adelie Adelie 1.0000000 1.471517e-10 2.310543e-33       63   Fold01

pred_qda = predict(fit_qda, testSet)

caret::confusionMatrix(testSet$species, pred_qda)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        45         0      0
##   Chinstrap      1        19      0
##   Gentoo         0         0     37
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9902          
##                  95% CI : (0.9466, 0.9998)
##     No Information Rate : 0.451           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9845          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9783           1.0000        1.0000
## Specificity                 1.0000           0.9880        1.0000
## Pos Pred Value              1.0000           0.9500        1.0000
## Neg Pred Value              0.9825           1.0000        1.0000
## Prevalence                  0.4510           0.1863        0.3627
## Detection Rate              0.4412           0.1863        0.3627
## Detection Prevalence        0.4412           0.1961        0.3627
## Balanced Accuracy           0.9891           0.9940        1.0000

Naïve Bayes

We achieved 96% accuracy!

fit_nb = caret::train(species~., data = trainSet, method = "nb", 
                      trControl = trainSet.control, metric = "Accuracy")

fit_nb$pred %>% head
##   fL usekernel adjust      pred       obs      Adelie    Chinstrap       Gentoo
## 1  0      TRUE      1    Adelie    Adelie 0.998995823 1.004110e-03 6.724318e-08
## 2  0      TRUE      1    Adelie    Adelie 0.985051595 1.494827e-02 1.321935e-07
## 3  0      TRUE      1    Adelie    Adelie 0.974670355 2.532964e-02 8.196930e-12
## 4  0      TRUE      1 Chinstrap Chinstrap 0.002440201 9.975579e-01 1.934395e-06
## 5  0      TRUE      1    Adelie    Adelie 0.999995559 4.440587e-06 6.252162e-14
## 6  0      TRUE      1    Adelie    Adelie 0.975424705 2.457529e-02 1.157407e-10
##   rowIndex Resample
## 1       46   Fold03
## 2       36   Fold01
## 3       25   Fold02
## 4      226   Fold01
## 5       49   Fold01
## 6       28   Fold06

pred_nb = predict(fit_nb, testSet)

caret::confusionMatrix(testSet$species, pred_nb)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        44         1      0
##   Chinstrap      3        17      0
##   Gentoo         0         0     37
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9608          
##                  95% CI : (0.9026, 0.9892)
##     No Information Rate : 0.4608          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9378          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9362           0.9444        1.0000
## Specificity                 0.9818           0.9643        1.0000
## Pos Pred Value              0.9778           0.8500        1.0000
## Neg Pred Value              0.9474           0.9878        1.0000
## Prevalence                  0.4608           0.1765        0.3627
## Detection Rate              0.4314           0.1667        0.3627
## Detection Prevalence        0.4412           0.1961        0.3627
## Balanced Accuracy           0.9590           0.9544        1.0000

Conclusion

Overall, the three models performed really well. The LDA achieved 100% accuracy, whereas the QDA had 99% and NB had 96%.

The LDA tended to perform better than logistic regression when the classes were well separated. If the predictors had an approximately normal distribution for each class (such as the “body_mass_g” but the other three were also very close to normal distribution), than LDA tended to be more stable and performed better than logistic regression. Besides, LDA was more frequently used when there were more than 2 response classes. Overall, the LDA assumed that the data drew from a multivariate Gaussian distribution, with a class specific mean vector and common covariance matrix. It produced linear separation boundaries. On the other hand, QDA would be recommended if the training set was large or if the assumption of a common covariance matrix was not realistic. Unlike LDA, QDA considered each class had its own variance or covariance matrix rather than to have a common one. In this QDA model, a variable “island” had to be removed for solving a rank deficiency problem. The QDA led to a quadratic decision surface and still performed really well without the “island” variable. Lastly, the NB assumed conditional independence and therefore it would not perform as well when the continuous variables were highly correlated. It assumed Gaussian or normal distribution for the continuous variables. Although the NB model performed the worst among the three algorithms presented here, it still performed really well with 96% accuracy. Overall, the misclassification found in QDA and NB always happened between Adelie and Chinstrap. Gentoo always achieved 100% sensitivity and 100% specificity in all three models. That was not hard to explain after we looked at the above ggplot and feature plot. Gentoo could always be clearly separated from the other two based on the measures.

In reality, we always dealed with messy data and the response classes could not be so well separated or classified. The penguin data set was a very small data set which allowed us to explore various algorithms with all the variables. Since some of the physiological measures highly overlapped with some decision boundaries (which made it indistinguishable between Adelie and Chinstrap), we should avoid picking them. For example, we should just pick “body_mass_g” and “bill_length_mm” and ignore the other two. Nevertheless, including all variables still returned highly accurate results from these three classifiers.