This was the second assignment from the class data 622. We would use the “penguin” data set and apply three distinct classification algorithms to predict for “species”.
if(!require(pacman)){install.packages("pacman"); require(pacman)}
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 3.6.2
packages <- c('nnet', 'glue', 'broom', 'MASS', 'caret', 'InformationValue', 'Hmisc', 'kableExtra', 'corrplot', 'tidyverse', 'ROCR', 'palmerpenguins', 'mice', 'nortest', 'ellipse')
pacman::p_load(char = packages)
df = penguins
df %>% head() %>% kableExtra::kable()
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
We looked at the data type, number of missing and unique values for each variable. We would explore the data and also impute the data set before fitting the models.
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 344 obs. of 8 variables:
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
metadata = data.frame(colName = names(df)) %>%
dplyr::mutate(dtype = sapply(df, class),
missing = colSums(is.na(df)),
unique = sapply(df, function(x) length(unique(x))))
metadata %>% kableExtra::kable()
| colName | dtype | missing | unique |
|---|---|---|---|
| species | factor | 0 | 3 |
| island | factor | 0 | 3 |
| bill_length_mm | numeric | 2 | 165 |
| bill_depth_mm | numeric | 2 | 81 |
| flipper_length_mm | integer | 2 | 56 |
| body_mass_g | integer | 2 | 95 |
| sex | factor | 11 | 3 |
| year | integer | 0 | 3 |
We have 3 distinct species. Let’s focus on the four main physiological measures, i.e. “bill_length_mm”, “bill_depth_mm”, “flipper_length_mm”, and “body_mass_g”, and see how they differ.
bodyVars = metadata %>% dplyr::filter(unique >3) %>% .$colName %>% as.character
dfGather <- df %>%
dplyr::select(all_of(c("species", bodyVars))) %>%
dplyr::mutate(species = as.character(species)) %>%
tidyr::gather(key, value, -species)
ggplot(dfGather, aes(value, color = species)) +
geom_density() +
theme(legend.position = "top") +
geom_vline(data = aggregate(value ~ species + key, dfGather, median),
aes(xintercept = value,
color = species),
linetype = "dashed") +
facet_wrap(~ key, nrow = 5, scales = "free")
## Warning: Removed 8 rows containing non-finite values (stat_density).
While the above ggplot displayed the distribution of each continuous variable for the three species (and we could see that Adelie and Chinstrap are highly similar to each other), below feature plot further demonstrated how these four physiological measures related to each other, and how these measures could classify the species, e.g. Adelie and Chinstrap were almost identical in most of these measures (as the circles overlapped with each other).
dfTemp = df[complete.cases(df), ] %>% dplyr::select(species, bill_length_mm:body_mass_g)
caret::featurePlot(x = dfTemp %>% dplyr::select(-species),
y = dfTemp$species,
plot = "ellipse",
# add a key at the top
auto.key = list(columns = 3))
Next, we would have to impute the data set. We took out a useless variable “year” before modeling. Moreover, we would conduct a correlation matrix to see the correlation among the four physiological measures.
# impute missing values
init = mice(df, maxit = 0)
meth = init$method
predM = init$predictorMatrix
meth[c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")] = "norm"
meth[c("sex")] = "logreg"
set.seed(1234)
imputed <- mice(df, method = meth, predictorMatrix = predM, m = 5)
##
## iter imp variable
## 1 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
df2 <- complete(imputed) %>% dplyr::select(-year)
# double check and make sure no more missing values
colSums(is.na(df2))
## species island bill_length_mm bill_depth_mm
## 0 0 0 0
## flipper_length_mm body_mass_g sex
## 0 0 0
# df2 - ready to build model
df2 %>% head()
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.10000 18.70000 181.000 3750.000
## 2 Adelie Torgersen 39.50000 17.40000 186.000 3800.000
## 3 Adelie Torgersen 40.30000 18.00000 195.000 3250.000
## 4 Adelie Torgersen 44.08715 20.98134 191.335 4194.923
## 5 Adelie Torgersen 36.70000 19.30000 193.000 3450.000
## 6 Adelie Torgersen 39.30000 20.60000 190.000 3650.000
## sex
## 1 male
## 2 female
## 3 female
## 4 male
## 5 female
## 6 male
# let's take a look of the bodyVars after completing the whole data set
df2 %>%
dplyr::select(bodyVars) %>%
cor() %>%
corrplot(method = "number", type = "upper", order = "hclust")
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(bodyVars)` instead of `bodyVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
We would scale the data, and then conduct the Anderson Darling test for normality. We would split the data and set up a train control before building models. The four physiological measures had from mild to high correlation to each other. The “body_mass_g” was probably the only variable that had an underlying normal distribution for the three species, but the other three measures were close enough to normality (judging from above ggplot). We would do a 70/30 split, i.e. 70% of our data for training purpose, while the remaining 30% would be used for testing. In addition, we applied a 10-fold cross-validation for our model training exercise.
# scale data
dfScaled <- df2 %>% dplyr::mutate_at(vars(bodyVars), scale)
# Anderson Darling test for normality
ad.test.list = lapply(1:length(bodyVars), function(i){
aggregate(eval(sym(bodyVars[i])) ~ species, data = dfScaled, FUN = function(x) ad.test(x)$p.value) %>%
as.data.frame() %>%
dplyr::mutate(var = bodyVars[i]) %>%
dplyr::select(var, species, ad.test.p.value = V1)
})
names(ad.test.list) = bodyVars
ad.test.list %>% dplyr::bind_rows() %>%
dplyr::mutate(is_normal = dplyr::case_when(ad.test.p.value > .05 ~ 1, TRUE ~ 0))
## var species ad.test.p.value is_normal
## 1 bill_length_mm Adelie 0.705413971 1
## 2 bill_length_mm Chinstrap 0.038582632 0
## 3 bill_length_mm Gentoo 0.185708400 1
## 4 bill_depth_mm Adelie 0.074340606 1
## 5 bill_depth_mm Chinstrap 0.271415846 1
## 6 bill_depth_mm Gentoo 0.034553821 0
## 7 flipper_length_mm Adelie 0.367196463 1
## 8 flipper_length_mm Chinstrap 0.791306198 1
## 9 flipper_length_mm Gentoo 0.004220275 0
## 10 body_mass_g Adelie 0.111789540 1
## 11 body_mass_g Chinstrap 0.366284805 1
## 12 body_mass_g Gentoo 0.138980393 1
# split data
set.seed(1234)
index <- caret::createDataPartition(dfScaled$species, p = .7, list = FALSE)
trainSet <- dfScaled[index, ]
testSet <- dfScaled[-index, ]
# trainSetControl
trainSet.control <- caret::trainControl(method = "cv", number = 10, savePredictions = "final", classProbs = TRUE)
We achieved 100% accuracy!
fit_lda = caret::train(species~., data = trainSet, method = "lda",
trControl = trainSet.control, metric = "Accuracy")
fit_lda$pred %>% head()
## parameter pred obs Adelie Chinstrap Gentoo rowIndex Resample
## 1 none Adelie Adelie 0.9999991 9.114867e-07 2.988432e-22 7 Fold01
## 2 none Adelie Adelie 0.9968483 3.151695e-03 2.959611e-19 9 Fold01
## 3 none Adelie Adelie 0.9999959 4.117413e-06 9.578573e-20 12 Fold01
## 4 none Adelie Adelie 0.9998156 1.843733e-04 3.058554e-17 28 Fold01
## 5 none Adelie Adelie 1.0000000 4.759773e-10 4.677373e-22 37 Fold01
## 6 none Adelie Adelie 0.9999765 2.351812e-05 3.223636e-14 52 Fold01
pred_lda = predict(fit_lda, testSet)
caret::confusionMatrix(testSet$species, pred_lda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 45 0 0
## Chinstrap 0 20 0
## Gentoo 0 0 37
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9645, 1)
## No Information Rate : 0.4412
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.4412 0.1961 0.3627
## Detection Rate 0.4412 0.1961 0.3627
## Detection Prevalence 0.4412 0.1961 0.3627
## Balanced Accuracy 1.0000 1.0000 1.0000
We removed the “island” variable to resolve the rank deficiency problem and achieved 99% accuracy!
fit_qda = caret::train(species~., data = trainSet %>% dplyr::select(-island), method = "qda",
trControl = trainSet.control, metric = "Accuracy")
fit_qda$pred %>% head()
## parameter pred obs Adelie Chinstrap Gentoo rowIndex Resample
## 1 none Adelie Adelie 1.0000000 2.323055e-14 2.225360e-65 10 Fold01
## 2 none Adelie Adelie 0.9999928 7.158867e-06 7.131863e-37 12 Fold01
## 3 none Adelie Adelie 1.0000000 1.134423e-12 1.861348e-54 16 Fold01
## 4 none Adelie Adelie 0.9815971 1.840293e-02 3.813363e-34 20 Fold01
## 5 none Adelie Adelie 1.0000000 2.810009e-15 1.635939e-50 22 Fold01
## 6 none Adelie Adelie 1.0000000 1.471517e-10 2.310543e-33 63 Fold01
pred_qda = predict(fit_qda, testSet)
caret::confusionMatrix(testSet$species, pred_qda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 45 0 0
## Chinstrap 1 19 0
## Gentoo 0 0 37
##
## Overall Statistics
##
## Accuracy : 0.9902
## 95% CI : (0.9466, 0.9998)
## No Information Rate : 0.451
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9845
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9783 1.0000 1.0000
## Specificity 1.0000 0.9880 1.0000
## Pos Pred Value 1.0000 0.9500 1.0000
## Neg Pred Value 0.9825 1.0000 1.0000
## Prevalence 0.4510 0.1863 0.3627
## Detection Rate 0.4412 0.1863 0.3627
## Detection Prevalence 0.4412 0.1961 0.3627
## Balanced Accuracy 0.9891 0.9940 1.0000
We achieved 96% accuracy!
fit_nb = caret::train(species~., data = trainSet, method = "nb",
trControl = trainSet.control, metric = "Accuracy")
fit_nb$pred %>% head
## fL usekernel adjust pred obs Adelie Chinstrap Gentoo
## 1 0 TRUE 1 Adelie Adelie 0.998995823 1.004110e-03 6.724318e-08
## 2 0 TRUE 1 Adelie Adelie 0.985051595 1.494827e-02 1.321935e-07
## 3 0 TRUE 1 Adelie Adelie 0.974670355 2.532964e-02 8.196930e-12
## 4 0 TRUE 1 Chinstrap Chinstrap 0.002440201 9.975579e-01 1.934395e-06
## 5 0 TRUE 1 Adelie Adelie 0.999995559 4.440587e-06 6.252162e-14
## 6 0 TRUE 1 Adelie Adelie 0.975424705 2.457529e-02 1.157407e-10
## rowIndex Resample
## 1 46 Fold03
## 2 36 Fold01
## 3 25 Fold02
## 4 226 Fold01
## 5 49 Fold01
## 6 28 Fold06
pred_nb = predict(fit_nb, testSet)
caret::confusionMatrix(testSet$species, pred_nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 44 1 0
## Chinstrap 3 17 0
## Gentoo 0 0 37
##
## Overall Statistics
##
## Accuracy : 0.9608
## 95% CI : (0.9026, 0.9892)
## No Information Rate : 0.4608
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9378
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9362 0.9444 1.0000
## Specificity 0.9818 0.9643 1.0000
## Pos Pred Value 0.9778 0.8500 1.0000
## Neg Pred Value 0.9474 0.9878 1.0000
## Prevalence 0.4608 0.1765 0.3627
## Detection Rate 0.4314 0.1667 0.3627
## Detection Prevalence 0.4412 0.1961 0.3627
## Balanced Accuracy 0.9590 0.9544 1.0000
Overall, the three models performed really well. The LDA achieved 100% accuracy, whereas the QDA had 99% and NB had 96%.
The LDA tended to perform better than logistic regression when the classes were well separated. If the predictors had an approximately normal distribution for each class (such as the “body_mass_g” but the other three were also very close to normal distribution), than LDA tended to be more stable and performed better than logistic regression. Besides, LDA was more frequently used when there were more than 2 response classes. Overall, the LDA assumed that the data drew from a multivariate Gaussian distribution, with a class specific mean vector and common covariance matrix. It produced linear separation boundaries. On the other hand, QDA would be recommended if the training set was large or if the assumption of a common covariance matrix was not realistic. Unlike LDA, QDA considered each class had its own variance or covariance matrix rather than to have a common one. In this QDA model, a variable “island” had to be removed for solving a rank deficiency problem. The QDA led to a quadratic decision surface and still performed really well without the “island” variable. Lastly, the NB assumed conditional independence and therefore it would not perform as well when the continuous variables were highly correlated. It assumed Gaussian or normal distribution for the continuous variables. Although the NB model performed the worst among the three algorithms presented here, it still performed really well with 96% accuracy. Overall, the misclassification found in QDA and NB always happened between Adelie and Chinstrap. Gentoo always achieved 100% sensitivity and 100% specificity in all three models. That was not hard to explain after we looked at the above ggplot and feature plot. Gentoo could always be clearly separated from the other two based on the measures.
In reality, we always dealed with messy data and the response classes could not be so well separated or classified. The penguin data set was a very small data set which allowed us to explore various algorithms with all the variables. Since some of the physiological measures highly overlapped with some decision boundaries (which made it indistinguishable between Adelie and Chinstrap), we should avoid picking them. For example, we should just pick “body_mass_g” and “bill_length_mm” and ignore the other two. Nevertheless, including all variables still returned highly accurate results from these three classifiers.