Source Code: https://github.com/djlofland/DATA622_MachineLearning/tree/master/Homework2
Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html
We will be working with the Penguin dataset again as we did for Homework #1. Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.
Using the target variable, Species, please conduct:
You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices.
Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors).
Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model.
Look at the fit statistics/ accuracy rates.
# Load dataframe
df <- penguins %>%
as_tibble()
# drop the year column
df <- df %>%
dplyr::select(-year)
# show summary infio
skim(df)| Name | df |
| Number of rows | 344 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
Our data set has the following independent features:
island (Biscoe, Dream or TorgerSen)bill_length_mm and bill_depth_mmflipper_length_mmbody_weight_gsexyear … I assume this is the year the penguin was measuredOur Dependent variable is species, a categorical variable with 3 distinct values: Adelie, Chinstrap, and Gentoo.
Per instructions, we will drop the year variable. In theory, if we expected penguin characteristics to be changing over time (weather, access to food, ecological problems, etc), then year would be a potentially useful feature. If we assume there are no dramatic time dependent shifts taking place, then year probably doesn’t provide any additional resolution toward prediction.
We know from the summary above that we have a few NAs. However, it is useful to know how those NAs cluster before making any decisions on how to handle them.
# show missing data
gg_miss_upset(df)We have some NAs that need handling. Since 2 records are missing most of their data, I’ll drop those entirely. This leave 9 records missing the sex feature. Since sex might be meaningful we can either KNN impute or drop those rows. Since we only have a few rows, I am going to drop them for convenience. If there had been a larger number then we would have to consider whether to drop the feature. If we chose to impute, the best strategy would be KNN imputing based on other row features; however, while imputing is a “go to” strategy, it does in fact carry the risk of biasing our data by reducing variance. All impute techniques run this risk - when we fill in values with mean, median, or imputed, we are artificially weighing our data. This can work in our favor by, but can also work against us. As such, the decision to impute should be explored in more depth given the specific problem.
# remove rows with NAs
df <- na.omit(df)Our island feature and species outcome are categorical - for machine learning, we really need everything coded as numeric. I will do a simple dummy encoding for island which will generate 3 new binary columns, one for each island. Note, dummy coding can be done with or without an intercept. I’ll skip so we will have 3 columns, not 2. For species prediction, we prefer to only have a single column. A typical strategy is to code with a numerical substitute for character values. Since we have 3 outcomes, we can map these to 1, 2 and 3.
# convert species to a factor
species <- as.factor(df$species)
# dummy encode our categorical `island`
dummies <- dummyVars(species ~ ., drop2nd=TRUE, data = df)
df <- data.frame(predict(dummies, newdata = df))## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'species' is not a factor
df$species <- species
skim(df)| Name | df |
| Number of rows | 333 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 146, Gen: 119, Chi: 68 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| island.Biscoe | 0 | 1 | 0.49 | 0.50 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
| island.Dream | 0 | 1 | 0.37 | 0.48 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▅ |
| island.Torgersen | 0 | 1 | 0.14 | 0.35 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▁ |
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
| sex.female | 0 | 1 | 0.50 | 0.50 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
| sex.male | 0 | 1 | 0.50 | 0.50 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
rm(dummies)
rm(species)Between feature correlations are a huge problem for most regressions. There are better techniques for which multicollinearity is not an issue. The problem with multicollinearity is that our algorithms cannot tease out which variables are contributing to the outcome when they are correlated. When we see correlated features, we need to arbitrarily remove some or reach for a technique like PCA which transforms the features into a new dimensional space with complete independence. The problem with PCA is that we loose all explainibility of the features in our model. Typically we do a forward or reverse step-wise approach to select the minimal set of features which provide the best model performance.
# Calculate and plot the Multicollinearity
correlation = cor(df[,-10], use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu"))rm(correlation)# Show feature correlations/target by decreasing correlation
df$speciesNumeric <- as.numeric(df$species)
stack(sort(cor(df[,1:9], df[, 11])[,], decreasing=TRUE))## values ind
## 1 0.85073678 flipper_length_mm
## 2 0.75043358 body_mass_g
## 3 0.73054767 bill_length_mm
## 4 0.59652777 island.Biscoe
## 5 0.01096361 sex.male
## 6 -0.01096361 sex.female
## 7 -0.31536361 island.Dream
## 8 -0.41931810 island.Torgersen
## 9 -0.74034609 bill_depth_mm
We see some strong correlations between the features - this multicollinearity will be an issues with any standard linear regression approaches. If we are wanting the best model possible, we might need to consider an approach like PCA to reframe our features into a set that are independent. We also notice in this plot that bill_length_mm, body_mass_g, and flipper_length_mm are more strongly correlated. We also see these features positively correlated with both Biscoe Island and being male. Next, we see that flipper_length_mm, body_mass-g, bill_length_mm, and island.Biscoe have the highest correlations with the target species.
For regressions, we need to ensure there is a direct relationship between changes in each feature and our target outcome. Ideally, there should be a linear relationship. We can employ transformations, BoxCox, etc to help tease variables which don’t show a linear relationship.
Now lets explore any correlations between features and with the target. Note I separate Male and Female into separate chart.
males <- df %>%
dplyr::filter(sex.male == 1) %>%
dplyr::select(-c(sex.male, sex.female))
featurePlot(x=males[,1:7], y=males[,8],
plot="pairs",
main = 'Male Penguin FeaturePlot',
auto.key=list(columns=3))females <- df %>%
dplyr::filter(sex.female == 1) %>%
dplyr::select(-c(sex.male, sex.female))
featurePlot(x=females[,1:7], y=females[,8],
plot="pairs",
main = 'Female Penguin FeaturePlot',
auto.key=list(columns=3))rm(males)
rm(females)These paired feature plots illustrate how the different species have distinct clustering (different colors). Gentoo for example has a larger body and flipper length, but it’s bill is smaller in depth. By using pairs of features at a time, we can more easily see the feature separation. The island features (dummy variables) only allow for a 0 to 1, but with those columns we can see which islands penguins are found on. Biscoe has Adelie and Gentoo, whereas Dream has Adelie and Chinstrap. On Torgersen, we only see Adelie.
We were not given a separate hold out group - so to better measure model performance against unseen samples, I’ll set aside 20% of our initial data as a holdout Test group. While this may slightly lower our training performance (few samples means less information captured by the model), the trade-off is that we can identify under- and over-fitting and better measure model performance.
# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df$species, p = .8, list = FALSE, times = 1)
pengTrain <- df[ trainIndex,]
pengTest <- df[-trainIndex,]
# Cross validation setup for caret modeling
ctrl <- trainControl(method = 'cv',
number = 10,
classProbs = T,
savePredictions = TRUE)We train our model on the Training Data only, assess performance, then separately predict from the holdout test set and use that for final model performance.
I only included a subset of features based on EDA above: island.Biscoe, bill_length_mm,bill_depth_mm,flipper_length_mm, andbody_mass_g`.
library(pROC)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
lda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = pengTrain,
method="lda",
trControl = ctrl
)print('Training Performance')## [1] "Training Performance"
summary(lda_model)## Length Class Mode
## prior 3 -none- numeric
## counts 3 -none- numeric
## means 15 -none- numeric
## scaling 10 -none- numeric
## lev 3 -none- character
## svd 2 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## xNames 5 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 3 -none- character
## param 0 -none- list
varImp(lda_model)## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## Adelie Chinstrap Gentoo
## island.Biscoe 0.00 100.00 100.00
## flipper_length_mm 99.75 91.61 99.75
## bill_length_mm 92.73 92.73 90.85
## bill_depth_mm 88.92 92.48 92.48
## body_mass_g 85.61 89.32 89.32
# --- Evaluate our model
train_pred <- as.factor(predict(lda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 116 2 0
## Chinstrap 1 53 0
## Gentoo 0 0 96
##
## Overall Statistics
##
## Accuracy : 0.9888
## 95% CI : (0.9676, 0.9977)
## No Information Rate : 0.4366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9825
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9915 0.9636 1.0000
## Specificity 0.9868 0.9953 1.0000
## Pos Pred Value 0.9831 0.9815 1.0000
## Neg Pred Value 0.9933 0.9907 1.0000
## Prevalence 0.4366 0.2052 0.3582
## Detection Rate 0.4328 0.1978 0.3582
## Detection Prevalence 0.4403 0.2015 0.3582
## Balanced Accuracy 0.9891 0.9795 1.0000
Overall accuracy was 0.9888, which on the surface is quite good and the model p-value shows it was quite significant. That said, I’ve included a confusion matrix and per-class Sensitivity and Specificity so we can better understand mis-classifications. R uses a one against many approach to calculate these. Sensitivity is TP/(TP+FN) and Specificity is TN / (TN + FP). In the variable importance table, we can see which features were most useful for predicting each class. Generally the order of importance was the same across species with island.Biscoe and flipper_length_mm being th emost important features for classification. Notice that bill_length_mm and bill_depth_mm importance were switched in Gentoo relative to the other 2 species.
print('Holdout Test Performance')## [1] "Holdout Test Performance"
test_pred <- as.factor(predict(lda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 0 0
## Chinstrap 1 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9759
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 1.0000 1.0000
## Specificity 1.0000 0.9808 1.0000
## Pos Pred Value 1.0000 0.9286 1.0000
## Neg Pred Value 0.9730 1.0000 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.2000 0.3538
## Detection Prevalence 0.4308 0.2154 0.3538
## Balanced Accuracy 0.9828 0.9904 1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(lda_values <- defaultSummary(modelValues))## Accuracy Kappa
## 0.9846154 0.9759437
The performance against the holdout group is important to ensure our model is generalized (no overfitting) and can handle new data points not seen during training. Again, accuracy is quite good and slightly lower (as we would expect with new data) at 0.9864. Again, I included a confusion matrix to understand how the model is erroring per class. I provide the Sensitivity and Specificity per class.
Note, I had problems running straight qda() within caret and train(). So, I went ahead and used stepQDA which applies a stepwise feature selection and evaluation using QDA.
qda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = pengTrain,
method="stepQDA",
trControl = ctrl,
metric = "Accuracy"
)## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82617; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.94617; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.169
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81333; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.9585; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.00 0.00 0.16
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82167; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.9585; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.153
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81733; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.94583; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.154
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81717; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95417; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.153
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95033; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.153
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.8175; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95883; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.153
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95033; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.152
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82583; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.147
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 240 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.825; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.95417; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.147
## `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 268 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81752; in: "flipper_length_mm"; variables (1): flipper_length_mm
## correctness rate: 0.94786; in: "bill_length_mm"; variables (2): flipper_length_mm, bill_length_mm
##
## hr.elapsed min.elapsed sec.elapsed
## 0.000 0.000 0.151
summary(qda_model)## Length Class Mode
## call 6 -none- call
## method 1 -none- character
## start.variables 0 -none- NULL
## process 4 data.frame list
## model 2 data.frame list
## result.pm 2 -none- numeric
## runtime 3 -none- numeric
## performance.measure 1 -none- character
## formula 3 formula call
## fit 8 qda list
## xNames 5 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 3 -none- character
## param 0 -none- list
varImp(qda_model)## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## Adelie Chinstrap Gentoo
## island.Biscoe 0.00 100.00 100.00
## flipper_length_mm 99.75 91.61 99.75
## bill_length_mm 92.73 92.73 90.85
## bill_depth_mm 88.92 92.48 92.48
## body_mass_g 85.61 89.32 89.32
# --- Evaluate our model
train_pred <- as.factor(predict(qda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 112 5 0
## Chinstrap 3 48 1
## Gentoo 2 2 95
##
## Overall Statistics
##
## Accuracy : 0.9515
## 95% CI : (0.9185, 0.9739)
## No Information Rate : 0.4366
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9239
##
## Mcnemar's Test P-Value : 0.418
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9573 0.8727 0.9896
## Specificity 0.9669 0.9812 0.9767
## Pos Pred Value 0.9573 0.9231 0.9596
## Neg Pred Value 0.9669 0.9676 0.9941
## Prevalence 0.4366 0.2052 0.3582
## Detection Rate 0.4179 0.1791 0.3545
## Detection Prevalence 0.4366 0.1940 0.3694
## Balanced Accuracy 0.9621 0.9270 0.9832
As with LDA above, I provide accuracy, a confusion matrix, per-class performance (Sensitivity and Selectivity) and variable importance. Notice that our accuracy=0.9515 was lower than we saw in LDA. Unlike LDA which correctly identify every Gentoo, QDA missed 5 times - four times by incorrectly predicting Gentoo (FP) and once where it failed to predict a Gentoo (FN). This led to lower Specificity and Sensitvity.
test_pred <- as.factor(predict(qda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 0 12 1
## Gentoo 1 0 22
##
## Overall Statistics
##
## Accuracy : 0.9538
## 95% CI : (0.871, 0.9904)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9274
##
## Mcnemar's Test P-Value : 0.3916
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.9231 0.9565
## Specificity 0.9722 0.9808 0.9762
## Pos Pred Value 0.9655 0.9231 0.9565
## Neg Pred Value 0.9722 0.9808 0.9762
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1846 0.3385
## Detection Prevalence 0.4462 0.2000 0.3538
## Balanced Accuracy 0.9689 0.9519 0.9664
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(qda_values <- defaultSummary(modelValues))## Accuracy Kappa
## 0.9538462 0.9274013
Accuracy with the holdout group was similar to the training data with accuracy=0.9538. Again, in the confusion matrix we can see where the mistakes were made. There were fewer mistakes, but the holdout group only had 20% of the total samples. Please refer to the confusion matrix and per-class Sensitivity and Specificity metrices provided.
Before starting, keep in mind that a fundamental assumption for Naive Bayes (any why it’s “Naive”) is independence of features. Multicollinearity is a problem and leads to poorer predictive power with this technique. We already know there is inherent feature correlations so I would anticipate NB to perform worse that other modeling techniques.
nb_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
data = pengTrain,
method="nb",
rControl = ctrl
)summary(nb_model)## Length Class Mode
## apriori 3 table numeric
## tables 5 -none- list
## levels 3 -none- character
## call 7 -none- call
## x 5 data.frame list
## usekernel 1 -none- logical
## varnames 5 -none- character
## xNames 5 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 3 -none- character
## param 1 -none- list
varImp(nb_model)## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## Adelie Chinstrap Gentoo
## island.Biscoe 0.00 100.00 100.00
## flipper_length_mm 99.75 91.61 99.75
## bill_length_mm 92.73 92.73 90.85
## bill_depth_mm 88.92 92.48 92.48
## body_mass_g 85.61 89.32 89.32
# --- Evaluate our model
train_pred <- as.factor(predict(nb_model, newdata = pengTrain))## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 11
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 82
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 84
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 141
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 148
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 194
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 207
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 226
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 235
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 237
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 242
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 243
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 244
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 251
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 264
confusionMatrix(train_pred, pengTrain$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 116 5 0
## Chinstrap 1 50 0
## Gentoo 0 0 96
##
## Overall Statistics
##
## Accuracy : 0.9776
## 95% CI : (0.9519, 0.9917)
## No Information Rate : 0.4366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9648
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9915 0.9091 1.0000
## Specificity 0.9669 0.9953 1.0000
## Pos Pred Value 0.9587 0.9804 1.0000
## Neg Pred Value 0.9932 0.9770 1.0000
## Prevalence 0.4366 0.2052 0.3582
## Detection Rate 0.4328 0.1866 0.3582
## Detection Prevalence 0.4515 0.1903 0.3582
## Balanced Accuracy 0.9792 0.9522 1.0000
Naive Bayes had an accuracy=0.9776, which isn’t too bad. The provided confusion matrix and per-class Sensitivity and Specificity show that it correctly identify Gentto, but missed a few Adelie and Chinstrap predictions.
test_pred <- as.factor(predict(nb_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 3 0
## Chinstrap 1 10 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9385
## 95% CI : (0.8499, 0.983)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.902
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.7692 1.0000
## Specificity 0.9167 0.9808 1.0000
## Pos Pred Value 0.9032 0.9091 1.0000
## Neg Pred Value 0.9706 0.9444 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1538 0.3538
## Detection Prevalence 0.4769 0.1692 0.3538
## Balanced Accuracy 0.9411 0.8750 1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(nb_values <- defaultSummary(modelValues))## Accuracy Kappa
## 0.9384615 0.9020347
Against Holdout data, accuracy was lower (0.9385) suggesting that the model didn’t generalize as well with new data and may have overfitted during training. THe model performed well at identifying Gentoo, but notice Chinstrap Sensitivity was quite low (0.7692) where it only predicted 10/13 of the CHinstrap penguins. The model had trouble differentiating Adelie and Chinstrap, and our corresponding Adelie Specificity was lower (28/31 = 0.9167) since it incorrectly predicted 3 chinstrap as being Adelie. Here we see the usefulness of Sensitivity and Specificity.
(models_summary <- rbind(lda_values, qda_values, nb_values))## Accuracy Kappa
## lda_values 0.9846154 0.9759437
## qda_values 0.9538462 0.9274013
## nb_values 0.9384615 0.9020347
This summary looks at just the accuracy of holdout data using the 3 different modeling approaches. All three models performed extremely well with LDA doing best and Naive Bayes the worst. I would note that the classes had fairly good separation which probably led to good LDA performance. Naive Bayes makes an assumption of independence of features which might explain it’s poorer performance. We didn’t do anything to handle multicollinearity. These models performed sufficiently that I did not go back and do further feature cleanup. If model performance had been lower, I would have considered normalizing my features (center, scale. and potentially BoxCox), but that just wasn’t necessary. As a side note, given how well these models performed, it might be worth learning these modeling techniques with a different data set where we do see more differences and can relate those to how the models work and why we’d choose one over another.