Source Code: https://github.com/djlofland/DATA622_MachineLearning/tree/master/Homework2

Instructions

Let’s use the Penguin dataset for our assignment. To learn more about the dataset, please visit: https://allisonhorst.github.io/palmerpenguins/articles/intro.html

We will be working with the Penguin dataset again as we did for Homework #1. Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.

Using the target variable, Species, please conduct:

  1. Linear Discriminant Analysis (30 points):
  1. You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices.

  2. Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors).

  3. Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model.

  4. Look at the fit statistics/ accuracy rates.

  1. Quadratic Discriminant Analysis (30 points)
  1. Same steps as above to consider
  1. Naïve Bayes (30 points)
  1. Same steps as above to consider
  1. Comment on the models fits/strength/weakness/accuracy for all these three models that you worked with. (10 points)

Load Data & EDA

# Load dataframe
df <- penguins %>% 
  as_tibble()

# drop the year column
df <- df %>% 
  dplyr::select(-year)

# show summary infio
skim(df)
Data summary
Name df
Number of rows 344
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂

Plan of Attack

Our data set has the following independent features:

  • Geography - island (Biscoe, Dream or TorgerSen)
  • Bill Dimensions - bill_length_mm and bill_depth_mm
  • Flipper Dimensions - flipper_length_mm
  • Overall Weight - body_weight_g
  • Gender - sex
  • Year - year … I assume this is the year the penguin was measured

Our Dependent variable is species, a categorical variable with 3 distinct values: Adelie, Chinstrap, and Gentoo.

Per instructions, we will drop the year variable. In theory, if we expected penguin characteristics to be changing over time (weather, access to food, ecological problems, etc), then year would be a potentially useful feature. If we assume there are no dramatic time dependent shifts taking place, then year probably doesn’t provide any additional resolution toward prediction.

Handle NAs

We know from the summary above that we have a few NAs. However, it is useful to know how those NAs cluster before making any decisions on how to handle them.

# show missing data
gg_miss_upset(df)

We have some NAs that need handling. Since 2 records are missing most of their data, I’ll drop those entirely. This leave 9 records missing the sex feature. Since sex might be meaningful we can either KNN impute or drop those rows. Since we only have a few rows, I am going to drop them for convenience. If there had been a larger number then we would have to consider whether to drop the feature. If we chose to impute, the best strategy would be KNN imputing based on other row features; however, while imputing is a “go to” strategy, it does in fact carry the risk of biasing our data by reducing variance. All impute techniques run this risk - when we fill in values with mean, median, or imputed, we are artificially weighing our data. This can work in our favor by, but can also work against us. As such, the decision to impute should be explored in more depth given the specific problem.

# remove rows with NAs
df <- na.omit(df)

Dummy Code Categorical

Our island feature and species outcome are categorical - for machine learning, we really need everything coded as numeric. I will do a simple dummy encoding for island which will generate 3 new binary columns, one for each island. Note, dummy coding can be done with or without an intercept. I’ll skip so we will have 3 columns, not 2. For species prediction, we prefer to only have a single column. A typical strategy is to code with a numerical substitute for character values. Since we have 3 outcomes, we can map these to 1, 2 and 3.

# convert species to a factor
species <- as.factor(df$species)

# dummy encode our categorical `island`
dummies <- dummyVars(species ~ ., drop2nd=TRUE, data = df)
df <- data.frame(predict(dummies, newdata = df))
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'species' is not a factor
df$species <- species

skim(df)
Data summary
Name df
Number of rows 333
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1 FALSE 3 Ade: 146, Gen: 119, Chi: 68

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
island.Biscoe 0 1 0.49 0.50 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▇
island.Dream 0 1 0.37 0.48 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▅
island.Torgersen 0 1 0.14 0.35 0.0 0.0 0.0 0.0 1.0 ▇▁▁▁▁
bill_length_mm 0 1 43.99 5.47 32.1 39.5 44.5 48.6 59.6 ▃▇▇▆▁
bill_depth_mm 0 1 17.16 1.97 13.1 15.6 17.3 18.7 21.5 ▅▆▇▇▂
flipper_length_mm 0 1 200.97 14.02 172.0 190.0 197.0 213.0 231.0 ▂▇▃▅▃
body_mass_g 0 1 4207.06 805.22 2700.0 3550.0 4050.0 4775.0 6300.0 ▃▇▅▃▂
sex.female 0 1 0.50 0.50 0.0 0.0 0.0 1.0 1.0 ▇▁▁▁▇
sex.male 0 1 0.50 0.50 0.0 0.0 1.0 1.0 1.0 ▇▁▁▁▇
rm(dummies)
rm(species)

Multicollinearity

Between feature correlations are a huge problem for most regressions. There are better techniques for which multicollinearity is not an issue. The problem with multicollinearity is that our algorithms cannot tease out which variables are contributing to the outcome when they are correlated. When we see correlated features, we need to arbitrarily remove some or reach for a technique like PCA which transforms the features into a new dimensional space with complete independence. The problem with PCA is that we loose all explainibility of the features in our model. Typically we do a forward or reverse step-wise approach to select the minimal set of features which provide the best model performance.

# Calculate and plot the Multicollinearity
correlation = cor(df[,-10], use = 'pairwise.complete.obs')

corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
         col=brewer.pal(n=8, name="RdYlBu"))

rm(correlation)
# Show feature correlations/target by decreasing correlation
df$speciesNumeric <- as.numeric(df$species)
stack(sort(cor(df[,1:9], df[, 11])[,], decreasing=TRUE))
##        values               ind
## 1  0.85073678 flipper_length_mm
## 2  0.75043358       body_mass_g
## 3  0.73054767    bill_length_mm
## 4  0.59652777     island.Biscoe
## 5  0.01096361          sex.male
## 6 -0.01096361        sex.female
## 7 -0.31536361      island.Dream
## 8 -0.41931810  island.Torgersen
## 9 -0.74034609     bill_depth_mm

We see some strong correlations between the features - this multicollinearity will be an issues with any standard linear regression approaches. If we are wanting the best model possible, we might need to consider an approach like PCA to reframe our features into a set that are independent. We also notice in this plot that bill_length_mm, body_mass_g, and flipper_length_mm are more strongly correlated. We also see these features positively correlated with both Biscoe Island and being male. Next, we see that flipper_length_mm, body_mass-g, bill_length_mm, and island.Biscoe have the highest correlations with the target species.

Feature-Target Correlation

For regressions, we need to ensure there is a direct relationship between changes in each feature and our target outcome. Ideally, there should be a linear relationship. We can employ transformations, BoxCox, etc to help tease variables which don’t show a linear relationship.

Now lets explore any correlations between features and with the target. Note I separate Male and Female into separate chart.

males <- df %>% 
  dplyr::filter(sex.male == 1) %>% 
  dplyr::select(-c(sex.male, sex.female))

featurePlot(x=males[,1:7], y=males[,8], 
            plot="pairs", 
            main = 'Male Penguin FeaturePlot',
            auto.key=list(columns=3))

females <- df %>% 
  dplyr::filter(sex.female == 1) %>% 
  dplyr::select(-c(sex.male, sex.female))

featurePlot(x=females[,1:7], y=females[,8], 
            plot="pairs", 
            main = 'Female Penguin FeaturePlot',
            auto.key=list(columns=3))

rm(males)
rm(females)

These paired feature plots illustrate how the different species have distinct clustering (different colors). Gentoo for example has a larger body and flipper length, but it’s bill is smaller in depth. By using pairs of features at a time, we can more easily see the feature separation. The island features (dummy variables) only allow for a 0 to 1, but with those columns we can see which islands penguins are found on. Biscoe has Adelie and Gentoo, whereas Dream has Adelie and Chinstrap. On Torgersen, we only see Adelie.

Training-Test Split

We were not given a separate hold out group - so to better measure model performance against unseen samples, I’ll set aside 20% of our initial data as a holdout Test group. While this may slightly lower our training performance (few samples means less information captured by the model), the trade-off is that we can identify under- and over-fitting and better measure model performance.

# --- Setup Training and Test datasets
set.seed(3456)
trainIndex <- createDataPartition(df$species, p = .8, list = FALSE, times = 1)

pengTrain <- df[ trainIndex,]
pengTest  <- df[-trainIndex,]

# Cross validation setup for caret modeling
ctrl <- trainControl(method = 'cv', 
                     number = 10,
                     classProbs = T,
                     savePredictions = TRUE)

1. Linear Discriminate Analysis

We train our model on the Training Data only, assess performance, then separately predict from the holdout test set and use that for final model performance.
I only included a subset of features based on EDA above: island.Biscoe, bill_length_mm,bill_depth_mm,flipper_length_mm, andbody_mass_g`.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
lda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, 
               data = pengTrain, 
               method="lda",
               trControl = ctrl
)

Training Performance

print('Training Performance')
## [1] "Training Performance"
summary(lda_model)
##             Length Class      Mode     
## prior        3     -none-     numeric  
## counts       3     -none-     numeric  
## means       15     -none-     numeric  
## scaling     10     -none-     numeric  
## lev          3     -none-     character
## svd          2     -none-     numeric  
## N            1     -none-     numeric  
## call         3     -none-     call     
## xNames       5     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    3     -none-     character
## param        0     -none-     list
varImp(lda_model)
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                   Adelie Chinstrap Gentoo
## island.Biscoe       0.00    100.00 100.00
## flipper_length_mm  99.75     91.61  99.75
## bill_length_mm     92.73     92.73  90.85
## bill_depth_mm      88.92     92.48  92.48
## body_mass_g        85.61     89.32  89.32
# --- Evaluate our model
train_pred <- as.factor(predict(lda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       116         2      0
##   Chinstrap      1        53      0
##   Gentoo         0         0     96
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9888          
##                  95% CI : (0.9676, 0.9977)
##     No Information Rate : 0.4366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9825          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9915           0.9636        1.0000
## Specificity                 0.9868           0.9953        1.0000
## Pos Pred Value              0.9831           0.9815        1.0000
## Neg Pred Value              0.9933           0.9907        1.0000
## Prevalence                  0.4366           0.2052        0.3582
## Detection Rate              0.4328           0.1978        0.3582
## Detection Prevalence        0.4403           0.2015        0.3582
## Balanced Accuracy           0.9891           0.9795        1.0000

Overall accuracy was 0.9888, which on the surface is quite good and the model p-value shows it was quite significant. That said, I’ve included a confusion matrix and per-class Sensitivity and Specificity so we can better understand mis-classifications. R uses a one against many approach to calculate these. Sensitivity is TP/(TP+FN) and Specificity is TN / (TN + FP). In the variable importance table, we can see which features were most useful for predicting each class. Generally the order of importance was the same across species with island.Biscoe and flipper_length_mm being th emost important features for classification. Notice that bill_length_mm and bill_depth_mm importance were switched in Gentoo relative to the other 2 species.

Holdout Test Performance

print('Holdout Test Performance')
## [1] "Holdout Test Performance"
test_pred <- as.factor(predict(lda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         0      0
##   Chinstrap      1        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9759          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           1.0000        1.0000
## Specificity                 1.0000           0.9808        1.0000
## Pos Pred Value              1.0000           0.9286        1.0000
## Neg Pred Value              0.9730           1.0000        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.2000        0.3538
## Detection Prevalence        0.4308           0.2154        0.3538
## Balanced Accuracy           0.9828           0.9904        1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(lda_values <- defaultSummary(modelValues))
##  Accuracy     Kappa 
## 0.9846154 0.9759437

The performance against the holdout group is important to ensure our model is generalized (no overfitting) and can handle new data points not seen during training. Again, accuracy is quite good and slightly lower (as we would expect with new data) at 0.9864. Again, I included a confusion matrix to understand how the model is erroring per class. I provide the Sensitivity and Specificity per class.

2. Quadratic Discriminant Analysis

Note, I had problems running straight qda() within caret and train(). So, I went ahead and used stepQDA which applies a stepwise feature selection and evaluation using QDA.

qda_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, 
                   data = pengTrain, 
                   method="stepQDA",
                   trControl = ctrl,
                   metric = "Accuracy"
)
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82617;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.94617;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.169
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81333;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.9585;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        0.16
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82167;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.9585;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.153
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81733;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.94583;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.154
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81717;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95417;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.153
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95033;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.153
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.8175;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95883;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.153
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 242 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.818;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95033;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.152
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 241 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.82583;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.147
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 240 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.825;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.95417;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.147
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 268 observations of 5 variables in 3 classes; direction: both
## stop criterion: improvement less than 5%.
## correctness rate: 0.81752;  in: "flipper_length_mm";  variables (1): flipper_length_mm 
## correctness rate: 0.94786;  in: "bill_length_mm";  variables (2): flipper_length_mm, bill_length_mm 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##       0.000       0.000       0.151

Training Performance

summary(qda_model)
##                     Length Class      Mode     
## call                6      -none-     call     
## method              1      -none-     character
## start.variables     0      -none-     NULL     
## process             4      data.frame list     
## model               2      data.frame list     
## result.pm           2      -none-     numeric  
## runtime             3      -none-     numeric  
## performance.measure 1      -none-     character
## formula             3      formula    call     
## fit                 8      qda        list     
## xNames              5      -none-     character
## problemType         1      -none-     character
## tuneValue           2      data.frame list     
## obsLevels           3      -none-     character
## param               0      -none-     list
varImp(qda_model)
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                   Adelie Chinstrap Gentoo
## island.Biscoe       0.00    100.00 100.00
## flipper_length_mm  99.75     91.61  99.75
## bill_length_mm     92.73     92.73  90.85
## bill_depth_mm      88.92     92.48  92.48
## body_mass_g        85.61     89.32  89.32
# --- Evaluate our model
train_pred <- as.factor(predict(qda_model, newdata = pengTrain))
confusionMatrix(train_pred, pengTrain$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       112         5      0
##   Chinstrap      3        48      1
##   Gentoo         2         2     95
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9515          
##                  95% CI : (0.9185, 0.9739)
##     No Information Rate : 0.4366          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9239          
##                                           
##  Mcnemar's Test P-Value : 0.418           
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9573           0.8727        0.9896
## Specificity                 0.9669           0.9812        0.9767
## Pos Pred Value              0.9573           0.9231        0.9596
## Neg Pred Value              0.9669           0.9676        0.9941
## Prevalence                  0.4366           0.2052        0.3582
## Detection Rate              0.4179           0.1791        0.3545
## Detection Prevalence        0.4366           0.1940        0.3694
## Balanced Accuracy           0.9621           0.9270        0.9832

As with LDA above, I provide accuracy, a confusion matrix, per-class performance (Sensitivity and Selectivity) and variable importance. Notice that our accuracy=0.9515 was lower than we saw in LDA. Unlike LDA which correctly identify every Gentoo, QDA missed 5 times - four times by incorrectly predicting Gentoo (FP) and once where it failed to predict a Gentoo (FN). This led to lower Specificity and Sensitvity.

Holdout Test Performance

test_pred <- as.factor(predict(qda_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      0        12      1
##   Gentoo         1         0     22
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9538         
##                  95% CI : (0.871, 0.9904)
##     No Information Rate : 0.4462         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9274         
##                                          
##  Mcnemar's Test P-Value : 0.3916         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.9231        0.9565
## Specificity                 0.9722           0.9808        0.9762
## Pos Pred Value              0.9655           0.9231        0.9565
## Neg Pred Value              0.9722           0.9808        0.9762
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1846        0.3385
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9689           0.9519        0.9664
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(qda_values <- defaultSummary(modelValues))
##  Accuracy     Kappa 
## 0.9538462 0.9274013

Accuracy with the holdout group was similar to the training data with accuracy=0.9538. Again, in the confusion matrix we can see where the mistakes were made. There were fewer mistakes, but the holdout group only had 20% of the total samples. Please refer to the confusion matrix and per-class Sensitivity and Specificity metrices provided.

3. Naive Bayes

Before starting, keep in mind that a fundamental assumption for Naive Bayes (any why it’s “Naive”) is independence of features. Multicollinearity is a problem and leads to poorer predictive power with this technique. We already know there is inherent feature correlations so I would anticipate NB to perform worse that other modeling techniques.

nb_model <- train(species ~ island.Biscoe + bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, 
                  data = pengTrain, 
                  method="nb",
                  rControl = ctrl
)

Training Performance

summary(nb_model)
##             Length Class      Mode     
## apriori     3      table      numeric  
## tables      5      -none-     list     
## levels      3      -none-     character
## call        7      -none-     call     
## x           5      data.frame list     
## usekernel   1      -none-     logical  
## varnames    5      -none-     character
## xNames      5      -none-     character
## problemType 1      -none-     character
## tuneValue   3      data.frame list     
## obsLevels   3      -none-     character
## param       1      -none-     list
varImp(nb_model)
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                   Adelie Chinstrap Gentoo
## island.Biscoe       0.00    100.00 100.00
## flipper_length_mm  99.75     91.61  99.75
## bill_length_mm     92.73     92.73  90.85
## bill_depth_mm      88.92     92.48  92.48
## body_mass_g        85.61     89.32  89.32
# --- Evaluate our model
train_pred <- as.factor(predict(nb_model, newdata = pengTrain))
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 11
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 82
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 84
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 141
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 148
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 194
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 207
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 226
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 235
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 237
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 242
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 243
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 244
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 251
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 264
confusionMatrix(train_pred, pengTrain$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       116         5      0
##   Chinstrap      1        50      0
##   Gentoo         0         0     96
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9776          
##                  95% CI : (0.9519, 0.9917)
##     No Information Rate : 0.4366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9648          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9915           0.9091        1.0000
## Specificity                 0.9669           0.9953        1.0000
## Pos Pred Value              0.9587           0.9804        1.0000
## Neg Pred Value              0.9932           0.9770        1.0000
## Prevalence                  0.4366           0.2052        0.3582
## Detection Rate              0.4328           0.1866        0.3582
## Detection Prevalence        0.4515           0.1903        0.3582
## Balanced Accuracy           0.9792           0.9522        1.0000

Naive Bayes had an accuracy=0.9776, which isn’t too bad. The provided confusion matrix and per-class Sensitivity and Specificity show that it correctly identify Gentto, but missed a few Adelie and Chinstrap predictions.

Holdout Test Performance

test_pred <- as.factor(predict(nb_model, newdata = pengTest))
confusionMatrix(test_pred, pengTest$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         3      0
##   Chinstrap      1        10      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9385         
##                  95% CI : (0.8499, 0.983)
##     No Information Rate : 0.4462         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.902          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.7692        1.0000
## Specificity                 0.9167           0.9808        1.0000
## Pos Pred Value              0.9032           0.9091        1.0000
## Neg Pred Value              0.9706           0.9444        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1538        0.3538
## Detection Prevalence        0.4769           0.1692        0.3538
## Balanced Accuracy           0.9411           0.8750        1.0000
# save final performance for later comparison
modelValues <- data.frame(obs = pengTest$species, pred=test_pred)
colnames(modelValues) = c('obs', 'pred')
(nb_values <- defaultSummary(modelValues))
##  Accuracy     Kappa 
## 0.9384615 0.9020347

Against Holdout data, accuracy was lower (0.9385) suggesting that the model didn’t generalize as well with new data and may have overfitted during training. THe model performed well at identifying Gentoo, but notice Chinstrap Sensitivity was quite low (0.7692) where it only predicted 10/13 of the CHinstrap penguins. The model had trouble differentiating Adelie and Chinstrap, and our corresponding Adelie Specificity was lower (28/31 = 0.9167) since it incorrectly predicted 3 chinstrap as being Adelie. Here we see the usefulness of Sensitivity and Specificity.

4. Compare models

(models_summary <- rbind(lda_values, qda_values, nb_values))
##             Accuracy     Kappa
## lda_values 0.9846154 0.9759437
## qda_values 0.9538462 0.9274013
## nb_values  0.9384615 0.9020347

This summary looks at just the accuracy of holdout data using the 3 different modeling approaches. All three models performed extremely well with LDA doing best and Naive Bayes the worst. I would note that the classes had fairly good separation which probably led to good LDA performance. Naive Bayes makes an assumption of independence of features which might explain it’s poorer performance. We didn’t do anything to handle multicollinearity. These models performed sufficiently that I did not go back and do further feature cleanup. If model performance had been lower, I would have considered normalizing my features (center, scale. and potentially BoxCox), but that just wasn’t necessary. As a side note, given how well these models performed, it might be worth learning these modeling techniques with a different data set where we do see more differences and can relate those to how the models work and why we’d choose one over another.