require(psych)
require(caret)
require(e1071)
require(DataExplorer)
require(palmerpenguins)
require(MASS)
require(tidyverse)
require(GGally)a. Linear Discriminant Analysis
You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices.
Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors).
Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model.
Look at the fit statistics/ accuracy rates.
Data Exploration
The penguins data has 344 rows of data and 8 columns. I consist of three different species the Adelie, the Chinstrap, and the Gentoo. The Adelie is spread out three different islands while the Chinstrap and the Gentoo reside in only one island.
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Remove Columns and missing Data
What features to choose?
After doing the featurePlot and other visuals we see that we should keep are bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g. The features that should be removed are island, sex and year, since each of these values were not that far apart that would make a dramatic change. When looking at the correlation of the values though you can see that bill_depth_mm and flipper_length_mm are negatively correlated with species so they are the weaker features of the data.But we will see how that affects the accuracy.
Data Preparation
Split The Data
- Here the data gets split into a train and test samples with a 70/30 ratio
set.seed(123)
penguins_samples <- penguins_tf$species %>%
createDataPartition(p = 0.70, list = FALSE)
train_penguins <- penguins_tf[penguins_samples, ]## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Normalize the data
- Categorical variables are automatically ignored, although they have been removed already
Linear Discriminant Analysis
Fit the data
## Call:
## lda(species ~ ., data = train_penguins_tf)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4398340 0.1991701 0.3609959
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie -0.9405683 0.6069344 -0.7688184 -0.6266052
## Chinstrap 0.8682414 0.6779309 -0.3885983 -0.5727280
## Gentoo 0.6669500 -1.1135141 1.1511203 1.0794379
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm -0.4034190 2.40623278
## bill_depth_mm 2.1555205 -0.03089148
## flipper_length_mm -1.2610839 -0.34819316
## body_mass_g -0.9958888 -1.34633311
##
## Proportion of trace:
## LD1 LD2
## 0.8662 0.1338
Accuracy and Prediction
Check for accuracy
# Make predictions
predictions <- model1 %>% predict(test_penguins_tf)
# Model accuracy
acc1<-mean(predictions$class==test_penguins_tf$species)
acc1## [1] 1
- Correctly classified 100% of observations
## [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## Levels: Adelie Chinstrap Gentoo
## Adelie Chinstrap Gentoo
## 1 0.9999663 3.368300e-05 1.292934e-20
## 2 0.9997173 2.826783e-04 1.759782e-13
## 3 0.9871920 1.280802e-02 8.774580e-15
## 4 0.9999923 7.651464e-06 1.241291e-25
## 5 0.9999792 2.077339e-05 3.758390e-16
## 6 0.9999999 5.462569e-08 2.713852e-22
## 7 0.9495205 5.047955e-02 9.552348e-23
## 8 0.9998939 1.061455e-04 6.289380e-26
## 9 0.9998679 1.321081e-04 2.631037e-20
## 10 0.9996247 3.752792e-04 4.740123e-19
## LD1 LD2
## 1 4.393039 -0.94505287
## 2 2.454653 -0.95076831
## 3 2.925150 0.09820153
lda.data <- cbind(train_penguins_tf, predict(model1)$x)
ggplot(lda.data, aes(LD1, LD2)) +
geom_point(aes(color = species))Looking at this you see that the Gentoo species is in the negative side of the LD1.
b. Quadratic Discriminant Analysis
Quadratic Discriminant Analysis does not work with all of the features, one gets a warning when features sex and isaland are included
Fit the model
## Call:
## qda(species ~ ., data = train_penguins_tf)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4398340 0.1991701 0.3609959
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie -0.9405683 0.6069344 -0.7688184 -0.6266052
## Chinstrap 0.8682414 0.6779309 -0.3885983 -0.5727280
## Gentoo 0.6669500 -1.1135141 1.1511203 1.0794379
Predictions and Accuracy
# Make predictions
predictions <- model2 %>% predict(test_penguins_tf)
# Model accuracy
acc2<- mean(predictions$class == test_penguins_tf$species)
acc2## [1] 1
## [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## Levels: Adelie Chinstrap Gentoo
## Adelie Chinstrap Gentoo
## 1 0.9999914 8.603087e-06 3.624649e-39
## 2 0.9975292 2.470775e-03 2.676516e-24
## 3 0.9969078 3.092243e-03 5.156444e-26
## 4 1.0000000 1.048917e-09 1.345814e-50
## 5 0.9997662 2.337704e-04 6.844792e-29
## 6 1.0000000 2.785740e-09 4.515863e-39
## 7 0.9989343 1.065658e-03 1.870528e-48
## 8 0.9999965 3.461617e-06 5.186200e-50
## 9 0.9999778 2.221201e-05 4.407984e-40
## 10 0.9998821 1.178983e-04 6.687368e-37
c. Naïve Bayes
Fit the model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Adelie Chinstrap Gentoo
## 0.4398340 0.1991701 0.3609959
##
## Conditional probabilities:
## bill_length_mm
## Y [,1] [,2]
## Adelie -0.9405683 0.4931934
## Chinstrap 0.8682414 0.6409665
## Gentoo 0.6669500 0.5550147
##
## bill_depth_mm
## Y [,1] [,2]
## Adelie 0.6069344 0.5808984
## Chinstrap 0.6779309 0.5867838
## Gentoo -1.1135141 0.4750237
##
## flipper_length_mm
## Y [,1] [,2]
## Adelie -0.7688184 0.4860308
## Chinstrap -0.3885983 0.5041611
## Gentoo 1.1511203 0.4583213
##
## body_mass_g
## Y [,1] [,2]
## Adelie -0.6266052 0.5867777
## Chinstrap -0.5727280 0.5349244
## Gentoo 1.0794379 0.6067798
Predictions and Accuracy
predicted.classes<- model3 %>% predict(test_penguins_tf)
# Model accuracy
acc3<- mean(predicted.classes == test_penguins_tf$species)
acc3## [1] 0.990099
- Comment on the models fits/strength/weakness/accuracy for all these three models that you worked with.
title <- c( "LDA Accuracy", "QDA Accuracy", "Naive Bayes Accuracy")
accuracy<-c(acc1, acc2, acc3)
accuracytbl<-data.frame (title,accuracy)Lets look at the accuray of the three different models.
## title accuracy
## 1 LDA Accuracy 1.000000
## 2 QDA Accuracy 1.000000
## 3 Naive Bayes Accuracy 0.990099
Looking at the accuracy LDA and QDA both of the accuracy are 100 also when it comes to the means they are the same. The naive bayes has a high accuracy rate but it is not 100% in comparison to the other two. Although LDA and QDA did get the same accuracy LDA has the inclusion of linear discriminant which gives you another look at the data. For this I would choose the LDA classifier.
When looking at the actual values of the fit for all models Adelie has a negative values on bill_length, flipper_length and body_mass so in comparison to the other species Adelie in those areas are smaller in comparison to the other two species. Looking at Chinstrap flipper_length and body_mass is also negative so the Gentoo is bigger in those areas. Gentoo only has a negative in bill_depth, so they seem to be a bigger species. Looking at that you can see why those features were negatively correlated but they are useful because you can easily compare the species by those features.
References