require(psych)
require(caret)
require(e1071)
require(DataExplorer)
require(palmerpenguins)
require(MASS)
require(tidyverse)
require(GGally)

a. Linear Discriminant Analysis

  1. You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices.

  2. Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors).

  3. Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model.

  4. Look at the fit statistics/ accuracy rates.

Data Exploration

The penguins data has 344 rows of data and 8 columns. I consist of three different species the Adelie, the Chinstrap, and the Gentoo. The Adelie is spread out three different islands while the Chinstrap and the Gentoo reside in only one island.

str(penguins)
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Remove Columns and missing Data

penguins_tf<- penguins[-c(2,7,8)]
penguins_tf<-na.omit(penguins_tf)

What features to choose?

After doing the featurePlot and other visuals we see that we should keep are bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g. The features that should be removed are island, sex and year, since each of these values were not that far apart that would make a dramatic change. When looking at the correlation of the values though you can see that bill_depth_mm and flipper_length_mm are negatively correlated with species so they are the weaker features of the data.But we will see how that affects the accuracy.

GGPairs

ggpairs(penguins_tf, aes (fill= species))

pairs.panels(penguins_tf)

featurePlot

featurePlot(x = penguins_tf,
        y = penguins_tf$species,
        plot = "pairs",
        auto.key = list(columns = 3)) 

Data Preparation

Split The Data

  • Here the data gets split into a train and test samples with a 70/30 ratio
set.seed(123)
penguins_samples <- penguins_tf$species %>%
  createDataPartition(p = 0.70, list = FALSE)
train_penguins <- penguins_tf[penguins_samples, ]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test_penguins <- penguins_tf[-penguins_samples, ]

Normalize the data

  • Categorical variables are automatically ignored, although they have been removed already
# Estimate preprocessing parameters
preproc.param <- train_penguins %>% 
  preProcess(method = c("center", "scale"))
# Transform the data using the estimated parameters
train_penguins_tf <- preproc.param %>% predict(train_penguins)
test_penguins_tf <- preproc.param %>% predict(test_penguins)

Linear Discriminant Analysis

Fit the data

# Fit the model
model1 <- lda(species~., data = train_penguins_tf)
model1
## Call:
## lda(species ~ ., data = train_penguins_tf)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4398340 0.1991701 0.3609959 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie        -0.9405683     0.6069344        -0.7688184  -0.6266052
## Chinstrap      0.8682414     0.6779309        -0.3885983  -0.5727280
## Gentoo         0.6669500    -1.1135141         1.1511203   1.0794379
## 
## Coefficients of linear discriminants:
##                          LD1         LD2
## bill_length_mm    -0.4034190  2.40623278
## bill_depth_mm      2.1555205 -0.03089148
## flipper_length_mm -1.2610839 -0.34819316
## body_mass_g       -0.9958888 -1.34633311
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8662 0.1338

Accuracy and Prediction

Check for accuracy

# Make predictions
predictions <- model1 %>% predict(test_penguins_tf)
# Model accuracy
acc1<-mean(predictions$class==test_penguins_tf$species)
acc1
## [1] 1
  • Correctly classified 100% of observations
# Predicted classes
head(predictions$class, 10)
##  [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## Levels: Adelie Chinstrap Gentoo
# Predicted probabilities of class memebership.
head(predictions$posterior,10) 
##       Adelie    Chinstrap       Gentoo
## 1  0.9999663 3.368300e-05 1.292934e-20
## 2  0.9997173 2.826783e-04 1.759782e-13
## 3  0.9871920 1.280802e-02 8.774580e-15
## 4  0.9999923 7.651464e-06 1.241291e-25
## 5  0.9999792 2.077339e-05 3.758390e-16
## 6  0.9999999 5.462569e-08 2.713852e-22
## 7  0.9495205 5.047955e-02 9.552348e-23
## 8  0.9998939 1.061455e-04 6.289380e-26
## 9  0.9998679 1.321081e-04 2.631037e-20
## 10 0.9996247 3.752792e-04 4.740123e-19
# Linear discriminant
head(predictions$x, 3) 
##        LD1         LD2
## 1 4.393039 -0.94505287
## 2 2.454653 -0.95076831
## 3 2.925150  0.09820153
lda.data <- cbind(train_penguins_tf, predict(model1)$x)
ggplot(lda.data, aes(LD1, LD2)) +
  geom_point(aes(color = species))

Looking at this you see that the Gentoo species is in the negative side of the LD1.

b. Quadratic Discriminant Analysis

Quadratic Discriminant Analysis does not work with all of the features, one gets a warning when features sex and isaland are included

Fit the model

# Fit the model
model2 <- qda(species~., data = train_penguins_tf)
model2
## Call:
## qda(species ~ ., data = train_penguins_tf)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4398340 0.1991701 0.3609959 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie        -0.9405683     0.6069344        -0.7688184  -0.6266052
## Chinstrap      0.8682414     0.6779309        -0.3885983  -0.5727280
## Gentoo         0.6669500    -1.1135141         1.1511203   1.0794379

Predictions and Accuracy

# Make predictions
predictions <- model2 %>% predict(test_penguins_tf)
# Model accuracy
acc2<- mean(predictions$class == test_penguins_tf$species)
acc2
## [1] 1
# Predicted classes
head(predictions$class, 10)
##  [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## Levels: Adelie Chinstrap Gentoo
# Predicted probabilities of class membership.
head(predictions$posterior,10) 
##       Adelie    Chinstrap       Gentoo
## 1  0.9999914 8.603087e-06 3.624649e-39
## 2  0.9975292 2.470775e-03 2.676516e-24
## 3  0.9969078 3.092243e-03 5.156444e-26
## 4  1.0000000 1.048917e-09 1.345814e-50
## 5  0.9997662 2.337704e-04 6.844792e-29
## 6  1.0000000 2.785740e-09 4.515863e-39
## 7  0.9989343 1.065658e-03 1.870528e-48
## 8  0.9999965 3.461617e-06 5.186200e-50
## 9  0.9999778 2.221201e-05 4.407984e-40
## 10 0.9998821 1.178983e-04 6.687368e-37

c. Naïve Bayes

Fit the model

model3 <- naiveBayes(species ~., data = train_penguins_tf)
model3
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    Adelie Chinstrap    Gentoo 
## 0.4398340 0.1991701 0.3609959 
## 
## Conditional probabilities:
##            bill_length_mm
## Y                 [,1]      [,2]
##   Adelie    -0.9405683 0.4931934
##   Chinstrap  0.8682414 0.6409665
##   Gentoo     0.6669500 0.5550147
## 
##            bill_depth_mm
## Y                 [,1]      [,2]
##   Adelie     0.6069344 0.5808984
##   Chinstrap  0.6779309 0.5867838
##   Gentoo    -1.1135141 0.4750237
## 
##            flipper_length_mm
## Y                 [,1]      [,2]
##   Adelie    -0.7688184 0.4860308
##   Chinstrap -0.3885983 0.5041611
##   Gentoo     1.1511203 0.4583213
## 
##            body_mass_g
## Y                 [,1]      [,2]
##   Adelie    -0.6266052 0.5867777
##   Chinstrap -0.5727280 0.5349244
##   Gentoo     1.0794379 0.6067798

Predictions and Accuracy

predicted.classes<- model3 %>% predict(test_penguins_tf)
# Model accuracy
acc3<- mean(predicted.classes == test_penguins_tf$species)
acc3
## [1] 0.990099
plot(predicted.classes)

  1. Comment on the models fits/strength/weakness/accuracy for all these three models that you worked with.
title <- c( "LDA Accuracy", "QDA Accuracy", "Naive Bayes Accuracy")
accuracy<-c(acc1, acc2, acc3)

accuracytbl<-data.frame (title,accuracy)

Lets look at the accuray of the three different models.

accuracytbl
##                  title accuracy
## 1         LDA Accuracy 1.000000
## 2         QDA Accuracy 1.000000
## 3 Naive Bayes Accuracy 0.990099

Looking at the accuracy LDA and QDA both of the accuracy are 100 also when it comes to the means they are the same. The naive bayes has a high accuracy rate but it is not 100% in comparison to the other two. Although LDA and QDA did get the same accuracy LDA has the inclusion of linear discriminant which gives you another look at the data. For this I would choose the LDA classifier.

When looking at the actual values of the fit for all models Adelie has a negative values on bill_length, flipper_length and body_mass so in comparison to the other species Adelie in those areas are smaller in comparison to the other two species. Looking at Chinstrap flipper_length and body_mass is also negative so the Gentoo is bigger in those areas. Gentoo only has a negative in bill_depth, so they seem to be a bigger species. Looking at that you can see why those features were negatively correlated but they are useful because you can easily compare the species by those features.

References

  1. http://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/

  2. https://www.r-bloggers.com/2018/01/understanding-naive-bayes-classifier-using-r/