Homework 2 - LDA / QDA / Naive Bayes Analysis

We will be working with the Penguin dataset again as we did for Homework #1. Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.We will be conducting Linear Discriminant Analysis, Quadratic Discriminant Analysis and Naive Bayes for the penguin dataset while evaluating all the features or defendant variables.

# Libraries
library(palmerpenguins)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(cowplot)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(regclass)
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked from 'package:caret':
## 
##     predictors
## The following object is masked from 'package:tidyr':
## 
##     fill
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
## 
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
## 
##     qq
library(e1071)
library(klaR)

Data Exploration

First I will take a close look at the dataset using glimpse and summary function. I am using some of my Homework 1 methods for exploration and preparation.

#Load Data
data("penguins")
#Closer look at the data
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A...
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge...
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34....
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18....
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ...
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347...
## $ sex               <fct> male, female, female, NA, female, male, female, m...
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2...
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

species and its relationship with numerical variables

penguins %>%
  group_by(species) %>%
  summarize(across(where(is.numeric), mean, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 6
##   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##   <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
## 1 Adelie              38.8          18.3              190.       3701. 2008.
## 2 Chinstrap           48.8          18.4              196.       3733. 2008.
## 3 Gentoo              47.5          15.0              217.       5076. 2008.

Below bar plots are for the visualization of the count of each species penguins in the dataset. From the Second bar plot we can visualize the species distribution of each sex.

penguins %>%
  count(species) %>%
  ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
  geom_label(aes(x = species, y = n, label = n)) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  theme_minimal() +
  labs(title = 'Penguins Species & Count')

penguins %>%
  drop_na() %>%
  count(sex, species) %>%
  ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
  geom_label(aes(x = species, y = n, label = n)) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  facet_wrap(~sex) +
  theme_minimal() +
  labs(title = 'Penguins Species ~ Gender')

As you can see from the above visualization and the summaries, We see from the three categories Adelie (153) has the highest count followed by Gentoo (119) and Chinstrap (68).The distribution between gender is almost equal among all three species.

Explore Variables

library(explore)
## Warning: package 'explore' was built under R version 3.6.3
penguins %>% 
  explore_all(target = species)

Relationship with categorical features

In this section, my purpose is to analyze the relationship between species with other categorical features such as sex and the island.

plot_island<-ggplot(penguins, aes(x = island, y = species, color = species)) +
  geom_jitter(size = 3)
plot_sex<-ggplot(penguins, aes(x = sex, y = species, color = species)) +
  geom_jitter(size = 3)
plot_grid(plot_island, plot_sex, labels = "AUTO")

From the above island graph we can see that the Adelie penguins can be found in all three islands while Chintstrap in only Dream island and Gentoo only in Biscoe island.

Relationship with numerical features

# Scatterplot 1: penguin flipper length versus body mass
plot1<-ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species,
                 shape = species),
             size = 2) +
  scale_color_manual(values = c("darkorange","darkorchid","cyan"))
# Scatterplot 2: penguin bill length versus bill depth
plot2<-ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species,
                 shape = species),
             size = 2)  +
  scale_color_manual(values = c("darkorange","darkorchid","cyan"))
plot3<-ggplot(penguins, aes(x = flipper_length_mm,
                     y = body_mass_g)) +
  geom_point(aes(color = sex)) +
  scale_color_manual(values = c("darkorange","cyan4"),
                     na.translate = FALSE) +
  facet_wrap(~species)

plot_grid(plot1, plot2, plot3, labels = "AUTO")

Correlation

#correlation matrix
penguins %>%
  select_if(is.numeric) %>%
  drop_na() %>%
  cor()
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm        1.00000000   -0.23505287         0.6561813  0.59510982
## bill_depth_mm        -0.23505287    1.00000000        -0.5838512 -0.47191562
## flipper_length_mm     0.65618134   -0.58385122         1.0000000  0.87120177
## body_mass_g           0.59510982   -0.47191562         0.8712018  1.00000000
## year                  0.05454458   -0.06035364         0.1696751  0.04220939
##                          year
## bill_length_mm     0.05454458
## bill_depth_mm     -0.06035364
## flipper_length_mm  0.16967511
## body_mass_g        0.04220939
## year               1.00000000
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:explore':
## 
##     rescale01
ggpairs(penguins, columns = 3:6, title = "Correlogram of Variables", 
        ggplot2::aes(color = island),
        progress = FALSE, 
        lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1))) 

Penguin flipper length and body mass shows a positive association for each 3 species. We can also notice from the visualization that Adelie and Chinstrap overlaps for all the variables except for bill length. We can assume that bill length variable is helping us to distinguish the three species from each other.

Data Preparation

First step of my data preparation is to deal with missing data. Next we can use the featureplot method to verify the variables to use. Then I will partition the dataset into train and test set using createDataPartition method. I will allocate 80% of the dataset to training data and the 20% to test data.

#omit missig data
penguins <- na.omit(penguins)
#check for missing data
colSums(is.na(penguins))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 0                 0 
## flipper_length_mm       body_mass_g               sex              year 
##                 0                 0                 0                 0

Eliminate the variable year and island

penguins = penguins %>% 
  dplyr::select( -one_of("island", "year") )
head(penguins)
## # A tibble: 6 x 6
##   species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##   <fct>            <dbl>         <dbl>             <int>       <int> <fct> 
## 1 Adelie            39.1          18.7               181        3750 male  
## 2 Adelie            39.5          17.4               186        3800 female
## 3 Adelie            40.3          18                 195        3250 female
## 4 Adelie            36.7          19.3               193        3450 female
## 5 Adelie            39.3          20.6               190        3650 male  
## 6 Adelie            38.9          17.8               181        3625 female

Feature Selection

featurePlot(x = penguins[,c("bill_length_mm",
                            "bill_depth_mm",
                            "flipper_length_mm",
                            "body_mass_g")], 
            y = penguins$species, 
            plot = "box",
            ## Add a key at the top
            scales = list(x = list(relation = "free"),
                          y = list(relation = "free")),
            adjust = 1.5,
            pch = "|",
            layout = c(2,2),
            auto.key = list(columns = 3))

We can see from the featureplot that the medians and quartiles do not overlap each other for any variables. flipper_length_mm, *body_mass_g** and bill_depth_mm distingushes Gentoo from Adelie and Chinstrap species while bill_length_mm separates Adelie from Chinstrap and Gentoo species.

Split between Train and Test dataset

#Split data 80% training and 20% test sets
set.seed(123)

partition <- penguins$species %>%
    createDataPartition(p = 0.8, list = FALSE)
train <- penguins[partition, ]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test <- penguins[-partition, ]

Linear Discriminant Analysis - LDA

I will carry out this LDA using the lda() function from the R MASS package. First I will normalize the data. Categorical variables will be automatically ignored. In my lda model I will be ignoring sexmail variable and will only be using the numerical features to predict the species.

#Estimating processing parameters 
preproc.param <- train %>%
  preProcess(method = c("center", "scale"))

# Transform the data using the estimated parameters
train.transformed <- preproc.param %>% predict(train)
test.transformed <- preproc.param %>% predict(test)
# Fit the model
lda_model <- lda(species~ bill_length_mm + bill_depth_mm + body_mass_g + flipper_length_mm , data = train.transformed)


lda_model
## Call:
## lda(species ~ bill_length_mm + bill_depth_mm + body_mass_g + 
##     flipper_length_mm, data = train.transformed)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm body_mass_g flipper_length_mm
## Adelie        -0.9555507     0.6009140  -0.6092842        -0.7707877
## Chinstrap      0.9088301     0.6564651  -0.5901947        -0.3716006
## Gentoo         0.6438935    -1.1084638   1.0806975         1.1522937
## 
## Coefficients of linear discriminants:
##                          LD1          LD2
## bill_length_mm     0.5415983 -2.408094172
## bill_depth_mm     -2.1000729  0.004472972
## body_mass_g        1.0280418  1.274291326
## flipper_length_mm  1.0940111  0.404513143
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8462 0.1538

LDA determines group means and computer the probability of belonging to a different group for each individual.

  1. Prior probabilities of groups shows the proportion of training observations in each group. For example there are 44% of training observations in Adelie species.
  2. Group means shows the mean of each variable of all species.
  3. Coefficients of linear discriminants shows the linear combination of predictor variables that are used to form the LDA decision rule.

\(LD1 = 0.45*bill_length_mm - 2.06*bill_depth_mm + 1.05*body_mass_g + 1.11*flipper_length_mm\)
\(LD2 = 2.32*bill_length_mm - 0.04*bill_depth_mm - 1.25*body_mass_g - 0.38*flipper_length_mm\)

# Make predictions
pred_lda_comp <- lda_model %>%
  predict(test.transformed)
# Confusion Matrix for LDA 
con_lda_model <- confusionMatrix(pred_lda_comp$class, test.transformed$species)

LDA Plot

lda.data <- cbind(train.transformed, predict(lda_model)$x)
ggplot(lda.data, aes(LD1, LD2)) +
  geom_point(aes(color = species))

Quadratic Discriminant Analysis - QDA

QDA can be computed using R function qda(). Will use the same numerical features of species.

qda_model <- qda(species~ bill_length_mm + bill_depth_mm + body_mass_g + flipper_length_mm , data = train.transformed)
qda_model
## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + body_mass_g + 
##     flipper_length_mm, data = train.transformed)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm body_mass_g flipper_length_mm
## Adelie        -0.9555507     0.6009140  -0.6092842        -0.7707877
## Chinstrap      0.9088301     0.6564651  -0.5901947        -0.3716006
## Gentoo         0.6438935    -1.1084638   1.0806975         1.1522937
# Make predictions
pred_qda_comp <- qda_model %>%
  predict(test.transformed)
# Confusion Matrix for QDA
con_qda_model <- confusionMatrix(pred_qda_comp$class, test.transformed$species)

Naive Bayes

Will be using the same transformed data that we used for LDA and QDA.

head(train)
## # A tibble: 6 x 6
##   species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##   <fct>            <dbl>         <dbl>             <int>       <int> <fct> 
## 1 Adelie            40.3          18                 195        3250 female
## 2 Adelie            36.7          19.3               193        3450 female
## 3 Adelie            38.9          17.8               181        3625 female
## 4 Adelie            39.2          19.6               195        4675 male  
## 5 Adelie            41.1          17.6               182        3200 female
## 6 Adelie            38.6          21.2               191        3800 male
naive_model <- naiveBayes(species~., data = train)
naive_model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Conditional probabilities:
##            bill_length_mm
## Y               [,1]     [,2]
##   Adelie    38.87265 2.595972
##   Chinstrap 48.94727 2.904266
##   Gentoo    47.51562 3.164052
## 
##            bill_depth_mm
## Y               [,1]     [,2]
##   Adelie    18.37607 1.194430
##   Chinstrap 18.48727 1.150210
##   Gentoo    14.95417 1.005555
## 
##            flipper_length_mm
## Y               [,1]     [,2]
##   Adelie    190.3504 6.823282
##   Chinstrap 195.8909 6.857472
##   Gentoo    217.0417 6.558107
## 
##            body_mass_g
## Y               [,1]     [,2]
##   Adelie    3730.769 471.5664
##   Chinstrap 3745.909 404.9967
##   Gentoo    5071.094 496.2800
## 
##            sex
## Y              female      male
##   Adelie    0.4786325 0.5213675
##   Chinstrap 0.4545455 0.5454545
##   Gentoo    0.4895833 0.5104167
pred_naive <- predict(naive_model, train.transformed)

Data Analysis

lda_penguins <- train(data=train.transformed, species ~ ., method="lda",
                      trControl=trainControl(method="cv", number=10))

lda_penguins$results
##   parameter Accuracy Kappa AccuracySD KappaSD
## 1      none        1     1          0       0
qda_penguins <- train(data=train.transformed, species ~ ., method="qda",
                      trControl=trainControl(method="cv", number=10))

qda_penguins$results
##   parameter  Accuracy     Kappa AccuracySD    KappaSD
## 1      none 0.9962963 0.9940397 0.01171214 0.01884801
nb_penguins <- train(data=train.transformed, species ~ ., method="nb",
                      trControl=trainControl(method="cv", number=10))
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 17
nb_penguins$results
##   usekernel fL adjust  Accuracy     Kappa AccuracySD    KappaSD
## 1     FALSE  0      1 0.9703398 0.9537508 0.03892642 0.06044277
## 2      TRUE  0      1 0.9631970 0.9422348 0.04606048 0.07211191

All three models worked really well in this scenario. LDA and QDA has a 99% accuracy rate while Naive Bayes had a 96%.What special about Naive Bayes is that it was able to hand both categorical and continuous variables. With the highest accuracy rate of 97% and a Kappa rate of 99%, I feel like using QDA on for this analysis is reasonable.