Data Manipulation

# load packages
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 4.0.3

## Warning: package 'tibble' was built under R version 4.0.3

## Warning: package 'tidyr' was built under R version 4.0.3

## Warning: package 'dplyr' was built under R version 4.0.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(caret)

## Warning: package 'caret' was built under R version 4.0.2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)

## Warning: package 'e1071' was built under R version 4.0.4

library(caTools)

## Warning: package 'caTools' was built under R version 4.0.4

1. Split the data into training and testing set.

We will be splitting the penguin dataset in training (80%) and test set (20%). We noticed there are some NA values in the data and they have to be omitted.

# Load the data
penguins <- palmerpenguins::penguins
# Split the data into training (80%) and test set (20%)
penguins <- as.data.frame(penguins)
#omitting all NA values
penguins<-na.omit(penguins)
#converting the island and sex variables into numerical
#"Biscoe"
levels(penguins$island)[1] <- 0
#"Dream"
levels(penguins$island)[2] <- 1
#"Torgersen"
levels(penguins$island)[3] <- 2

#female
levels(penguins$sex)[1] <- 0
#"male"
levels(penguins$sex)[2] <- 1

set.seed(120)  # Setting Seed 

training.samples <- penguins$species %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data <- penguins[training.samples, ]
test.data <- penguins[-training.samples, ]

2. Normalize the data. Categorical variables are automatically ignored.

For LDA to work, we have to make sure our variables are normally distributed. This is because LDA assumes the predictors to be normally distributed. LDA assumes that predictors are normally distributed and the different classes have class-specific means and equal variance/covariance.

# Estimate preprocessing parameters
preproc.param <- train.data %>% 
  preProcess(method = c("center", "scale"))
# Transform the data using the estimated parameters
train.transformed <- preproc.param %>% predict(train.data)
test.transformed <- preproc.param %>% predict(test.data)

Linear discriminant analysis - LDA

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Fit the model
model <- lda(species~., data = train.transformed)
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class==test.transformed$species)

## [1] 1

Compute LDA:

library(MASS)

# I will be removing year 
model <- lda(species~. - year, data = train.transformed)
model

## Call:
## lda(species ~ . - year, data = train.transformed)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##             island1   island2 bill_length_mm bill_depth_mm flipper_length_mm
## Adelie    0.3931624 0.3247863     -0.9555862     0.6041178        -0.7633521
## Chinstrap 1.0000000 0.0000000      0.8349499     0.6116721        -0.4203050
## Gentoo    0.0000000 0.0000000      0.6862640    -1.0867057         1.1711351
##           body_mass_g      sex1
## Adelie     -0.6210872 0.4529915
## Chinstrap  -0.6096684 0.4363636
## Gentoo      1.1062392 0.5520833
## 
## Coefficients of linear discriminants:
##                          LD1         LD2
## island1           -1.0710715 -1.77820348
## island2           -1.0600222 -0.21538615
## bill_length_mm     0.5311923 -2.11473056
## bill_depth_mm     -1.6661173 -0.17620747
## flipper_length_mm  1.2482407 -0.01937768
## body_mass_g        1.1955640  0.67876715
## sex1              -0.5513973  1.03207058
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8381 0.1619

Prior probabilities of groups: the proportion of training observations in each group.

For example, there are 43% of the training observations in the Adelie group. There are 20% of the training observations in the Chinstrap group There are 35% of the training observations in the Gentoo group

Group means: group center of gravity. Shows the mean of each variable in each group.

Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule. for example, LD1 = -1.07 x island1 + -1.06 x island2 + 0.53 x bill_length_mm -2.03 x bill_depth_mm - 1.25 x flipper_length_mm - 1.0 x body_mass_g - 0.55 x sex1.

Similarly, LD2 = -1.77 x island1 + -0.215 x island2 - 2.24 x bill_length_mm + -0.08 x bill_depth_mm + 0.13 x flipper_length_mm - 1.35 x body_mass_g + 1.03 x sex1.

plot(model)

This plot shows the linear discriminants of the model.

lda.data <- cbind(train.transformed, predict(model)$x)
ggplot(lda.data, aes(LD1, LD2)) +
  geom_point(aes(color = species))

Model accuracy:

You can compute the model accuracy as follow:

mean(predictions$class==test.transformed$species)

## [1] 1

Our model classified 100% of the observations, which is very good.

Quadratic discriminant analysis - QDA

QDA is more flexible than LDA, because the covariance can be different for each class. LDA is a bit better than QDA when you have a small training set. QDA is often recommended if the training set is very large, where variance of the classifier is not an issue.

library(MASS)
# Fit the model, removed island from the model because it gave a rank deficiency error in the group Chinstrap
training.samples <- penguins$species %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data <- penguins[training.samples, ]
test.data <- penguins[-training.samples, ]

model <- qda(species~ bill_length_mm + bill_length_mm + flipper_length_mm + body_mass_g + sex, data = train.transformed)
model

## Call:
## qda(species ~ bill_length_mm + bill_length_mm + flipper_length_mm + 
##     body_mass_g + sex, data = train.transformed)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm flipper_length_mm body_mass_g      sex1
## Adelie        -0.9555862        -0.7633521  -0.6210872 0.4529915
## Chinstrap      0.8349499        -0.4203050  -0.6096684 0.4363636
## Gentoo         0.6862640         1.1711351   1.1062392 0.5520833

# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$species)

## [1] 1

There are 43% of the training observations in the Adelie group. There are 20% of the training observations in the Chinstrap group. There are 35% of the training observations in the Gentoo group.

This model accuracy is 100%.

Naive Bayes Classifier

training.samples <- penguins$species %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data <- penguins[training.samples, ]
test.data <- penguins[-training.samples, ]

# Using the same train.data and test.data from above to fit Naive Bayes Model  

classifier_cl <- naiveBayes(species ~ .- year, data = train.data) 
classifier_cl

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Conditional probabilities:
##            island
## Y                   0         1         2
##   Adelie    0.3247863 0.3760684 0.2991453
##   Chinstrap 0.0000000 1.0000000 0.0000000
##   Gentoo    1.0000000 0.0000000 0.0000000
## 
##            bill_length_mm
## Y               [,1]     [,2]
##   Adelie    38.83932 2.476196
##   Chinstrap 49.18727 3.426479
##   Gentoo    47.58854 3.195717
## 
##            bill_depth_mm
## Y               [,1]      [,2]
##   Adelie    18.33504 1.1555948
##   Chinstrap 18.46545 1.1686045
##   Gentoo    14.98646 0.9857532
## 
##            flipper_length_mm
## Y               [,1]     [,2]
##   Adelie    190.3846 6.508411
##   Chinstrap 196.3273 6.947034
##   Gentoo    217.1875 6.737659
## 
##            body_mass_g
## Y               [,1]     [,2]
##   Adelie    3706.838 464.4776
##   Chinstrap 3781.364 372.0320
##   Gentoo    5088.542 502.8722
## 
##            sex
## Y                   0         1
##   Adelie    0.4786325 0.5213675
##   Chinstrap 0.4363636 0.5636364
##   Gentoo    0.4791667 0.5208333

# Predicting on test data' 
y_pred <- predict(classifier_cl, newdata = test.data) 
  
# Confusion Matrix 
cm <- table(test.data$species, y_pred) 
cm

##            y_pred
##             Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23

# Model Evauation 
confusionMatrix(cm)

## Confusion Matrix and Statistics
## 
##            y_pred
##             Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4615          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9667           1.0000        1.0000
## Specificity                 1.0000           0.9811        1.0000
## Pos Pred Value              1.0000           0.9231        1.0000
## Neg Pred Value              0.9722           1.0000        1.0000
## Prevalence                  0.4615           0.1846        0.3538
## Detection Rate              0.4462           0.1846        0.3538
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9833           0.9906        1.0000

The Conditional probability for each feature or variable is created by model separately. The apriori probabilities are also calculated which indicates the distribution of our data.

So, 23 Gentoo are correctly classified as Gentoo. Out of 30 Adelie, 29 Adelie are correctly classified as Adelie, and 1 are classified as Chinstrap. Out of 12 Chinstrap, 12 Chinstrap are correctly classified as Chinstrap.

The model achieved 98% accuracy with a p-value of less than 1. With Sensitivity, Specificity, and Balanced accuracy, the model build is good.

Conclusion

Comparing the accuracy of the three models, QDA and LDA performs the best by correctly classifying 100% of observations compared to Naive Bayes Classifier. However, using for QDA i had to remove the island variable because it gave a rank deficiency for the Chinstrap. Doing so it doesn’t include most of the data like LDA did. So I believe in this dataset, LDA would be the ideal concept to use as it had 100% accuracy and was the most inclusive of the data out of the three. I also ran the model without the two other variables, island and sex, and the statistics remained the same but the accuracy is a bit lower.

Reference

http://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/

https://www.geeksforgeeks.org/naive-bayes-classifier-in-r-programming/#:~:text=Naive%20Bayes%20is%20a%20Supervised,between%20the%20features%20or%20variables.

Data 622 Homework 2

Tony Mei

3/14/2021