Data Manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Warning: package 'e1071' was built under R version 4.0.4
## Warning: package 'caTools' was built under R version 4.0.4
1. Split the data into training and testing set.
We will be splitting the penguin dataset in training (80%) and test set (20%). We noticed there are some NA values in the data and they have to be omitted.
# Load the data
penguins <- palmerpenguins::penguins
# Split the data into training (80%) and test set (20%)
penguins <- as.data.frame(penguins)
#omitting all NA values
penguins<-na.omit(penguins)
#converting the island and sex variables into numerical
#"Biscoe"
levels(penguins$island)[1] <- 0
#"Dream"
levels(penguins$island)[2] <- 1
#"Torgersen"
levels(penguins$island)[3] <- 2
#female
levels(penguins$sex)[1] <- 0
#"male"
levels(penguins$sex)[2] <- 1
set.seed(120) # Setting Seed
training.samples <- penguins$species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- penguins[training.samples, ]
test.data <- penguins[-training.samples, ]2. Normalize the data. Categorical variables are automatically ignored.
For LDA to work, we have to make sure our variables are normally distributed. This is because LDA assumes the predictors to be normally distributed. LDA assumes that predictors are normally distributed and the different classes have class-specific means and equal variance/covariance.
Linear discriminant analysis - LDA
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Fit the model
model <- lda(species~., data = train.transformed)
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class==test.transformed$species)## [1] 1
Compute LDA:
library(MASS)
# I will be removing year
model <- lda(species~. - year, data = train.transformed)
model## Call:
## lda(species ~ . - year, data = train.transformed)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## island1 island2 bill_length_mm bill_depth_mm flipper_length_mm
## Adelie 0.3931624 0.3247863 -0.9555862 0.6041178 -0.7633521
## Chinstrap 1.0000000 0.0000000 0.8349499 0.6116721 -0.4203050
## Gentoo 0.0000000 0.0000000 0.6862640 -1.0867057 1.1711351
## body_mass_g sex1
## Adelie -0.6210872 0.4529915
## Chinstrap -0.6096684 0.4363636
## Gentoo 1.1062392 0.5520833
##
## Coefficients of linear discriminants:
## LD1 LD2
## island1 -1.0710715 -1.77820348
## island2 -1.0600222 -0.21538615
## bill_length_mm 0.5311923 -2.11473056
## bill_depth_mm -1.6661173 -0.17620747
## flipper_length_mm 1.2482407 -0.01937768
## body_mass_g 1.1955640 0.67876715
## sex1 -0.5513973 1.03207058
##
## Proportion of trace:
## LD1 LD2
## 0.8381 0.1619
- Prior probabilities of groups: the proportion of training observations in each group.
For example, there are 43% of the training observations in the Adelie group. There are 20% of the training observations in the Chinstrap group There are 35% of the training observations in the Gentoo group
- Group means: group center of gravity. Shows the mean of each variable in each group.
Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule. for example, LD1 = -1.07 x island1 + -1.06 x island2 + 0.53 x bill_length_mm -2.03 x bill_depth_mm - 1.25 x flipper_length_mm - 1.0 x body_mass_g - 0.55 x sex1.
Similarly, LD2 = -1.77 x island1 + -0.215 x island2 - 2.24 x bill_length_mm + -0.08 x bill_depth_mm + 0.13 x flipper_length_mm - 1.35 x body_mass_g + 1.03 x sex1.
This plot shows the linear discriminants of the model.
lda.data <- cbind(train.transformed, predict(model)$x)
ggplot(lda.data, aes(LD1, LD2)) +
geom_point(aes(color = species))Model accuracy:
You can compute the model accuracy as follow:
## [1] 1
Our model classified 100% of the observations, which is very good.
Quadratic discriminant analysis - QDA
QDA is more flexible than LDA, because the covariance can be different for each class. LDA is a bit better than QDA when you have a small training set. QDA is often recommended if the training set is very large, where variance of the classifier is not an issue.
library(MASS)
# Fit the model, removed island from the model because it gave a rank deficiency error in the group Chinstrap
training.samples <- penguins$species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- penguins[training.samples, ]
test.data <- penguins[-training.samples, ]
model <- qda(species~ bill_length_mm + bill_length_mm + flipper_length_mm + body_mass_g + sex, data = train.transformed)
model## Call:
## qda(species ~ bill_length_mm + bill_length_mm + flipper_length_mm +
## body_mass_g + sex, data = train.transformed)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## bill_length_mm flipper_length_mm body_mass_g sex1
## Adelie -0.9555862 -0.7633521 -0.6210872 0.4529915
## Chinstrap 0.8349499 -0.4203050 -0.6096684 0.4363636
## Gentoo 0.6862640 1.1711351 1.1062392 0.5520833
# Make predictions
predictions <- model %>% predict(test.transformed)
# Model accuracy
mean(predictions$class == test.transformed$species)## [1] 1
There are 43% of the training observations in the Adelie group. There are 20% of the training observations in the Chinstrap group. There are 35% of the training observations in the Gentoo group.
This model accuracy is 100%.
Naive Bayes Classifier
training.samples <- penguins$species %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- penguins[training.samples, ]
test.data <- penguins[-training.samples, ]
# Using the same train.data and test.data from above to fit Naive Bayes Model
classifier_cl <- naiveBayes(species ~ .- year, data = train.data)
classifier_cl ##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Conditional probabilities:
## island
## Y 0 1 2
## Adelie 0.3247863 0.3760684 0.2991453
## Chinstrap 0.0000000 1.0000000 0.0000000
## Gentoo 1.0000000 0.0000000 0.0000000
##
## bill_length_mm
## Y [,1] [,2]
## Adelie 38.83932 2.476196
## Chinstrap 49.18727 3.426479
## Gentoo 47.58854 3.195717
##
## bill_depth_mm
## Y [,1] [,2]
## Adelie 18.33504 1.1555948
## Chinstrap 18.46545 1.1686045
## Gentoo 14.98646 0.9857532
##
## flipper_length_mm
## Y [,1] [,2]
## Adelie 190.3846 6.508411
## Chinstrap 196.3273 6.947034
## Gentoo 217.1875 6.737659
##
## body_mass_g
## Y [,1] [,2]
## Adelie 3706.838 464.4776
## Chinstrap 3781.364 372.0320
## Gentoo 5088.542 502.8722
##
## sex
## Y 0 1
## Adelie 0.4786325 0.5213675
## Chinstrap 0.4363636 0.5636364
## Gentoo 0.4791667 0.5208333
# Predicting on test data'
y_pred <- predict(classifier_cl, newdata = test.data)
# Confusion Matrix
cm <- table(test.data$species, y_pred)
cm ## y_pred
## Adelie Chinstrap Gentoo
## Adelie 29 0 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
## Confusion Matrix and Statistics
##
## y_pred
## Adelie Chinstrap Gentoo
## Adelie 29 0 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4615
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9757
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9667 1.0000 1.0000
## Specificity 1.0000 0.9811 1.0000
## Pos Pred Value 1.0000 0.9231 1.0000
## Neg Pred Value 0.9722 1.0000 1.0000
## Prevalence 0.4615 0.1846 0.3538
## Detection Rate 0.4462 0.1846 0.3538
## Detection Prevalence 0.4462 0.2000 0.3538
## Balanced Accuracy 0.9833 0.9906 1.0000
The Conditional probability for each feature or variable is created by model separately. The apriori probabilities are also calculated which indicates the distribution of our data.
So, 23 Gentoo are correctly classified as Gentoo. Out of 30 Adelie, 29 Adelie are correctly classified as Adelie, and 1 are classified as Chinstrap. Out of 12 Chinstrap, 12 Chinstrap are correctly classified as Chinstrap.
The model achieved 98% accuracy with a p-value of less than 1. With Sensitivity, Specificity, and Balanced accuracy, the model build is good.
Conclusion
Comparing the accuracy of the three models, QDA and LDA performs the best by correctly classifying 100% of observations compared to Naive Bayes Classifier. However, using for QDA i had to remove the island variable because it gave a rank deficiency for the Chinstrap. Doing so it doesn’t include most of the data like LDA did. So I believe in this dataset, LDA would be the ideal concept to use as it had 100% accuracy and was the most inclusive of the data out of the three. I also ran the model without the two other variables, island and sex, and the statistics remained the same but the accuracy is a bit lower.
Reference
https://www.geeksforgeeks.org/naive-bayes-classifier-in-r-programming/#:~:text=Naive%20Bayes%20is%20a%20Supervised,between%20the%20features%20or%20variables.