The penguin dataset will be used for this task. This can be found here: (https://allisonhorst.github.io/palmerpenguins/index.html)

We will be working with the Penguin dataset again.

Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.

Using the target variable, Species, please conduct: a. Linear Discriminant Analysis : a. You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices. b. Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors). c. Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model. d. Look at the fit statistics/ accuracy rates. b. Quadratic Discriminant Analysis () a. Same steps as above to consider c. Naïve Bayes a. Same steps as above to consider d. Comment on the models fits/strength/weakness/accuracy for all these three models that you worked with.

Data Exploration

Before we begin analysis, we will investigate our dataset.

library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.0.5
library(kableExtra)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(visdat)
library(explore)
## Warning: package 'explore' was built under R version 4.0.5
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:explore':
## 
##     rescale01
library(recipes)
## Warning: package 'recipes' was built under R version 4.0.5
## 
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
## 
##     step
library(caret)
## Loading required package: lattice
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(AppliedPredictiveModeling)
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
## 
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity
library(naivebayes)
## naivebayes 0.9.7 loaded
setwd('C:\\Users\\user\\Documents\\Data622 HH Khan S2021 ML BigData C fm Trive')
getwd()
## [1] "C:/Users/user/Documents/Data622 HH Khan S2021 ML BigData C fm Trive"
penguins<-read.csv(file ='penguins.csv')

Let us see first the number of columns and the names of respective columns

length(names(penguins))
## [1] 8

The penguin dataset has 8 features namely with 344 rows:

summary(penguins) %>%kable() %>%  kable_paper("hover", full_width = F) %>% 
  scroll_box()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Length:344 Length:344 Min. :32.10 Min. :13.10 Min. :172.0 Min. :2700 Length:344 Min. :2007
Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
Mode :character Mode :character Median :44.45 Median :17.30 Median :197.0 Median :4050 Mode :character Median :2008
NA NA Mean :43.92 Mean :17.15 Mean :200.9 Mean :4202 NA Mean :2008
NA NA 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750 NA 3rd Qu.:2009
NA NA Max. :59.60 Max. :21.50 Max. :231.0 Max. :6300 NA Max. :2009
NA NA NA’s :2 NA’s :2 NA’s :2 NA’s :2 NA NA

We can clearly see that three of parameters are categorical, and the remaining five are numerical.

Let us see the first few rows of the dataset:

head(penguins)

Let us visualize the input data in graphical way.

visdat::vis_dat(penguins)

There are 333 full cases in the dataset, with 19 null values. Let us explore each column seperately.

penguins %>% 
  explore(species)

The species dataset does not contain any NA values, so all the rows are filled. Adelie, Chinstrap, and Gentoo penguins make up the population, with Adelie accounting for the bulk of the dataset.

penguins %>% 
  explore(island)

There are 3 main islands where the population of penguins live namely Torgersen, Dream and Biscoe. Most of the pengiuns reside in Biscoe island.

penguins %>% 
  explore(bill_length_mm)

The bill length column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(bill_depth_mm)

The bill depth column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(flipper_length_mm)

The flipper length column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(body_mass_g)

The body mass column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(sex)

The sex column has 11 NA which makes up 3.2% of the whole dataset. Most of the dataset is of male penguins.

penguins %>% 
  explore(year)

The dataset contains data from 2007, 2008 and 2009 year with most population data gathered in 2009. As the year information doesn’t seem important for our study we wont be including it in our study.

Let us see the realtionship between all the features.

penguins %>% 
  explore(island, target = species)

penguins %>% 
  explore(bill_length_mm, target = species)

penguins %>% 
  explore(bill_depth_mm, target = species)

penguins %>% 
  explore(flipper_length_mm, target = species)

penguins %>% 
  explore(body_mass_g, target = species)

penguins %>% 
  explore(sex, target = species)

penguins %>% 
  explore(year, target = species)

The data already shows some good trends. The variable flipper length mm distinguishes the species Gentoo from the species Chinstrap, while the variable bill length mm distinguishes the species Adelie from the species Chinstrap. And we’re seeing Chinstrap and Gentoo are on different islands. Let us now take a closer look at these variables:

penguins %>% 
  explore(bill_depth_mm, flipper_length_mm, target = species)

Based on varible flipper length and bill depth we can clearly seperate the Gentoo pengiuns.

penguins %>% 
  explore(flipper_length_mm, bill_length_mm, target = species)

The plot demonstrates a decent, if not perfect, separation of the three species!

Data Cleansing

We will be excluding the rows which have maximum number of NAs. These rows are excluded from our analyses because they do not include enough details about the observation to allow us to draw conclusions. This is accomplished by setting a limit of 0.5, which means that if 50% of the tests calculate NA for every finding, the row is excluded from our dataset. The table underneath shows that observations were omitted, one for Adelie and Gentoo species. They were omitted because there was no information available for these findings other than the island and year.

flag <-apply(penguins, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<4
data<-penguins[flag,]
penguins[!flag,] %>%kable() %>%  kable_paper("hover", full_width = F) %>% 
  scroll_box()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
4 Adelie Torgersen NA NA NA NA NA 2007
272 Gentoo Biscoe NA NA NA NA NA 2009

The distribution of penguin species is shown in the table below. We have 151 Adelie penguins, 68 Chinstrap penguins, and 123 Gentoo penguins, according to the transformed data.

kable(transform(as.data.frame(table(data$species)), percentage_column = Freq / nrow(data) * 100),col.names = c("Species","Occurence","Percentage")) %>%  kable_paper("hover", full_width = F) %>% 
  scroll_box()
Species Occurence Percentage
Adelie 151 44.15205
Chinstrap 68 19.88304
Gentoo 123 35.96491

From previewing the numerical varibles we can clearly identifiy that the attiributes follow the gaussian distribution. Flipper length has a bimodal distribution and is most likely the result of two separate penguin species. The distribution of our Dataset with the three penguin groups included is shown below. It seems that there is a significant difference between classes, especially when comparing Bill length to any of the numeric variables.

Furthermore, we can build ellipse base clustering plots based on the means distributions for every numerical attribute using the featurePlot function provided by caret package. This clearly demonstrates how there is very little variation between the species whenever bill length is compared to any other variable. This is supposed to be a strong descriptive attribute in the model and will serve as a trigger for very accurate predictions. Other pairwise comparisons show a lot of overlap, especially between Adelie and the Chinstrap species.

Let us draw the scatterplot matrix of columns Species, Bill Length mm, Bill Depth mm, Flipper Length mm and Body Mass g.

d = data[,c(1,3,4,5,6)]
ggpairs(d,progress = F, aes(color = species, alpha = 0.2), lower=list(combo=wrap("facethist",  
binwidth=0.5)))

Now lets draw the FeaturePlot

# d1 = data[,c(3,4,5,6)]

# d1 = data %>% 
#   dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g)
# 
# 
# transparentTheme(set = TRUE, pchSize = 1, trans = 0.4)
# 
# caret::featurePlot(x=d1 , 
#                    y = data$species, 
#                    plot = "ellipse",  
#                    auto.key = list(columns = 3))
# transparentTheme(trans = .9)
# 
# featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g), 
#             y = penguins$species,
#             plot = "density", 
#             scales = list(x = list(relation="free"), 
#                           y = list(relation="free")), 
#             adjust = 1.5, 
#             pch = "|", 
#             layout = c(4, 1), 
#             auto.key = list(columns = 4))
# https://rpubs.com/dhairavc/735867
# Dhairav Chhatbar
# transparentTheme(trans = .9)
# featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g), 
#             y = penguins$species,
#             plot = "density", 
#             scales = list(x = list(relation="free"), 
#                           y = list(relation="free")), 
#             adjust = 1.5, 
#             pch = "|", 
#             layout = c(4, 1), 
#             auto.key = list(columns = 4))
# head(data)
# head(data)
# d2 = data[,c(3,4,5,6)]
# head(d2)
# transparentTheme(set = TRUE, pchSize = 1, trans = 0.4)
# caret::featurePlot(x = d2, y = data$species, plot = "ellipse",  auto.key = list(columns = 3))
# ##null

From the plot we can easily determine that Gentoo species appear to have a measurable difference from Adelie and chinstrap species in terms of flipper length, body mass, and bill depth. Adelie appears to have a measurable difference in bill length from the other species. We can anticipate successful gentoo species isolation.

Preprocessing and Spliting Trainning and Testing Data

Now lets fill in the missing values in the dataset and remove the year and island column.

# install.packages('recipes')
library(recipes)
res <- 
  recipe(species ~ ., data = data) %>% 
  recipes::step_impute_median(all_numeric()) %>%
 recipes::step_impute_mode(all_nominal(), -all_outcomes())%>%
  step_rm(island) %>% 
  step_rm(year) %>% 
  prep()
 # could not find function "step_impute_mode"

The next step what we did was dividing the dataset into two parts which we achieved by using the caret package which provides inbuilt division function. The split ratio is by 80% for training and 20% for testing purposes.

library(caret)
set.seed(2021)
idx <- createDataPartition(data$species, p = .8, list = FALSE, times = 1)
train <- data[idx,]
test <- data[-idx,]

test<-bake(res, new_data = test)

train<-bake(res, new_data = train)

Linear Discriminant analysis

Linear discriminant analysis (LDA) is a statistical technique for identifying linear combinations of features that characterize two or more groups of artefacts or events. It is a method similar to PCA (principal component analysis) in that it seeks linear relationships between variables that better explain the results.

When the classes are well differentiated, LDA is a stable method to use that does not suffer from the same volatility as logistic regression. As there are more than two response classes, LDA is also common because it offers low-dimensional representations of the data.

(model1 <- lda(species~. ,data = train))
## Call:
## lda(species ~ ., data = train)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
##      0.44      0.20      0.36 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.99091      18.40165          190.2479    3725.826 0.5371901
## Chinstrap       48.90364      18.42182          195.9455    3746.364 0.4909091
## Gentoo          47.53030      15.04141          217.2222    5083.081 0.5454545
## 
## Coefficients of linear discriminants:
##                            LD1          LD2
## bill_length_mm     0.096716664 -0.423419176
## bill_depth_mm     -1.035629740 -0.135038882
## flipper_length_mm  0.072128261  0.013457632
## body_mass_g        0.001681983  0.001420329
## sexmale           -0.276444360  0.732927894
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8657 0.1343
trainingPredion <- predict(model1, train)

testingPredictions <-predict(model1, test)

trainingPredion <- cbind(train, trainingPredion) %>% mutate(group = "Train")

testingPredictions <- cbind(test, testingPredictions) %>% mutate(group = "Test")

ldaResult <- rbind(trainingPredion, testingPredictions)

ggplot(ldaResult) + geom_point(aes(x.LD1, x.LD2, colour = species))+ labs(x = "LD1", y="LD2") + facet_grid(.~group)

To see if this estimation is right, we will make an accurate comparison between our estimates and the actual values. A confusion matrix is shown below, and it demonstrates that we only misidentify two Chinstrap penguin as an Adelie penguin in the training group.

rTrain <- ldaResult %>% filter(group == "Train") %>% conf_mat(truth = species,estimate = class)
rTest <- ldaResult %>% filter(group == "Test") %>% conf_mat(truth = species,estimate = class)
autoplot(rTrain, type = "heatmap")  +
  labs(title = "LDA Train Group Confusion Matrix")

# Error: `species` should be a factor.  fixed now, for some reason, must have fixed

We have categorized accurately on the testing group.

autoplot(rTest, type = "heatmap")  +
  labs(title = "LDA Test Group Confusion Matrix")

The kappa provides a true estimate of the accuracy, which is used to determine the true accuracy of the model.

ldaResult %>% group_by(group) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% kable_paper("hover", full_width = F) %>% 
  scroll_box()
group .metric .estimator .estimate
Test accuracy multiclass 1.0000000
Train accuracy multiclass 0.9927273
Test kap multiclass 1.0000000
Train kap multiclass 0.9885479

Quadratic Discriminant Analysis

A quadratic classifier is a statistical classifier that separates observations from two or more groups of objects or events using a quadratic decision surface.

(model2 <- qda(species~. ,data = train))
## Call:
## qda(species ~ ., data = train)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
##      0.44      0.20      0.36 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.99091      18.40165          190.2479    3725.826 0.5371901
## Chinstrap       48.90364      18.42182          195.9455    3746.364 0.4909091
## Gentoo          47.53030      15.04141          217.2222    5083.081 0.5454545

Since qda may be multidimensional, the separation of this function on an x/y axis is not observable. The QDA function’s performance is similar to the LDA function’s misclassification of one Chinstrap penguin as Adelie. The KAP metrics is illustrated below which is similar to LDA.

m2trainingPredion <- predict(model2, train)
m2testingPredictions <- predict(model2,test)
m2trainingPredion <- cbind(train, m2trainingPredion)%>% mutate(group = "Train")
m2testingPredictions <- cbind(test, m2testingPredictions)%>% mutate(group = "Test")
qdaResult <- rbind(m2trainingPredion, m2testingPredictions)
rTrain <- qdaResult %>% filter(group == "Train") %>% conf_mat(truth = species,estimate = class)
rTest <- qdaResult %>% filter(group == "Test") %>% conf_mat(truth = species,estimate = class)
autoplot(rTrain, type = "heatmap")  +
  labs(title = "QDA Train Group Confusion Matrix")

autoplot(rTest, type = "heatmap")  +
  labs(title = "LDA Test Group Confusion Matrix")

qdaResult %>% group_by(group) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% kable_paper("hover", full_width = F) %>% 
  scroll_box()
group .metric .estimator .estimate
Test accuracy multiclass 1.0000000
Train accuracy multiclass 0.9927273
Test kap multiclass 1.0000000
Train kap multiclass 0.9885479

Naive Bayes

The naive Bayes classifiers are a form of simple “probabilistic classifier” that uses Bayes’ theorem with strong (naive) independence assumptions. Nave Bayes classifiers are highly scalable, requiring a set of parameters that is proportional to the number of attributes.

(model3<-naivebayes::naive_bayes(species~. ,data = train, type = "class"))
## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = species ~ ., data = train, type = "class")
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##    Adelie Chinstrap    Gentoo 
##      0.44      0.20      0.36 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##               
## bill_length_mm    Adelie Chinstrap    Gentoo
##           mean 38.990909 48.903636 47.530303
##           sd    2.564956  3.471949  3.128815
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_depth_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##              
## bill_depth_mm     Adelie  Chinstrap     Gentoo
##          mean 18.4016529 18.4218182 15.0414141
##          sd    1.2148377  1.0877586  0.9664801
## 
## --------------------------------------------------------------------------------- 
##  ::: flipper_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##                  
## flipper_length_mm     Adelie  Chinstrap     Gentoo
##              mean 190.247934 195.945455 217.222222
##              sd     6.717243   7.263348   6.592444
## 
## --------------------------------------------------------------------------------- 
##  ::: body_mass_g (Gaussian) 
## --------------------------------------------------------------------------------- 
##            
## body_mass_g    Adelie Chinstrap    Gentoo
##        mean 3725.8264 3746.3636 5083.0808
##        sd    437.7967  381.1052  489.8176
## 
## --------------------------------------------------------------------------------- 
##  ::: sex (Bernoulli) 
## --------------------------------------------------------------------------------- 
##         
## sex         Adelie Chinstrap    Gentoo
##   female 0.4628099 0.5090909 0.4545455
##   male   0.5371901 0.4909091 0.5454545
## 
## ---------------------------------------------------------------------------------

Given a bill length of less than 42, the likelihood of the penguin being of the Adelie genus is quite significant.

plot(model3)

m3trainingPredion <-  as.factor(predict(model3, train))
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
confusionMatrix(m3trainingPredion, train$species)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       117         4      0
##   Chinstrap      4        51      0
##   Gentoo         0         0     99
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9709          
##                  95% CI : (0.9435, 0.9874)
##     No Information Rate : 0.44            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9543          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9669           0.9273          1.00
## Specificity                 0.9740           0.9818          1.00
## Pos Pred Value              0.9669           0.9273          1.00
## Neg Pred Value              0.9740           0.9818          1.00
## Prevalence                  0.4400           0.2000          0.36
## Detection Rate              0.4255           0.1855          0.36
## Detection Prevalence        0.4400           0.2000          0.36
## Balanced Accuracy           0.9705           0.9545          1.00

Conclusion

This summary focuses solely on the accuracy of holdout data by using the three separate modelling approaches. All three models performed admirably, with LDA & QDA faring the best and Naive Bayes faring the worst. I’d like to point out that the classes were pretty well separated, which likely contributed to good LDA & QDA results. Naive Bayes allows the assumption of function independence, which can explain its poorer efficiency.