The penguin dataset will be used for this task. This can be found here: (https://allisonhorst.github.io/palmerpenguins/index.html)

We will be working with the Penguin dataset again.

Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.

Using the target variable, Species, please conduct: a. Linear Discriminant Analysis : a. You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices. b. Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors). c. Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model. d. Look at the fit statistics/ accuracy rates. b. Quadratic Discriminant Analysis () a. Same steps as above to consider c. Naïve Bayes a. Same steps as above to consider d. Comment on the models fits/strength/weakness/accuracy for all these three models that you worked with.

Data Exploration

Before we begin analysis, we will investigate our dataset.

library(palmerpenguins)

## Warning: package 'palmerpenguins' was built under R version 4.0.5

library(kableExtra)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(visdat)
library(explore)

## Warning: package 'explore' was built under R version 4.0.5

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:explore':
## 
##     rescale01

library(recipes)

## Warning: package 'recipes' was built under R version 4.0.5

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stats':
## 
##     step

library(caret)

## Loading required package: lattice

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(AppliedPredictiveModeling)
library(yardstick)

## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.

## 
## Attaching package: 'yardstick'

## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity

library(naivebayes)

## naivebayes 0.9.7 loaded

setwd('C:\\Users\\user\\Documents\\Data622 HH Khan S2021 ML BigData C fm Trive')
getwd()

## [1] "C:/Users/user/Documents/Data622 HH Khan S2021 ML BigData C fm Trive"

penguins<-read.csv(file ='penguins.csv')

Let us see first the number of columns and the names of respective columns

length(names(penguins))

## [1] 8

The penguin dataset has 8 features namely with 344 rows:

Species
Island
Bill Length mm
Bill Depth mm
Flipper Length mm
Body Mass g
Sex
Year Now lets explore the data types of these features:

summary(penguins) %>%kable() %>%  kable_paper("hover", full_width = F) %>% 
  scroll_box()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Length:344	Length:344	Min. :32.10	Min. :13.10	Min. :172.0	Min. :2700	Length:344	Min. :2007
Class :character	Class :character	1st Qu.:39.23	1st Qu.:15.60	1st Qu.:190.0	1st Qu.:3550	Class :character	1st Qu.:2007
Mode :character	Mode :character	Median :44.45	Median :17.30	Median :197.0	Median :4050	Mode :character	Median :2008
NA	NA	Mean :43.92	Mean :17.15	Mean :200.9	Mean :4202	NA	Mean :2008
NA	NA	3rd Qu.:48.50	3rd Qu.:18.70	3rd Qu.:213.0	3rd Qu.:4750	NA	3rd Qu.:2009
NA	NA	Max. :59.60	Max. :21.50	Max. :231.0	Max. :6300	NA	Max. :2009
NA	NA	NA’s :2	NA’s :2	NA’s :2	NA’s :2	NA	NA

We can clearly see that three of parameters are categorical, and the remaining five are numerical.

Let us see the first few rows of the dataset:

head(penguins)

Let us visualize the input data in graphical way.

visdat::vis_dat(penguins)

There are 333 full cases in the dataset, with 19 null values. Let us explore each column seperately.

penguins %>% 
  explore(species)

The species dataset does not contain any NA values, so all the rows are filled. Adelie, Chinstrap, and Gentoo penguins make up the population, with Adelie accounting for the bulk of the dataset.

penguins %>% 
  explore(island)

There are 3 main islands where the population of penguins live namely Torgersen, Dream and Biscoe. Most of the pengiuns reside in Biscoe island.

penguins %>% 
  explore(bill_length_mm)

The bill length column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(bill_depth_mm)

The bill depth column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(flipper_length_mm)

The flipper length column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(body_mass_g)

The body mass column has 2 NA which makes up 0.6% of the whole dataset.

penguins %>% 
  explore(sex)

The sex column has 11 NA which makes up 3.2% of the whole dataset. Most of the dataset is of male penguins.

penguins %>% 
  explore(year)

The dataset contains data from 2007, 2008 and 2009 year with most population data gathered in 2009. As the year information doesn’t seem important for our study we wont be including it in our study.

Let us see the realtionship between all the features.

penguins %>% 
  explore(island, target = species)

penguins %>% 
  explore(bill_length_mm, target = species)

penguins %>% 
  explore(bill_depth_mm, target = species)

penguins %>% 
  explore(flipper_length_mm, target = species)

penguins %>% 
  explore(body_mass_g, target = species)

penguins %>% 
  explore(sex, target = species)

penguins %>% 
  explore(year, target = species)

The data already shows some good trends. The variable flipper length mm distinguishes the species Gentoo from the species Chinstrap, while the variable bill length mm distinguishes the species Adelie from the species Chinstrap. And we’re seeing Chinstrap and Gentoo are on different islands. Let us now take a closer look at these variables:

penguins %>% 
  explore(bill_depth_mm, flipper_length_mm, target = species)

Based on varible flipper length and bill depth we can clearly seperate the Gentoo pengiuns.

penguins %>% 
  explore(flipper_length_mm, bill_length_mm, target = species)

The plot demonstrates a decent, if not perfect, separation of the three species!

Data Cleansing

We will be excluding the rows which have maximum number of NAs. These rows are excluded from our analyses because they do not include enough details about the observation to allow us to draw conclusions. This is accomplished by setting a limit of 0.5, which means that if 50% of the tests calculate NA for every finding, the row is excluded from our dataset. The table underneath shows that observations were omitted, one for Adelie and Gentoo species. They were omitted because there was no information available for these findings other than the island and year.

flag <-apply(penguins, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<4
data<-penguins[flag,]
penguins[!flag,] %>%kable() %>%  kable_paper("hover", full_width = F) %>% 
  scroll_box()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
4	Adelie	Torgersen	NA	NA	NA	NA	NA	2007
272	Gentoo	Biscoe	NA	NA	NA	NA	NA	2009

The distribution of penguin species is shown in the table below. We have 151 Adelie penguins, 68 Chinstrap penguins, and 123 Gentoo penguins, according to the transformed data.

kable(transform(as.data.frame(table(data$species)), percentage_column = Freq / nrow(data) * 100),col.names = c("Species","Occurence","Percentage")) %>%  kable_paper("hover", full_width = F) %>% 
  scroll_box()

Species	Occurence	Percentage
Adelie	151	44.15205
Chinstrap	68	19.88304
Gentoo	123	35.96491

From previewing the numerical varibles we can clearly identifiy that the attiributes follow the gaussian distribution. Flipper length has a bimodal distribution and is most likely the result of two separate penguin species. The distribution of our Dataset with the three penguin groups included is shown below. It seems that there is a significant difference between classes, especially when comparing Bill length to any of the numeric variables.

Furthermore, we can build ellipse base clustering plots based on the means distributions for every numerical attribute using the featurePlot function provided by caret package. This clearly demonstrates how there is very little variation between the species whenever bill length is compared to any other variable. This is supposed to be a strong descriptive attribute in the model and will serve as a trigger for very accurate predictions. Other pairwise comparisons show a lot of overlap, especially between Adelie and the Chinstrap species.

Let us draw the scatterplot matrix of columns Species, Bill Length mm, Bill Depth mm, Flipper Length mm and Body Mass g.

d = data[,c(1,3,4,5,6)]
ggpairs(d,progress = F, aes(color = species, alpha = 0.2), lower=list(combo=wrap("facethist",  
binwidth=0.5)))

Now lets draw the FeaturePlot

# d1 = data[,c(3,4,5,6)]

# d1 = data %>% 
#   dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g)
# 
# 
# transparentTheme(set = TRUE, pchSize = 1, trans = 0.4)
# 
# caret::featurePlot(x=d1 , 
#                    y = data$species, 
#                    plot = "ellipse",  
#                    auto.key = list(columns = 3))

# transparentTheme(trans = .9)
# 
# featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g), 
#             y = penguins$species,
#             plot = "density", 
#             scales = list(x = list(relation="free"), 
#                           y = list(relation="free")), 
#             adjust = 1.5, 
#             pch = "|", 
#             layout = c(4, 1), 
#             auto.key = list(columns = 4))

# https://rpubs.com/dhairavc/735867
# Dhairav Chhatbar
# transparentTheme(trans = .9)
# featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g), 
#             y = penguins$species,
#             plot = "density", 
#             scales = list(x = list(relation="free"), 
#                           y = list(relation="free")), 
#             adjust = 1.5, 
#             pch = "|", 
#             layout = c(4, 1), 
#             auto.key = list(columns = 4))

# head(data)

# head(data)
# d2 = data[,c(3,4,5,6)]
# head(d2)
# transparentTheme(set = TRUE, pchSize = 1, trans = 0.4)
# caret::featurePlot(x = d2, y = data$species, plot = "ellipse",  auto.key = list(columns = 3))
# ##null

From the plot we can easily determine that Gentoo species appear to have a measurable difference from Adelie and chinstrap species in terms of flipper length, body mass, and bill depth. Adelie appears to have a measurable difference in bill length from the other species. We can anticipate successful gentoo species isolation.

Preprocessing and Spliting Trainning and Testing Data

Now lets fill in the missing values in the dataset and remove the year and island column.

# install.packages('recipes')

library(recipes)
res <- 
  recipe(species ~ ., data = data) %>% 
  recipes::step_impute_median(all_numeric()) %>%
 recipes::step_impute_mode(all_nominal(), -all_outcomes())%>%
  step_rm(island) %>% 
  step_rm(year) %>% 
  prep()
 # could not find function "step_impute_mode"

The next step what we did was dividing the dataset into two parts which we achieved by using the caret package which provides inbuilt division function. The split ratio is by 80% for training and 20% for testing purposes.

library(caret)
set.seed(2021)
idx <- createDataPartition(data$species, p = .8, list = FALSE, times = 1)
train <- data[idx,]
test <- data[-idx,]

test<-bake(res, new_data = test)

train<-bake(res, new_data = train)

Linear Discriminant analysis

Linear discriminant analysis (LDA) is a statistical technique for identifying linear combinations of features that characterize two or more groups of artefacts or events. It is a method similar to PCA (principal component analysis) in that it seeks linear relationships between variables that better explain the results.

When the classes are well differentiated, LDA is a stable method to use that does not suffer from the same volatility as logistic regression. As there are more than two response classes, LDA is also common because it offers low-dimensional representations of the data.

(model1 <- lda(species~. ,data = train))

## Call:
## lda(species ~ ., data = train)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
##      0.44      0.20      0.36 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.99091      18.40165          190.2479    3725.826 0.5371901
## Chinstrap       48.90364      18.42182          195.9455    3746.364 0.4909091
## Gentoo          47.53030      15.04141          217.2222    5083.081 0.5454545
## 
## Coefficients of linear discriminants:
##                            LD1          LD2
## bill_length_mm     0.096716664 -0.423419176
## bill_depth_mm     -1.035629740 -0.135038882
## flipper_length_mm  0.072128261  0.013457632
## body_mass_g        0.001681983  0.001420329
## sexmale           -0.276444360  0.732927894
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8657 0.1343

trainingPredion <- predict(model1, train)

testingPredictions <-predict(model1, test)

trainingPredion <- cbind(train, trainingPredion) %>% mutate(group = "Train")

testingPredictions <- cbind(test, testingPredictions) %>% mutate(group = "Test")

ldaResult <- rbind(trainingPredion, testingPredictions)

ggplot(ldaResult) + geom_point(aes(x.LD1, x.LD2, colour = species))+ labs(x = "LD1", y="LD2") + facet_grid(.~group)

To see if this estimation is right, we will make an accurate comparison between our estimates and the actual values. A confusion matrix is shown below, and it demonstrates that we only misidentify two Chinstrap penguin as an Adelie penguin in the training group.

rTrain <- ldaResult %>% filter(group == "Train") %>% conf_mat(truth = species,estimate = class)
rTest <- ldaResult %>% filter(group == "Test") %>% conf_mat(truth = species,estimate = class)
autoplot(rTrain, type = "heatmap")  +
  labs(title = "LDA Train Group Confusion Matrix")

# Error: `species` should be a factor.  fixed now, for some reason, must have fixed

We have categorized accurately on the testing group.

autoplot(rTest, type = "heatmap")  +
  labs(title = "LDA Test Group Confusion Matrix")

The kappa provides a true estimate of the accuracy, which is used to determine the true accuracy of the model.

ldaResult %>% group_by(group) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% kable_paper("hover", full_width = F) %>% 
  scroll_box()

group	.metric	.estimator	.estimate
Test	accuracy	multiclass	1.0000000
Train	accuracy	multiclass	0.9927273
Test	kap	multiclass	1.0000000
Train	kap	multiclass	0.9885479

Quadratic Discriminant Analysis

A quadratic classifier is a statistical classifier that separates observations from two or more groups of objects or events using a quadratic decision surface.

(model2 <- qda(species~. ,data = train))

## Call:
## qda(species ~ ., data = train)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
##      0.44      0.20      0.36 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.99091      18.40165          190.2479    3725.826 0.5371901
## Chinstrap       48.90364      18.42182          195.9455    3746.364 0.4909091
## Gentoo          47.53030      15.04141          217.2222    5083.081 0.5454545

Since qda may be multidimensional, the separation of this function on an x/y axis is not observable. The QDA function’s performance is similar to the LDA function’s misclassification of one Chinstrap penguin as Adelie. The KAP metrics is illustrated below which is similar to LDA.

m2trainingPredion <- predict(model2, train)
m2testingPredictions <- predict(model2,test)
m2trainingPredion <- cbind(train, m2trainingPredion)%>% mutate(group = "Train")
m2testingPredictions <- cbind(test, m2testingPredictions)%>% mutate(group = "Test")
qdaResult <- rbind(m2trainingPredion, m2testingPredictions)
rTrain <- qdaResult %>% filter(group == "Train") %>% conf_mat(truth = species,estimate = class)
rTest <- qdaResult %>% filter(group == "Test") %>% conf_mat(truth = species,estimate = class)
autoplot(rTrain, type = "heatmap")  +
  labs(title = "QDA Train Group Confusion Matrix")

autoplot(rTest, type = "heatmap")  +
  labs(title = "LDA Test Group Confusion Matrix")

qdaResult %>% group_by(group) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% kable_paper("hover", full_width = F) %>% 
  scroll_box()

group	.metric	.estimator	.estimate
Test	accuracy	multiclass	1.0000000
Train	accuracy	multiclass	0.9927273
Test	kap	multiclass	1.0000000
Train	kap	multiclass	0.9885479

Naive Bayes

The naive Bayes classifiers are a form of simple “probabilistic classifier” that uses Bayes’ theorem with strong (naive) independence assumptions. Nave Bayes classifiers are highly scalable, requiring a set of parameters that is proportional to the number of attributes.

(model3<-naivebayes::naive_bayes(species~. ,data = train, type = "class"))

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = species ~ ., data = train, type = "class")
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##    Adelie Chinstrap    Gentoo 
##      0.44      0.20      0.36 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##               
## bill_length_mm    Adelie Chinstrap    Gentoo
##           mean 38.990909 48.903636 47.530303
##           sd    2.564956  3.471949  3.128815
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_depth_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##              
## bill_depth_mm     Adelie  Chinstrap     Gentoo
##          mean 18.4016529 18.4218182 15.0414141
##          sd    1.2148377  1.0877586  0.9664801
## 
## --------------------------------------------------------------------------------- 
##  ::: flipper_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##                  
## flipper_length_mm     Adelie  Chinstrap     Gentoo
##              mean 190.247934 195.945455 217.222222
##              sd     6.717243   7.263348   6.592444
## 
## --------------------------------------------------------------------------------- 
##  ::: body_mass_g (Gaussian) 
## --------------------------------------------------------------------------------- 
##            
## body_mass_g    Adelie Chinstrap    Gentoo
##        mean 3725.8264 3746.3636 5083.0808
##        sd    437.7967  381.1052  489.8176
## 
## --------------------------------------------------------------------------------- 
##  ::: sex (Bernoulli) 
## --------------------------------------------------------------------------------- 
##         
## sex         Adelie Chinstrap    Gentoo
##   female 0.4628099 0.5090909 0.4545455
##   male   0.5371901 0.4909091 0.5454545
## 
## ---------------------------------------------------------------------------------

Given a bill length of less than 42, the likelihood of the penguin being of the Adelie genus is quite significant.

plot(model3)

m3trainingPredion <-  as.factor(predict(model3, train))

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

confusionMatrix(m3trainingPredion, train$species)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie       117         4      0
##   Chinstrap      4        51      0
##   Gentoo         0         0     99
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9709          
##                  95% CI : (0.9435, 0.9874)
##     No Information Rate : 0.44            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9543          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9669           0.9273          1.00
## Specificity                 0.9740           0.9818          1.00
## Pos Pred Value              0.9669           0.9273          1.00
## Neg Pred Value              0.9740           0.9818          1.00
## Prevalence                  0.4400           0.2000          0.36
## Detection Rate              0.4255           0.1855          0.36
## Detection Prevalence        0.4400           0.2000          0.36
## Balanced Accuracy           0.9705           0.9545          1.00

Conclusion

This summary focuses solely on the accuracy of holdout data by using the three separate modelling approaches. All three models performed admirably, with LDA & QDA faring the best and Naive Bayes faring the worst. I’d like to point out that the classes were pretty well separated, which likely contributed to good LDA & QDA results. Naive Bayes allows the assumption of function independence, which can explain its poorer efficiency.