The penguin dataset will be used for this task. This can be found here: (https://allisonhorst.github.io/palmerpenguins/index.html)
We will be working with the Penguin dataset again.
Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.
Using the target variable, Species, please conduct: a. Linear Discriminant Analysis : a. You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices. b. Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors). c. Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model. d. Look at the fit statistics/ accuracy rates. b. Quadratic Discriminant Analysis () a. Same steps as above to consider c. Naïve Bayes a. Same steps as above to consider d. Comment on the models fits/strength/weakness/accuracy for all these three models that you worked with.
Before we begin analysis, we will investigate our dataset.
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.0.5
library(kableExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(visdat)
library(explore)
## Warning: package 'explore' was built under R version 4.0.5
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:explore':
##
## rescale01
library(recipes)
## Warning: package 'recipes' was built under R version 4.0.5
##
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
##
## step
library(caret)
## Loading required package: lattice
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(AppliedPredictiveModeling)
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
library(naivebayes)
## naivebayes 0.9.7 loaded
setwd('C:\\Users\\user\\Documents\\Data622 HH Khan S2021 ML BigData C fm Trive')
getwd()
## [1] "C:/Users/user/Documents/Data622 HH Khan S2021 ML BigData C fm Trive"
penguins<-read.csv(file ='penguins.csv')
Let us see first the number of columns and the names of respective columns
length(names(penguins))
## [1] 8
The penguin dataset has 8 features namely with 344 rows:
summary(penguins) %>%kable() %>% kable_paper("hover", full_width = F) %>%
scroll_box()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
Length:344 | Length:344 | Min. :32.10 | Min. :13.10 | Min. :172.0 | Min. :2700 | Length:344 | Min. :2007 | |
Class :character | Class :character | 1st Qu.:39.23 | 1st Qu.:15.60 | 1st Qu.:190.0 | 1st Qu.:3550 | Class :character | 1st Qu.:2007 | |
Mode :character | Mode :character | Median :44.45 | Median :17.30 | Median :197.0 | Median :4050 | Mode :character | Median :2008 | |
NA | NA | Mean :43.92 | Mean :17.15 | Mean :200.9 | Mean :4202 | NA | Mean :2008 | |
NA | NA | 3rd Qu.:48.50 | 3rd Qu.:18.70 | 3rd Qu.:213.0 | 3rd Qu.:4750 | NA | 3rd Qu.:2009 | |
NA | NA | Max. :59.60 | Max. :21.50 | Max. :231.0 | Max. :6300 | NA | Max. :2009 | |
NA | NA | NA’s :2 | NA’s :2 | NA’s :2 | NA’s :2 | NA | NA |
We can clearly see that three of parameters are categorical, and the remaining five are numerical.
Let us see the first few rows of the dataset:
head(penguins)
Let us visualize the input data in graphical way.
visdat::vis_dat(penguins)
There are 333 full cases in the dataset, with 19 null values. Let us explore each column seperately.
penguins %>%
explore(species)
The species dataset does not contain any NA values, so all the rows are filled. Adelie, Chinstrap, and Gentoo penguins make up the population, with Adelie accounting for the bulk of the dataset.
penguins %>%
explore(island)
There are 3 main islands where the population of penguins live namely Torgersen, Dream and Biscoe. Most of the pengiuns reside in Biscoe island.
penguins %>%
explore(bill_length_mm)
The bill length column has 2 NA which makes up 0.6% of the whole dataset.
penguins %>%
explore(bill_depth_mm)
The bill depth column has 2 NA which makes up 0.6% of the whole dataset.
penguins %>%
explore(flipper_length_mm)
The flipper length column has 2 NA which makes up 0.6% of the whole dataset.
penguins %>%
explore(body_mass_g)
The body mass column has 2 NA which makes up 0.6% of the whole dataset.
penguins %>%
explore(sex)
The sex column has 11 NA which makes up 3.2% of the whole dataset. Most of the dataset is of male penguins.
penguins %>%
explore(year)
The dataset contains data from 2007, 2008 and 2009 year with most population data gathered in 2009. As the year information doesn’t seem important for our study we wont be including it in our study.
Let us see the realtionship between all the features.
penguins %>%
explore(island, target = species)
penguins %>%
explore(bill_length_mm, target = species)
penguins %>%
explore(bill_depth_mm, target = species)
penguins %>%
explore(flipper_length_mm, target = species)
penguins %>%
explore(body_mass_g, target = species)
penguins %>%
explore(sex, target = species)
penguins %>%
explore(year, target = species)
The data already shows some good trends. The variable flipper length mm distinguishes the species Gentoo from the species Chinstrap, while the variable bill length mm distinguishes the species Adelie from the species Chinstrap. And we’re seeing Chinstrap and Gentoo are on different islands. Let us now take a closer look at these variables:
penguins %>%
explore(bill_depth_mm, flipper_length_mm, target = species)
Based on varible flipper length and bill depth we can clearly seperate the Gentoo pengiuns.
penguins %>%
explore(flipper_length_mm, bill_length_mm, target = species)
The plot demonstrates a decent, if not perfect, separation of the three species!
We will be excluding the rows which have maximum number of NAs. These rows are excluded from our analyses because they do not include enough details about the observation to allow us to draw conclusions. This is accomplished by setting a limit of 0.5, which means that if 50% of the tests calculate NA for every finding, the row is excluded from our dataset. The table underneath shows that observations were omitted, one for Adelie and Gentoo species. They were omitted because there was no information available for these findings other than the island and year.
flag <-apply(penguins, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<4
data<-penguins[flag,]
penguins[!flag,] %>%kable() %>% kable_paper("hover", full_width = F) %>%
scroll_box()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
4 | Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
272 | Gentoo | Biscoe | NA | NA | NA | NA | NA | 2009 |
The distribution of penguin species is shown in the table below. We have 151 Adelie penguins, 68 Chinstrap penguins, and 123 Gentoo penguins, according to the transformed data.
kable(transform(as.data.frame(table(data$species)), percentage_column = Freq / nrow(data) * 100),col.names = c("Species","Occurence","Percentage")) %>% kable_paper("hover", full_width = F) %>%
scroll_box()
Species | Occurence | Percentage |
---|---|---|
Adelie | 151 | 44.15205 |
Chinstrap | 68 | 19.88304 |
Gentoo | 123 | 35.96491 |
From previewing the numerical varibles we can clearly identifiy that the attiributes follow the gaussian distribution. Flipper length has a bimodal distribution and is most likely the result of two separate penguin species. The distribution of our Dataset with the three penguin groups included is shown below. It seems that there is a significant difference between classes, especially when comparing Bill length to any of the numeric variables.
Furthermore, we can build ellipse base clustering plots based on the means distributions for every numerical attribute using the featurePlot function provided by caret package. This clearly demonstrates how there is very little variation between the species whenever bill length is compared to any other variable. This is supposed to be a strong descriptive attribute in the model and will serve as a trigger for very accurate predictions. Other pairwise comparisons show a lot of overlap, especially between Adelie and the Chinstrap species.
Let us draw the scatterplot matrix of columns Species, Bill Length mm, Bill Depth mm, Flipper Length mm and Body Mass g.
d = data[,c(1,3,4,5,6)]
ggpairs(d,progress = F, aes(color = species, alpha = 0.2), lower=list(combo=wrap("facethist",
binwidth=0.5)))
Now lets draw the FeaturePlot
# d1 = data[,c(3,4,5,6)]
# d1 = data %>%
# dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g)
#
#
# transparentTheme(set = TRUE, pchSize = 1, trans = 0.4)
#
# caret::featurePlot(x=d1 ,
# y = data$species,
# plot = "ellipse",
# auto.key = list(columns = 3))
# transparentTheme(trans = .9)
#
# featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g),
# y = penguins$species,
# plot = "density",
# scales = list(x = list(relation="free"),
# y = list(relation="free")),
# adjust = 1.5,
# pch = "|",
# layout = c(4, 1),
# auto.key = list(columns = 4))
# https://rpubs.com/dhairavc/735867
# Dhairav Chhatbar
# transparentTheme(trans = .9)
# featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g),
# y = penguins$species,
# plot = "density",
# scales = list(x = list(relation="free"),
# y = list(relation="free")),
# adjust = 1.5,
# pch = "|",
# layout = c(4, 1),
# auto.key = list(columns = 4))
# head(data)
# head(data)
# d2 = data[,c(3,4,5,6)]
# head(d2)
# transparentTheme(set = TRUE, pchSize = 1, trans = 0.4)
# caret::featurePlot(x = d2, y = data$species, plot = "ellipse", auto.key = list(columns = 3))
# ##null
From the plot we can easily determine that Gentoo species appear to have a measurable difference from Adelie and chinstrap species in terms of flipper length, body mass, and bill depth. Adelie appears to have a measurable difference in bill length from the other species. We can anticipate successful gentoo species isolation.
Now lets fill in the missing values in the dataset and remove the year and island column.
# install.packages('recipes')
library(recipes)
res <-
recipe(species ~ ., data = data) %>%
recipes::step_impute_median(all_numeric()) %>%
recipes::step_impute_mode(all_nominal(), -all_outcomes())%>%
step_rm(island) %>%
step_rm(year) %>%
prep()
# could not find function "step_impute_mode"
The next step what we did was dividing the dataset into two parts which we achieved by using the caret package which provides inbuilt division function. The split ratio is by 80% for training and 20% for testing purposes.
library(caret)
set.seed(2021)
idx <- createDataPartition(data$species, p = .8, list = FALSE, times = 1)
train <- data[idx,]
test <- data[-idx,]
test<-bake(res, new_data = test)
train<-bake(res, new_data = train)
Linear discriminant analysis (LDA) is a statistical technique for identifying linear combinations of features that characterize two or more groups of artefacts or events. It is a method similar to PCA (principal component analysis) in that it seeks linear relationships between variables that better explain the results.
When the classes are well differentiated, LDA is a stable method to use that does not suffer from the same volatility as logistic regression. As there are more than two response classes, LDA is also common because it offers low-dimensional representations of the data.
(model1 <- lda(species~. ,data = train))
## Call:
## lda(species ~ ., data = train)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.44 0.20 0.36
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sexmale
## Adelie 38.99091 18.40165 190.2479 3725.826 0.5371901
## Chinstrap 48.90364 18.42182 195.9455 3746.364 0.4909091
## Gentoo 47.53030 15.04141 217.2222 5083.081 0.5454545
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 0.096716664 -0.423419176
## bill_depth_mm -1.035629740 -0.135038882
## flipper_length_mm 0.072128261 0.013457632
## body_mass_g 0.001681983 0.001420329
## sexmale -0.276444360 0.732927894
##
## Proportion of trace:
## LD1 LD2
## 0.8657 0.1343
trainingPredion <- predict(model1, train)
testingPredictions <-predict(model1, test)
trainingPredion <- cbind(train, trainingPredion) %>% mutate(group = "Train")
testingPredictions <- cbind(test, testingPredictions) %>% mutate(group = "Test")
ldaResult <- rbind(trainingPredion, testingPredictions)
ggplot(ldaResult) + geom_point(aes(x.LD1, x.LD2, colour = species))+ labs(x = "LD1", y="LD2") + facet_grid(.~group)
To see if this estimation is right, we will make an accurate comparison between our estimates and the actual values. A confusion matrix is shown below, and it demonstrates that we only misidentify two Chinstrap penguin as an Adelie penguin in the training group.
rTrain <- ldaResult %>% filter(group == "Train") %>% conf_mat(truth = species,estimate = class)
rTest <- ldaResult %>% filter(group == "Test") %>% conf_mat(truth = species,estimate = class)
autoplot(rTrain, type = "heatmap") +
labs(title = "LDA Train Group Confusion Matrix")
# Error: `species` should be a factor. fixed now, for some reason, must have fixed
We have categorized accurately on the testing group.
autoplot(rTest, type = "heatmap") +
labs(title = "LDA Test Group Confusion Matrix")
The kappa provides a true estimate of the accuracy, which is used to determine the true accuracy of the model.
ldaResult %>% group_by(group) %>%
yardstick::metrics(truth = species, estimate = class) %>%
kbl() %>% kable_paper("hover", full_width = F) %>%
scroll_box()
group | .metric | .estimator | .estimate |
---|---|---|---|
Test | accuracy | multiclass | 1.0000000 |
Train | accuracy | multiclass | 0.9927273 |
Test | kap | multiclass | 1.0000000 |
Train | kap | multiclass | 0.9885479 |
A quadratic classifier is a statistical classifier that separates observations from two or more groups of objects or events using a quadratic decision surface.
(model2 <- qda(species~. ,data = train))
## Call:
## qda(species ~ ., data = train)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.44 0.20 0.36
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sexmale
## Adelie 38.99091 18.40165 190.2479 3725.826 0.5371901
## Chinstrap 48.90364 18.42182 195.9455 3746.364 0.4909091
## Gentoo 47.53030 15.04141 217.2222 5083.081 0.5454545
Since qda may be multidimensional, the separation of this function on an x/y axis is not observable. The QDA function’s performance is similar to the LDA function’s misclassification of one Chinstrap penguin as Adelie. The KAP metrics is illustrated below which is similar to LDA.
m2trainingPredion <- predict(model2, train)
m2testingPredictions <- predict(model2,test)
m2trainingPredion <- cbind(train, m2trainingPredion)%>% mutate(group = "Train")
m2testingPredictions <- cbind(test, m2testingPredictions)%>% mutate(group = "Test")
qdaResult <- rbind(m2trainingPredion, m2testingPredictions)
rTrain <- qdaResult %>% filter(group == "Train") %>% conf_mat(truth = species,estimate = class)
rTest <- qdaResult %>% filter(group == "Test") %>% conf_mat(truth = species,estimate = class)
autoplot(rTrain, type = "heatmap") +
labs(title = "QDA Train Group Confusion Matrix")
autoplot(rTest, type = "heatmap") +
labs(title = "LDA Test Group Confusion Matrix")
qdaResult %>% group_by(group) %>%
yardstick::metrics(truth = species, estimate = class) %>%
kbl() %>% kable_paper("hover", full_width = F) %>%
scroll_box()
group | .metric | .estimator | .estimate |
---|---|---|---|
Test | accuracy | multiclass | 1.0000000 |
Train | accuracy | multiclass | 0.9927273 |
Test | kap | multiclass | 1.0000000 |
Train | kap | multiclass | 0.9885479 |
The naive Bayes classifiers are a form of simple “probabilistic classifier” that uses Bayes’ theorem with strong (naive) independence assumptions. Nave Bayes classifiers are highly scalable, requiring a set of parameters that is proportional to the number of attributes.
(model3<-naivebayes::naive_bayes(species~. ,data = train, type = "class"))
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = species ~ ., data = train, type = "class")
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## Adelie Chinstrap Gentoo
## 0.44 0.20 0.36
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: bill_length_mm (Gaussian)
## ---------------------------------------------------------------------------------
##
## bill_length_mm Adelie Chinstrap Gentoo
## mean 38.990909 48.903636 47.530303
## sd 2.564956 3.471949 3.128815
##
## ---------------------------------------------------------------------------------
## ::: bill_depth_mm (Gaussian)
## ---------------------------------------------------------------------------------
##
## bill_depth_mm Adelie Chinstrap Gentoo
## mean 18.4016529 18.4218182 15.0414141
## sd 1.2148377 1.0877586 0.9664801
##
## ---------------------------------------------------------------------------------
## ::: flipper_length_mm (Gaussian)
## ---------------------------------------------------------------------------------
##
## flipper_length_mm Adelie Chinstrap Gentoo
## mean 190.247934 195.945455 217.222222
## sd 6.717243 7.263348 6.592444
##
## ---------------------------------------------------------------------------------
## ::: body_mass_g (Gaussian)
## ---------------------------------------------------------------------------------
##
## body_mass_g Adelie Chinstrap Gentoo
## mean 3725.8264 3746.3636 5083.0808
## sd 437.7967 381.1052 489.8176
##
## ---------------------------------------------------------------------------------
## ::: sex (Bernoulli)
## ---------------------------------------------------------------------------------
##
## sex Adelie Chinstrap Gentoo
## female 0.4628099 0.5090909 0.4545455
## male 0.5371901 0.4909091 0.5454545
##
## ---------------------------------------------------------------------------------
Given a bill length of less than 42, the likelihood of the penguin being of the Adelie genus is quite significant.
plot(model3)
m3trainingPredion <- as.factor(predict(model3, train))
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
confusionMatrix(m3trainingPredion, train$species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 117 4 0
## Chinstrap 4 51 0
## Gentoo 0 0 99
##
## Overall Statistics
##
## Accuracy : 0.9709
## 95% CI : (0.9435, 0.9874)
## No Information Rate : 0.44
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9543
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9669 0.9273 1.00
## Specificity 0.9740 0.9818 1.00
## Pos Pred Value 0.9669 0.9273 1.00
## Neg Pred Value 0.9740 0.9818 1.00
## Prevalence 0.4400 0.2000 0.36
## Detection Rate 0.4255 0.1855 0.36
## Detection Prevalence 0.4400 0.2000 0.36
## Balanced Accuracy 0.9705 0.9545 1.00
This summary focuses solely on the accuracy of holdout data by using the three separate modelling approaches. All three models performed admirably, with LDA & QDA faring the best and Naive Bayes faring the worst. I’d like to point out that the classes were pretty well separated, which likely contributed to good LDA & QDA results. Naive Bayes allows the assumption of function independence, which can explain its poorer efficiency.