We will be working with the Penguin dataset again as we did for Homework #1. Please use “Species” as your target variable. For this assignment, you may want to drop/ignore the variable “year”.We will be conducting Linear Discriminant Analysis, Quadratic Discriminant Analysis and Naive Bayes for the penguin dataset while evaluating all the features or defendant variables.
# Libraries
library(palmerpenguins)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(cowplot)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(regclass)
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:caret':
##
## predictors
## The following object is masked from 'package:tidyr':
##
## fill
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
##
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
##
## qq
library(e1071)
library(klaR)
First I will take a close look at the dataset using glimpse and summary function. I am using some of my Homework 1 methods for exploration and preparation.
#Load Data
data("penguins")
#Closer look at the data
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A...
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge...
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34....
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18....
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ...
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347...
## $ sex <fct> male, female, female, NA, female, male, female, m...
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2...
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
species and its relationship with numerical variables
penguins %>%
group_by(species) %>%
summarize(across(where(is.numeric), mean, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 38.8 18.3 190. 3701. 2008.
## 2 Chinstrap 48.8 18.4 196. 3733. 2008.
## 3 Gentoo 47.5 15.0 217. 5076. 2008.
Below bar plots are for the visualization of the count of each species penguins in the dataset. From the Second bar plot we can visualize the species distribution of each sex.
penguins %>%
count(species) %>%
ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
geom_label(aes(x = species, y = n, label = n)) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
theme_minimal() +
labs(title = 'Penguins Species & Count')
penguins %>%
drop_na() %>%
count(sex, species) %>%
ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
geom_label(aes(x = species, y = n, label = n)) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
facet_wrap(~sex) +
theme_minimal() +
labs(title = 'Penguins Species ~ Gender')
As you can see from the above visualization and the summaries, We see from the three categories Adelie (153) has the highest count followed by Gentoo (119) and Chinstrap (68).The distribution between gender is almost equal among all three species.
Explore Variables
library(explore)
## Warning: package 'explore' was built under R version 3.6.3
penguins %>%
explore_all(target = species)
Relationship with categorical features
In this section, my purpose is to analyze the relationship between species with other categorical features such as sex and the island.
plot_island<-ggplot(penguins, aes(x = island, y = species, color = species)) +
geom_jitter(size = 3)
plot_sex<-ggplot(penguins, aes(x = sex, y = species, color = species)) +
geom_jitter(size = 3)
plot_grid(plot_island, plot_sex, labels = "AUTO")
From the above island graph we can see that the Adelie penguins can be found in all three islands while Chintstrap in only Dream island and Gentoo only in Biscoe island.
Relationship with numerical features
# Scatterplot 1: penguin flipper length versus body mass
plot1<-ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species,
shape = species),
size = 2) +
scale_color_manual(values = c("darkorange","darkorchid","cyan"))
# Scatterplot 2: penguin bill length versus bill depth
plot2<-ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species,
shape = species),
size = 2) +
scale_color_manual(values = c("darkorange","darkorchid","cyan"))
plot3<-ggplot(penguins, aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = sex)) +
scale_color_manual(values = c("darkorange","cyan4"),
na.translate = FALSE) +
facet_wrap(~species)
plot_grid(plot1, plot2, plot3, labels = "AUTO")
Correlation
#correlation matrix
penguins %>%
select_if(is.numeric) %>%
drop_na() %>%
cor()
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.00000000 -0.23505287 0.6561813 0.59510982
## bill_depth_mm -0.23505287 1.00000000 -0.5838512 -0.47191562
## flipper_length_mm 0.65618134 -0.58385122 1.0000000 0.87120177
## body_mass_g 0.59510982 -0.47191562 0.8712018 1.00000000
## year 0.05454458 -0.06035364 0.1696751 0.04220939
## year
## bill_length_mm 0.05454458
## bill_depth_mm -0.06035364
## flipper_length_mm 0.16967511
## body_mass_g 0.04220939
## year 1.00000000
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:explore':
##
## rescale01
ggpairs(penguins, columns = 3:6, title = "Correlogram of Variables",
ggplot2::aes(color = island),
progress = FALSE,
lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1)))
Penguin flipper length and body mass shows a positive association for each 3 species. We can also notice from the visualization that Adelie and Chinstrap overlaps for all the variables except for bill length. We can assume that bill length variable is helping us to distinguish the three species from each other.
First step of my data preparation is to deal with missing data. Next we can use the featureplot method to verify the variables to use. Then I will partition the dataset into train and test set using createDataPartition method. I will allocate 80% of the dataset to training data and the 20% to test data.
#omit missig data
penguins <- na.omit(penguins)
#check for missing data
colSums(is.na(penguins))
## species island bill_length_mm bill_depth_mm
## 0 0 0 0
## flipper_length_mm body_mass_g sex year
## 0 0 0 0
Eliminate the variable year and island
penguins = penguins %>%
dplyr::select( -one_of("island", "year") )
head(penguins)
## # A tibble: 6 x 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie 39.1 18.7 181 3750 male
## 2 Adelie 39.5 17.4 186 3800 female
## 3 Adelie 40.3 18 195 3250 female
## 4 Adelie 36.7 19.3 193 3450 female
## 5 Adelie 39.3 20.6 190 3650 male
## 6 Adelie 38.9 17.8 181 3625 female
Feature Selection
featurePlot(x = penguins[,c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")],
y = penguins$species,
plot = "box",
## Add a key at the top
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
adjust = 1.5,
pch = "|",
layout = c(2,2),
auto.key = list(columns = 3))
We can see from the featureplot that the medians and quartiles do not overlap each other for any variables. flipper_length_mm, *body_mass_g** and bill_depth_mm distingushes Gentoo from Adelie and Chinstrap species while bill_length_mm separates Adelie from Chinstrap and Gentoo species.
Split between Train and Test dataset
#Split data 80% training and 20% test sets
set.seed(123)
partition <- penguins$species %>%
createDataPartition(p = 0.8, list = FALSE)
train <- penguins[partition, ]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test <- penguins[-partition, ]
I will carry out this LDA using the lda() function from the R MASS package. First I will normalize the data. Categorical variables will be automatically ignored. In my lda model I will be ignoring sexmail variable and will only be using the numerical features to predict the species.
#Estimating processing parameters
preproc.param <- train %>%
preProcess(method = c("center", "scale"))
# Transform the data using the estimated parameters
train.transformed <- preproc.param %>% predict(train)
test.transformed <- preproc.param %>% predict(test)
# Fit the model
lda_model <- lda(species~ bill_length_mm + bill_depth_mm + body_mass_g + flipper_length_mm , data = train.transformed)
lda_model
## Call:
## lda(species ~ bill_length_mm + bill_depth_mm + body_mass_g +
## flipper_length_mm, data = train.transformed)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## bill_length_mm bill_depth_mm body_mass_g flipper_length_mm
## Adelie -0.9555507 0.6009140 -0.6092842 -0.7707877
## Chinstrap 0.9088301 0.6564651 -0.5901947 -0.3716006
## Gentoo 0.6438935 -1.1084638 1.0806975 1.1522937
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 0.5415983 -2.408094172
## bill_depth_mm -2.1000729 0.004472972
## body_mass_g 1.0280418 1.274291326
## flipper_length_mm 1.0940111 0.404513143
##
## Proportion of trace:
## LD1 LD2
## 0.8462 0.1538
LDA determines group means and computer the probability of belonging to a different group for each individual.
\(LD1 = 0.45*bill_length_mm - 2.06*bill_depth_mm + 1.05*body_mass_g + 1.11*flipper_length_mm\)
\(LD2 = 2.32*bill_length_mm - 0.04*bill_depth_mm - 1.25*body_mass_g - 0.38*flipper_length_mm\)
# Make predictions
pred_lda_comp <- lda_model %>%
predict(test.transformed)
# Confusion Matrix for LDA
con_lda_model <- confusionMatrix(pred_lda_comp$class, test.transformed$species)
LDA Plot
lda.data <- cbind(train.transformed, predict(lda_model)$x)
ggplot(lda.data, aes(LD1, LD2)) +
geom_point(aes(color = species))
QDA can be computed using R function qda(). Will use the same numerical features of species.
qda_model <- qda(species~ bill_length_mm + bill_depth_mm + body_mass_g + flipper_length_mm , data = train.transformed)
qda_model
## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + body_mass_g +
## flipper_length_mm, data = train.transformed)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## bill_length_mm bill_depth_mm body_mass_g flipper_length_mm
## Adelie -0.9555507 0.6009140 -0.6092842 -0.7707877
## Chinstrap 0.9088301 0.6564651 -0.5901947 -0.3716006
## Gentoo 0.6438935 -1.1084638 1.0806975 1.1522937
# Make predictions
pred_qda_comp <- qda_model %>%
predict(test.transformed)
# Confusion Matrix for QDA
con_qda_model <- confusionMatrix(pred_qda_comp$class, test.transformed$species)
Will be using the same transformed data that we used for LDA and QDA.
head(train)
## # A tibble: 6 x 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie 40.3 18 195 3250 female
## 2 Adelie 36.7 19.3 193 3450 female
## 3 Adelie 38.9 17.8 181 3625 female
## 4 Adelie 39.2 19.6 195 4675 male
## 5 Adelie 41.1 17.6 182 3200 female
## 6 Adelie 38.6 21.2 191 3800 male
naive_model <- naiveBayes(species~., data = train)
naive_model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Conditional probabilities:
## bill_length_mm
## Y [,1] [,2]
## Adelie 38.87265 2.595972
## Chinstrap 48.94727 2.904266
## Gentoo 47.51562 3.164052
##
## bill_depth_mm
## Y [,1] [,2]
## Adelie 18.37607 1.194430
## Chinstrap 18.48727 1.150210
## Gentoo 14.95417 1.005555
##
## flipper_length_mm
## Y [,1] [,2]
## Adelie 190.3504 6.823282
## Chinstrap 195.8909 6.857472
## Gentoo 217.0417 6.558107
##
## body_mass_g
## Y [,1] [,2]
## Adelie 3730.769 471.5664
## Chinstrap 3745.909 404.9967
## Gentoo 5071.094 496.2800
##
## sex
## Y female male
## Adelie 0.4786325 0.5213675
## Chinstrap 0.4545455 0.5454545
## Gentoo 0.4895833 0.5104167
pred_naive <- predict(naive_model, train.transformed)
lda_penguins <- train(data=train.transformed, species ~ ., method="lda",
trControl=trainControl(method="cv", number=10))
lda_penguins$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 1 1 0 0
qda_penguins <- train(data=train.transformed, species ~ ., method="qda",
trControl=trainControl(method="cv", number=10))
qda_penguins$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.9962963 0.9940397 0.01171214 0.01884801
nb_penguins <- train(data=train.transformed, species ~ ., method="nb",
trControl=trainControl(method="cv", number=10))
## Warning in FUN(X[[i]], ...): Numerical 0 probability for all classes with
## observation 17
nb_penguins$results
## usekernel fL adjust Accuracy Kappa AccuracySD KappaSD
## 1 FALSE 0 1 0.9703398 0.9537508 0.03892642 0.06044277
## 2 TRUE 0 1 0.9631970 0.9422348 0.04606048 0.07211191
All three models worked really well in this scenario. LDA and QDA has a 99% accuracy rate while Naive Bayes had a 96%.What special about Naive Bayes is that it was able to hand both categorical and continuous variables. With the highest accuracy rate of 97% and a Kappa rate of 99%, I feel like using QDA on for this analysis is reasonable.