library(skimr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(palmerpenguins)
require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
library(broom)
library(gmodels)
library(MASS)
library(psych)
library(caret)
library(devtools)
library(ggord)
library(AppliedPredictiveModeling)
library(klaR)
library(e1071)
From the bar charts we see that Chinstrap and Gentoo species are only found in their own island and the Adelie species can be found on all 3 islands
skim(penguins)
| Name | penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
penguins2 <- penguins %>% dplyr::select(-year)
ggplot(penguins2, aes(x = island, fill = species)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("darkorange","purple","cyan4"),guide = FALSE) +
facet_wrap(~species, ncol = 1) + coord_flip()
The distribution of Species specific numeric predictors such as bill length, bill depth, flipper length, and body mass look fairly normally distributed
transparentTheme(trans = .9)
featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g),
y = penguins$species,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 4))
Looking at the correlation plot we see some obserations that cna be made:
pairs.panels(penguins2, gap=0, bg=c("darkorange","purple","cyan4")[penguins$species], pch = 21)
Create training and test datasets with a 70/30 ratio
set.seed(90)
s <- createDataPartition(penguins2$species, p = .70, list = FALSE, times = 1)
penguins_train <- penguins2[s,]
penguins_test <- penguins2[-s,]
Setting up a linear discriminant model, the sex predictor is removed due to its low correlation with species and the body_mass_g predictor is removed due to its high correlation with flipper_length_mm
The 1st linear discriminant is able to explain 86% of the variation and the 2nd discriminant explaining 13% of the variation on its respective axis
The discriminant histogram for the 1st discriminant shows that the the Gentoo species can be very well separated but there is some lap with the the other two.
penguins.lda <- lda(species ~ bill_length_mm + flipper_length_mm + bill_depth_mm + island, data = penguins_train)
penguins.lda
## Call:
## lda(species ~ bill_length_mm + flipper_length_mm + bill_depth_mm +
## island, data = penguins_train)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4416667 0.2000000 0.3583333
##
## Group means:
## bill_length_mm flipper_length_mm bill_depth_mm islandDream
## Adelie 38.79434 189.2925 18.38774 0.4056604
## Chinstrap 48.48125 195.5417 18.32708 1.0000000
## Gentoo 47.43953 217.5349 15.04419 0.0000000
## islandTorgersen
## Adelie 0.3113208
## Chinstrap 0.0000000
## Gentoo 0.0000000
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 0.1276161 -0.36238646
## flipper_length_mm 0.1284807 0.06404114
## bill_depth_mm -0.8610339 0.15209734
## islandDream -1.4795362 -2.04890042
## islandTorgersen -1.1533786 -0.61727912
##
## Proportion of trace:
## LD1 LD2
## 0.8639 0.1361
penguins.lda_pred <- predict(penguins.lda, penguins_train)
ldahist(penguins.lda_pred$x[,1], penguins_train$species)
ggord(penguins.lda, penguins_train$species[-penguins.lda$na.action])
Looking at the predictions we see that this model has a 100% accuracy rate as from the test set all species were correctly predicted
penguins.lda_pred2 <- predict(penguins.lda, newdata = penguins_test)
ldahist(penguins.lda_pred2$x[,1], penguins_test$species)
lda_cfm <- table(Actual=penguins_test$species, Predicted = penguins.lda_pred2$class)
lda_cfm
## Predicted
## Actual Adelie Chinstrap Gentoo
## Adelie 45 0 0
## Chinstrap 0 20 0
## Gentoo 0 0 37
sum(diag(lda_cfm))/sum(lda_cfm)
## [1] 1
The quadratic discriminant model while gives the prior probabilities does not give the discriminant function as easily as the linear discriminant model. From the partition plot you can see the various decision boundries for each of the predictors
penguins.qda <- qda(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex, data = penguins_train)
penguins.qda
## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
## sex, data = penguins_train)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4377682 0.2060086 0.3562232
##
## Group means:
## bill_length_mm bill_depth_mm flipper_length_mm sexmale
## Adelie 38.83137 18.37843 189.4412 0.5392157
## Chinstrap 48.48125 18.32708 195.5417 0.5000000
## Gentoo 47.52530 15.05301 217.6024 0.5421687
penguins.qda_pred <- predict(penguins.qda, penguins_train)
partimat(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + sex, data = penguins_train, method = "qda")
The quadratic model misclassified one species with an accuracy rate of 99%
penguins.qda_pred2 <- predict(penguins.qda, newdata = penguins_test)
qda_cfm <- table(Actual=penguins_test$species, Predicted = penguins.qda_pred2$class)
qda_cfm
## Predicted
## Actual Adelie Chinstrap Gentoo
## Adelie 44 0 0
## Chinstrap 1 19 0
## Gentoo 0 0 36
sum(diag(qda_cfm))/sum(qda_cfm)
## [1] 0.99
The Naive Bayes model includes the island variable so that to use it as a prior probability rate for a given species. The model itself misclassifies 5 species with an accuracy of 95%
penguins.nb <- naiveBayes(species ~ bill_length_mm + flipper_length_mm + bill_depth_mm + island, data = penguins_train)
penguins.nb_pred <- predict(penguins.nb, penguins_test, type="class")
nb_cfm <- table(Actual=penguins_test$species, Predicted = penguins.nb_pred)
nb_cfm
## Predicted
## Actual Adelie Chinstrap Gentoo
## Adelie 44 1 0
## Chinstrap 2 18 0
## Gentoo 0 0 37
sum(diag(nb_cfm))/sum(nb_cfm)
## [1] 0.9705882