DATA 622 - Assignment 2

library(skimr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(palmerpenguins)
require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
library(broom)
library(gmodels)
library(MASS)
library(psych)
library(caret)
library(devtools)
library(ggord)
library(AppliedPredictiveModeling)
library(klaR)
library(e1071)

Linear Discriminant Analysis

You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices.
Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors)
Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model
Look at the fit statistics/ accuracy rates.

Exploratory Analysis

From the bar charts we see that Chinstrap and Gentoo species are only found in their own island and the Adelie species can be found on all 3 islands

skim(penguins)

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

penguins2 <- penguins %>% dplyr::select(-year)
ggplot(penguins2, aes(x = island, fill = species)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"),guide = FALSE) +
  facet_wrap(~species, ncol = 1) + coord_flip()

The distribution of Species specific numeric predictors such as bill length, bill depth, flipper length, and body mass look fairly normally distributed

transparentTheme(trans = .9)
featurePlot(x = penguins %>% dplyr::select(bill_length_mm, bill_depth_mm,flipper_length_mm, body_mass_g), 
            y = penguins$species,
            plot = "density", 
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(4, 1), 
            auto.key = list(columns = 4))

Looking at the correlation plot we see some obserations that cna be made:

The bill_length_mm, flipper_length_mm, and body_mass_g are is largely positive correlated with the species
the body_mass_g has a large positive correlation with the flipper_length_mm and possibly one can be remvoed from the models to mitigate co-linearity issues

pairs.panels(penguins2, gap=0, bg=c("darkorange","purple","cyan4")[penguins$species], pch = 21)

Split Dataset

Create training and test datasets with a 70/30 ratio

set.seed(90)

s <- createDataPartition(penguins2$species, p = .70, list = FALSE, times = 1)
penguins_train <- penguins2[s,]
penguins_test <- penguins2[-s,]

LDA Model

Setting up a linear discriminant model, the sex predictor is removed due to its low correlation with species and the body_mass_g predictor is removed due to its high correlation with flipper_length_mm

The 1st linear discriminant is able to explain 86% of the variation and the 2nd discriminant explaining 13% of the variation on its respective axis

The discriminant histogram for the 1st discriminant shows that the the Gentoo species can be very well separated but there is some lap with the the other two.

penguins.lda <- lda(species ~ bill_length_mm + flipper_length_mm + bill_depth_mm + island, data = penguins_train)
penguins.lda

## Call:
## lda(species ~ bill_length_mm + flipper_length_mm + bill_depth_mm + 
##     island, data = penguins_train)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4416667 0.2000000 0.3583333 
## 
## Group means:
##           bill_length_mm flipper_length_mm bill_depth_mm islandDream
## Adelie          38.79434          189.2925      18.38774   0.4056604
## Chinstrap       48.48125          195.5417      18.32708   1.0000000
## Gentoo          47.43953          217.5349      15.04419   0.0000000
##           islandTorgersen
## Adelie          0.3113208
## Chinstrap       0.0000000
## Gentoo          0.0000000
## 
## Coefficients of linear discriminants:
##                          LD1         LD2
## bill_length_mm     0.1276161 -0.36238646
## flipper_length_mm  0.1284807  0.06404114
## bill_depth_mm     -0.8610339  0.15209734
## islandDream       -1.4795362 -2.04890042
## islandTorgersen   -1.1533786 -0.61727912
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8639 0.1361

penguins.lda_pred <- predict(penguins.lda, penguins_train)

ldahist(penguins.lda_pred$x[,1], penguins_train$species)

ggord(penguins.lda, penguins_train$species[-penguins.lda$na.action])

Looking at the predictions we see that this model has a 100% accuracy rate as from the test set all species were correctly predicted

penguins.lda_pred2 <- predict(penguins.lda, newdata = penguins_test)
ldahist(penguins.lda_pred2$x[,1], penguins_test$species)

lda_cfm <- table(Actual=penguins_test$species, Predicted = penguins.lda_pred2$class)
lda_cfm

##            Predicted
## Actual      Adelie Chinstrap Gentoo
##   Adelie        45         0      0
##   Chinstrap      0        20      0
##   Gentoo         0         0     37

sum(diag(lda_cfm))/sum(lda_cfm)

## [1] 1

Quadratic Discriminant Analysis

The quadratic discriminant model while gives the prior probabilities does not give the discriminant function as easily as the linear discriminant model. From the partition plot you can see the various decision boundries for each of the predictors

penguins.qda <- qda(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm  + sex, data = penguins_train)


penguins.qda

## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     sex, data = penguins_train)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4377682 0.2060086 0.3562232 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm   sexmale
## Adelie          38.83137      18.37843          189.4412 0.5392157
## Chinstrap       48.48125      18.32708          195.5417 0.5000000
## Gentoo          47.52530      15.05301          217.6024 0.5421687

penguins.qda_pred <- predict(penguins.qda, penguins_train)

partimat(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm  + sex, data = penguins_train, method = "qda")

The quadratic model misclassified one species with an accuracy rate of 99%

penguins.qda_pred2 <- predict(penguins.qda, newdata = penguins_test)

qda_cfm <- table(Actual=penguins_test$species, Predicted = penguins.qda_pred2$class)
qda_cfm

##            Predicted
## Actual      Adelie Chinstrap Gentoo
##   Adelie        44         0      0
##   Chinstrap      1        19      0
##   Gentoo         0         0     36

sum(diag(qda_cfm))/sum(qda_cfm)

## [1] 0.99

Naive Bayes

The Naive Bayes model includes the island variable so that to use it as a prior probability rate for a given species. The model itself misclassifies 5 species with an accuracy of 95%

penguins.nb <- naiveBayes(species ~ bill_length_mm + flipper_length_mm  + bill_depth_mm + island, data = penguins_train)

penguins.nb_pred <- predict(penguins.nb, penguins_test, type="class")
nb_cfm <- table(Actual=penguins_test$species, Predicted = penguins.nb_pred)
nb_cfm

##            Predicted
## Actual      Adelie Chinstrap Gentoo
##   Adelie        44         1      0
##   Chinstrap      2        18      0
##   Gentoo         0         0     37

sum(diag(nb_cfm))/sum(nb_cfm)

## [1] 0.9705882