Assignment 2

Data – 622 Homework 2

The dataset that will be used for this assignment is the penguin dataset. this can here: <https://allisonhorst.github.io/palmerpenguins/articles/intro.html>

Data Exploration

Before conducting any modeling, we will explore our dataset. Below describes the penguin dataset where we have 344 observations with 8 features used to describe these observations. Three of these variables are nominal while the other five are numeric.

summary(penguins) %>%kable() %>%  kable_styling(
      full_width = F, position="center") %>% 
  scroll_box()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10 Min. :172.0 Min. :2700 female:165 Min. :2007
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30 Median :197.0 Median :4050 NA’s : 11 Median :2008
NA NA Mean :43.92 Mean :17.15 Mean :200.9 Mean :4202 NA Mean :2008
NA NA 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750 NA 3rd Qu.:2009
NA NA Max. :59.60 Max. :21.50 Max. :231.0 Max. :6300 NA Max. :2009
NA NA NA’s :2 NA’s :2 NA’s :2 NA’s :2 NA NA

We can also see in this dataset for the categorical values that our feature that we are looking to predict on are comprised of Adelie, Chinstrap, and Gentoo penguins, where Adelie comprises of majority of the dataset. All features experience NA and there are a few observations where majority of the values are NA’s. These are removed from our analyses as they do not provide enough information about the observation to make an inference. This is done by setting a threshold of 0.5, meaning if 50% of the feaures measure NA for any observation, that row is removed from our dataset. The table below shows which observations were removed, 1 for an Adelie species and one for a Gentoo species. they were removed because no information besides island and year were available for these observations.

limitmissing <- .5*ncol(penguins)

retain<-apply(penguins, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<limitmissing

Dataset<-penguins[retain,]

penguins[!retain,] %>% kable() %>%  kable_styling(
      full_width = F, position="center") %>% 
  scroll_box()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen NA NA NA NA NA 2007
Gentoo Biscoe NA NA NA NA NA 2009

For this assignment, let us use ‘species’ as our outcome or the dependent variable.

Linear Discriminant analysis

Linear discriminant analysis (LDA) is a method used in statistics to find a linaer combinations of features that characterize two or more classes of objects or event. Its a mehtod that is closely related to principal component analysis in that they look for linear combinations of variables that best explain the data.

Previously, we ran binary and multi-nomial logistic regression analyses on this dataset, however, these tools can struggle to converge when datasets produce perfect/near perfect predictions. LDA is robust to use when the classes are well-separated and will not suffer from the same instability as logistic regression. LDA is also popular when we have more than two response classes as it provides low-dimensional views of the data.

The table below provides the class distribution of the penguin species. It shows that we have 152 Adelie penguins, 68 Chinstrap penguins and 124 Gentoo penguins.

table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

As shown in the table above, the penguins dataset has approximately 3 classes, Adelie, Chinstrap, and Gentoo. The percentage associated with each class is shown below.

Dataset<-penguins[retain,]

Dataset<-
  Dataset %>% 
  filter(is.na(species)==F) %>% 
  dplyr::select(-year)
prop.table(table(Dataset$species)) %>% as.data.frame() %>% 
  mutate(Freq= percent(Freq)) %>% 
  rename( "Class" = Var1, "Proportion" = "Freq")
##       Class Proportion
## 1    Adelie      44.2%
## 2 Chinstrap      19.9%
## 3    Gentoo      36.0%

We can look at a distribution of our numeric variables below and we can see that each of the features follow close to a gaussian distribution. flipper length follows a bimodal distribution and is likely a distribution of 2 different penguin species. Below shows the distribution of our Dataset with the three penguin classes included. We can see that there is quite the distinction in classes especially when it comes to Bill length vs any of the other numeric variables.

additionally, using the featurePlot function from the caret package, we can develop ellipse base clustering plots centered aronud the means of these distributions for each numeric feature. This clearly shows how there is very little overlap between between the species when bill length vs any other variable is compared. This may be a potential catalyst for very accurate predictions and is expected to be a strong descriptive variable in the model. Other pairwise comparisons tend to experience a lot more overlap, particularly between the Adelie and the Chinstrap species.

GGally::ggpairs(Dataset[,c(1,3,4,5,6)], progress = F, aes(color = species))+ theme_bw()

{
transparentTheme(set = T, trans = .5)
  caret::featurePlot(x = Dataset[,c(3,4,5,6)], y = Dataset$species,
                   plot = "ellipse",
                   auto.key = list(columns = 3)
                   )
}

the boxplot distributions of the numeric variables show a distinction in several cases. For flipper length and body mass and bill depth, gentoo species seems to have a statistically significant difference from the Adelie and chinstrap species. for bill length, Adelie seems to have a statistically signicifant difference from the other species. we should expect good separation of the gentoo species.

caret::featurePlot(x = Dataset[,c(3,4,5,6)], y = Dataset$species,
                   plot = "box",
                   auto.key = list(columns = 3),
                   scales = list(y = list(relation="free")
                   ))

Model Preparation

We will be comparing three models in this assignment: Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and Naive Bayes (NB). to keep the comparibility between models, the same transformations and preparations will be made to the dataset, and accuracy on training and testing sets will be produced for comparison between models.

Train/Test Split

The first step was to split the data. This was done useing the initial_split function in the tidymodels package and the sampling was stratified by species in order to obtain a similar proportion mix from the original. the train/test ratio was specified as 0.75 meaning 75% of the data was used to train the model while 25% of the data was used to test.

set.seed(1)
Datasplit<- initial_split(Dataset, prop = .75, strata = species)

Transformations

using the recipes package, various transformations were made to the dataset prior to the analysis based on initial exploratory data analysis as well as personal judgment.

several imputation methods for all numeric features were testing included median, knn, mean. because little change occurred across the imputations, the option that was most computationally efficient was selected which was median. For categorical variables, similarly several imputation methods were tested however the one that was most computationally efficient was selected which was mode.

Additionally, all predictors in the dataset were scanned for non-descriptive features by a variance metric. This would allow all zero-variance features or low-variance features to be eliminated from analysis due to lack of model explainability.

Finally, step_corr was used in order to remove variables that experience high multi-colinearity. Because these models develop linear combinations of features in order to explain the target variable, co-linear features are removed in order to preserve model integrety and decrease complexity.

We can see based on the recipe output below that only the island variable was flagged and thus removed from the dataset.

(
lda_Recipe<-
  recipe(species ~ ., data = Dataset) %>% 
  step_medianimpute(all_numeric()) %>% 
  step_modeimpute(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_corr(all_numeric(),threshold = 0.9) %>% 
  step_rm(island) %>% 
  prep()
)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          6
## 
## Training data contained 342 data points and 9 incomplete rows. 
## 
## Operations:
## 
## Median Imputation for bill_length_mm, ... [trained]
## Mode Imputation for island, sex [trained]
## Zero variance filter removed no terms [trained]
## Sparse, unbalanced variable filter removed no terms [trained]
## Correlation filter removed no terms [trained]
## Variables removed island [trained]
Datatrain<-bake(lda_Recipe,training(Datasplit))

Datatest<-bake(lda_Recipe,testing(Datasplit))

A. Linear Discriminant Analysis:

Model 1 (LDA) below shows the prior probabilities of groups in the dataset as well as the group means associated with each group. the coefficients of linear discreminants are also described.

(m1<-lda(species~. ,data = Datatrain))
## Call:
## lda(species ~ ., data = Datatrain)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.73684      18.31754          190.0614    3718.640 0.5438596
## Chinstrap       48.42745      18.42353          195.5294    3701.961 0.4901961
## Gentoo          47.29462      14.94086          217.0753    5051.344 0.5268817
## 
## Coefficients of linear discriminants:
##                            LD1          LD2
## bill_length_mm     0.094636652 -0.447384887
## bill_depth_mm     -0.991760015 -0.132173902
## flipper_length_mm  0.102386898  0.031059264
## body_mass_g        0.001241383  0.001250533
## sexmale           -0.287041283  0.869301075
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8617 0.1383

Generating predictions on the training and testing data are shown below. An x-y plot of the two LDA coefficients used to describe the data and where each penguin lie are presented in the figure below. We can see based on this figure that we obtain very good separation on both the training and testing datasets.

m1.training.predictions<- m1 %>% predict(Datatrain)
m1.testing.predictions<- m1 %>% predict(Datatest)


lda.data<-
  rbind(
    cbind(Datatrain, m1.training.predictions) %>% mutate(set = "Training Set"),
    cbind(Datatest, m1.testing.predictions) %>% mutate(set = "Testing Set")
  ) %>% 
  mutate(model = "LDA")



ggplot(lda.data) + geom_point(aes(x.LD1, x.LD2, colour = species), size = 2.5)+
  defaulttheme+
  facet_wrap(.~set)

To assess whether this separation is accurate, we can compare our predictions to the true values. A confusion matrix is presented below and this shows that we only misclassify one Chinstrap penguin as an Adelie penguin in the training set, and have classified perfectly on the testing datasets.

m1.p.train<-
lda.data %>% 
  filter(set == "Training Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "LDA Training Set Confusion Matrix")

m1.p.test<-
lda.data %>% 
  filter(set == "Testing Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "LDA Testing Set Confusion Matrix")


gridExtra::grid.arrange(m1.p.train,m1.p.test,ncol=2)

Two metrics are presented below for this classification problem. “Accuracy” and “KAP”. Accuracy measures the raw accuracy given the precision and specificity of the model in how many true positives, and true negatives there are in comparison to the total amount of predictions.

KAP stands for kappa and provides a true estimate of the accuracy when compared to a control which is used to assess the true accuracy when there are class imbalances in the dataset. and we can see that we have near perfect predictions on the training set and perfect predictions on the testing set

lda.data %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2()
set .metric .estimator .estimate
Testing Set accuracy multiclass 1.0000000
Training Set accuracy multiclass 0.9961240
Testing Set kap multiclass 1.0000000
Training Set kap multiclass 0.9938942

B. Quadratic Discriminant Analysis

Quadratic discriminant analysis is another form of discriminant analysis that alters the gassian densities. This allows the features to be predicted by more than 2 DA descriptors or some curvature separation function. The same transformations are used in order to keep models comparable.

the prior probabilities and the means of each group are presented in the model call from the qda function below.

(m2<-qda(species~. ,data = Datatrain))
## Call:
## qda(species ~ ., data = Datatrain)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.73684      18.31754          190.0614    3718.640 0.5438596
## Chinstrap       48.42745      18.42353          195.5294    3701.961 0.4901961
## Gentoo          47.29462      14.94086          217.0753    5051.344 0.5268817

because qda can be multi-dimensional (greater than 2 dimension) separation of this function is not visible on an x/y plane. However, based on the confusion matric below, the accuracy of the QDA function performs identical to the LDA function missclassifying one Chinstrap penguin as Adelie. Thus the same accuracy and KAP metrics are observed. as shown below.

m2.training.predictions<- m2 %>% predict(Datatrain)
m2.testing.predictions<- m2 %>% predict(Datatest)


qda.data<-
  rbind(
    cbind(Datatrain, m2.training.predictions) %>% mutate(set = "Training Set"),
    cbind(Datatest, m2.testing.predictions) %>% mutate(set = "Testing Set")
  ) %>% 
  mutate(model = "QDA")
  

m2.p.train<-
qda.data %>% 
  filter(set == "Training Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "QDA Training Set Confusion Matrix")

m2.p.test<-
qda.data %>% 
  filter(set == "Testing Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "QDA Testing Set Confusion Matrix")


gridExtra::grid.arrange(m2.p.train,m2.p.test,ncol=2)

qda.data %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2()
set .metric .estimator .estimate
Testing Set accuracy multiclass 1.0000000
Training Set accuracy multiclass 0.9961240
Testing Set kap multiclass 1.0000000
Training Set kap multiclass 0.9938942

C. Naive Bayes

Finally, Naive Bayes classifier is a family of simple probabalistic classifiers that applies Bayes theorem but assumes independence between variables - hence Naive. they are among the simplest of models in the bayesian world of models but can acheive very high accuracy. Because of the assumptions behind naive bayes, they are models that are typically described as having high bias, but low variance. The library naivebayes is used to conduct this analysis.

Again, the same transformations are used to keep models comparable. A summary of the model output is shown below with the prior probabilities, assumed distribution for features, the mean of each predictor by penguin, and the standard deviation of each predictor by penguin.

library(naivebayes)

(m3<-naivebayes::naive_bayes(species~. ,data = Datatrain, type = "class"))
## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = species ~ ., data = Datatrain, 
##     type = "class")
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##               
## bill_length_mm    Adelie Chinstrap    Gentoo
##           mean 38.736842 48.427451 47.294624
##           sd    2.573702  3.267481  3.206121
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_depth_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##              
## bill_depth_mm    Adelie Chinstrap    Gentoo
##          mean 18.317544 18.423529 14.940860
##          sd    1.251777  1.049112  1.031592
## 
## --------------------------------------------------------------------------------- 
##  ::: flipper_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##                  
## flipper_length_mm     Adelie  Chinstrap     Gentoo
##              mean 190.061404 195.529412 217.075269
##              sd     6.117269   6.917667   6.557831
## 
## --------------------------------------------------------------------------------- 
##  ::: body_mass_g (Gaussian) 
## --------------------------------------------------------------------------------- 
##            
## body_mass_g    Adelie Chinstrap    Gentoo
##        mean 3718.6404 3701.9608 5051.3441
##        sd    463.5236  345.4650  518.3896
## 
## --------------------------------------------------------------------------------- 
##  ::: sex (Bernoulli) 
## --------------------------------------------------------------------------------- 
##         
## sex         Adelie Chinstrap    Gentoo
##   female 0.4561404 0.5098039 0.4731183
##   male   0.5438596 0.4901961 0.5268817
## 
## ---------------------------------------------------------------------------------
summary(m3)
## 
## ================================== Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = species ~ ., data = Datatrain,      type = "class") 
## - Laplace: 0 
## - Classes: 3 
## - Samples: 258 
## - Features: 5 
## - Conditional distributions: 
##     - Bernoulli: 1
##     - Gaussian: 4
## - Prior probabilities: 
##     - Adelie: 0.4419
##     - Chinstrap: 0.1977
##     - Gentoo: 0.3605
## 
## ---------------------------------------------------------------------------------

The function below plots the results of the naivebayes model depicting the densities of each group by predictor. these density functions show the probabilities of a particular classification given some value of a predictor. Under bill length for example, we can observe that the probability of the penguin being of the Adelie species given the bill length is below 42 is very high. the peaks of the plots tend to follow the representation of a species in the dataset.

plot(m3)

The foncution matric for the naivebayes classifier shows that this model does not perform as well as the discriminant models - both LDA and QDA. The results show that we misclassify 5 Chinstrap penguins as Adelie, and 1 Adelie penguin as chinstrap in the training set. It also shows we misclassify 3 Adelie penguins as Chinstrap penguins in the testing dataset.

m3.training.predictions<- data.frame(class = m3 %>% predict(Datatrain))
m3.testing.predictions<- data.frame(class = m3 %>% predict(Datatest))


nb.data<-
  rbind(
    cbind(Datatrain, m3.training.predictions) %>% mutate(set = "Training Set"),
    cbind(Datatest, m3.testing.predictions) %>% mutate(set = "Testing Set")
  ) %>% 
  mutate(model = "Naive Bayes")
  

m3.p.train<-
nb.data %>% 
  filter(set == "Training Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "Naive Bayes Training Set Confusion Matrix")

m3.p.test<-
nb.data %>% 
  filter(set == "Testing Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "Naive Bayes Testing Set Confusion Matrix")


gridExtra::grid.arrange(m3.p.train,m3.p.test,ncol=2)

Our accuracy and KAP for these models are very high, however they do not perform better than the LDA or QDA models.

nb.data %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2()
set .metric .estimator .estimate
Testing Set accuracy multiclass 0.9642857
Training Set accuracy multiclass 0.9767442
Testing Set kap multiclass 0.9447126
Training Set kap multiclass 0.9632007

D. Summary

Comment on the models fit/strength/weakness/accuracy for all these three models that you worked with

Finally, each model is presented below in summary table presenting the results from the testing and training datasets for their accuracy and their KAP metric. The LDA and QDA model outperforms the naive bayes model in both training and datasets and when adjusted for class distributions. This makes the naive bayes model relatively less suitable for predictions. The LDA model is more simplistic than the QDA model. because of the LDA model’s relatively lower complexity as well as its ability to perform just as well as the QDA model, this would be the model of choice for this evaluation.

bind_rows()
## # A tibble: 0 x 0
summaryfunction<-function(accuracyset =NULL,MODEL){
accuracyset %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  mutate(model = MODEL)

}

final_summary<-
bind_rows(summaryfunction(nb.data,"Naive Bayes"),
         summaryfunction(lda.data,"LDA"),
         summaryfunction(qda.data,"QDA")
)

final_summary %>%
  filter(.metric =="accuracy") %>% 
  arrange(-.estimate) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2() %>% 
  add_header_above(c("test"= nrow(.)))
set .metric .estimator .estimate model
Testing Set accuracy multiclass 1.0000000 LDA
Testing Set accuracy multiclass 1.0000000 QDA
Training Set accuracy multiclass 0.9961240 LDA
Training Set accuracy multiclass 0.9961240 QDA
Training Set accuracy multiclass 0.9767442 Naive Bayes
Testing Set accuracy multiclass 0.9642857 Naive Bayes
final_summary %>%
  filter(.metric =="kap") %>% 
  arrange(-.estimate)%>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2() %>% 
  add_header_above(c("test"= nrow(.)))
set .metric .estimator .estimate model
Testing Set kap multiclass 1.0000000 LDA
Testing Set kap multiclass 1.0000000 QDA
Training Set kap multiclass 0.9938942 LDA
Training Set kap multiclass 0.9938942 QDA
Training Set kap multiclass 0.9632007 Naive Bayes
Testing Set kap multiclass 0.9447126 Naive Bayes