Assignment 2

Data – 622 Homework 2

The dataset that will be used for this assignment is the penguin dataset. this can here: <https://allisonhorst.github.io/palmerpenguins/articles/intro.html>

Data Exploration

Before conducting any modeling, we will explore our dataset. Below describes the penguin dataset where we have 344 observations with 8 features used to describe these observations. Three of these variables are nominal while the other five are numeric.

summary(penguins) %>%kable() %>%  kable_styling(
      full_width = F, position="center") %>% 
  scroll_box()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie :152	Biscoe :168	Min. :32.10	Min. :13.10	Min. :172.0	Min. :2700	female:165	Min. :2007
Chinstrap: 68	Dream :124	1st Qu.:39.23	1st Qu.:15.60	1st Qu.:190.0	1st Qu.:3550	male :168	1st Qu.:2007
Gentoo :124	Torgersen: 52	Median :44.45	Median :17.30	Median :197.0	Median :4050	NA’s : 11	Median :2008
NA	NA	Mean :43.92	Mean :17.15	Mean :200.9	Mean :4202	NA	Mean :2008
NA	NA	3rd Qu.:48.50	3rd Qu.:18.70	3rd Qu.:213.0	3rd Qu.:4750	NA	3rd Qu.:2009
NA	NA	Max. :59.60	Max. :21.50	Max. :231.0	Max. :6300	NA	Max. :2009
NA	NA	NA’s :2	NA’s :2	NA’s :2	NA’s :2	NA	NA

We can also see in this dataset for the categorical values that our feature that we are looking to predict on are comprised of Adelie, Chinstrap, and Gentoo penguins, where Adelie comprises of majority of the dataset. All features experience NA and there are a few observations where majority of the values are NA’s. These are removed from our analyses as they do not provide enough information about the observation to make an inference. This is done by setting a threshold of 0.5, meaning if 50% of the feaures measure NA for any observation, that row is removed from our dataset. The table below shows which observations were removed, 1 for an Adelie species and one for a Gentoo species. they were removed because no information besides island and year were available for these observations.

limitmissing <- .5*ncol(penguins)

retain<-apply(penguins, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<limitmissing

Dataset<-penguins[retain,]

penguins[!retain,] %>% kable() %>%  kable_styling(
      full_width = F, position="center") %>% 
  scroll_box()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Gentoo	Biscoe	NA	NA	NA	NA	NA	2009

For this assignment, let us use ‘species’ as our outcome or the dependent variable.

Linear Discriminant analysis

Linear discriminant analysis (LDA) is a method used in statistics to find a linaer combinations of features that characterize two or more classes of objects or event. Its a mehtod that is closely related to principal component analysis in that they look for linear combinations of variables that best explain the data.

Previously, we ran binary and multi-nomial logistic regression analyses on this dataset, however, these tools can struggle to converge when datasets produce perfect/near perfect predictions. LDA is robust to use when the classes are well-separated and will not suffer from the same instability as logistic regression. LDA is also popular when we have more than two response classes as it provides low-dimensional views of the data.

The table below provides the class distribution of the penguin species. It shows that we have 152 Adelie penguins, 68 Chinstrap penguins and 124 Gentoo penguins.

table(penguins$species)

## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

As shown in the table above, the penguins dataset has approximately 3 classes, Adelie, Chinstrap, and Gentoo. The percentage associated with each class is shown below.

Dataset<-penguins[retain,]

Dataset<-
  Dataset %>% 
  filter(is.na(species)==F) %>% 
  dplyr::select(-year)

prop.table(table(Dataset$species)) %>% as.data.frame() %>% 
  mutate(Freq= percent(Freq)) %>% 
  rename( "Class" = Var1, "Proportion" = "Freq")

##       Class Proportion
## 1    Adelie      44.2%
## 2 Chinstrap      19.9%
## 3    Gentoo      36.0%

We can look at a distribution of our numeric variables below and we can see that each of the features follow close to a gaussian distribution. flipper length follows a bimodal distribution and is likely a distribution of 2 different penguin species. Below shows the distribution of our Dataset with the three penguin classes included. We can see that there is quite the distinction in classes especially when it comes to Bill length vs any of the other numeric variables.

additionally, using the featurePlot function from the caret package, we can develop ellipse base clustering plots centered aronud the means of these distributions for each numeric feature. This clearly shows how there is very little overlap between between the species when bill length vs any other variable is compared. This may be a potential catalyst for very accurate predictions and is expected to be a strong descriptive variable in the model. Other pairwise comparisons tend to experience a lot more overlap, particularly between the Adelie and the Chinstrap species.

GGally::ggpairs(Dataset[,c(1,3,4,5,6)], progress = F, aes(color = species))+ theme_bw()

{
transparentTheme(set = T, trans = .5)
  caret::featurePlot(x = Dataset[,c(3,4,5,6)], y = Dataset$species,
                   plot = "ellipse",
                   auto.key = list(columns = 3)
                   )
}

the boxplot distributions of the numeric variables show a distinction in several cases. For flipper length and body mass and bill depth, gentoo species seems to have a statistically significant difference from the Adelie and chinstrap species. for bill length, Adelie seems to have a statistically signicifant difference from the other species. we should expect good separation of the gentoo species.

caret::featurePlot(x = Dataset[,c(3,4,5,6)], y = Dataset$species,
                   plot = "box",
                   auto.key = list(columns = 3),
                   scales = list(y = list(relation="free")
                   ))

Model Preparation

We will be comparing three models in this assignment: Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and Naive Bayes (NB). to keep the comparibility between models, the same transformations and preparations will be made to the dataset, and accuracy on training and testing sets will be produced for comparison between models.

Train/Test Split

The first step was to split the data. This was done useing the initial_split function in the tidymodels package and the sampling was stratified by species in order to obtain a similar proportion mix from the original. the train/test ratio was specified as 0.75 meaning 75% of the data was used to train the model while 25% of the data was used to test.

set.seed(1)
Datasplit<- initial_split(Dataset, prop = .75, strata = species)

Transformations

using the recipes package, various transformations were made to the dataset prior to the analysis based on initial exploratory data analysis as well as personal judgment.

several imputation methods for all numeric features were testing included median, knn, mean. because little change occurred across the imputations, the option that was most computationally efficient was selected which was median. For categorical variables, similarly several imputation methods were tested however the one that was most computationally efficient was selected which was mode.

Additionally, all predictors in the dataset were scanned for non-descriptive features by a variance metric. This would allow all zero-variance features or low-variance features to be eliminated from analysis due to lack of model explainability.

Finally, step_corr was used in order to remove variables that experience high multi-colinearity. Because these models develop linear combinations of features in order to explain the target variable, co-linear features are removed in order to preserve model integrety and decrease complexity.

We can see based on the recipe output below that only the island variable was flagged and thus removed from the dataset.

(
lda_Recipe<-
  recipe(species ~ ., data = Dataset) %>% 
  step_medianimpute(all_numeric()) %>% 
  step_modeimpute(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_corr(all_numeric(),threshold = 0.9) %>% 
  step_rm(island) %>% 
  prep()
)

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          6
## 
## Training data contained 342 data points and 9 incomplete rows. 
## 
## Operations:
## 
## Median Imputation for bill_length_mm, ... [trained]
## Mode Imputation for island, sex [trained]
## Zero variance filter removed no terms [trained]
## Sparse, unbalanced variable filter removed no terms [trained]
## Correlation filter removed no terms [trained]
## Variables removed island [trained]

Datatrain<-bake(lda_Recipe,training(Datasplit))

Datatest<-bake(lda_Recipe,testing(Datasplit))

A. Linear Discriminant Analysis:

Model 1 (LDA) below shows the prior probabilities of groups in the dataset as well as the group means associated with each group. the coefficients of linear discreminants are also described.

(m1<-lda(species~. ,data = Datatrain))

## Call:
## lda(species ~ ., data = Datatrain)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.73684      18.31754          190.0614    3718.640 0.5438596
## Chinstrap       48.42745      18.42353          195.5294    3701.961 0.4901961
## Gentoo          47.29462      14.94086          217.0753    5051.344 0.5268817
## 
## Coefficients of linear discriminants:
##                            LD1          LD2
## bill_length_mm     0.094636652 -0.447384887
## bill_depth_mm     -0.991760015 -0.132173902
## flipper_length_mm  0.102386898  0.031059264
## body_mass_g        0.001241383  0.001250533
## sexmale           -0.287041283  0.869301075
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8617 0.1383

Generating predictions on the training and testing data are shown below. An x-y plot of the two LDA coefficients used to describe the data and where each penguin lie are presented in the figure below. We can see based on this figure that we obtain very good separation on both the training and testing datasets.

m1.training.predictions<- m1 %>% predict(Datatrain)
m1.testing.predictions<- m1 %>% predict(Datatest)


lda.data<-
  rbind(
    cbind(Datatrain, m1.training.predictions) %>% mutate(set = "Training Set"),
    cbind(Datatest, m1.testing.predictions) %>% mutate(set = "Testing Set")
  ) %>% 
  mutate(model = "LDA")



ggplot(lda.data) + geom_point(aes(x.LD1, x.LD2, colour = species), size = 2.5)+
  defaulttheme+
  facet_wrap(.~set)

To assess whether this separation is accurate, we can compare our predictions to the true values. A confusion matrix is presented below and this shows that we only misclassify one Chinstrap penguin as an Adelie penguin in the training set, and have classified perfectly on the testing datasets.

m1.p.train<-
lda.data %>% 
  filter(set == "Training Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "LDA Training Set Confusion Matrix")

m1.p.test<-
lda.data %>% 
  filter(set == "Testing Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "LDA Testing Set Confusion Matrix")


gridExtra::grid.arrange(m1.p.train,m1.p.test,ncol=2)

Two metrics are presented below for this classification problem. “Accuracy” and “KAP”. Accuracy measures the raw accuracy given the precision and specificity of the model in how many true positives, and true negatives there are in comparison to the total amount of predictions.

KAP stands for kappa and provides a true estimate of the accuracy when compared to a control which is used to assess the true accuracy when there are class imbalances in the dataset. and we can see that we have near perfect predictions on the training set and perfect predictions on the testing set

lda.data %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2()

set	.metric	.estimator	.estimate
Testing Set	accuracy	multiclass	1.0000000
Training Set	accuracy	multiclass	0.9961240
Testing Set	kap	multiclass	1.0000000
Training Set	kap	multiclass	0.9938942

B. Quadratic Discriminant Analysis

Quadratic discriminant analysis is another form of discriminant analysis that alters the gassian densities. This allows the features to be predicted by more than 2 DA descriptors or some curvature separation function. The same transformations are used in order to keep models comparable.

the prior probabilities and the means of each group are presented in the model call from the qda function below.

(m2<-qda(species~. ,data = Datatrain))

## Call:
## qda(species ~ ., data = Datatrain)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.73684      18.31754          190.0614    3718.640 0.5438596
## Chinstrap       48.42745      18.42353          195.5294    3701.961 0.4901961
## Gentoo          47.29462      14.94086          217.0753    5051.344 0.5268817

because qda can be multi-dimensional (greater than 2 dimension) separation of this function is not visible on an x/y plane. However, based on the confusion matric below, the accuracy of the QDA function performs identical to the LDA function missclassifying one Chinstrap penguin as Adelie. Thus the same accuracy and KAP metrics are observed. as shown below.

m2.training.predictions<- m2 %>% predict(Datatrain)
m2.testing.predictions<- m2 %>% predict(Datatest)


qda.data<-
  rbind(
    cbind(Datatrain, m2.training.predictions) %>% mutate(set = "Training Set"),
    cbind(Datatest, m2.testing.predictions) %>% mutate(set = "Testing Set")
  ) %>% 
  mutate(model = "QDA")
  

m2.p.train<-
qda.data %>% 
  filter(set == "Training Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "QDA Training Set Confusion Matrix")

m2.p.test<-
qda.data %>% 
  filter(set == "Testing Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "QDA Testing Set Confusion Matrix")


gridExtra::grid.arrange(m2.p.train,m2.p.test,ncol=2)

qda.data %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2()

set	.metric	.estimator	.estimate
Testing Set	accuracy	multiclass	1.0000000
Training Set	accuracy	multiclass	0.9961240
Testing Set	kap	multiclass	1.0000000
Training Set	kap	multiclass	0.9938942

C. Naive Bayes

Finally, Naive Bayes classifier is a family of simple probabalistic classifiers that applies Bayes theorem but assumes independence between variables - hence Naive. they are among the simplest of models in the bayesian world of models but can acheive very high accuracy. Because of the assumptions behind naive bayes, they are models that are typically described as having high bias, but low variance. The library naivebayes is used to conduct this analysis.

Again, the same transformations are used to keep models comparable. A summary of the model output is shown below with the prior probabilities, assumed distribution for features, the mean of each predictor by penguin, and the standard deviation of each predictor by penguin.

library(naivebayes)

(m3<-naivebayes::naive_bayes(species~. ,data = Datatrain, type = "class"))

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = species ~ ., data = Datatrain, 
##     type = "class")
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##               
## bill_length_mm    Adelie Chinstrap    Gentoo
##           mean 38.736842 48.427451 47.294624
##           sd    2.573702  3.267481  3.206121
## 
## --------------------------------------------------------------------------------- 
##  ::: bill_depth_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##              
## bill_depth_mm    Adelie Chinstrap    Gentoo
##          mean 18.317544 18.423529 14.940860
##          sd    1.251777  1.049112  1.031592
## 
## --------------------------------------------------------------------------------- 
##  ::: flipper_length_mm (Gaussian) 
## --------------------------------------------------------------------------------- 
##                  
## flipper_length_mm     Adelie  Chinstrap     Gentoo
##              mean 190.061404 195.529412 217.075269
##              sd     6.117269   6.917667   6.557831
## 
## --------------------------------------------------------------------------------- 
##  ::: body_mass_g (Gaussian) 
## --------------------------------------------------------------------------------- 
##            
## body_mass_g    Adelie Chinstrap    Gentoo
##        mean 3718.6404 3701.9608 5051.3441
##        sd    463.5236  345.4650  518.3896
## 
## --------------------------------------------------------------------------------- 
##  ::: sex (Bernoulli) 
## --------------------------------------------------------------------------------- 
##         
## sex         Adelie Chinstrap    Gentoo
##   female 0.4561404 0.5098039 0.4731183
##   male   0.5438596 0.4901961 0.5268817
## 
## ---------------------------------------------------------------------------------

summary(m3)

## 
## ================================== Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = species ~ ., data = Datatrain,      type = "class") 
## - Laplace: 0 
## - Classes: 3 
## - Samples: 258 
## - Features: 5 
## - Conditional distributions: 
##     - Bernoulli: 1
##     - Gaussian: 4
## - Prior probabilities: 
##     - Adelie: 0.4419
##     - Chinstrap: 0.1977
##     - Gentoo: 0.3605
## 
## ---------------------------------------------------------------------------------

The function below plots the results of the naivebayes model depicting the densities of each group by predictor. these density functions show the probabilities of a particular classification given some value of a predictor. Under bill length for example, we can observe that the probability of the penguin being of the Adelie species given the bill length is below 42 is very high. the peaks of the plots tend to follow the representation of a species in the dataset.

plot(m3)

The foncution matric for the naivebayes classifier shows that this model does not perform as well as the discriminant models - both LDA and QDA. The results show that we misclassify 5 Chinstrap penguins as Adelie, and 1 Adelie penguin as chinstrap in the training set. It also shows we misclassify 3 Adelie penguins as Chinstrap penguins in the testing dataset.

m3.training.predictions<- data.frame(class = m3 %>% predict(Datatrain))
m3.testing.predictions<- data.frame(class = m3 %>% predict(Datatest))


nb.data<-
  rbind(
    cbind(Datatrain, m3.training.predictions) %>% mutate(set = "Training Set"),
    cbind(Datatest, m3.testing.predictions) %>% mutate(set = "Testing Set")
  ) %>% 
  mutate(model = "Naive Bayes")
  

m3.p.train<-
nb.data %>% 
  filter(set == "Training Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "Naive Bayes Training Set Confusion Matrix")

m3.p.test<-
nb.data %>% 
  filter(set == "Testing Set") %>% 
  conf_mat(truth = species,estimate = class) %>% autoplot(type = "heatmap")+
  defaulttheme+
  #scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")+
  labs(title = "Naive Bayes Testing Set Confusion Matrix")


gridExtra::grid.arrange(m3.p.train,m3.p.test,ncol=2)

Our accuracy and KAP for these models are very high, however they do not perform better than the LDA or QDA models.

nb.data %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2()

set	.metric	.estimator	.estimate
Testing Set	accuracy	multiclass	0.9642857
Training Set	accuracy	multiclass	0.9767442
Testing Set	kap	multiclass	0.9447126
Training Set	kap	multiclass	0.9632007

D. Summary

Comment on the models fit/strength/weakness/accuracy for all these three models that you worked with

Finally, each model is presented below in summary table presenting the results from the testing and training datasets for their accuracy and their KAP metric. The LDA and QDA model outperforms the naive bayes model in both training and datasets and when adjusted for class distributions. This makes the naive bayes model relatively less suitable for predictions. The LDA model is more simplistic than the QDA model. because of the LDA model’s relatively lower complexity as well as its ability to perform just as well as the QDA model, this would be the model of choice for this evaluation.

bind_rows()

## # A tibble: 0 x 0

summaryfunction<-function(accuracyset =NULL,MODEL){
accuracyset %>% group_by(set) %>% 
  yardstick::metrics(truth = species, estimate = class) %>% 
  mutate(model = MODEL)

}

final_summary<-
bind_rows(summaryfunction(nb.data,"Naive Bayes"),
         summaryfunction(lda.data,"LDA"),
         summaryfunction(qda.data,"QDA")
)

final_summary %>%
  filter(.metric =="accuracy") %>% 
  arrange(-.estimate) %>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2() %>% 
  add_header_above(c("test"= nrow(.)))

set	.metric	.estimator	.estimate	model
Testing Set	accuracy	multiclass	1.0000000	LDA
Testing Set	accuracy	multiclass	1.0000000	QDA
Training Set	accuracy	multiclass	0.9961240	LDA
Training Set	accuracy	multiclass	0.9961240	QDA
Training Set	accuracy	multiclass	0.9767442	Naive Bayes
Testing Set	accuracy	multiclass	0.9642857	Naive Bayes

final_summary %>%
  filter(.metric =="kap") %>% 
  arrange(-.estimate)%>% 
  kbl() %>% 
  kable_styling(
      full_width = F, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2() %>% 
  add_header_above(c("test"= nrow(.)))

set	.metric	.estimator	.estimate	model
Testing Set	kap	multiclass	1.0000000	LDA
Testing Set	kap	multiclass	1.0000000	QDA
Training Set	kap	multiclass	0.9938942	LDA
Training Set	kap	multiclass	0.9938942	QDA
Training Set	kap	multiclass	0.9632007	Naive Bayes
Testing Set	kap	multiclass	0.9447126	Naive Bayes