Data Exploration

penguins_df <- palmerpenguins::penguins %>% 
  dplyr::select(-year)

str(penguins_df)

## tibble [344 x 7] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...

The penguins dataset is composed of 344 datapoints and has one response variable (species) and seven explanatory variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year).

Intuitively, we know that the year variable shouldn’t make a difference in the penguin species, so we will remove it from the dataset. We are now left with two categorical variables (island and sex) and four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm andbody_mass_g).

Dealing with Missing Values

summary(penguins_df)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex     
##  Min.   :172.0     Min.   :2700   female:165  
##  1st Qu.:190.0     1st Qu.:3550   male  :168  
##  Median :197.0     Median :4050   NA's  : 11  
##  Mean   :200.9     Mean   :4202               
##  3rd Qu.:213.0     3rd Qu.:4750               
##  Max.   :231.0     Max.   :6300               
##  NA's   :2         NA's   :2

From the summary statistics, we can see that 5 out of 6 of the explanatory variables have at least some null values. It doesn’t make sense to impute the null records of our categorical variable (sex), so we will remove them from our dataset. We will impute the missing data of the numerical values using the Multivariate imputation by chained equations (MICE) method. Multiple imputation involves creating multiple predictions for each missing value, helping to account for the uncertainty in the individual imputations.

# remove records with null sex values
penguins_df <- penguins_df %>% filter(!is.na(sex))

# impute null values using MICE method
penguins_df <- complete(mice(data = penguins_df,
                         method = "pmm", print = FALSE), 3)

Subsetting the Data

One of the assumptions for LDA, QDA, and Naive Bayes is that that each predictor variable is normally distributed. This implies non-categorical variables, so we will eliminate them from the dataset.

penguins_df <- penguins_df %>%
  dplyr::select(-island, -sex)

The final dataset contains 333 records with 4 explanatory variables.

Visualizing the Data

In order for LDA, QDA and Naive Bayes methods to work optimally, the data must be approximately normally distributed for each variable in each class. Let’s take a look:

For the most part, the data appear to be normally distributed for each class, so we are set to move forward.

Variable Relationships

Covariances

In LDA, we assume that the covariance matrices are the same between classes. We can use Box’s M test to test if the covariance matrices are equal. The null hypothesis is that the covariances are equal across all groups.

boxm <- heplots::boxM(penguins_df %>% select_if(is.numeric), penguins_df$species)
boxm

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  penguins_df %>% select_if(is.numeric)
## Chi-Sq (approx.) = 74.731, df = 20, p-value = 3.02e-08

Since the p-value is less than 0.05, we can reject the null hypothesis and we know that the covariance matrices are different for at least one group. This means that an LDA model will likely not perform as well as a QDA model.

We can visualize the covariances of the variables against the species to identify which ones are the most different:

heplots::covEllipses(penguins_df %>% dplyr::select_if(is.numeric), 
                     penguins_df$species, 
                     fill = TRUE, 
                     pooled = FALSE, 
                     col = c("blue", "red", "orange"), 
                     variables = c(1:ncol(penguins_df %>% dplyr::select_if(is.numeric))), 
                     fill.alpha = 0.05)

It is evident by the differing shapes of the ellipses that the variance is not equal for some of our variables (ex: bill_depth).

Correlations

In LDA and QDA, there are no assumptions about the relationships between the variables. However, in Naive Bayes, we assume that all features are independent of one another, ie, no correlations between variables in a class. Let’s take a look at the relationships of the numeric variables against each other for each species.

First, we’ll take a look at the Adelie species:

##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm                NA     0.3858132         0.3322740   0.5442764
## bill_depth_mm                 NA            NA         0.3108974   0.5801560
## flipper_length_mm             NA            NA                NA   0.4648539
## body_mass_g                   NA            NA                NA          NA

Next, we’ll take a look at the Chinstrap species:

##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm                NA     0.6535362         0.4716073   0.5136383
## bill_depth_mm                 NA            NA         0.5801429   0.6044983
## flipper_length_mm             NA            NA                NA   0.6415594
## body_mass_g                   NA            NA                NA          NA

Finally, we’ll take a look at the Gentoo species:

##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm                NA     0.6540233         0.6642052   0.6667302
## bill_depth_mm                 NA            NA         0.7106422   0.7229672
## flipper_length_mm             NA            NA                NA   0.7113053
## body_mass_g                   NA            NA                NA          NA

We can see that there is a stronger relationship between many of the variables for the Gentoo species. We know off the bat if we use all variables in the model that the Naive Bayes method might not perform as well as the other methods.

We can also visualize these relationships:

caret::featurePlot(x = penguins_df %>% select_if(is.numeric),
        y = penguins_df$species,
        plot = "pairs",
        auto.key = list(columns = 3)) # add key for species at the top

We can see that the Gentoo species is more easily separable from the other two species. Gentoo penguins tend to have a smaller bill depth, larger flipper length, and larger body mass than the other species.

Normalization of data

Since discriminant analysis can be affected by the scale and unit of the predictor variables, we will normalize the continuous predictors for the analysis. We will use the scale function, which will standardize the data so that we have a mean of 0 and a standard deviation of 1.

final_df <- penguins_df %>% 
  mutate_at(c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"),
            ~(scale(.) %>% as.vector))

Linear Discriminant Analysis

First, we’ll train an LDA model with 10-fold cross validation. LDA is more stable than Logistic Regression, and is more often used in instances where there are more than 2 outcome classes. It assumes that the predictor variables are (1) normally distributed and (2) that the classes have identical covariance matrices. We know from our EDA that the data is approximately normally distributed by class, but that the covariance matrices are not the same.

lda_model <- train(data=final_df, 
                   species ~ ., 
                   method="lda",
                   trControl=trainControl(method="cv", number=10))

We will also create an LDA model that eliminates the bill_depth variable, as this was the feature that appeared to have the most unequal variance across species.

lda_model2 <- train(data=final_df, 
                   species ~ bill_length_mm + flipper_length_mm + body_mass_g, 
                   method="lda",
                   trControl=trainControl(method="cv", number=10))

Quadratic Discriminant Analysis

Quadratic Discriminant Analysis is similar to LDA, but does not assume that the classes have identical covariance matrices. Our conditions are met for QDA, so we will keep all variables in the model.

qda_model <- train(data=final_df, 
                   species ~ ., 
                   method="qda",
                   trControl=trainControl(method="cv", number=10))

Naïve Bayes

A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. We know from our EDA that some of the variables for the Gentoo species had a high correlation, so we anticipate that the Naive Bayes model won’t perform as well as the other two.

nb_model <- train(data=final_df, 
                   species ~ ., 
                   method="nb",
                   trControl=trainControl(method="cv", number=10))

Comparison of Models

Finally, we can compare the accuracy and kappa of our 3 models.

Accuracy: is a measure of how many records were correctly classified by the model
Kappa: is similar to accuracy, but also “takes into account the possibility of the agreement occurring by chance”

In both cases, the higher the value, the better the classification.

nms <- c('LDA - All Variables', 
         'LDA - No Bill Depth', 
         'QDA', 
         'Naive Bayes')
acc <- c(lda_model$results$Accuracy, 
         lda_model2$results$Accuracy, 
         qda_model$results$Accuracy, 
         nb_model$results$Accuracy[1])
kap <- c(lda_model$results$Kappa, 
         lda_model2$results$Kappa, 
         qda_model$results$Kappa, 
         nb_model$results$Kappa[1])


data.frame(nms, acc, kap)

##                   nms       acc       kap
## 1 LDA - All Variables 0.9851103 0.9765707
## 2 LDA - No Bill Depth 0.9729947 0.9574153
## 3                 QDA 0.9879679 0.9811025
## 4         Naive Bayes 0.9700423 0.9530511

The results are pretty similar high the board. However, the QDA model wins out at \(98.8%\) accuracy and \(98.1%\) kappa. The final ranking aligns with that we had expected from our preliminary analysis of the data: QDA, LDA, and finally Naive Bayes.

Summary and Conclusions

In summary:

The QDA model ultimately performed the best out of all models.
The LDA model performed second-best because we failed to meet the assumption that the covariance matrices are equal.
The Naive Bayes model performed the worst (although still very well) because we failed to meet the assumption that the variables were independent.

Data 622: Introduction to Machine Learning

Spring 2020 - Homework #2

Amber Ferger

3/19/2021