R Lab: Classifying Penguins

SETUP

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(caTools)
library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(nnet)
library(rpart)
library(rpart.plot)

library(palmerpenguins)

penguins %>%
  drop_na() -> clean.data

set.seed(284)  # pick your own seed number (I used my abc123 number.)
sample_split <- sample.split(Y = clean.data$species, SplitRatio = 0.5)
train_set <- subset(x = clean.data, sample_split == TRUE)
test_set <- subset(x = clean.data, sample_split == FALSE)

Analysis

Exercise 1:Fit and assess a logistic regression model using the multinom(), predict(), and confusionMatrix() functions. How accurate is this model?

summary(
  logreg.model <- multinom(factor(species) ~ bill_length_mm + bill_depth_mm +
                           flipper_length_mm + body_mass_g, 
                           data=train_set, maxit=50)
)

## # weights:  18 (10 variable)
## initial  value 183.468252 
## iter  10 value 12.367957
## iter  20 value 2.244638
## iter  30 value 0.545504
## iter  40 value 0.008076
## iter  50 value 0.000437
## final  value 0.000437 
## stopped after 50 iterations

## Call:
## multinom(formula = factor(species) ~ bill_length_mm + bill_depth_mm + 
##     flipper_length_mm + body_mass_g, data = train_set, maxit = 50)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap   -3.424724      54.200630     -7.862102         -9.445153
## Gentoo      -2.062269       8.234405    -23.872150         -1.546746
##           body_mass_g
## Chinstrap -0.11831060
## Gentoo     0.08214113
## 
## Std. Errors:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap  0.20111311     1.92589567   40.19717111         9.9985935
## Gentoo     0.00139627     0.02550289    0.03699088         0.1024792
##           body_mass_g
## Chinstrap   0.5459752
## Gentoo      0.1025563
## 
## Residual Deviance: 0.0008749982 
## AIC: 20.00087

predicted_class <- predict(logreg.model, test_set)
confusionMatrix(test_set$species, predicted_class)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        71         1      1
##   Chinstrap      4        30      0
##   Gentoo         0         0     59
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9639         
##                  95% CI : (0.923, 0.9866)
##     No Information Rate : 0.4518         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.943          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9467           0.9677        0.9833
## Specificity                 0.9780           0.9704        1.0000
## Pos Pred Value              0.9726           0.8824        1.0000
## Neg Pred Value              0.9570           0.9924        0.9907
## Prevalence                  0.4518           0.1867        0.3614
## Detection Rate              0.4277           0.1807        0.3554
## Detection Prevalence        0.4398           0.2048        0.3554
## Balanced Accuracy           0.9623           0.9691        0.9917

There are 344 observations with 4 numeric predictor variables and 2-factor predictor variables (sex and island). The variable year is ignored in this model. This model has a 96.39% accuracy which proves it is very accurate.

Exercise 2: Fit and assess a linear discriminant analysis model using the lda(), predict(), and confusionMatrix() functions. How accurate is this model?

CART.model <- rpart(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, 
                       data = train_set, method = "class")
rpart.plot(CART.model)

As proven above the accuracy of the model is still quite high.

Exercise 3: Which variables were (automatically) selected for this model? If you look at the models above, you can see that the variables bill_depth_mm and body_mass_g did not make the cut. However, the automatically selected variables are bill length and flipper length.

Exercise 4: How accurate is this model?

preds <- predict(CART.model, newdata = test_set, type = "class")
confusionMatrix(test_set$species, preds)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        72         1      0
##   Chinstrap      5        26      3
##   Gentoo         0         0     59
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9458          
##                  95% CI : (0.8996, 0.9749)
##     No Information Rate : 0.4639          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9139          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9351           0.9630        0.9516
## Specificity                 0.9888           0.9424        1.0000
## Pos Pred Value              0.9863           0.7647        1.0000
## Neg Pred Value              0.9462           0.9924        0.9720
## Prevalence                  0.4639           0.1627        0.3735
## Detection Rate              0.4337           0.1566        0.3554
## Detection Prevalence        0.4398           0.2048        0.3554
## Balanced Accuracy           0.9619           0.9527        0.9758

For a sample such as this, this model also indicates a high level of accuracy! the level of accuracy that is indicated is = 0.9458, which is very high as well and is in accordance with the other models.

Exercise 5: How do the models compare? (Which is more accurate?). Can you suggest a reason why they differ?

Although the accuracy is high for all of the above, the highest would be with the Logistic Regression with a 0.9639. A suggestive reason of why that may be the case is because the linear discriminant analysis requires more assumptions about the underlying data.The logistic regression is the more flexible approach in the event that these assumptions are violated.

R Lab: Classifying Penguins

Nicole Hayek

2022-12-04

SETUP

Analysis