Introduction

Do you know that there are several types of penguins? Altough they all look pretty similar, there are actually 17 species of penguins. As opposed to the polar bears (which inhabit the range of the north pole), all penguin species, except one, lives in the southern hemisphere. Three species of penguins are of interest to our project, the Adelie, Chinstrap and Gentoo penguins. They are the three species of penguins observed from the Palmer Long-Term Ecological Research Station on Anvers Island, Antarctica.

In this project, We will be using the Palmer Penguins dataset for our project provided by the Palmer LTER. The dataset contains physical observations of penguins. We'll use these physical characteristics to make classification models that can predict the penguins' classes. We'll use three different approaches : Naive-Bayes, decision tree and random forest.

*From left to right : Chinstrap, Adelie and Gentoo penguins

Data preparation

Before we start, let's load the required libraries

library(palmerpenguins) # The dataset
library(tidyverse) # Data wrangling and visualization
library(GGally) # To make correlation matrix
library(caret) # Near zero variance
library(class) # kNN to fill in missing values
library(e1071) # Naive-Bayes
library(partykit) # Decision tree
library(randomForest) # Random forest
library(kableExtra) # Pretty printing tables
data <- penguins
glimpse(data)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex               <fct> male, female, female, NA, female, male, female, mal…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

The dataset ontains 344 rows with 8 columns :

  • species : penguin species (Adelie, Gentoo, Chinstrap) _ island : islands where the penguin was observed (Biscoe, Dream, Torgensen)
  • bill_length_mm : length of bill (in mm)
  • bill_depth_mm : bill Depth (in mm)
  • flipper_length_mm : flipper length (in mm)
  • body_mass_g : body mass (in g)
  • sex : gender (male or female)
  • year : year of observation

Data Preprocessing

Before going deeper, let's check for missing values

colSums(is.na(data))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex              year 
##                 2                 2                11                 0

It seems that we have missing values for teh bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and sex. Let's take a look at these records

data %>% 
  filter_all(any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007
Adelie Torgersen 37.8 17.1 186 3300 NA 2007
Adelie Torgersen 37.8 17.3 180 3700 NA 2007
Adelie Dream 37.5 18.9 179 2975 NA 2007
Gentoo Biscoe 44.5 14.3 216 4100 NA 2007
Gentoo Biscoe 46.2 14.4 214 4650 NA 2008
Gentoo Biscoe 47.3 13.8 216 4725 NA 2009
Gentoo Biscoe 44.5 15.7 217 4875 NA 2009
Gentoo Biscoe NA NA NA NA NA 2009

It seems there are 2 records that does not have any details at all. For these records, we can drop them as they contain no data at all.

# Removing records with body_mass_g NA 
data <- data %>% 
  filter(!is.na(body_mass_g))

The other 9 records are missing sex values. We can try to impute them. As they pertain to sex, it would be fine to assume a 50-50 split. However, there might be a correlation between body_mass or other phyisical characteristics with sex. We'll keep these records and try to impute them later in the data exploration & wrangling section.

Now let's check for duplicate rows

data[data %>% duplicated(), ]
## # A tibble: 0 x 8
## # … with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>

There is no duplicated rows.

Data exploration & wrangling

Adjusting data types

The data types for our variables are :

  • species : categorical (fct) _ island : categorical (fct)
  • bill_length_mm : numeric (dbl)
  • bill_depth_mm : numeric (dbl)
  • flipper_length_mm : numeric (int)
  • body_mass_g : numeric (int)
  • sex : categorical (fct)
  • year : numeric (int)

Although year is a numeric factor, in this project, we are not treating it like . In the context of this data, year is a categorical variable which categorizes when the observation was taken. Therefore, let's change this variable into categorical.

data$year <- factor(data$year)

As a remark, it is notable body_mass_g and flipper_length_mm are reported in integer values. Body mass and flipper length are continuous variables, so logically the values would contain decimals. By reporting them as integer (probably rounded), the resolving power is reduced. Sadly, we can not do anything about this.

Zero variance

Let's check if there are variables with near zero variance

data %>% nearZeroVar()
## integer(0)

There is no variables with zero variance.

Data distribution

Let's start by getting to know the distribution of our numeric data

boxplot(data %>% select_if(is.numeric))

summary(data %>% select_if(is.numeric))
##  bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g  
##  Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700  
##  1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
##  Median :44.45   Median :17.30   Median :197.0     Median :4050  
##  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202  
##  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
##  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300

Compared to the other numeric variables, body_mass_g is reported in grams, which causes it to shoot up. We can remedy this in several ways. One of them is by changing the units of the values, from mm to cm and from grams to kg.

data.clean <- data %>% 
  mutate(
    bill_length_cm = bill_length_mm / 10,
    bill_depth_cm = bill_depth_mm / 10,
    flipper_length_cm = flipper_length_mm / 10,
    body_mass_kg = body_mass_g / 1000,
  ) %>% 
  select(-c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))

data.clean %>% 
  head() %>% 
  kable() %>% 
  kable_styling()
species island sex year bill_length_cm bill_depth_cm flipper_length_cm body_mass_kg
Adelie Torgersen male 2007 3.91 1.87 18.1 3.750
Adelie Torgersen female 2007 3.95 1.74 18.6 3.800
Adelie Torgersen female 2007 4.03 1.80 19.5 3.250
Adelie Torgersen female 2007 3.67 1.93 19.3 3.450
Adelie Torgersen male 2007 3.93 2.06 19.0 3.650
Adelie Torgersen female 2007 3.89 1.78 18.1 3.625

Let's see how the range of the values now compare to each other

boxplot(data.clean %>% select_if(is.numeric))

After adjusting the variables (feature engineering), the range of the values of the variables now lie between 0-20, which is much better than the previous data.

Correlation

Let' see the correlation between our variables

ggcorr(data.clean, label = T)

There are very strong correlations between the physical characteristics. This is expected, as the relations between shape and form of biological entities are governed by morphometrics. This suggests that we might not need all the variables. Let's try to exclude flipper_length_cm from the features.

data.clean %>% 
  select(-flipper_length_cm) %>% 
  ggcorr(label = T)

This shows a much better correlation matrix. Before justifying removing flipper_length_cm from the variables, let's see if it provides any predictive power. To do this, we'll have to plot both body_mass_kg and flipper_length_cm vs the class averages.

data.clean %>% 
  ggplot(aes(x = flipper_length_cm, y = body_mass_kg , color = species)) +
  geom_point() +
  geom_smooth(se = F) + 
  labs(title = 'Flipper length (cm) vs. Body mass (kg)', x = 'Flipper length (cm)', y = 'Body mass (kg)')

We can see that the relation between flipper_length_cm and body_mass_kg can be satisfactorily described linearly. Therefore, we can remove flipper_length_cm from our dataset

data.clean <- data.clean %>% 
  select(-flipper_length_cm)

The effect of island

island is an interesting variable. Let's see if being on an island have any effect on the penguins.

data.clean %>% 
  select(-c(sex, year)) %>% 
  pivot_longer(-c(species, island), names_to = 'params') %>% 
  ggplot(aes(x = island, y = value, color = species)) +
  geom_jitter() +
  facet_wrap(~params)

We see that some species, are only available on some islands : Gentoo on Biscoe, Chinstrap on Dream while Adelie are available on every Biscoe, Dream and Torgersen. From our conclusion, we can expect island to be a very strong predictor. However, this predictor might be misleading. island is not an intrinsic property of the object that we are trying to predict : penguins. If our model learns to rely on this variable for prediction, it will fail when a population of Chinstrap or Adelie migrates to another island. Therefore, we'll drop this variable from our dataset.

# Remove island
data.clean <- data.clean %>% select(-island)

Imputing sex

Let's tackle the missing values in the sex column

data.clean %>% 
  filter_all(any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling()
species sex year bill_length_cm bill_depth_cm body_mass_kg
Adelie NA 2007 3.41 1.81 3.475
Adelie NA 2007 4.20 2.02 4.250
Adelie NA 2007 3.78 1.71 3.300
Adelie NA 2007 3.78 1.73 3.700
Adelie NA 2007 3.75 1.89 2.975
Gentoo NA 2007 4.45 1.43 4.100
Gentoo NA 2008 4.62 1.44 4.650
Gentoo NA 2009 4.73 1.38 4.725
Gentoo NA 2009 4.45 1.57 4.875

Each species are represented in the missing values. In the context of imputing sex, our working assumption is that there is a correlation between phyical characteristics to sex. Before that, we have to control for year, as weight may fluctuate depending on the conditions in that particular year. To do this, we'll see how weight fluctuate by year for each species.

data.clean %>% 
 group_by(year, species) %>%
 summarize(mean_weight = mean(body_mass_kg)) %>%
  arrange(year) %>% 
  ggplot(aes(x = as.numeric(year))) +
  geom_line(aes(y = mean_weight, color = species)) + 
  geom_jitter(data = data.clean, aes(x = as.numeric(year), y = body_mass_kg, color = species)) +
  labs(title = 'Mean weight of penguings', x = 'Year', y = 'Mean weight (kg)')

The plot shows that the average weight of penguins throughout the year is stable. We can safely say that there is no correlation between year and the penguin's body mass. However, the weight between different penguin species is different.

We'll impute the sex data using k-nearest neighbor, taking species into account. We separate the observations without sex, and later combine them back into the dataset.

# Getting
sex_na <- data.clean %>% 
  filter(is.na(sex)) %>% 
  select(-sex)

# Get population with complete data
pop <- data.clean %>% 
  filter(!is.na(sex))

Before that, we have to decide on a k value.

# Seaching k-value
k <- round(sqrt(nrow(pop)))

The k value is 18. Because it's even, we'll remove add 1 to make it odd.

k <- k + 1

Now let's use kNN to predict the sex

# Splitting sex from data, selecting relevant variables
sex <- pop %>% select(species, sex)
pop <- pop %>% select(species, bill_length_cm, bill_depth_cm, body_mass_kg)
sex_na <- sex_na %>% select(species, bill_length_cm, bill_depth_cm, body_mass_kg)

# Using kNN for Gentoo
sex_gentoo <- knn(train = pop %>% filter(species == 'Gentoo') %>% select(-species) %>% as.matrix(), 
    test = sex_na %>% filter(species == 'Gentoo') %>% select(-species) %>% as.matrix(), 
    sex %>% filter(species == 'Gentoo') %>% select(sex) %>% as.matrix()
    )

# Using KNN for Adelie
sex_adelie <- knn(train = pop %>% filter(species == 'Adelie') %>% select(-species) %>% as.matrix(), 
    test = sex_na %>% filter(species == 'Adelie') %>% select(-species) %>% as.matrix(), 
    sex %>% filter(species == 'Adelie') %>% select(sex) %>% as.matrix()
    )

sex.combined <- data.frame(sex = fct_c(sex_adelie, sex_gentoo))

Now that we have the predicted sex, let's merge back the data set

sex.combined <- data.clean %>% 
  filter(is.na(sex)) %>%
  select(-sex) %>% 
  add_column(sex.combined)

data.clean <- data.clean %>% 
  filter(!is.na(sex)) %>%
  add_row(sex.combined)

And verify there is no NA

colSums(is.na(data.clean))
##        species            sex           year bill_length_cm  bill_depth_cm 
##              0              0              0              0              0 
##   body_mass_kg 
##              0

Year

The year variable is irrelevant in our learning because it is not a physical attribute of our target object (penguin) Therefore, we'll drop year from our dataset.

data.clean <- data.clean %>% select(-year)

Summary

From our exploration, we concluded that :

  • island is a very strong predictor (potentially a perfect / quasi-perfect separator)
  • year is irrelevant in our dataset

And we have :

  • transformed flipper_length_mm, bill_length_cm, bill_depth_cm andbody_mass_g into a scale that is more numerically balanced.
  • dropped flipper_length_cm from the dataset
  • removed island and year from the dataset
  • imputed sex using kNN according to species

Now let's dive into modeling

Data splitting

Before diving into the model, let's split our data into train and test dataset.

idx <- sample(nrow(data.clean), nrow(data.clean) * 0.8)

data.train <- data.clean[idx, ]
data.test <- data.clean[-idx, ]

Naive-Bayes

Let's try using Naive-Bayes classiefier to train our data

model.naive <- naiveBayes(x = data.train %>% select(-species) , 
                          y = data.train$species, 
                          laplace = 1)

result.naive <- predict(model.naive, data.test %>% select(-species))

# Make confusion matrix
cm.naive <- confusionMatrix(result.naive, data.test$species)
cm.naive
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         1      0
##   Chinstrap      0        14      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9219, 0.9996)
##     No Information Rate : 0.4638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9771          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9333        1.0000
## Specificity                 0.9730           1.0000        1.0000
## Pos Pred Value              0.9697           1.0000        1.0000
## Neg Pred Value              1.0000           0.9818        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2029        0.3188
## Detection Prevalence        0.4783           0.2029        0.3188
## Balanced Accuracy           0.9865           0.9667        1.0000

We have very nice Naive-Bayes model with accuracy of 98.55%.

Decision Tree

Decision tree is a very interpretable model. It creates a decision tree, showing how the model classifies each of the data points. The decision tree created by the model can also be taken as a field guide on how to identify penguings.

model.dt <- ctree(species ~ . , data =  data.train)

plot(model.dt, type = "simple")

Model Evaluation

Let's evaluate the model

result.dt <- predict(model.dt, data.test)

cm.dt <- confusionMatrix(result.dt, data.test$species)
cm.dt
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         1      0
##   Chinstrap      0        14      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9219, 0.9996)
##     No Information Rate : 0.4638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9771          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9333        1.0000
## Specificity                 0.9730           1.0000        1.0000
## Pos Pred Value              0.9697           1.0000        1.0000
## Neg Pred Value              1.0000           0.9818        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2029        0.3188
## Detection Prevalence        0.4783           0.2029        0.3188
## Balanced Accuracy           0.9865           0.9667        1.0000

The decision tree that we build has an accuracy of 98.55%, less than the Naive Bayes model.

Decision tree model can pick up very little details of the data and build rules to fit individual data. However, such tree will We can improve the model by trying to control the branching of the tree. There are 3 variables to control :

  • alpha : The minimum alpha value of the node before it can split (default : 0.05)
  • minsplit : Minimum number of observations in the node, before it is allowed to split (default : 7)
  • minbucket : The minimum number of observations that can exist in the last bucket (default : 20)

The process of tuning a decision tree model is called pruning, a fitting term.

Let's try to prune a model with some control terms.

model.dt.ctrl <- ctree(species ~ . , 
                  data =  data.train,
                  control = ctree_control(alpha=0.001,
                                            minsplit=0,
                                            minbucket=0))
plot(model.dt.ctrl, type = 'simple')

result.dt.ctrl <- predict(model.dt.ctrl, data.test)

cm.dt.ctrl <- confusionMatrix(result.dt.ctrl, data.test$species)
cm.dt.ctrl
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         1      0
##   Chinstrap      0        14      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9219, 0.9996)
##     No Information Rate : 0.4638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9771          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9333        1.0000
## Specificity                 0.9730           1.0000        1.0000
## Pos Pred Value              0.9697           1.0000        1.0000
## Neg Pred Value              1.0000           0.9818        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2029        0.3188
## Detection Prevalence        0.4783           0.2029        0.3188
## Balanced Accuracy           0.9865           0.9667        1.0000

By setting minbucket and minsplit to 0, we are allowing the tree to learn every details. Fortunately, this dataset is not very complicated, so the tree does not grow too big and we're resulting in an accuracy 98.55%. Unfortunately, unlike decision trees, you can't interpet the random forest model.

Random Forest

Random forest is an ensemble model. It uses an ensemble of decision trees, and learns from the best models to get better prediction.

rf.ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
model.rf <- fb_forest <- train(species ~ ., data = data.train, method = "rf", trControl = rf.ctrl)
result.rf <- predict(model.rf, data.test)

cm.rf <- confusionMatrix(result.rf, data.test$species)
cm.rf
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         0      0
##   Chinstrap      0        15      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9479, 1)
##     No Information Rate : 0.4638     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           1.0000        1.0000
## Specificity                 1.0000           1.0000        1.0000
## Pos Pred Value              1.0000           1.0000        1.0000
## Neg Pred Value              1.0000           1.0000        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2174        0.3188
## Detection Prevalence        0.4638           0.2174        0.3188
## Balanced Accuracy           1.0000           1.0000        1.0000

The random forest model produces an accuracy of 100%. Random forest is actually a strong algorithm. However, because our dataset is small, it might be too complicated for the daaset and using a single decision tree or a Naive Bayes model is enough.

Conclusion

We have carried out an investigation of the penguins around Palmer LTER. Our models can differentiate the penguins based on their physical characteristics : sex, bill length, bill depth and body mass. Aside from that, we have also provided an "identificaton manual" using our decision tree model that can actually be used as a guide in the field.