Calling Out Penguins

Introduction

Do you know that there are several types of penguins? Altough they all look pretty similar, there are actually 17 species of penguins. As opposed to the polar bears (which inhabit the range of the north pole), all penguin species, except one, lives in the southern hemisphere. Three species of penguins are of interest to our project, the Adelie, Chinstrap and Gentoo penguins. They are the three species of penguins observed from the Palmer Long-Term Ecological Research Station on Anvers Island, Antarctica.

In this project, We will be using the Palmer Penguins dataset for our project provided by the Palmer LTER. The dataset contains physical observations of penguins. We'll use these physical characteristics to make classification models that can predict the penguins' classes. We'll use three different approaches : Naive-Bayes, decision tree and random forest.

*From left to right : Chinstrap, Adelie and Gentoo penguins

Data preparation

Before we start, let's load the required libraries

library(palmerpenguins) # The dataset
library(tidyverse) # Data wrangling and visualization
library(GGally) # To make correlation matrix
library(caret) # Near zero variance
library(class) # kNN to fill in missing values
library(e1071) # Naive-Bayes
library(partykit) # Decision tree
library(randomForest) # Random forest
library(kableExtra) # Pretty printing tables

data <- penguins
glimpse(data)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex               <fct> male, female, female, NA, female, male, female, mal…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

The dataset ontains 344 rows with 8 columns :

species : penguin species (Adelie, Gentoo, Chinstrap) _ island : islands where the penguin was observed (Biscoe, Dream, Torgensen)
bill_length_mm : length of bill (in mm)
bill_depth_mm : bill Depth (in mm)
flipper_length_mm : flipper length (in mm)
body_mass_g : body mass (in g)
sex : gender (male or female)
year : year of observation

Data Preprocessing

Before going deeper, let's check for missing values

colSums(is.na(data))

##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex              year 
##                 2                 2                11                 0

It seems that we have missing values for teh bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and sex. Let's take a look at these records

data %>% 
  filter_all(any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	34.1	18.1	193	3475	NA	2007
Adelie	Torgersen	42.0	20.2	190	4250	NA	2007
Adelie	Torgersen	37.8	17.1	186	3300	NA	2007
Adelie	Torgersen	37.8	17.3	180	3700	NA	2007
Adelie	Dream	37.5	18.9	179	2975	NA	2007
Gentoo	Biscoe	44.5	14.3	216	4100	NA	2007
Gentoo	Biscoe	46.2	14.4	214	4650	NA	2008
Gentoo	Biscoe	47.3	13.8	216	4725	NA	2009
Gentoo	Biscoe	44.5	15.7	217	4875	NA	2009
Gentoo	Biscoe	NA	NA	NA	NA	NA	2009

It seems there are 2 records that does not have any details at all. For these records, we can drop them as they contain no data at all.

# Removing records with body_mass_g NA 
data <- data %>% 
  filter(!is.na(body_mass_g))

The other 9 records are missing sex values. We can try to impute them. As they pertain to sex, it would be fine to assume a 50-50 split. However, there might be a correlation between body_mass or other phyisical characteristics with sex. We'll keep these records and try to impute them later in the data exploration & wrangling section.

Now let's check for duplicate rows

data[data %>% duplicated(), ]

## # A tibble: 0 x 8
## # … with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>

There is no duplicated rows.

Data exploration & wrangling

Adjusting data types

The data types for our variables are :

species : categorical (fct) _ island : categorical (fct)
bill_length_mm : numeric (dbl)
bill_depth_mm : numeric (dbl)
flipper_length_mm : numeric (int)
body_mass_g : numeric (int)
sex : categorical (fct)
year : numeric (int)

Although year is a numeric factor, in this project, we are not treating it like . In the context of this data, year is a categorical variable which categorizes when the observation was taken. Therefore, let's change this variable into categorical.

data$year <- factor(data$year)

As a remark, it is notable body_mass_g and flipper_length_mm are reported in integer values. Body mass and flipper length are continuous variables, so logically the values would contain decimals. By reporting them as integer (probably rounded), the resolving power is reduced. Sadly, we can not do anything about this.

Zero variance

Let's check if there are variables with near zero variance

data %>% nearZeroVar()

## integer(0)

There is no variables with zero variance.

Data distribution

Let's start by getting to know the distribution of our numeric data

boxplot(data %>% select_if(is.numeric))

summary(data %>% select_if(is.numeric))

##  bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g  
##  Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700  
##  1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
##  Median :44.45   Median :17.30   Median :197.0     Median :4050  
##  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202  
##  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
##  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300

Compared to the other numeric variables, body_mass_g is reported in grams, which causes it to shoot up. We can remedy this in several ways. One of them is by changing the units of the values, from mm to cm and from grams to kg.

data.clean <- data %>% 
  mutate(
    bill_length_cm = bill_length_mm / 10,
    bill_depth_cm = bill_depth_mm / 10,
    flipper_length_cm = flipper_length_mm / 10,
    body_mass_kg = body_mass_g / 1000,
  ) %>% 
  select(-c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))

data.clean %>% 
  head() %>% 
  kable() %>% 
  kable_styling()

species	island	sex	year	bill_length_cm	bill_depth_cm	flipper_length_cm	body_mass_kg
Adelie	Torgersen	male	2007	3.91	1.87	18.1	3.750
Adelie	Torgersen	female	2007	3.95	1.74	18.6	3.800
Adelie	Torgersen	female	2007	4.03	1.80	19.5	3.250
Adelie	Torgersen	female	2007	3.67	1.93	19.3	3.450
Adelie	Torgersen	male	2007	3.93	2.06	19.0	3.650
Adelie	Torgersen	female	2007	3.89	1.78	18.1	3.625

Let's see how the range of the values now compare to each other

boxplot(data.clean %>% select_if(is.numeric))

After adjusting the variables (feature engineering), the range of the values of the variables now lie between 0-20, which is much better than the previous data.

Correlation

Let' see the correlation between our variables

ggcorr(data.clean, label = T)

There are very strong correlations between the physical characteristics. This is expected, as the relations between shape and form of biological entities are governed by morphometrics. This suggests that we might not need all the variables. Let's try to exclude flipper_length_cm from the features.

data.clean %>% 
  select(-flipper_length_cm) %>% 
  ggcorr(label = T)

This shows a much better correlation matrix. Before justifying removing flipper_length_cm from the variables, let's see if it provides any predictive power. To do this, we'll have to plot both body_mass_kg and flipper_length_cm vs the class averages.

data.clean %>% 
  ggplot(aes(x = flipper_length_cm, y = body_mass_kg , color = species)) +
  geom_point() +
  geom_smooth(se = F) + 
  labs(title = 'Flipper length (cm) vs. Body mass (kg)', x = 'Flipper length (cm)', y = 'Body mass (kg)')

We can see that the relation between flipper_length_cm and body_mass_kg can be satisfactorily described linearly. Therefore, we can remove flipper_length_cm from our dataset

data.clean <- data.clean %>% 
  select(-flipper_length_cm)

The effect of `island`

island is an interesting variable. Let's see if being on an island have any effect on the penguins.

data.clean %>% 
  select(-c(sex, year)) %>% 
  pivot_longer(-c(species, island), names_to = 'params') %>% 
  ggplot(aes(x = island, y = value, color = species)) +
  geom_jitter() +
  facet_wrap(~params)

We see that some species, are only available on some islands : Gentoo on Biscoe, Chinstrap on Dream while Adelie are available on every Biscoe, Dream and Torgersen. From our conclusion, we can expect island to be a very strong predictor. However, this predictor might be misleading. island is not an intrinsic property of the object that we are trying to predict : penguins. If our model learns to rely on this variable for prediction, it will fail when a population of Chinstrap or Adelie migrates to another island. Therefore, we'll drop this variable from our dataset.

# Remove island
data.clean <- data.clean %>% select(-island)

Imputing sex

Let's tackle the missing values in the sex column

data.clean %>% 
  filter_all(any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling()

species	sex	year	bill_length_cm	bill_depth_cm	body_mass_kg
Adelie	NA	2007	3.41	1.81	3.475
Adelie	NA	2007	4.20	2.02	4.250
Adelie	NA	2007	3.78	1.71	3.300
Adelie	NA	2007	3.78	1.73	3.700
Adelie	NA	2007	3.75	1.89	2.975
Gentoo	NA	2007	4.45	1.43	4.100
Gentoo	NA	2008	4.62	1.44	4.650
Gentoo	NA	2009	4.73	1.38	4.725
Gentoo	NA	2009	4.45	1.57	4.875

Each species are represented in the missing values. In the context of imputing sex, our working assumption is that there is a correlation between phyical characteristics to sex. Before that, we have to control for year, as weight may fluctuate depending on the conditions in that particular year. To do this, we'll see how weight fluctuate by year for each species.

data.clean %>% 
 group_by(year, species) %>%
 summarize(mean_weight = mean(body_mass_kg)) %>%
  arrange(year) %>% 
  ggplot(aes(x = as.numeric(year))) +
  geom_line(aes(y = mean_weight, color = species)) + 
  geom_jitter(data = data.clean, aes(x = as.numeric(year), y = body_mass_kg, color = species)) +
  labs(title = 'Mean weight of penguings', x = 'Year', y = 'Mean weight (kg)')

The plot shows that the average weight of penguins throughout the year is stable. We can safely say that there is no correlation between year and the penguin's body mass. However, the weight between different penguin species is different.

We'll impute the sex data using k-nearest neighbor, taking species into account. We separate the observations without sex, and later combine them back into the dataset.

# Getting
sex_na <- data.clean %>% 
  filter(is.na(sex)) %>% 
  select(-sex)

# Get population with complete data
pop <- data.clean %>% 
  filter(!is.na(sex))

Before that, we have to decide on a k value.

# Seaching k-value
k <- round(sqrt(nrow(pop)))

The k value is 18. Because it's even, we'll remove add 1 to make it odd.

k <- k + 1

Now let's use kNN to predict the sex

# Splitting sex from data, selecting relevant variables
sex <- pop %>% select(species, sex)
pop <- pop %>% select(species, bill_length_cm, bill_depth_cm, body_mass_kg)
sex_na <- sex_na %>% select(species, bill_length_cm, bill_depth_cm, body_mass_kg)

# Using kNN for Gentoo
sex_gentoo <- knn(train = pop %>% filter(species == 'Gentoo') %>% select(-species) %>% as.matrix(), 
    test = sex_na %>% filter(species == 'Gentoo') %>% select(-species) %>% as.matrix(), 
    sex %>% filter(species == 'Gentoo') %>% select(sex) %>% as.matrix()
    )

# Using KNN for Adelie
sex_adelie <- knn(train = pop %>% filter(species == 'Adelie') %>% select(-species) %>% as.matrix(), 
    test = sex_na %>% filter(species == 'Adelie') %>% select(-species) %>% as.matrix(), 
    sex %>% filter(species == 'Adelie') %>% select(sex) %>% as.matrix()
    )

sex.combined <- data.frame(sex = fct_c(sex_adelie, sex_gentoo))

Now that we have the predicted sex, let's merge back the data set

sex.combined <- data.clean %>% 
  filter(is.na(sex)) %>%
  select(-sex) %>% 
  add_column(sex.combined)

data.clean <- data.clean %>% 
  filter(!is.na(sex)) %>%
  add_row(sex.combined)

And verify there is no NA

colSums(is.na(data.clean))

##        species            sex           year bill_length_cm  bill_depth_cm 
##              0              0              0              0              0 
##   body_mass_kg 
##              0

Year

The year variable is irrelevant in our learning because it is not a physical attribute of our target object (penguin) Therefore, we'll drop year from our dataset.

data.clean <- data.clean %>% select(-year)

Summary

From our exploration, we concluded that :

island is a very strong predictor (potentially a perfect / quasi-perfect separator)
year is irrelevant in our dataset

And we have :

transformed flipper_length_mm, bill_length_cm, bill_depth_cm andbody_mass_g into a scale that is more numerically balanced.
dropped flipper_length_cm from the dataset
removed island and year from the dataset
imputed sex using kNN according to species

Now let's dive into modeling

Data splitting

Before diving into the model, let's split our data into train and test dataset.

idx <- sample(nrow(data.clean), nrow(data.clean) * 0.8)

data.train <- data.clean[idx, ]
data.test <- data.clean[-idx, ]

Naive-Bayes

Let's try using Naive-Bayes classiefier to train our data

model.naive <- naiveBayes(x = data.train %>% select(-species) , 
                          y = data.train$species, 
                          laplace = 1)

result.naive <- predict(model.naive, data.test %>% select(-species))

# Make confusion matrix
cm.naive <- confusionMatrix(result.naive, data.test$species)
cm.naive

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         1      0
##   Chinstrap      0        14      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9219, 0.9996)
##     No Information Rate : 0.4638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9771          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9333        1.0000
## Specificity                 0.9730           1.0000        1.0000
## Pos Pred Value              0.9697           1.0000        1.0000
## Neg Pred Value              1.0000           0.9818        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2029        0.3188
## Detection Prevalence        0.4783           0.2029        0.3188
## Balanced Accuracy           0.9865           0.9667        1.0000

We have very nice Naive-Bayes model with accuracy of 98.55%.

Decision Tree

Decision tree is a very interpretable model. It creates a decision tree, showing how the model classifies each of the data points. The decision tree created by the model can also be taken as a field guide on how to identify penguings.

model.dt <- ctree(species ~ . , data =  data.train)

plot(model.dt, type = "simple")

Model Evaluation

Let's evaluate the model

result.dt <- predict(model.dt, data.test)

cm.dt <- confusionMatrix(result.dt, data.test$species)
cm.dt

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         1      0
##   Chinstrap      0        14      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9219, 0.9996)
##     No Information Rate : 0.4638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9771          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9333        1.0000
## Specificity                 0.9730           1.0000        1.0000
## Pos Pred Value              0.9697           1.0000        1.0000
## Neg Pred Value              1.0000           0.9818        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2029        0.3188
## Detection Prevalence        0.4783           0.2029        0.3188
## Balanced Accuracy           0.9865           0.9667        1.0000

The decision tree that we build has an accuracy of 98.55%, less than the Naive Bayes model.

Decision tree model can pick up very little details of the data and build rules to fit individual data. However, such tree will We can improve the model by trying to control the branching of the tree. There are 3 variables to control :

alpha : The minimum alpha value of the node before it can split (default : 0.05)
minsplit : Minimum number of observations in the node, before it is allowed to split (default : 7)
minbucket : The minimum number of observations that can exist in the last bucket (default : 20)

The process of tuning a decision tree model is called pruning, a fitting term.

Let's try to prune a model with some control terms.

model.dt.ctrl <- ctree(species ~ . , 
                  data =  data.train,
                  control = ctree_control(alpha=0.001,
                                            minsplit=0,
                                            minbucket=0))
plot(model.dt.ctrl, type = 'simple')

result.dt.ctrl <- predict(model.dt.ctrl, data.test)

cm.dt.ctrl <- confusionMatrix(result.dt.ctrl, data.test$species)
cm.dt.ctrl

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         1      0
##   Chinstrap      0        14      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9219, 0.9996)
##     No Information Rate : 0.4638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9771          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9333        1.0000
## Specificity                 0.9730           1.0000        1.0000
## Pos Pred Value              0.9697           1.0000        1.0000
## Neg Pred Value              1.0000           0.9818        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2029        0.3188
## Detection Prevalence        0.4783           0.2029        0.3188
## Balanced Accuracy           0.9865           0.9667        1.0000

By setting minbucket and minsplit to 0, we are allowing the tree to learn every details. Fortunately, this dataset is not very complicated, so the tree does not grow too big and we're resulting in an accuracy 98.55%. Unfortunately, unlike decision trees, you can't interpet the random forest model.

Random Forest

Random forest is an ensemble model. It uses an ensemble of decision trees, and learns from the best models to get better prediction.

rf.ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
model.rf <- fb_forest <- train(species ~ ., data = data.train, method = "rf", trControl = rf.ctrl)

result.rf <- predict(model.rf, data.test)

cm.rf <- confusionMatrix(result.rf, data.test$species)
cm.rf

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        32         0      0
##   Chinstrap      0        15      0
##   Gentoo         0         0     22
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9479, 1)
##     No Information Rate : 0.4638     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           1.0000        1.0000
## Specificity                 1.0000           1.0000        1.0000
## Pos Pred Value              1.0000           1.0000        1.0000
## Neg Pred Value              1.0000           1.0000        1.0000
## Prevalence                  0.4638           0.2174        0.3188
## Detection Rate              0.4638           0.2174        0.3188
## Detection Prevalence        0.4638           0.2174        0.3188
## Balanced Accuracy           1.0000           1.0000        1.0000

The random forest model produces an accuracy of 100%. Random forest is actually a strong algorithm. However, because our dataset is small, it might be too complicated for the daaset and using a single decision tree or a Naive Bayes model is enough.

Conclusion

We have carried out an investigation of the penguins around Palmer LTER. Our models can differentiate the penguins based on their physical characteristics : sex, bill length, bill depth and body mass. Aside from that, we have also provided an "identificaton manual" using our decision tree model that can actually be used as a guide in the field.

Calling Out Penguins

Anugerah Erlaut

8/20/2020

Introduction

Data preparation

Data Preprocessing

Data exploration & wrangling

Adjusting data types

Zero variance

Data distribution

Correlation

The effect of `island`

Imputing sex

Year

Summary

Data splitting

Naive-Bayes

Decision Tree

Model Evaluation

Random Forest

Conclusion

Calling Out Penguins

Anugerah Erlaut

8/20/2020

Introduction

Data preparation

Data Preprocessing

Data exploration & wrangling

Adjusting data types

Zero variance

Data distribution

Correlation

The effect of island

Imputing sex

Year

Summary

Data splitting

Naive-Bayes

Decision Tree

Model Evaluation

Random Forest

Conclusion

The effect of `island`