Do you know that there are several types of penguins? Altough they all look pretty similar, there are actually 17 species of penguins. As opposed to the polar bears (which inhabit the range of the north pole), all penguin species, except one, lives in the southern hemisphere. Three species of penguins are of interest to our project, the Adelie, Chinstrap and Gentoo penguins. They are the three species of penguins observed from the Palmer Long-Term Ecological Research Station on Anvers Island, Antarctica.
In this project, We will be using the Palmer Penguins dataset for our project provided by the Palmer LTER. The dataset contains physical observations of penguins. We'll use these physical characteristics to make classification models that can predict the penguins' classes. We'll use three different approaches : Naive-Bayes, decision tree and random forest.
*From left to right : Chinstrap, Adelie and Gentoo penguins
Before we start, let's load the required libraries
library(palmerpenguins) # The dataset
library(tidyverse) # Data wrangling and visualization
library(GGally) # To make correlation matrix
library(caret) # Near zero variance
library(class) # kNN to fill in missing values
library(e1071) # Naive-Bayes
library(partykit) # Decision tree
library(randomForest) # Random forest
library(kableExtra) # Pretty printing tablesdata <- penguins
glimpse(data)## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex <fct> male, female, female, NA, female, male, female, mal…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
The dataset ontains 344 rows with 8 columns :
Before going deeper, let's check for missing values
colSums(is.na(data))## species island bill_length_mm bill_depth_mm
## 0 0 2 2
## flipper_length_mm body_mass_g sex year
## 2 2 11 0
It seems that we have missing values for teh bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and sex. Let's take a look at these records
data %>%
filter_all(any_vars(is.na(.))) %>%
kable() %>%
kable_styling()| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | NA | 2007 |
| Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | NA | 2007 |
| Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | NA | 2007 |
| Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | NA | 2007 |
| Adelie | Dream | 37.5 | 18.9 | 179 | 2975 | NA | 2007 |
| Gentoo | Biscoe | 44.5 | 14.3 | 216 | 4100 | NA | 2007 |
| Gentoo | Biscoe | 46.2 | 14.4 | 214 | 4650 | NA | 2008 |
| Gentoo | Biscoe | 47.3 | 13.8 | 216 | 4725 | NA | 2009 |
| Gentoo | Biscoe | 44.5 | 15.7 | 217 | 4875 | NA | 2009 |
| Gentoo | Biscoe | NA | NA | NA | NA | NA | 2009 |
It seems there are 2 records that does not have any details at all. For these records, we can drop them as they contain no data at all.
# Removing records with body_mass_g NA
data <- data %>%
filter(!is.na(body_mass_g))The other 9 records are missing sex values. We can try to impute them. As they pertain to sex, it would be fine to assume a 50-50 split. However, there might be a correlation between body_mass or other phyisical characteristics with sex. We'll keep these records and try to impute them later in the data exploration & wrangling section.
Now let's check for duplicate rows
data[data %>% duplicated(), ]## # A tibble: 0 x 8
## # … with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## # bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## # year <int>
There is no duplicated rows.
The data types for our variables are :
Although year is a numeric factor, in this project, we are not treating it like . In the context of this data, year is a categorical variable which categorizes when the observation was taken. Therefore, let's change this variable into categorical.
data$year <- factor(data$year)As a remark, it is notable body_mass_g and flipper_length_mm are reported in integer values. Body mass and flipper length are continuous variables, so logically the values would contain decimals. By reporting them as integer (probably rounded), the resolving power is reduced. Sadly, we can not do anything about this.
Let's check if there are variables with near zero variance
data %>% nearZeroVar()## integer(0)
There is no variables with zero variance.
Let's start by getting to know the distribution of our numeric data
boxplot(data %>% select_if(is.numeric))summary(data %>% select_if(is.numeric))## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Min. :32.10 Min. :13.10 Min. :172.0 Min. :2700
## 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550
## Median :44.45 Median :17.30 Median :197.0 Median :4050
## Mean :43.92 Mean :17.15 Mean :200.9 Mean :4202
## 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750
## Max. :59.60 Max. :21.50 Max. :231.0 Max. :6300
Compared to the other numeric variables, body_mass_g is reported in grams, which causes it to shoot up. We can remedy this in several ways. One of them is by changing the units of the values, from mm to cm and from grams to kg.
data.clean <- data %>%
mutate(
bill_length_cm = bill_length_mm / 10,
bill_depth_cm = bill_depth_mm / 10,
flipper_length_cm = flipper_length_mm / 10,
body_mass_kg = body_mass_g / 1000,
) %>%
select(-c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))
data.clean %>%
head() %>%
kable() %>%
kable_styling()| species | island | sex | year | bill_length_cm | bill_depth_cm | flipper_length_cm | body_mass_kg |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | male | 2007 | 3.91 | 1.87 | 18.1 | 3.750 |
| Adelie | Torgersen | female | 2007 | 3.95 | 1.74 | 18.6 | 3.800 |
| Adelie | Torgersen | female | 2007 | 4.03 | 1.80 | 19.5 | 3.250 |
| Adelie | Torgersen | female | 2007 | 3.67 | 1.93 | 19.3 | 3.450 |
| Adelie | Torgersen | male | 2007 | 3.93 | 2.06 | 19.0 | 3.650 |
| Adelie | Torgersen | female | 2007 | 3.89 | 1.78 | 18.1 | 3.625 |
Let's see how the range of the values now compare to each other
boxplot(data.clean %>% select_if(is.numeric))After adjusting the variables (feature engineering), the range of the values of the variables now lie between 0-20, which is much better than the previous data.
Let' see the correlation between our variables
ggcorr(data.clean, label = T)There are very strong correlations between the physical characteristics. This is expected, as the relations between shape and form of biological entities are governed by morphometrics. This suggests that we might not need all the variables. Let's try to exclude flipper_length_cm from the features.
data.clean %>%
select(-flipper_length_cm) %>%
ggcorr(label = T)This shows a much better correlation matrix. Before justifying removing flipper_length_cm from the variables, let's see if it provides any predictive power. To do this, we'll have to plot both body_mass_kg and flipper_length_cm vs the class averages.
data.clean %>%
ggplot(aes(x = flipper_length_cm, y = body_mass_kg , color = species)) +
geom_point() +
geom_smooth(se = F) +
labs(title = 'Flipper length (cm) vs. Body mass (kg)', x = 'Flipper length (cm)', y = 'Body mass (kg)')We can see that the relation between flipper_length_cm and body_mass_kg can be satisfactorily described linearly. Therefore, we can remove flipper_length_cm from our dataset
data.clean <- data.clean %>%
select(-flipper_length_cm)islandisland is an interesting variable. Let's see if being on an island have any effect on the penguins.
data.clean %>%
select(-c(sex, year)) %>%
pivot_longer(-c(species, island), names_to = 'params') %>%
ggplot(aes(x = island, y = value, color = species)) +
geom_jitter() +
facet_wrap(~params)We see that some species, are only available on some islands : Gentoo on Biscoe, Chinstrap on Dream while Adelie are available on every Biscoe, Dream and Torgersen. From our conclusion, we can expect island to be a very strong predictor. However, this predictor might be misleading. island is not an intrinsic property of the object that we are trying to predict : penguins. If our model learns to rely on this variable for prediction, it will fail when a population of Chinstrap or Adelie migrates to another island. Therefore, we'll drop this variable from our dataset.
# Remove island
data.clean <- data.clean %>% select(-island)Let's tackle the missing values in the sex column
data.clean %>%
filter_all(any_vars(is.na(.))) %>%
kable() %>%
kable_styling()| species | sex | year | bill_length_cm | bill_depth_cm | body_mass_kg |
|---|---|---|---|---|---|
| Adelie | NA | 2007 | 3.41 | 1.81 | 3.475 |
| Adelie | NA | 2007 | 4.20 | 2.02 | 4.250 |
| Adelie | NA | 2007 | 3.78 | 1.71 | 3.300 |
| Adelie | NA | 2007 | 3.78 | 1.73 | 3.700 |
| Adelie | NA | 2007 | 3.75 | 1.89 | 2.975 |
| Gentoo | NA | 2007 | 4.45 | 1.43 | 4.100 |
| Gentoo | NA | 2008 | 4.62 | 1.44 | 4.650 |
| Gentoo | NA | 2009 | 4.73 | 1.38 | 4.725 |
| Gentoo | NA | 2009 | 4.45 | 1.57 | 4.875 |
Each species are represented in the missing values. In the context of imputing sex, our working assumption is that there is a correlation between phyical characteristics to sex. Before that, we have to control for year, as weight may fluctuate depending on the conditions in that particular year. To do this, we'll see how weight fluctuate by year for each species.
data.clean %>%
group_by(year, species) %>%
summarize(mean_weight = mean(body_mass_kg)) %>%
arrange(year) %>%
ggplot(aes(x = as.numeric(year))) +
geom_line(aes(y = mean_weight, color = species)) +
geom_jitter(data = data.clean, aes(x = as.numeric(year), y = body_mass_kg, color = species)) +
labs(title = 'Mean weight of penguings', x = 'Year', y = 'Mean weight (kg)')The plot shows that the average weight of penguins throughout the year is stable. We can safely say that there is no correlation between year and the penguin's body mass. However, the weight between different penguin species is different.
We'll impute the sex data using k-nearest neighbor, taking species into account. We separate the observations without sex, and later combine them back into the dataset.
# Getting
sex_na <- data.clean %>%
filter(is.na(sex)) %>%
select(-sex)
# Get population with complete data
pop <- data.clean %>%
filter(!is.na(sex))Before that, we have to decide on a k value.
# Seaching k-value
k <- round(sqrt(nrow(pop)))The k value is 18. Because it's even, we'll remove add 1 to make it odd.
k <- k + 1Now let's use kNN to predict the sex
# Splitting sex from data, selecting relevant variables
sex <- pop %>% select(species, sex)
pop <- pop %>% select(species, bill_length_cm, bill_depth_cm, body_mass_kg)
sex_na <- sex_na %>% select(species, bill_length_cm, bill_depth_cm, body_mass_kg)
# Using kNN for Gentoo
sex_gentoo <- knn(train = pop %>% filter(species == 'Gentoo') %>% select(-species) %>% as.matrix(),
test = sex_na %>% filter(species == 'Gentoo') %>% select(-species) %>% as.matrix(),
sex %>% filter(species == 'Gentoo') %>% select(sex) %>% as.matrix()
)
# Using KNN for Adelie
sex_adelie <- knn(train = pop %>% filter(species == 'Adelie') %>% select(-species) %>% as.matrix(),
test = sex_na %>% filter(species == 'Adelie') %>% select(-species) %>% as.matrix(),
sex %>% filter(species == 'Adelie') %>% select(sex) %>% as.matrix()
)
sex.combined <- data.frame(sex = fct_c(sex_adelie, sex_gentoo))Now that we have the predicted sex, let's merge back the data set
sex.combined <- data.clean %>%
filter(is.na(sex)) %>%
select(-sex) %>%
add_column(sex.combined)
data.clean <- data.clean %>%
filter(!is.na(sex)) %>%
add_row(sex.combined)And verify there is no NA
colSums(is.na(data.clean))## species sex year bill_length_cm bill_depth_cm
## 0 0 0 0 0
## body_mass_kg
## 0
The year variable is irrelevant in our learning because it is not a physical attribute of our target object (penguin) Therefore, we'll drop year from our dataset.
data.clean <- data.clean %>% select(-year)From our exploration, we concluded that :
island is a very strong predictor (potentially a perfect / quasi-perfect separator)year is irrelevant in our datasetAnd we have :
flipper_length_mm, bill_length_cm, bill_depth_cm andbody_mass_g into a scale that is more numerically balanced.flipper_length_cm from the datasetisland and year from the datasetsex using kNN according to speciesNow let's dive into modeling
Before diving into the model, let's split our data into train and test dataset.
idx <- sample(nrow(data.clean), nrow(data.clean) * 0.8)
data.train <- data.clean[idx, ]
data.test <- data.clean[-idx, ]Let's try using Naive-Bayes classiefier to train our data
model.naive <- naiveBayes(x = data.train %>% select(-species) ,
y = data.train$species,
laplace = 1)
result.naive <- predict(model.naive, data.test %>% select(-species))
# Make confusion matrix
cm.naive <- confusionMatrix(result.naive, data.test$species)
cm.naive## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 32 1 0
## Chinstrap 0 14 0
## Gentoo 0 0 22
##
## Overall Statistics
##
## Accuracy : 0.9855
## 95% CI : (0.9219, 0.9996)
## No Information Rate : 0.4638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9771
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 0.9333 1.0000
## Specificity 0.9730 1.0000 1.0000
## Pos Pred Value 0.9697 1.0000 1.0000
## Neg Pred Value 1.0000 0.9818 1.0000
## Prevalence 0.4638 0.2174 0.3188
## Detection Rate 0.4638 0.2029 0.3188
## Detection Prevalence 0.4783 0.2029 0.3188
## Balanced Accuracy 0.9865 0.9667 1.0000
We have very nice Naive-Bayes model with accuracy of 98.55%.
Decision tree is a very interpretable model. It creates a decision tree, showing how the model classifies each of the data points. The decision tree created by the model can also be taken as a field guide on how to identify penguings.
model.dt <- ctree(species ~ . , data = data.train)
plot(model.dt, type = "simple")Let's evaluate the model
result.dt <- predict(model.dt, data.test)
cm.dt <- confusionMatrix(result.dt, data.test$species)
cm.dt## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 32 1 0
## Chinstrap 0 14 0
## Gentoo 0 0 22
##
## Overall Statistics
##
## Accuracy : 0.9855
## 95% CI : (0.9219, 0.9996)
## No Information Rate : 0.4638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9771
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 0.9333 1.0000
## Specificity 0.9730 1.0000 1.0000
## Pos Pred Value 0.9697 1.0000 1.0000
## Neg Pred Value 1.0000 0.9818 1.0000
## Prevalence 0.4638 0.2174 0.3188
## Detection Rate 0.4638 0.2029 0.3188
## Detection Prevalence 0.4783 0.2029 0.3188
## Balanced Accuracy 0.9865 0.9667 1.0000
The decision tree that we build has an accuracy of 98.55%, less than the Naive Bayes model.
Decision tree model can pick up very little details of the data and build rules to fit individual data. However, such tree will We can improve the model by trying to control the branching of the tree. There are 3 variables to control :
alpha : The minimum alpha value of the node before it can split (default : 0.05)minsplit : Minimum number of observations in the node, before it is allowed to split (default : 7)minbucket : The minimum number of observations that can exist in the last bucket (default : 20)The process of tuning a decision tree model is called pruning, a fitting term.
Let's try to prune a model with some control terms.
model.dt.ctrl <- ctree(species ~ . ,
data = data.train,
control = ctree_control(alpha=0.001,
minsplit=0,
minbucket=0))
plot(model.dt.ctrl, type = 'simple')result.dt.ctrl <- predict(model.dt.ctrl, data.test)
cm.dt.ctrl <- confusionMatrix(result.dt.ctrl, data.test$species)
cm.dt.ctrl## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 32 1 0
## Chinstrap 0 14 0
## Gentoo 0 0 22
##
## Overall Statistics
##
## Accuracy : 0.9855
## 95% CI : (0.9219, 0.9996)
## No Information Rate : 0.4638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9771
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 0.9333 1.0000
## Specificity 0.9730 1.0000 1.0000
## Pos Pred Value 0.9697 1.0000 1.0000
## Neg Pred Value 1.0000 0.9818 1.0000
## Prevalence 0.4638 0.2174 0.3188
## Detection Rate 0.4638 0.2029 0.3188
## Detection Prevalence 0.4783 0.2029 0.3188
## Balanced Accuracy 0.9865 0.9667 1.0000
By setting minbucket and minsplit to 0, we are allowing the tree to learn every details. Fortunately, this dataset is not very complicated, so the tree does not grow too big and we're resulting in an accuracy 98.55%. Unfortunately, unlike decision trees, you can't interpet the random forest model.
Random forest is an ensemble model. It uses an ensemble of decision trees, and learns from the best models to get better prediction.
rf.ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
model.rf <- fb_forest <- train(species ~ ., data = data.train, method = "rf", trControl = rf.ctrl)result.rf <- predict(model.rf, data.test)
cm.rf <- confusionMatrix(result.rf, data.test$species)
cm.rf## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 32 0 0
## Chinstrap 0 15 0
## Gentoo 0 0 22
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9479, 1)
## No Information Rate : 0.4638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.4638 0.2174 0.3188
## Detection Rate 0.4638 0.2174 0.3188
## Detection Prevalence 0.4638 0.2174 0.3188
## Balanced Accuracy 1.0000 1.0000 1.0000
The random forest model produces an accuracy of 100%. Random forest is actually a strong algorithm. However, because our dataset is small, it might be too complicated for the daaset and using a single decision tree or a Naive Bayes model is enough.
We have carried out an investigation of the penguins around Palmer LTER. Our models can differentiate the penguins based on their physical characteristics : sex, bill length, bill depth and body mass. Aside from that, we have also provided an "identificaton manual" using our decision tree model that can actually be used as a guide in the field.