This project is meant to analyze three different species of penguin: Chinstrap, Gentoo, and Adéle. 8 variables of data have been collected on 344 different penguins. These penguins were living in the Palmer Archipelago, a group of islands off of Antarctica.The data was collected by Dr. Kristen Gorman while working with the Palmer Station Long Term Ecological Research Program.
I plan to use this data to demonstrate how unsupervised learning can uncover similarities that may have otherwise been overlooked. I also hope to build classification models that can be used to identify one of the three species of penguins using only a handfull of measurements.
Unsupervised learning should show us which penguin is most similar to another, and will explain why. I will use k-means clustering to separate the individual penguins into a predetermined number of groups.
Classification models will likely give us more predictive power than the unsupervised model, but its results may be slightly less clear than the unuspervised methods.
options(warn = -1)
library(palmerpenguins)
suppressMessages(library(gplots))
suppressMessages(library(janitor))
suppressMessages(library(tidyverse))
suppressMessages(library(caret))
library(rpart)
library(rpart.plot)
My data is found in the ‘palmerpenguins’ package of R. The ‘penguins’ data is immediately accessible once the ‘palmerpenguins’ library is loaded. While working on this project I will be using the ‘gplots’ library to build enhanced visualizations such as our heat map. ‘janitor’ is a library that I will utilize when building certain tables. The ‘tidyverse’ package gives us access to multiple different packages without having to install them all separately. ‘tidyverse’ will be used primarily for creating graphics, and manipulating data. The library ‘caret’ will be used to train our complex classification models. ‘rpart’ is a package that allows us to build classification trees. ‘rpart.plot’ is the package that will enable me to visualize the classification trees built by ‘rpart’. Classification trees should be a great way to find which characteristics are attributed to each species of penguin.
dim(penguins)
## [1] 344 8
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
table(is.na(penguins))
##
## FALSE TRUE
## 2733 19
We have 19 total NA values. The 4 variables missing two values leads me to believe there may be some overlap, but further analysis will allow us to hone in on these. If I remove the NA values from the data and only lose 11 values then that means 2 penguins never had their bill length, bill depth, flipper length, body mass, or sex observed. The other 9 missing values for sex would be fully complete with the exception of sex, but since it is such a small portion of our data I feel that we can justify the removal of these records. If I remove all NA’s and the data drops by 19 points then that means 19 penguins were fully observed with the exception of a single record. We could avoid this by imputing the median values for all numerics and splitting the missing sex records 6 to male and 5 to female. This would retain every record without disrupting our fully recorded observations very much.
# Remove all NA values
penguins.2 <- na.omit(penguins)
# Verify how many records we lost from the original 344
nrow(penguins.2)
## [1] 333
# Create dummy variables for all of our factor variables
penguins.2$male_sex <- ifelse(penguins.2$sex == 'male',1,0)
penguins.2$female_sex <- ifelse(penguins.2$sex == 'female',1,0)
penguins.2$Adelie_Species <- ifelse(penguins.2$species == 'Adelie',1,0)
penguins.2$Chinstrap_Species <- ifelse(penguins.2$species == 'Chinstrap',1,0)
penguins.2$Gentoo_Species <- ifelse(penguins.2$species == 'Gentoo',1,0)
penguins.2$Biscoe_Island <- ifelse(penguins.2$species == 'Biscoe',1,0)
penguins.2$Dream_Island <- ifelse(penguins.2$species == 'Dream',1,0)
penguins.2$Torgersen_Island <- ifelse(penguins.2$species == 'Torgersen',1,0)
# Use a correlation matrix to show relationships between each variable
penguins.numeric <- penguins.2[,-c(1,2,7)]
colfunc<-colorRampPalette(c("red","white","green"))
heatmap.2(cor(penguins.numeric),Rowv=FALSE,Colv=FALSE,dendrogram="none",col=colfunc(15),cellnote=round(cor(penguins.numeric),2),notecol="black",key=FALSE,trace='none',margins=c(7,5))
# Build a table to show the ratio of male to female for each species of penguin
penguins.2 %>%
tabyl(sex, species) %>%
adorn_totals(where = c("row", "col")) # add row, column totals
## sex Adelie Chinstrap Gentoo Total
## female 73 34 58 165
## male 73 34 61 168
## Total 146 68 119 333
# visualize the populations of each species on the three islands
ggplot(penguins.2, aes(x = island, fill = species)) +
geom_bar(alpha = 1) +
scale_fill_manual(values = c("grey","orange","black"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
I have removed all NA values to find that they did overlap. I am left with data on 333 individual penguins. I then made a binary variables out of the penguin sexes, species and islands. Doing this allows me to use them as numerics, which helps a ton in predictive modeling and visualization. Turning these factor variables into numerics I should be able to utilize the information in a much more dynamic way.
I also built out a correlation matrix. This matrix displays all of the numeric variables with a gradient color scheme to show which variables are closely related, and which are the most different from the others.
The first thing I need to do before building any sort of predictive model will be partitioning the data. This means splitting the data into 2 different sets. One data set will be used to train my predictive models, and the other set will be used to test the models once they have been built. I set a seed in order to make these results replicable. Without a seed the results will be different every time the code is ran. I used 70% of the dataset to train my model, and the remaining 30% will make up my testing data.
set.seed(11883427)
train.penguins.rows <- sample (rownames(penguins.2),nrow(penguins.2)*.7)
train.penguins.data <- penguins.2[train.penguins.rows,]
valid.penguins.rows <- setdiff(rownames(penguins.2),train.penguins.rows)
valid.penguins.data <- penguins.2[valid.penguins.rows,]
# This is our first classification tree
penguins.rt <- rpart(train.penguins.data$`island` ~ ., data = train.penguins.data, cp = 0, minsplit = 12)
printcp(penguins.rt)
##
## Classification tree:
## rpart(formula = train.penguins.data$island ~ ., data = train.penguins.data,
## cp = 0, minsplit = 12)
##
## Variables actually used in tree construction:
## [1] bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
## [5] sex species
##
## Root node error: 123/233 = 0.5279
##
## n= 233
##
## CP nsplit rel error xerror xstd
## 1 0.4634146 0 1.00000 1.00000 0.061953
## 2 0.0203252 1 0.53659 0.53659 0.055917
## 3 0.0162602 9 0.33333 0.63415 0.058564
## 4 0.0081301 10 0.31707 0.61789 0.058180
## 5 0.0040650 12 0.30081 0.61789 0.058180
## 6 0.0000000 14 0.29268 0.61789 0.058180
penguins.rt$cptable
## CP nsplit rel error xerror xstd
## 1 0.463414634 0 1.0000000 1.0000000 0.06195350
## 2 0.020325203 1 0.5365854 0.5365854 0.05591742
## 3 0.016260163 9 0.3333333 0.6341463 0.05856391
## 4 0.008130081 10 0.3170732 0.6178862 0.05817998
## 5 0.004065041 12 0.3008130 0.6178862 0.05817998
## 6 0.000000000 14 0.2926829 0.6178862 0.05817998
# Plot the tree with prp function
prp(penguins.rt, digits = 4, type = 1, extra = 1, varlen = -10,
box.col = ifelse(penguins.rt$frame$var == "<leaf>", 'gray', 'white'))