Data Wrangling - MidTerm

About the data

This project is meant to analyze three different species of penguin: Chinstrap, Gentoo, and Adéle. 8 variables of data have been collected on 344 different penguins. These penguins were living in the Palmer Archipelago, a group of islands off of Antarctica.The data was collected by Dr. Kristen Gorman while working with the Palmer Station Long Term Ecological Research Program.

About my approach

I plan to use this data to demonstrate how unsupervised learning can uncover similarities that may have otherwise been overlooked. I also hope to build classification models that can be used to identify one of the three species of penguins using only a handfull of measurements.

Unsupervised learning should show us which penguin is most similar to another, and will explain why. I will use k-means clustering to separate the individual penguins into a predetermined number of groups.

Classification models will likely give us more predictive power than the unsupervised model, but its results may be slightly less clear than the unuspervised methods.

Suppress warnings and messages then load all required packages

options(warn = -1)
library(palmerpenguins)
suppressMessages(library(gplots))
suppressMessages(library(janitor))
suppressMessages(library(tidyverse))
suppressMessages(library(caret))
library(rpart)
library(rpart.plot)

My data is found in the ‘palmerpenguins’ package of R. The ‘penguins’ data is immediately accessible once the ‘palmerpenguins’ library is loaded. While working on this project I will be using the ‘gplots’ library to build enhanced visualizations such as our heat map. ‘janitor’ is a library that I will utilize when building certain tables. The ‘tidyverse’ package gives us access to multiple different packages without having to install them all separately. ‘tidyverse’ will be used primarily for creating graphics, and manipulating data. The library ‘caret’ will be used to train our complex classification models. ‘rpart’ is a package that allows us to build classification trees. ‘rpart.plot’ is the package that will enable me to visualize the classification trees built by ‘rpart’. Classification trees should be a great way to find which characteristics are attributed to each species of penguin.

Inspect the initial data

dim(penguins)

## [1] 344   8

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

table(is.na(penguins))

## 
## FALSE  TRUE 
##  2733    19

We have 19 total NA values. The 4 variables missing two values leads me to believe there may be some overlap, but further analysis will allow us to hone in on these. If I remove the NA values from the data and only lose 11 values then that means 2 penguins never had their bill length, bill depth, flipper length, body mass, or sex observed. The other 9 missing values for sex would be fully complete with the exception of sex, but since it is such a small portion of our data I feel that we can justify the removal of these records. If I remove all NA’s and the data drops by 19 points then that means 19 penguins were fully observed with the exception of a single record. We could avoid this by imputing the median values for all numerics and splitting the missing sex records 6 to male and 5 to female. This would retain every record without disrupting our fully recorded observations very much.

Pre-processing for analysis and post-processing inspection

# Remove all NA values
penguins.2 <- na.omit(penguins)

# Verify how many records we lost from the original 344
nrow(penguins.2)

## [1] 333

# Create dummy variables for all of our factor variables
penguins.2$male_sex <- ifelse(penguins.2$sex == 'male',1,0)
penguins.2$female_sex <- ifelse(penguins.2$sex == 'female',1,0)
penguins.2$Adelie_Species <- ifelse(penguins.2$species == 'Adelie',1,0)
penguins.2$Chinstrap_Species <- ifelse(penguins.2$species == 'Chinstrap',1,0)
penguins.2$Gentoo_Species <- ifelse(penguins.2$species == 'Gentoo',1,0)
penguins.2$Biscoe_Island <- ifelse(penguins.2$species == 'Biscoe',1,0)
penguins.2$Dream_Island <- ifelse(penguins.2$species == 'Dream',1,0)
penguins.2$Torgersen_Island <- ifelse(penguins.2$species == 'Torgersen',1,0)

# Use a correlation matrix to show relationships between each variable
penguins.numeric <- penguins.2[,-c(1,2,7)]
colfunc<-colorRampPalette(c("red","white","green"))
heatmap.2(cor(penguins.numeric),Rowv=FALSE,Colv=FALSE,dendrogram="none",col=colfunc(15),cellnote=round(cor(penguins.numeric),2),notecol="black",key=FALSE,trace='none',margins=c(7,5))

# Build a table to show the ratio of male to female for each species of penguin
penguins.2 %>%
  tabyl(sex, species) %>%
  adorn_totals(where = c("row", "col")) # add row, column totals

##     sex Adelie Chinstrap Gentoo Total
##  female     73        34     58   165
##    male     73        34     61   168
##   Total    146        68    119   333

# visualize the populations of each species on the three islands
ggplot(penguins.2, aes(x = island, fill = species)) +
  geom_bar(alpha = 1) +
  scale_fill_manual(values = c("grey","orange","black"),
                    guide = FALSE) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip()

I have removed all NA values to find that they did overlap. I am left with data on 333 individual penguins. I then made a binary variables out of the penguin sexes, species and islands. Doing this allows me to use them as numerics, which helps a ton in predictive modeling and visualization. Turning these factor variables into numerics I should be able to utilize the information in a much more dynamic way.

I also built out a correlation matrix. This matrix displays all of the numeric variables with a gradient color scheme to show which variables are closely related, and which are the most different from the others.

Begin work on unsupervised learning

The first thing I need to do before building any sort of predictive model will be partitioning the data. This means splitting the data into 2 different sets. One data set will be used to train my predictive models, and the other set will be used to test the models once they have been built. I set a seed in order to make these results replicable. Without a seed the results will be different every time the code is ran. I used 70% of the dataset to train my model, and the remaining 30% will make up my testing data.

set.seed(11883427)
train.penguins.rows <- sample (rownames(penguins.2),nrow(penguins.2)*.7)
train.penguins.data <- penguins.2[train.penguins.rows,] 
valid.penguins.rows <- setdiff(rownames(penguins.2),train.penguins.rows)
valid.penguins.data <- penguins.2[valid.penguins.rows,]

First Full Models

# This is our first classification tree
penguins.rt <- rpart(train.penguins.data$`island` ~ ., data = train.penguins.data, cp = 0, minsplit = 12)
printcp(penguins.rt)

## 
## Classification tree:
## rpart(formula = train.penguins.data$island ~ ., data = train.penguins.data, 
##     cp = 0, minsplit = 12)
## 
## Variables actually used in tree construction:
## [1] bill_depth_mm     bill_length_mm    body_mass_g       flipper_length_mm
## [5] sex               species          
## 
## Root node error: 123/233 = 0.5279
## 
## n= 233 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.4634146      0   1.00000 1.00000 0.061953
## 2 0.0203252      1   0.53659 0.53659 0.055917
## 3 0.0162602      9   0.33333 0.63415 0.058564
## 4 0.0081301     10   0.31707 0.61789 0.058180
## 5 0.0040650     12   0.30081 0.61789 0.058180
## 6 0.0000000     14   0.29268 0.61789 0.058180

penguins.rt$cptable

##            CP nsplit rel error    xerror       xstd
## 1 0.463414634      0 1.0000000 1.0000000 0.06195350
## 2 0.020325203      1 0.5365854 0.5365854 0.05591742
## 3 0.016260163      9 0.3333333 0.6341463 0.05856391
## 4 0.008130081     10 0.3170732 0.6178862 0.05817998
## 5 0.004065041     12 0.3008130 0.6178862 0.05817998
## 6 0.000000000     14 0.2926829 0.6178862 0.05817998

# Plot the tree with prp function
prp(penguins.rt, digits = 4, type = 1, extra = 1, varlen = -10,
    box.col = ifelse(penguins.rt$frame$var == "<leaf>", 'gray', 'white'))