This project is meant to analyze three different species of penguin: Chinstrap, Gentoo, and Adéle. 8 variables of data have been collected on 344 different penguins. These penguins were living in the Palmer Archipelago, a group of islands off of Antarctica.The data was collected by Dr. Kristen Gorman while working with the Palmer Station Long Term Ecological Research Program.
I plan to use this data to demonstrate how unsupervised learning can uncover similarities that may have otherwise been overlooked. I also hope to build classification models that can be used to identify the sex of a penguins using only a handfull of measurements.I plan to use a regression tree, as well as a linear regression model. Since the dataset is relatively small I will use bootstrap re-sampling to generate more observations. Building a larger dataset should provide a stronger prediction, and testing our dataset will generate more reliable outcomes.
Unsupervised learning should show us the major differences between male and female penguins. I will use k-means clustering to separate the individual penguins into a predetermined number of groups.
Classification models will likely give us more predictive power than the unsupervised model, but its results may be slightly less clear than the unuspervised methods.
options(warn = -1)
library(palmerpenguins)
suppressMessages(library(gplots))
suppressMessages(library(janitor))
suppressMessages(library(tidyverse))
suppressMessages(library(caret))
library(rpart)
library(rpart.plot)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
My data is found in the ‘palmerpenguins’ package of R. The ‘penguins’ data is immediately accessible once the ‘palmerpenguins’ library is loaded. While working on this project I will be using the ‘gplots’ library to build enhanced visualizations such as our heat map. ‘janitor’ is a library that I will utilize when building certain tables. The ‘tidyverse’ package gives us access to multiple different packages without having to install them all separately. ‘tidyverse’ will be used primarily for creating graphics, and manipulating data. The library ‘caret’ will be used to train our complex classification models. ‘rpart’ is a package that allows us to build classification trees. ‘rpart.plot’ is the package that will enable me to visualize the classification trees built by ‘rpart’. Classification trees should be a great way to find which characteristics are attributed to each species of penguin.
dim(penguins)
## [1] 344 8
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
table(is.na(penguins))
##
## FALSE TRUE
## 2733 19
We have 19 total NA values. The 4 variables missing two values leads me to believe there may be some overlap, but further analysis will allow us to hone in on these. If I remove the NA values from the data and only lose 11 values then that means 2 penguins never had their bill length, bill depth, flipper length, body mass, or sex observed. The other 9 missing values for sex would be fully complete with the exception of sex, but since it is such a small portion of our data I feel that we can justify the removal of these records. If I remove all NA’s and the data drops by 19 points then that means 19 penguins were fully observed with the exception of a single record. We could avoid this by imputing the median values for all numerics and splitting the missing sex records 6 to male and 5 to female. This would retain every record without disrupting our fully recorded observations very much.
# Remove all NA values
penguins.2 <- na.omit(penguins)
# Verify how many records we lost from the original 344
nrow(penguins.2)
## [1] 333
# Create dummy variables for all of our factor variables
penguins.2$sex_dummy <- ifelse(penguins.2$sex == 'male',1,0)
penguins.2$Adelie_Species <- ifelse(penguins.2$species == 'Adelie',1,0)
penguins.2$Chinstrap_Species <- ifelse(penguins.2$species == 'Chinstrap',1,0)
penguins.2$Biscoe_Island <- ifelse(penguins.2$island == 'Biscoe',1,0)
penguins.2$Dream_Island <- ifelse(penguins.2$island == 'Dream',1,0)
# Use a correlation matrix to show relationships between each variable
penguins.numeric <- penguins.2[,-c(1,2,7)]
colfunc<-colorRampPalette(c("red","white","green"))
heatmap.2(cor(penguins.numeric),Rowv=FALSE,Colv=FALSE,dendrogram="none",col=colfunc(15),cellnote=round(cor(penguins.numeric),2),notecol="black",key=FALSE,trace='none',margins=c(9,7))
# Build a table to show the ratio of male to female for each species of penguin
penguins.2 %>%
tabyl(sex, species) %>%
adorn_totals(where = c("row", "col")) # add row, column totals
## sex Adelie Chinstrap Gentoo Total
## female 73 34 58 165
## male 73 34 61 168
## Total 146 68 119 333
# visualize the populations of each species on the three islands
ggplot(penguins.2, aes(x = island, fill = species)) +
geom_bar(alpha = 1) +
scale_fill_manual(values = c("grey","orange","black"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
I have removed all NA values to find that they did overlap. I am left with data on 333 individual penguins. I then made a binary variables out of the penguin sexes, species and islands. Doing this allows me to use them as numerics, which helps a ton in predictive modeling and visualization. Turning these factor variables into numerics I should be able to utilize the information in a much more dynamic way.
I also built out a correlation matrix. This matrix displays all of the numeric variables with a gradient color scheme to show which variables are closely related, and which are the most different from the others. It looks like our response variable ‘sex_dummy’ is most closely related to the dimensions of the penguins. It makes sense that we wouldn’t get much information from species with regards to sex, since all three species are split evenly between males and females. The islands wouldn’t tell us much about the sex of our penguins because there aren’t more males on one island than another, but this wasn’t known previously.
We can see from the bar plot that Adele penguins are the only penguin with a presence on all 3 islands. Chinstrap penguins are observed living only on Dream Island, and all of the Gentoo penguins live on Biscoe Island.
The first thing I will need to do is partition my data. This means splitting the data into 2 different sets. One data set will be used to train my predictive models, and the other set will be used to test the models once they have been built. I set a seed in order to make these results replicable. Without a seed the results will be different every time the code is ran. I used 70% of the dataset to train my model, and the remaining 30% will make up my testing data.
set.seed(11883427)
train.penguins.rows <- sample (rownames(penguins.numeric),nrow(penguins.numeric)*.7)
train.penguins.data <- penguins.numeric[train.penguins.rows,]
valid.penguins.rows <- setdiff(rownames(penguins.numeric),train.penguins.rows)
valid.penguins.data <- penguins.numeric[valid.penguins.rows,]
# Classification tree
penguins.rt <- rpart(train.penguins.data$`sex_dummy` ~ ., data = train.penguins.data, cp = 0, minsplit = 12)
printcp(penguins.rt)
##
## Regression tree:
## rpart(formula = train.penguins.data$sex_dummy ~ ., data = train.penguins.data,
## cp = 0, minsplit = 12)
##
## Variables actually used in tree construction:
## [1] bill_depth_mm bill_length_mm Biscoe_Island body_mass_g
## [5] Chinstrap_Species year
##
## Root node error: 58.223/233 = 0.24988
##
## n= 233
##
## CP nsplit rel error xerror xstd
## 1 0.2635516 0 1.00000 1.01213 0.0042999
## 2 0.0746527 2 0.47290 0.48532 0.0668237
## 3 0.0742715 3 0.39824 0.51007 0.0699388
## 4 0.0504828 4 0.32397 0.40272 0.0626034
## 5 0.0331479 5 0.27349 0.36353 0.0635328
## 6 0.0232163 6 0.24034 0.40232 0.0700662
## 7 0.0178455 7 0.21713 0.38095 0.0693697
## 8 0.0118906 9 0.18143 0.36347 0.0681762
## 9 0.0048403 10 0.16954 0.39616 0.0725157
## 10 0.0034819 12 0.15986 0.39592 0.0713885
## 11 0.0020260 13 0.15638 0.40036 0.0714221
## 12 0.0019988 15 0.15233 0.39632 0.0697660
## 13 0.0000000 17 0.14833 0.39450 0.0693405
penguins.rt$cptable
## CP nsplit rel error xerror xstd
## 1 0.263551573 0 1.0000000 1.0121268 0.00429988
## 2 0.074652702 2 0.4728969 0.4853239 0.06682374
## 3 0.074271529 3 0.3982442 0.5100668 0.06993878
## 4 0.050482840 4 0.3239726 0.4027218 0.06260345
## 5 0.033147875 5 0.2734898 0.3635251 0.06353277
## 6 0.023216256 6 0.2403419 0.4023211 0.07006619
## 7 0.017845454 7 0.2171257 0.3809496 0.06936975
## 8 0.011890586 9 0.1814347 0.3634719 0.06817619
## 9 0.004840309 10 0.1695442 0.3961572 0.07251572
## 10 0.003481865 12 0.1598635 0.3959236 0.07138846
## 11 0.002025959 13 0.1563817 0.4003615 0.07142208
## 12 0.001998849 15 0.1523298 0.3963213 0.06976597
## 13 0.000000000 17 0.1483321 0.3944998 0.06934049
# Plot the tree with prp function
prp(penguins.rt, digits = 4, type = 1, extra = 1, varlen = -10,
box.col = ifelse(penguins.rt$frame$var == "<leaf>", 'gray', 'white'))
# Linear Regression Model
penguins_glmodel_full <- glm(train.penguins.data$'sex_dummy'~., data = train.penguins.data)
summary(penguins_glmodel_full)
##
## Call:
## glm(formula = train.penguins.data$sex_dummy ~ ., data = train.penguins.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.318e-16 -2.036e-16 0.000e+00 2.220e-16 6.661e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.803e-14 4.630e-14 -1.901e+00 0.0586 .
## bill_length_mm 6.203e-18 8.448e-18 7.340e-01 0.4635
## bill_depth_mm 1.619e-16 2.122e-17 7.630e+00 6.85e-13 ***
## flipper_length_mm -2.684e-18 3.626e-18 -7.400e-01 0.4599
## body_mass_g 3.113e-19 5.916e-20 5.262e+00 3.36e-07 ***
## year 4.219e-17 2.318e-17 1.820e+00 0.0701 .
## sex_dummy 1.000e+00 5.620e-17 1.779e+16 < 2e-16 ***
## Adelie_Species -1.892e-16 1.605e-16 -1.179e+00 0.2395
## Chinstrap_Species -1.526e-16 1.552e-16 -9.840e-01 0.3264
## Biscoe_Island -1.718e-17 6.481e-17 -2.650e-01 0.7912
## Dream_Island -1.797e-17 6.162e-17 -2.920e-01 0.7708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.764219e-32)
##
## Null deviance: 5.8223e+01 on 232 degrees of freedom
## Residual deviance: 1.5017e-29 on 222 degrees of freedom
## AIC: -16049
##
## Number of Fisher Scoring iterations: 1
roc(train.penguins.data$sex_dummy, penguins_glmodel_full$fitted.values, plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = train.penguins.data$sex_dummy, predictor = penguins_glmodel_full$fitted.values, plot = TRUE)
##
## Data: penguins_glmodel_full$fitted.values in 114 controls (train.penguins.data$sex_dummy 0) < 119 cases (train.penguins.data$sex_dummy 1).
## Area under the curve: 1
It seems like we can drop year, Adelie_Species, Chinstrap_Species, Biscoe_Island and Dream_Island. Doing so will have minimal impact on our model, but will make it much more parsimonious. A more efficient model is a better model. We may also be able to drop bill_length and flipper_length, but I’m curious to see the results we get after dropping these first 5 variables before dropping these.
Our regression tree came out nicely.
penguins.3 <- penguins.numeric[,-c(5,7:10)]
set.seed(11883427)
train.penguins.rows.2 <- sample (rownames(penguins.3),nrow(penguins.3)*.7)
train.penguins.data.2 <- penguins.3[train.penguins.rows.2,]
valid.penguins.rows.2 <- setdiff(rownames(penguins.3),train.penguins.rows.2)
valid.penguins.data.2 <- penguins.3[valid.penguins.rows.2,]
# Classification tree
penguins.rt.2 <- rpart(train.penguins.data.2$`sex_dummy` ~ ., data = train.penguins.data.2, cp = 0, minsplit = 12)
printcp(penguins.rt.2)
##
## Regression tree:
## rpart(formula = train.penguins.data.2$sex_dummy ~ ., data = train.penguins.data.2,
## cp = 0, minsplit = 12)
##
## Variables actually used in tree construction:
## [1] bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
##
## Root node error: 58.223/233 = 0.24988
##
## n= 233
##
## CP nsplit rel error xerror xstd
## 1 0.2635516 0 1.00000 1.01213 0.0042999
## 2 0.0746527 2 0.47290 0.48532 0.0668237
## 3 0.0742715 3 0.39824 0.51007 0.0699388
## 4 0.0504828 4 0.32397 0.40272 0.0626034
## 5 0.0331479 5 0.27349 0.36353 0.0635328
## 6 0.0232163 6 0.24034 0.40232 0.0700662
## 7 0.0178455 7 0.21713 0.38095 0.0693697
## 8 0.0061655 9 0.18143 0.35829 0.0677268
## 9 0.0048403 10 0.17527 0.38500 0.0703059
## 10 0.0034819 12 0.16559 0.38767 0.0703719
## 11 0.0020260 13 0.16211 0.39281 0.0705689
## 12 0.0019988 15 0.15805 0.39368 0.0695206
## 13 0.0000000 17 0.15406 0.39119 0.0690728
penguins.rt.2$cptable
## CP nsplit rel error xerror xstd
## 1 0.263551573 0 1.0000000 1.0121268 0.00429988
## 2 0.074652702 2 0.4728969 0.4853239 0.06682374
## 3 0.074271529 3 0.3982442 0.5100668 0.06993878
## 4 0.050482840 4 0.3239726 0.4027218 0.06260345
## 5 0.033147875 5 0.2734898 0.3635251 0.06353277
## 6 0.023216256 6 0.2403419 0.4023211 0.07006619
## 7 0.017845454 7 0.2171257 0.3809496 0.06936975
## 8 0.006165489 9 0.1814347 0.3582946 0.06772684
## 9 0.004840309 10 0.1752693 0.3850035 0.07030594
## 10 0.003481865 12 0.1655886 0.3876681 0.07037193
## 11 0.002025959 13 0.1621068 0.3928143 0.07056891
## 12 0.001998849 15 0.1580549 0.3936764 0.06952059
## 13 0.000000000 17 0.1540572 0.3911920 0.06907283
# Plot the tree with prp function
prp(penguins.rt.2, digits = 4, type = 1, extra = 1, varlen = -10,
box.col = ifelse(penguins.rt.2$frame$var == "<leaf>", 'gray', 'white'))
# Linear Regression Model
penguins_glmodel_trim <- glm(train.penguins.data.2$'sex_dummy'~., data = train.penguins.data.2)
summary(penguins_glmodel_trim)
##
## Call:
## glm(formula = train.penguins.data.2$sex_dummy ~ ., data = train.penguins.data.2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0 0 NA NA
## bill_length_mm 0 0 NA NA
## bill_depth_mm 0 0 NA NA
## flipper_length_mm 0 0 NA NA
## body_mass_g 0 0 NA NA
## sex_dummy 1 0 Inf <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0)
##
## Null deviance: 58.223 on 232 degrees of freedom
## Residual deviance: 0.000 on 227 degrees of freedom
## AIC: -Inf
##
## Number of Fisher Scoring iterations: 1