Data Wrangling

About the data

Penguins!

This project is meant to analyze three different species of penguin: Chinstrap, Gentoo, and Adéle. 8 variables of data have been collected on 344 different penguins. These penguins were living in the Palmer Archipelago, a group of islands off of Antarctica.The data was collected by Dr. Kristen Gorman while working with the Palmer Station Long Term Ecological Research Program. It shows the size of all three species found on the Archipelago. The data was collected between 2007 and 2009, and imported directly from the Environmental Data Initiative Portal, with no rights reserved. So anyone can practice their R-programming skills with this data set whenever they want!

About my approach

I plan to use this data to demonstrate how unsupervised learning can uncover similarities that may have otherwise been overlooked. I also hope to build classification models that can be used to identify the sex of a penguins using only a handfull of measurements.I plan to use a regression tree, as well as a linear regression model. Since the dataset is relatively small I will use bootstrap re-sampling to generate more observations. Building a larger dataset should provide a stronger prediction, and testing our dataset will generate more reliable outcomes.

Unsupervised learning should show us the major differences between male and female penguins. I will use k-means clustering to separate the individual penguins into a predetermined number of groups.

Classification models will likely give us more predictive power than the unsupervised model, but its results may be slightly less clear than the unuspervised methods.

Suppress warnings and messages then load all required packages

options(warn = -1)
library(palmerpenguins)
suppressMessages(library(gplots))
suppressMessages(library(janitor))
suppressMessages(library(tidyverse))
suppressMessages(library(caret))
suppressMessages(library(factoextra))
suppressMessages(library(gridExtra))
suppressMessages(library(fpc))
suppressMessages(library(rpart))
suppressMessages(library(rpart.plot))
suppressMessages(library(pROC))

My data is found in the ‘palmerpenguins’ package of R. The ‘penguins’ data is immediately accessible once the ‘palmerpenguins’ library is loaded. While working on this project I will be using the ‘gplots’ library to build enhanced visualizations such as our heat map. ‘janitor’ is a library that I will utilize when building certain tables. The ‘tidyverse’ package gives us access to multiple different packages without having to install them all separately. ‘tidyverse’ will be used primarily for creating graphics, and manipulating data. The library ‘caret’ will be used to train our complex classification models. ‘rpart’ is a package that allows us to build classification trees. ‘rpart.plot’ is the package that will enable me to visualize the classification trees built by ‘rpart’. Classification trees should be a great way to find which characteristics are attributed to each species of penguin.

Inspect the initial data

dim(penguins)

## [1] 344   8

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

table(is.na(penguins))

## 
## FALSE  TRUE 
##  2733    19

We have 19 total NA values. The 4 variables missing two values leads me to believe there may be some overlap, but further analysis will allow us to hone in on these. If I remove the NA values from the data and only lose 11 values then that means 2 penguins never had their bill length, bill depth, flipper length, body mass, or sex observed. The other 9 missing values for sex would be fully complete with the exception of sex, but since it is such a small portion of our data I feel that we can justify the removal of these records. If I remove all NA’s and the data drops by 19 points then that means 19 penguins were fully observed with the exception of a single record. We could avoid this by imputing the median values for all numerics and splitting the missing sex records 6 to male and 5 to female. This would retain every record without disrupting our fully recorded observations very much.

Pre-processing for analysis and post-processing inspection

# Remove all NA values
penguins.2 <- na.omit(penguins)

# Verify how many records we lost from the original 344
nrow(penguins.2)

## [1] 333

# Create dummy variables for all of our factor variables
penguins.2$sex_dummy <- ifelse(penguins.2$sex == 'male',1,0)
penguins.2$Adelie_Species <- ifelse(penguins.2$species == 'Adelie',1,0)
penguins.2$Chinstrap_Species <- ifelse(penguins.2$species == 'Chinstrap',1,0)
penguins.2$Biscoe_Island <- ifelse(penguins.2$island == 'Biscoe',1,0)
penguins.2$Dream_Island <- ifelse(penguins.2$island == 'Dream',1,0)

# Use a correlation matrix to show relationships between each variable
penguins.numeric <- penguins.2[,-c(1,2,7)]
colfunc<-colorRampPalette(c("red","white","green"))
heatmap.2(cor(penguins.numeric),Rowv=FALSE,Colv=FALSE,dendrogram="none",col=colfunc(15),cellnote=round(cor(penguins.numeric),2),notecol="black",key=FALSE,trace='none',margins=c(9,7))

# Build a table to show the ratio of male to female for each species of penguin
penguins.2 %>%
  tabyl(sex, species) %>%
  adorn_totals(where = c("row", "col")) # add row, column totals

##     sex Adelie Chinstrap Gentoo Total
##  female     73        34     58   165
##    male     73        34     61   168
##   Total    146        68    119   333

# visualize the populations of each species on the three islands
ggplot(penguins.2, aes(x = island, fill = species)) +
  geom_bar(alpha = 1) +
  scale_fill_manual(values = c("grey","orange","black"),
                    guide = FALSE) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip()

I have removed all NA values to find that they did overlap. I am left with data on 333 individual penguins. I then made a binary variables out of the penguin sexes, species and islands. Doing this allows me to use them as numerics, which helps a ton in predictive modeling and visualization. Turning these factor variables into numerics I should be able to utilize the information in a much more dynamic way.

I also built out a correlation matrix. This matrix displays all of the numeric variables with a gradient color scheme to show which variables are closely related, and which are the most different from the others. It looks like our response variable ‘sex_dummy’ is most closely related to the dimensions of the penguins. It makes sense that we wouldn’t get much information from species with regards to sex, since all three species are split evenly between males and females. The islands wouldn’t tell us much about the sex of our penguins because there aren’t more males on one island than another, but this wasn’t known previously.

We can see from the bar plot that Adele penguins are the only penguin with a presence on all 3 islands. Chinstrap penguins are observed living only on Dream Island, and all of the Gentoo penguins live on Biscoe Island.

Begin work on unsupervised learning

The first thing I will need to do is partition my data. This means splitting the data into 2 different sets. One data set will be used to train my predictive models, and the other set will be used to test the models once they have been built. I set a seed in order to make these results replicable. Without a seed the results will be different every time the code is ran. I used 70% of the dataset to train my model, and the remaining 30% will make up my testing data.

set.seed(11883427)
train.penguins.rows <- sample (rownames(penguins.numeric),nrow(penguins.numeric)*.7)
train.penguins.data <- penguins.numeric[train.penguins.rows,] 
valid.penguins.rows <- setdiff(rownames(penguins.numeric),train.penguins.rows)
valid.penguins.data <- penguins.numeric[valid.penguins.rows,]

First Full Models

fit <- kmeans(penguins.numeric, centers = 2, nstart = 35)
table(fit$cluster)

## 
##   1   2 
## 130 203

fviz_cluster(fit, data = penguins.numeric)

k3 <- kmeans(penguins.numeric, centers = 3, nstart = 25)
k4 <- kmeans(penguins.numeric, centers = 4, nstart = 25)
k5 <- kmeans(penguins.numeric, centers = 5, nstart = 25)

p1 <- fviz_cluster(fit, geom = "point", data = penguins.numeric) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = penguins.numeric) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = penguins.numeric) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = penguins.numeric) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

plotcluster(penguins.numeric, fit$cluster)

# Classification tree
penguins.rt <- rpart(train.penguins.data$`sex_dummy` ~ ., data = train.penguins.data, cp = 0, minsplit = 12)
printcp(penguins.rt)

## 
## Regression tree:
## rpart(formula = train.penguins.data$sex_dummy ~ ., data = train.penguins.data, 
##     cp = 0, minsplit = 12)
## 
## Variables actually used in tree construction:
## [1] bill_depth_mm     bill_length_mm    Biscoe_Island     body_mass_g      
## [5] Chinstrap_Species year             
## 
## Root node error: 58.223/233 = 0.24988
## 
## n= 233 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.2635516      0   1.00000 1.00504 0.0035342
## 2  0.0746527      2   0.47290 0.50831 0.0682158
## 3  0.0742715      3   0.39824 0.52177 0.0706629
## 4  0.0504828      4   0.32397 0.48659 0.0701309
## 5  0.0331479      5   0.27349 0.45596 0.0713328
## 6  0.0232163      6   0.24034 0.40478 0.0676628
## 7  0.0178455      7   0.21713 0.38595 0.0669183
## 8  0.0118906      9   0.18143 0.37200 0.0670720
## 9  0.0048403     10   0.16954 0.36631 0.0663960
## 10 0.0034819     12   0.15986 0.38966 0.0672858
## 11 0.0020260     13   0.15638 0.39705 0.0681883
## 12 0.0019988     15   0.15233 0.40384 0.0677251
## 13 0.0000000     17   0.14833 0.40421 0.0677162

penguins.rt$cptable

##             CP nsplit rel error    xerror        xstd
## 1  0.263551573      0 1.0000000 1.0050393 0.003534227
## 2  0.074652702      2 0.4728969 0.5083115 0.068215848
## 3  0.074271529      3 0.3982442 0.5217689 0.070662931
## 4  0.050482840      4 0.3239726 0.4865851 0.070130900
## 5  0.033147875      5 0.2734898 0.4559580 0.071332793
## 6  0.023216256      6 0.2403419 0.4047821 0.067662818
## 7  0.017845454      7 0.2171257 0.3859470 0.066918289
## 8  0.011890586      9 0.1814347 0.3720003 0.067071968
## 9  0.004840309     10 0.1695442 0.3663120 0.066396004
## 10 0.003481865     12 0.1598635 0.3896614 0.067285808
## 11 0.002025959     13 0.1563817 0.3970469 0.068188254
## 12 0.001998849     15 0.1523298 0.4038389 0.067725115
## 13 0.000000000     17 0.1483321 0.4042056 0.067716171

# Plot the tree with prp function
prp(penguins.rt, digits = 4, type = 1, extra = 1, varlen = -10,
    box.col = ifelse(penguins.rt$frame$var == "<leaf>", 'gray', 'white'))

# Linear Regression Model
penguins_glmodel_full <- glm(train.penguins.data$'sex_dummy'~., data = train.penguins.data)
summary(penguins_glmodel_full)

## 
## Call:
## glm(formula = train.penguins.data$sex_dummy ~ ., data = train.penguins.data)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -8.318e-16  -2.036e-16   0.000e+00   2.220e-16   6.661e-16  
## 
## Coefficients:
##                     Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)       -8.803e-14  4.630e-14 -1.901e+00   0.0586 .  
## bill_length_mm     6.203e-18  8.448e-18  7.340e-01   0.4635    
## bill_depth_mm      1.619e-16  2.122e-17  7.630e+00 6.85e-13 ***
## flipper_length_mm -2.684e-18  3.626e-18 -7.400e-01   0.4599    
## body_mass_g        3.113e-19  5.916e-20  5.262e+00 3.36e-07 ***
## year               4.219e-17  2.318e-17  1.820e+00   0.0701 .  
## sex_dummy          1.000e+00  5.620e-17  1.779e+16  < 2e-16 ***
## Adelie_Species    -1.892e-16  1.605e-16 -1.179e+00   0.2395    
## Chinstrap_Species -1.526e-16  1.552e-16 -9.840e-01   0.3264    
## Biscoe_Island     -1.718e-17  6.481e-17 -2.650e-01   0.7912    
## Dream_Island      -1.797e-17  6.162e-17 -2.920e-01   0.7708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 6.764219e-32)
## 
##     Null deviance: 5.8223e+01  on 232  degrees of freedom
## Residual deviance: 1.5017e-29  on 222  degrees of freedom
## AIC: -16049
## 
## Number of Fisher Scoring iterations: 1

roc(train.penguins.data$sex_dummy, penguins_glmodel_full$fitted.values, plot = TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.default(response = train.penguins.data$sex_dummy, predictor = penguins_glmodel_full$fitted.values,     plot = TRUE)
## 
## Data: penguins_glmodel_full$fitted.values in 114 controls (train.penguins.data$sex_dummy 0) < 119 cases (train.penguins.data$sex_dummy 1).
## Area under the curve: 1

Examine Results and Adjust

It seems like we can drop year, Adelie_Species, Chinstrap_Species, Biscoe_Island and Dream_Island. Doing so will have minimal impact on our model, but will make it much more parsimonious. A more efficient model is a better model. We may also be able to drop bill_length and flipper_length, but I’m curious to see the results we get after dropping these first 5 variables before dropping these.

penguins.3 <- penguins.numeric[,-c(5,7:10)]
set.seed(11883427)
train.penguins.rows.2 <- sample (rownames(penguins.3),nrow(penguins.3)*.7)
train.penguins.data.2 <- penguins.3[train.penguins.rows.2,] 
valid.penguins.rows.2 <- setdiff(rownames(penguins.3),train.penguins.rows.2)
valid.penguins.data.2 <- penguins.3[valid.penguins.rows.2,]

# Classification tree
penguins.rt.2 <- rpart(train.penguins.data.2$`sex_dummy` ~ ., data = train.penguins.data.2, cp = 0, minsplit = 12)
printcp(penguins.rt.2)

## 
## Regression tree:
## rpart(formula = train.penguins.data.2$sex_dummy ~ ., data = train.penguins.data.2, 
##     cp = 0, minsplit = 12)
## 
## Variables actually used in tree construction:
## [1] bill_depth_mm     bill_length_mm    body_mass_g       flipper_length_mm
## 
## Root node error: 58.223/233 = 0.24988
## 
## n= 233 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.2635516      0   1.00000 1.01213 0.0042999
## 2  0.0746527      2   0.47290 0.48532 0.0668237
## 3  0.0742715      3   0.39824 0.51007 0.0699388
## 4  0.0504828      4   0.32397 0.40272 0.0626034
## 5  0.0331479      5   0.27349 0.36353 0.0635328
## 6  0.0232163      6   0.24034 0.40232 0.0700662
## 7  0.0178455      7   0.21713 0.38095 0.0693697
## 8  0.0061655      9   0.18143 0.35829 0.0677268
## 9  0.0048403     10   0.17527 0.38500 0.0703059
## 10 0.0034819     12   0.16559 0.38767 0.0703719
## 11 0.0020260     13   0.16211 0.39281 0.0705689
## 12 0.0019988     15   0.15805 0.39368 0.0695206
## 13 0.0000000     17   0.15406 0.39119 0.0690728

penguins.rt.2$cptable

##             CP nsplit rel error    xerror       xstd
## 1  0.263551573      0 1.0000000 1.0121268 0.00429988
## 2  0.074652702      2 0.4728969 0.4853239 0.06682374
## 3  0.074271529      3 0.3982442 0.5100668 0.06993878
## 4  0.050482840      4 0.3239726 0.4027218 0.06260345
## 5  0.033147875      5 0.2734898 0.3635251 0.06353277
## 6  0.023216256      6 0.2403419 0.4023211 0.07006619
## 7  0.017845454      7 0.2171257 0.3809496 0.06936975
## 8  0.006165489      9 0.1814347 0.3582946 0.06772684
## 9  0.004840309     10 0.1752693 0.3850035 0.07030594
## 10 0.003481865     12 0.1655886 0.3876681 0.07037193
## 11 0.002025959     13 0.1621068 0.3928143 0.07056891
## 12 0.001998849     15 0.1580549 0.3936764 0.06952059
## 13 0.000000000     17 0.1540572 0.3911920 0.06907283

# Plot the tree with prp function
prp(penguins.rt.2, digits = 4, type = 1, extra = 1, varlen = -10,
    box.col = ifelse(penguins.rt.2$frame$var == "<leaf>", 'gray', 'white'))

# Linear Regression Model
penguins_glmodel_trim <- glm(train.penguins.data.2$'sex_dummy'~., data = train.penguins.data.2)
summary(penguins_glmodel_trim)

## 
## Call:
## glm(formula = train.penguins.data.2$sex_dummy ~ ., data = train.penguins.data.2)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##      0       0       0       0       0  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0          0      NA       NA    
## bill_length_mm           0          0      NA       NA    
## bill_depth_mm            0          0      NA       NA    
## flipper_length_mm        0          0      NA       NA    
## body_mass_g              0          0      NA       NA    
## sex_dummy                1          0     Inf   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0)
## 
##     Null deviance: 58.223  on 232  degrees of freedom
## Residual deviance:  0.000  on 227  degrees of freedom
## AIC: -Inf
## 
## Number of Fisher Scoring iterations: 1

Data Wrangling - Final

John Gleason

7/15/2021