myGithub.

This is an analysis of the palmerpenguins 0.1.1 dataSet.

That I have found so interesting on the Internet.

Takes some time and work but is illustrative.

The emperor penguin (Aptenodytes forsteri).

The Largest flightless seabird is not in this dataSet.

# install.packages("remotes") 
# <https://allisonhorst.github.io/palmerpenguins/> 
# remotes::install_github("allisonhorst/palmerpenguins") 
remotes::install_github("allisonhorst/palmerpenguins", quiet = TRUE) 
#library(shiny) 
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr) 
library(palmerpenguins) 
library(tidyverse) 
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
library(bib2df)

Basic data analysis with palmerpenguins 0.1.1 dataSet

The Names of variables in palmepenguins DataSet

Here there are the eight (8) variables of palmepenguins DataSet

names(penguins) 
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

And which of the 8 are categorical and which quantitative variables?

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

A top view of the data

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

The first 6 variables in palmepenguins 0.1.1 DataSet

Here there are first 6 rows of palmepenguins DataSet

head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

The last 6 variables in palmepenguins 0.1.1 DataSet

Here there are last 6 rows of palmepenguins DataSet

tail(penguins)
## # A tibble: 6 × 8
##   species   island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
##   <fct>     <fct>           <dbl>         <dbl>        <int>   <int> <fct> <int>
## 1 Chinstrap Dream            45.7          17            195    3650 fema…  2009
## 2 Chinstrap Dream            55.8          19.8          207    4000 male   2009
## 3 Chinstrap Dream            43.5          18.1          202    3400 fema…  2009
## 4 Chinstrap Dream            49.6          18.2          193    3775 male   2009
## 5 Chinstrap Dream            50.8          19            210    4100 male   2009
## 6 Chinstrap Dream            50.2          18.7          198    3775 fema…  2009
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

And now the structure of palmepenguins DataSet

str(penguins) 
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Missing values

It’s a lot missing values NA (19) in this [344 × 8] dataSet, dislike

Iris dataSet by example.

penguins %>%
  #group_by(species) %>%
   select(everything()) %>% 
  summarise_all(list(~sum(is.na(.)))) %>%
  pivot_longer(cols = 1:8, names_to = 'columns', values_to = 'NA_count') %>%
  arrange(desc(NA_count)) %>%
  ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'darkorange') +
  geom_label(aes(label = NA_count)) +
  theme_minimal() +
  labs(title = 'Penguins - NA Count')

The count of individuals by specie

penguins %>% count(species)
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

The Graphic of Penguin mass vs. flipper length

mass_flipper <- ggplot(data = penguins, 
                       aes(x = flipper_length_mm,
                           y = body_mass_g)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 3,
             alpha = 0.8) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(title = "Penguin size, Palmer Station LTER",
       subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin species",
       shape = "Penguin species") +
  theme(legend.position = c(0.2, 0.7),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")

mass_flipper
## Warning: Removed 2 rows containing missing values (geom_point).

The Graphic of Flipper length vs. bill length

flipper_bill <- ggplot(data = penguins,
                         aes(x = flipper_length_mm,
                             y = bill_length_mm)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 3,
             alpha = 0.8) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(title = "Flipper and bill length",
       subtitle = "Dimensions for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       x = "Flipper length (mm)",
       y = "Bill length (mm)",
       color = "Penguin species",
       shape = "Penguin species") +
  theme(legend.position = c(0.85, 0.15),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")

flipper_bill
## Warning: Removed 2 rows containing missing values (geom_point).

The Graph of Bill length vs. depth

bill_len_dep <- ggplot(data = penguins,
                         aes(x = bill_length_mm,
                             y = bill_depth_mm,
                             group = species)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 3,
             alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, aes(color = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(title = "Penguin bill dimensions",
       subtitle = "Bill length and depth for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       x = "Bill length (mm)",
       y = "Bill depth (mm)",
       color = "Penguin species",
       shape = "Penguin species") +
  theme(legend.position = c(0.85, 0.15),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot")

bill_len_dep
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

A Note at the end.

Even the first approach to palmerpenguins 0.1.1 dataSet, is a lot of code! Of course, I skipped a lot of things in this post, There are many plots, correlations and data cleaning we can do for this dataSet. However, I hope this trial gives inspiration for further data analysis.

Next time it will be go to a R Shiny App!!!!

Thanks.

x <- citation("palmerpenguins")
toBibtex(x) 
## @Manual{,
##   title = {palmerpenguins: Palmer Archipelago (Antarctica) penguin data},
##   author = {Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman},
##   year = {2020},
##   note = {R package version 0.1.0},
##   doi = {10.5281/zenodo.3960218},
##   url = {https://allisonhorst.github.io/palmerpenguins/},
## }

Ernesto Ibáñez Twitter @eeibanez