Introduction

The purpose of this document is to provide you the opportunity to practically apply the skills that were developed in class. The focus of this session is tidyr specifically. It is useful for getting data into a tidy format in which each variable is its own column and each row is its own observation.

For this assignment, we will be using the palmerpenguins package and data set. Please install the package and load it into your current R session. You can run the command data(package = 'palmerpenguins') which will allow you to access the penguins dataframe.

library(palmerpenguins)
library(tidyverse)
library(explore)
data(package = 'palmerpenguins')

Exploration

penguins %>% describe()
## # A tibble: 8 × 8
##   variable          type     na na_pct unique    min   mean    max
##   <chr>             <chr> <int>  <dbl>  <int>  <dbl>  <dbl>  <dbl>
## 1 species           fct       0    0        3   NA     NA     NA  
## 2 island            fct       0    0        3   NA     NA     NA  
## 3 bill_length_mm    dbl       2    0.6    165   32.1   43.9   59.6
## 4 bill_depth_mm     dbl       2    0.6     81   13.1   17.2   21.5
## 5 flipper_length_mm int       2    0.6     56  172    201.   231  
## 6 body_mass_g       int       2    0.6     95 2700   4202.  6300  
## 7 sex               fct      11    3.2      3   NA     NA     NA  
## 8 year              int       0    0        3 2007   2008.  2009
penguins %>% explore_all()

penguins %>%  explore_all(target = sex)

penguins %>%
  select(species, body_mass_g, ends_with("_mm")) %>% 
  GGally::ggpairs(aes(color = species)) +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"))

Question 1

  1. What is the maximum bill length of each species of penguin?
  2. How many male and female penguins were there on each island from each species in 2007? Show which islands had zero male or female penguins.
  3. Get the average value for all numeric variables (excluding year) per species.
# Maximum bill Length of each species of penguin
maxspecies <- penguins %>% 
              group_by(species) %>% 
              summarise(maxspecies=max(bill_length_mm,na.rm = T))
#b
species_2007 <- penguins %>% 
               filter(year %in% '2007') %>% 
               group_by(species) %>% 
               count(island, sex, .drop = FALSE)
#OR
penguins %>% filter(year %in% '2007') %>% count(species, island,sex, .drop = FALSE)
## # A tibble: 21 × 4
##    species   island    sex        n
##    <fct>     <fct>     <fct>  <int>
##  1 Adelie    Biscoe    female     5
##  2 Adelie    Biscoe    male       5
##  3 Adelie    Dream     female     9
##  4 Adelie    Dream     male      10
##  5 Adelie    Dream     <NA>       1
##  6 Adelie    Torgersen female     8
##  7 Adelie    Torgersen male       7
##  8 Adelie    Torgersen <NA>       5
##  9 Chinstrap Biscoe    female     0
## 10 Chinstrap Biscoe    male       0
## # … with 11 more rows
#c The data has 4 numeric variables (anonymous function)
avg_numeric <- penguins %>%
               group_by(species) %>% 
               #select(body_mass_g, ends_with("_mm")) %>%
               summarize(across(c(body_mass_g, ends_with("_mm")),\(x) mean(x, na.rm=TRUE)))

Question 2

The code chunk below merges values from different columns into a single column using the penguins data and saving it as penguins_united. Run the code below and make sure you understand how it is transforming the data. Then, use penguins_united and untangle these columns as they were before and store them in an object called penguins_separated using the separate() function. Remember to replace "NA" with NA.

penguins_united <- penguins %>% 
  mutate(across(bill_length_mm:body_mass_g, as.character)) %>% 
  unite(
          col = "merged", 
          bill_length_mm:body_mass_g
    )

penguins_united
## # A tibble: 344 × 5
##    species island    merged             sex     year
##    <fct>   <fct>     <chr>              <fct>  <int>
##  1 Adelie  Torgersen 39.1_18.7_181_3750 male    2007
##  2 Adelie  Torgersen 39.5_17.4_186_3800 female  2007
##  3 Adelie  Torgersen 40.3_18_195_3250   female  2007
##  4 Adelie  Torgersen NA_NA_NA_NA        <NA>    2007
##  5 Adelie  Torgersen 36.7_19.3_193_3450 female  2007
##  6 Adelie  Torgersen 39.3_20.6_190_3650 male    2007
##  7 Adelie  Torgersen 38.9_17.8_181_3625 female  2007
##  8 Adelie  Torgersen 39.2_19.6_195_4675 male    2007
##  9 Adelie  Torgersen 34.1_18.1_193_3475 <NA>    2007
## 10 Adelie  Torgersen 42_20.2_190_4250   <NA>    2007
## # … with 334 more rows
# Start your answer below

penguin_separated <- penguins_united %>% 
                  separate(merged, c("bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g" ),
                           sep ="_",remove =  T) %>% 
                 mutate(across(where(is.character), ~na_if(., "NA")))

penguin_separated
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
##    <fct>   <fct>     <chr>          <chr>         <chr>      <chr>   <fct> <int>
##  1 Adelie  Torgersen 39.1           18.7          181        3750    male   2007
##  2 Adelie  Torgersen 39.5           17.4          186        3800    fema…  2007
##  3 Adelie  Torgersen 40.3           18            195        3250    fema…  2007
##  4 Adelie  Torgersen <NA>           <NA>          <NA>       <NA>    <NA>   2007
##  5 Adelie  Torgersen 36.7           19.3          193        3450    fema…  2007
##  6 Adelie  Torgersen 39.3           20.6          190        3650    male   2007
##  7 Adelie  Torgersen 38.9           17.8          181        3625    fema…  2007
##  8 Adelie  Torgersen 39.2           19.6          195        4675    male   2007
##  9 Adelie  Torgersen 34.1           18.1          193        3475    <NA>   2007
## 10 Adelie  Torgersen 42             20.2          190        4250    <NA>   2007
## # … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
## #   ²​body_mass_g