A Talk about the Tidyverse Syntax

Pablo Adames
September 14, 2021

Tidyness

Many guiding principles…

… that will make you a more effective:

  • data scientist
  • data analysis

Some Tidy principles

  • Separate work flow from data product

  • Start a script on a dedicated folder/sub-folder within a project

  • Use a fresh session

  • Avoid .RData

Tidy Data

  • One variables per column
  • One observations per row
  • One value per cell

The data science workflow

  • One type of observation per table.

Many good references

The data science work flow

The data science workflow

The repetitive tasks a data scientist does to generate insights and predictions from data.

We will cover


  • Importing data from Excel Workbooks

  • Creating new variables

  • Transforming data

  • Merging data sets

  • Visualizing data

We will also…


  • Comparing Base R with Tidyverse Syntax

  • Using a Kaggle data set

Download the data from Kaggle

Use the link below to download all the excel files the data folder.

library(pacman)
pacman::p_load(tidyverse, readxl, here) 

athletes <- read_excel(here("data", "Athletes.xlsx"))
coaches  <- read_excel(here("data", "Coaches.xlsx"))
genders  <- read_excel(path = here("data", "EntriesGender.xlsx"),
                       col_types = c("text", rep("numeric",3)))
teams    <- read_excel(here("data", "Teams.xlsx"))
medals   <- read_excel(here("data", "Medals.xlsx"),
                       col_types = c("numeric", "text", rep("numeric", 5)))

The tibble

Let's create a very silly and wide tibble.

big_tibble <- genders %>% 
  left_join(athletes, by=c("Discipline")) %>% 
  left_join(teams, by=c("Discipline", "NOC"))
bigger_tibble <- bind_cols(big_tibble, big_tibble)
bigger_tibble
# A tibble: 18,905 × 16
   Discipline...1 Female...2 Male...3 Total...4 Name.x...5   NOC...6  Name.y...7
   <chr>               <dbl>    <dbl>     <dbl> <chr>        <chr>    <chr>     
 1 3x3 Basketball         32       32        64 BAYASGALAN … Mongolia Mongolia  
 2 3x3 Basketball         32       32        64 BEKKERING R… Netherl… Netherlan…
 3 3x3 Basketball         32       32        64 BOGAERTS Ra… Belgium  Belgium   
 4 3x3 Basketball         32       32        64 BROWN Ira    Japan    Japan     
 5 3x3 Basketball         32       32        64 BROWN Ira    Japan    Japan     
 6 3x3 Basketball         32       32        64 CAVARS Agnis Latvia   Latvia    
 7 3x3 Basketball         32       32        64 CELIS Nick   Belgium  Belgium   
 8 3x3 Basketball         32       32        64 CONSOLINI C… Italy    Italy     
 9 3x3 Basketball         32       32        64 CUIC Claudia Romania  Romania   
10 3x3 Basketball         32       32        64 D ALIE Rael… Italy    Italy     
# … with 18,895 more rows, and 9 more variables: Event...8 <chr>,
#   Discipline...9 <chr>, Female...10 <dbl>, Male...11 <dbl>, Total...12 <dbl>,
#   Name.x...13 <chr>, NOC...14 <chr>, Name.y...15 <chr>, Event...16 <chr>

The tibble selling points

What makes a tibble different from a data.frame

  1. The second header row shows data types
  2. Row numbers instead of row names
  3. If more columns than can be fit across, they are mentioned at the bottom
  4. Strings do not default to factors
  5. At most 10 rows shown by default

R Base Syntax Vs Tidyverse Syntax

Create a new variable

With base R

medals1 <- 
  transform(medals,
            avg_medal_count = mean(Total))

medals1$avg_medal_count[1]
[1] 11.6129
  • Assignment operator
  • Function call
  • Implicit lazy data frame element evaluation

With Tidyverse

library(tidyverse)

medals2 <- 
  medals %>%
  mutate(avg_medal_count = mean(Total))

medals2$avg_medal_count[1]
[1] 11.6129
  • Assignment operator
  • Pipe binary operator
  • Function call
  • Implicit lazy data frame element evaluation

Base R not so messy eh!


Hello

Function calls I can read…


However, what is it with the …

Introducing the ...

Pipe Operator

BaseR = TidyR Case
f(x) = x %>% f Single argument
f(x, y) = x %>% f(y) First argument
f(y, x) = x %>% f(y, .) Non-first argument
f(x, y, arg = z) = z %>% f(x, y, arg = .) Named argument


From the tidyverse style guide:

Use %>% to emphasize a sequence of actions, rather than the object that the actions are being performed on.

Tidiness (a reality check)

Piping:

teams %>% 
  head(n=10) %>% 
  tail(n=1) %>% 
  select(NOC, Discipline) %>%
  unlist() %>% 
  unname()
[1] "Netherlands"    "3x3 Basketball"


It reads from left to right, top to bottom. Consult the the tidyverse style guide.

Using nested function calls:

 unname(
   unlist(
     select(
       tail( 
         head( teams, n = 10), 
         n=1), 
       NOC, Discipline)))
[1] "Netherlands"    "3x3 Basketball"


Harder to read because you have to find the inner most layer first, then move up.

Create a variable

# Base R syntax
gender_bR <- transform(genders,
          gender_bal = ifelse(Male - Female < 0, 
                               "FemaleDominated", 
                               ifelse(Male-Female>0, 
                                      "MaleDominated", 
                                      "Balanced")))
library(tidyverse)
# Tidyverse syntax
genders_tv <- genders %>%
  mutate(gender_bal = case_when(
    Male - Female < 0 ~ "FemaleDominated",
    Male - Female == 0 ~ "Balanced",
    Male - Female > 0 ~ "MaleDominated"
  ))

Identical results either way:

# A tibble: 5 × 5
  Discipline          Female  Male Total gender_bal     
  <chr>                <dbl> <dbl> <dbl> <chr>          
1 3x3 Basketball          32    32    64 Balanced       
2 Archery                 64    64   128 Balanced       
3 Artistic Gymnastics     98    98   196 Balanced       
4 Artistic Swimming      105     0   105 FemaleDominated
5 Athletics              969  1072  2041 MaleDominated  

Transforming a data set (1)

Report the number of teams in each of the categories created in the previous slide.

genders_tv %>% 
  count(gender_bal, sort = TRUE) %>%
  knitr::kable("pipe")
gender_bal n
Balanced 23
MaleDominated 19
FemaleDominated 4

Transforming a data set (2)

Get the top five delegations by number of coaches present in the Tokyo 2021 Summer Olympics.

coaches %>% 
  count(NOC, name = "Number of Coaches", sort = TRUE) %>% 
  head(n=5) %>%
  rename('National Olympic Committee' = NOC)  %>%
  knitr::kable("pipe")
National Olympic Committee Number of Coaches
Japan 35
Spain 28
United States of America 28
Australia 22
Canada 16

Merging two data sets

coaches %>% select(!Event) %>% 
  inner_join(athletes, 
             by = c("NOC", "Discipline"), 
             suffix = c("_coach", "_athlete")) %>%
  select ("Name_athlete", "Name_coach", 
          "NOC", "Discipline") %>% 
  arrange(Discipline, NOC) %>%
  head(20)
# A tibble: 20 × 4
   Name_athlete         Name_coach       NOC       Discipline       
   <chr>                <chr>            <chr>     <chr>            
 1 BUCKLE Carolyn       MONTICO Loredana Australia Artistic Swimming
 2 BURKHILL Hannah      MONTICO Loredana Australia Artistic Swimming
 3 GAZZARD Kiera        MONTICO Loredana Australia Artistic Swimming
 4 HO Alessandra        MONTICO Loredana Australia Artistic Swimming
 5 KINASH Kirsten       MONTICO Loredana Australia Artistic Swimming
 6 PRESSER Rachel       MONTICO Loredana Australia Artistic Swimming
 7 THOMPSON Amie        MONTICO Loredana Australia Artistic Swimming
 8 BUCKLE Carolyn       MONTICO Loredana Australia Artistic Swimming
 9 BURKHILL Hannah      MONTICO Loredana Australia Artistic Swimming
10 GAZZARD Kiera        MONTICO Loredana Australia Artistic Swimming
11 HO Alessandra        MONTICO Loredana Australia Artistic Swimming
12 KINASH Kirsten       MONTICO Loredana Australia Artistic Swimming
13 PRESSER Rachel       MONTICO Loredana Australia Artistic Swimming
14 THOMPSON Amie        MONTICO Loredana Australia Artistic Swimming
15 ALEXANDRI Anna-Maria BRUN Aristide    Austria   Artistic Swimming
16 ALEXANDRI Eirini     BRUN Aristide    Austria   Artistic Swimming
17 ALEXANDRI Anna-Maria MLADENOVA Albena Austria   Artistic Swimming
18 ALEXANDRI Eirini     MLADENOVA Albena Austria   Artistic Swimming
19 ALEXANDRI Anna-Maria SHAHNAZI Shahbaz Austria   Artistic Swimming
20 ALEXANDRI Eirini     SHAHNAZI Shahbaz Austria   Artistic Swimming

References