Pablo Adames
September 14, 2021
Many guiding principles…
… that will make you a more effective:
Separate work flow from data product
Start a script on a dedicated folder/sub-folder within a project
Use a fresh session
Avoid .RData
The repetitive tasks a data scientist does to generate insights and predictions from data.
We will cover
Importing data from Excel Workbooks
Creating new variables
Transforming data
Merging data sets
Visualizing data
We will also…
Comparing Base R with Tidyverse Syntax
Using a Kaggle data set
Use the link below to download all the excel files the data folder.
library(pacman)
pacman::p_load(tidyverse, readxl, here)
athletes <- read_excel(here("data", "Athletes.xlsx"))
coaches <- read_excel(here("data", "Coaches.xlsx"))
genders <- read_excel(path = here("data", "EntriesGender.xlsx"),
col_types = c("text", rep("numeric",3)))
teams <- read_excel(here("data", "Teams.xlsx"))
medals <- read_excel(here("data", "Medals.xlsx"),
col_types = c("numeric", "text", rep("numeric", 5)))
Let's create a very silly and wide tibble.
big_tibble <- genders %>%
left_join(athletes, by=c("Discipline")) %>%
left_join(teams, by=c("Discipline", "NOC"))
bigger_tibble <- bind_cols(big_tibble, big_tibble)
bigger_tibble
# A tibble: 18,905 × 16
Discipline...1 Female...2 Male...3 Total...4 Name.x...5 NOC...6 Name.y...7
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 3x3 Basketball 32 32 64 BAYASGALAN … Mongolia Mongolia
2 3x3 Basketball 32 32 64 BEKKERING R… Netherl… Netherlan…
3 3x3 Basketball 32 32 64 BOGAERTS Ra… Belgium Belgium
4 3x3 Basketball 32 32 64 BROWN Ira Japan Japan
5 3x3 Basketball 32 32 64 BROWN Ira Japan Japan
6 3x3 Basketball 32 32 64 CAVARS Agnis Latvia Latvia
7 3x3 Basketball 32 32 64 CELIS Nick Belgium Belgium
8 3x3 Basketball 32 32 64 CONSOLINI C… Italy Italy
9 3x3 Basketball 32 32 64 CUIC Claudia Romania Romania
10 3x3 Basketball 32 32 64 D ALIE Rael… Italy Italy
# … with 18,895 more rows, and 9 more variables: Event...8 <chr>,
# Discipline...9 <chr>, Female...10 <dbl>, Male...11 <dbl>, Total...12 <dbl>,
# Name.x...13 <chr>, NOC...14 <chr>, Name.y...15 <chr>, Event...16 <chr>
Create a new variable
With base R
medals1 <-
transform(medals,
avg_medal_count = mean(Total))
medals1$avg_medal_count[1]
[1] 11.6129
- Assignment operator
- Function call
- Implicit lazy data frame element evaluation
With Tidyverse
library(tidyverse)
medals2 <-
medals %>%
mutate(avg_medal_count = mean(Total))
medals2$avg_medal_count[1]
[1] 11.6129
- Assignment operator
- Pipe binary operator
- Function call
- Implicit lazy data frame element evaluation
Hello
BaseR | = | TidyR | Case |
---|---|---|---|
f(x) | = | x %>% f | Single argument |
f(x, y) | = | x %>% f(y) | First argument |
f(y, x) | = | x %>% f(y, .) | Non-first argument |
f(x, y, arg = z) | = | z %>% f(x, y, arg = .) | Named argument |
From the tidyverse style guide:
Use %>%
to emphasize a sequence of actions, rather than the object that the actions are being performed on.
Piping:
teams %>%
head(n=10) %>%
tail(n=1) %>%
select(NOC, Discipline) %>%
unlist() %>%
unname()
[1] "Netherlands" "3x3 Basketball"
It reads from left to right, top to bottom.
Consult the the tidyverse style guide.
Using nested function calls:
unname(
unlist(
select(
tail(
head( teams, n = 10),
n=1),
NOC, Discipline)))
[1] "Netherlands" "3x3 Basketball"
Harder to read because you have to find the inner most layer first, then move up.
# Base R syntax
gender_bR <- transform(genders,
gender_bal = ifelse(Male - Female < 0,
"FemaleDominated",
ifelse(Male-Female>0,
"MaleDominated",
"Balanced")))
library(tidyverse)
# Tidyverse syntax
genders_tv <- genders %>%
mutate(gender_bal = case_when(
Male - Female < 0 ~ "FemaleDominated",
Male - Female == 0 ~ "Balanced",
Male - Female > 0 ~ "MaleDominated"
))
Identical results either way:
# A tibble: 5 × 5
Discipline Female Male Total gender_bal
<chr> <dbl> <dbl> <dbl> <chr>
1 3x3 Basketball 32 32 64 Balanced
2 Archery 64 64 128 Balanced
3 Artistic Gymnastics 98 98 196 Balanced
4 Artistic Swimming 105 0 105 FemaleDominated
5 Athletics 969 1072 2041 MaleDominated
Report the number of teams in each of the categories created in the previous slide.
genders_tv %>%
count(gender_bal, sort = TRUE) %>%
knitr::kable("pipe")
gender_bal | n |
---|---|
Balanced | 23 |
MaleDominated | 19 |
FemaleDominated | 4 |
Get the top five delegations by number of coaches present in the Tokyo 2021 Summer Olympics.
coaches %>%
count(NOC, name = "Number of Coaches", sort = TRUE) %>%
head(n=5) %>%
rename('National Olympic Committee' = NOC) %>%
knitr::kable("pipe")
National Olympic Committee | Number of Coaches |
---|---|
Japan | 35 |
Spain | 28 |
United States of America | 28 |
Australia | 22 |
Canada | 16 |
coaches %>% select(!Event) %>%
inner_join(athletes,
by = c("NOC", "Discipline"),
suffix = c("_coach", "_athlete")) %>%
select ("Name_athlete", "Name_coach",
"NOC", "Discipline") %>%
arrange(Discipline, NOC) %>%
head(20)
# A tibble: 20 × 4
Name_athlete Name_coach NOC Discipline
<chr> <chr> <chr> <chr>
1 BUCKLE Carolyn MONTICO Loredana Australia Artistic Swimming
2 BURKHILL Hannah MONTICO Loredana Australia Artistic Swimming
3 GAZZARD Kiera MONTICO Loredana Australia Artistic Swimming
4 HO Alessandra MONTICO Loredana Australia Artistic Swimming
5 KINASH Kirsten MONTICO Loredana Australia Artistic Swimming
6 PRESSER Rachel MONTICO Loredana Australia Artistic Swimming
7 THOMPSON Amie MONTICO Loredana Australia Artistic Swimming
8 BUCKLE Carolyn MONTICO Loredana Australia Artistic Swimming
9 BURKHILL Hannah MONTICO Loredana Australia Artistic Swimming
10 GAZZARD Kiera MONTICO Loredana Australia Artistic Swimming
11 HO Alessandra MONTICO Loredana Australia Artistic Swimming
12 KINASH Kirsten MONTICO Loredana Australia Artistic Swimming
13 PRESSER Rachel MONTICO Loredana Australia Artistic Swimming
14 THOMPSON Amie MONTICO Loredana Australia Artistic Swimming
15 ALEXANDRI Anna-Maria BRUN Aristide Austria Artistic Swimming
16 ALEXANDRI Eirini BRUN Aristide Austria Artistic Swimming
17 ALEXANDRI Anna-Maria MLADENOVA Albena Austria Artistic Swimming
18 ALEXANDRI Eirini MLADENOVA Albena Austria Artistic Swimming
19 ALEXANDRI Anna-Maria SHAHNAZI Shahbaz Austria Artistic Swimming
20 ALEXANDRI Eirini SHAHNAZI Shahbaz Austria Artistic Swimming