Goals of data wrangling

  • Data come in various formats (long, wide) and data type (xlsx, ods, json, csv, sav)
  • No format is suitable for every goal
  • Fluency in data wrangling gives you a lot of power.
  • Make data format suitable to use: e.g. for statistical models (correlations, linear regression), functions, data viz, summary table
  • Calculate new variables, filter or combine data
  • Reveal information
  • Summarise information
  • (also creating counterbalanced, randomised stimulus lists)

Collection of R packages for data science that share:

  • common data philosophies
  • grammar
  • data structures
  • best practice
  • designed to work together
# Installs 19 packages
install.packages("tidyverse")
# Loads 6 packages
library(tidyverse)

Tidy data

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value have its own cell.

Why?

  • Placing variables in columns takes advantage of R’s vectorised nature (faster processing, more compact code).
  • Consistent data structure allows easier learning because tidyverse functions have similar underpinning principles.

tidyverse: verbs

  • Functions that do specific things to our data.

  • Must know: read_csv, write_csv, glimpse, select, filter, mutate, group_by / ungroup, summarise, pivot_wider / _longer

  • Also important: count, pull, slice, across, recode, unique, n, where, everything, ~ and ., map, starts_with, ends_with, contains, separate, unite, transmute, _join, bind_rows / _cols

  • There are more but these are the most important ones.

tbls (tibbles)

  • tidyverse is operating with tibbles
  • Type of data structure
  • Easier to read in console
# Imports data as data frame
data_as_frame <- read.csv("path_to_data/data.csv")
# Imports data as tibble
data_as_tibble <- read_csv("path_to_data/data.csv")
  • .csv: comma separated file
  • readr package: e.g. read_csv, read_delim, read_tsv

For other data formats:

  • haven package: e.g. read_dta, read_sav, read_sas
  • readxl package: e.g. read_excel, read_xls, read_xlsx
# Summarise data structure in base R
str(data_as_frame)
# Summarise data structure in tidyverse
glimpse(data_as_tibble)

Open exercise script 1

tidyverse functions

Functions follow the principle

function_name(data_name, argument)

where argument specifies which variable / condition etc. the function has to operate on.

# Picking out variables
select(data, variable) 
# Subsetting data
filter(data, variable > 100) 
# Change / add variables
mutate(data, sqr = variable^2)
# Aggregate data
summarise(data, avg = mean(variable)) 

Selecting variables

Extracts variables you’re interested in.

blomkvist <- read_csv("data/blomkvist.csv")
glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ sex        <chr> "male", "female", "female", "…
$ age        <dbl> 84, 37, 62, 85, 73, 65, 30, 4…
$ medicine   <dbl> 8, 1, 0, 4, 5, 0, 0, 0, 11, 0…
$ meds_cat   <chr> "a lot", "little", "none", "f…
$ smoker     <chr> "former", "no", "yes", "forme…
$ rt_hand_d  <dbl> 702, 471, 639, 708, 607, 542,…
$ rt_hand_nd <dbl> 780, 497, 638, 639, 652, 499,…
$ rt_foot_d  <dbl> 1009, 738, 878, 902, 923, 687…
$ rt_foot_nd <dbl> 963, 692, 786, 1374, 805, 600…

Selecting variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, id, sex, age)
# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: select range

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, id:age)
# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: index

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, 1, 2, 3)
# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: index range

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, 1:3)
# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: rename

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, id, sex, rt = rt_hand_d)
# A tibble: 267 × 3
     id sex       rt
  <dbl> <chr>  <dbl>
1     1 male    702.
2     2 female  471.
3     3 female  639.
4     4 female  708 
5     5 male    607.
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, id, starts_with("rt_"))
# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, ends_with("d"))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, id, ends_with("d"))
# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, id, contains("hand"))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, id, contains("hand"))
# A tibble: 267 × 3
     id rt_hand_d rt_hand_nd
  <dbl>     <dbl>      <dbl>
1     1      702.       780.
2     2      471.       497 
3     3      639.       638 
4     4      708        639.
5     5      607.       652 
# ℹ 262 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, -contains("hand"))
# A tibble: 267 × 8
     id sex      age medicine meds_cat smoker rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl>    <dbl> <chr>    <chr>      <dbl>      <dbl>
1     1 male      84        8 a lot    former     1009        963.
2     2 female    37        1 little   no          738.       692.
3     3 female    62        0 none     yes         878        786 
4     4 female    85        4 few      former      902.      1374.
5     5 male      73        5 a lot    former      923        805 
# ℹ 262 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, -ends_with("_d"))

???

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, -ends_with("_d"))
# A tibble: 267 × 8
     id sex      age medicine meds_cat smoker rt_hand_nd rt_foot_nd
  <dbl> <chr>  <dbl>    <dbl> <chr>    <chr>       <dbl>      <dbl>
1     1 male      84        8 a lot    former       780.       963.
2     2 female    37        1 little   no           497        692.
3     3 female    62        0 none     yes          638        786 
4     4 female    85        4 few      former       639.      1374.
5     5 male      73        5 a lot    former       652        805 
# ℹ 262 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, -sex:-smoker)

???

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, -sex:-smoker)
# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, where(is.character))
# A tibble: 267 × 3
  sex    meds_cat smoker
  <chr>  <chr>    <chr> 
1 male   a lot    former
2 female little   no    
3 female none     yes   
4 female few      former
5 male   a lot    former
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, where(is.numeric))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…
select(blomkvist, where(is.numeric))
# A tibble: 267 × 7
     id   age medicine rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <dbl>    <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1    84        8      702.       780.     1009        963.
2     2    37        1      471.       497       738.       692.
3     3    62        0      639.       638       878        786 
4     4    85        4      708        639.      902.      1374.
5     5    73        5      607.       652       923        805 
# ℹ 262 more rows

Continue with exercise 2

Filtering data

Select variables of interest

blomkvist_rt <- select(blomkvist, id, smoker, age, rt = rt_hand_d)

Subsetting data by selecting rows that meet one condition or more.

filter(data, condition)

Continuous variables

filter(blomkvist_rt, rt >= 708)
# A tibble: 64 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     4 former    85  708 
2     9 former    83  737.
3    12 no        88  889 
4    13 yes       62  884.
5    14 former    88  832.
# ℹ 59 more rows
filter(blomkvist_rt, rt > 708)
# A tibble: 62 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     9 former    83  737.
2    12 no        88  889 
3    13 yes       62  884.
4    14 former    88  832.
5    17 former    80  930 
# ℹ 57 more rows

Continuous variables

filter(blomkvist_rt, rt > 708, rt < 900)
# A tibble: 38 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     9 former    83  737.
2    12 no        88  889 
3    13 yes       62  884.
4    14 former    88  832.
5    39 no        88  727 
# ℹ 33 more rows

Continuous variables

filter(blomkvist_rt, rt > 708 | rt < 900)
# A tibble: 266 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 261 more rows
# which is a logical tautology really
filter(blomkvist_rt, rt < 708 | rt > 900)
# A tibble: 226 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     5 former    73  607.
5     6 no        65  542.
# ℹ 221 more rows

Continuous variables

filter(blomkvist_rt, rt > mean(rt, na.rm = TRUE))
# A tibble: 100 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     4 former    85  708 
3     9 former    83  737.
4    12 no        88  889 
5    13 yes       62  884.
# ℹ 95 more rows
mean(blomkvist_rt$rt, na.rm = TRUE)
[1] 639

NB. What’s na.rm = TRUE?

# Data with missing values
y <- c(100, 1150, 200, 43, NA, 15)
mean(y)
[1] NA
mean(y, na.rm = TRUE)
[1] 302
sd(y, na.rm = TRUE)
[1] 480

Categorical variables

filter(blomkvist_rt, smoker == "yes")
# A tibble: 26 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     3 yes       62  639.
2    10 yes       58  550.
3    13 yes       62  884.
4    24 yes       57  612.
5    28 yes       59  586.
# ℹ 21 more rows
unique(blomkvist_rt$smoker)
[1] "former" "no"     "yes"    NA      

Categorical variables

filter(blomkvist_rt, smoker != "yes")
# A tibble: 240 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     4 former    85  708 
4     5 former    73  607.
5     6 no        65  542.
# ℹ 235 more rows
unique(blomkvist_rt$smoker)
[1] "former" "no"     "yes"    NA      

Categorical variables

filter(blomkvist_rt, smoker %in% c("yes", "former"))
# A tibble: 103 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     3 yes       62  639.
3     4 former    85  708 
4     5 former    73  607.
5     8 former    49  509.
# ℹ 98 more rows
filter(blomkvist_rt, !(smoker %in% c("yes", "former")))
# A tibble: 164 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     2 no        37  471.
2     6 no        65  542.
3     7 no        30  571.
4    11 no        25  548 
5    12 no        88  889 
# ℹ 159 more rows

Missing data

unique(blomkvist_rt$smoker)
[1] "former" "no"     "yes"    NA      
filter(blomkvist_rt, smoker == "NA") # ooops!!!
# A tibble: 0 × 4
# ℹ 4 variables: id <dbl>, smoker <chr>, age <dbl>, rt <dbl>
filter(blomkvist_rt, smoker == NA) # double ooops!!!
# A tibble: 0 × 4
# ℹ 4 variables: id <dbl>, smoker <chr>, age <dbl>, rt <dbl>

Missing data

filter(blomkvist_rt, is.na(smoker))
# A tibble: 1 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1   149 <NA>      34  479.
filter(blomkvist_rt, !is.na(smoker))
# A tibble: 266 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 261 more rows

Continue with exercise 3

Mutating data

Adding new variables to the data or changing existing ones.

mutate(blomkvist_rt, rt_2 = rt * rt)
# A tibble: 267 × 5
     id smoker   age    rt    rt_2
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702. 492336.
2     2 no        37  471. 221527.
3     3 yes       62  639. 407895.
4     4 former    85  708  501264 
5     5 former    73  607. 368854.
# ℹ 262 more rows
mutate(blomkvist_rt, rt_2 = rt^2)
# A tibble: 267 × 5
     id smoker   age    rt    rt_2
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702. 492336.
2     2 no        37  471. 221527.
3     3 yes       62  639. 407895.
4     4 former    85  708  501264 
5     5 former    73  607. 368854.
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, log_rt = log(rt))
# A tibble: 267 × 5
     id smoker   age    rt log_rt
  <dbl> <chr>  <dbl> <dbl>  <dbl>
1     1 former    84  702.   6.55
2     2 no        37  471.   6.15
3     3 yes       62  639.   6.46
4     4 former    85  708    6.56
5     5 former    73  607.   6.41
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, is_slow = rt > 700)
# A tibble: 267 × 5
     id smoker   age    rt is_slow
  <dbl> <chr>  <dbl> <dbl> <lgl>  
1     1 former    84  702. TRUE   
2     2 no        37  471. FALSE  
3     3 yes       62  639. FALSE  
4     4 former    85  708  TRUE   
5     5 former    73  607. FALSE  
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, mean_rt = mean(rt, na.rm = TRUE))
# A tibble: 267 × 5
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    639.
2     2 no        37  471.    639.
3     3 yes       62  639.    639.
4     4 former    85  708     639.
5     5 former    73  607.    639.
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, mean_rt = mean(rt, na.rm = TRUE),
                     is_slow = rt > mean_rt)
# A tibble: 267 × 6
     id smoker   age    rt mean_rt is_slow
  <dbl> <chr>  <dbl> <dbl>   <dbl> <lgl>  
1     1 former    84  702.    639. TRUE   
2     2 no        37  471.    639. FALSE  
3     3 yes       62  639.    639. FALSE  
4     4 former    85  708     639. TRUE   
5     5 former    73  607.    639. FALSE  
# ℹ 262 more rows
mutate(blomkvist_rt, # or both in one go
       is_slow = rt > mean(rt, na.rm = TRUE))

Mutating data: recode()

mutate(blomkvist_rt, smoker_recoded = recode(smoker, former = "former smoker",
                                                     yes = "smoker",
                                                     no = "non-smoker"))
# A tibble: 267 × 5
     id smoker   age    rt smoker_recoded
  <dbl> <chr>  <dbl> <dbl> <chr>         
1     1 former    84  702. former smoker 
2     2 no        37  471. non-smoker    
3     3 yes       62  639. smoker        
4     4 former    85  708  former smoker 
5     5 former    73  607. former smoker 
# ℹ 262 more rows

Mutating data

Continue with exercise 4

Grouping data with group_by()

Perform an action (function) for each level the grouping variable individually.

blomkvist_grouped <- group_by(blomkvist_rt, smoker) # Group by smoker
blomkvist_grouped 
# A tibble: 267 × 4
# Groups:   smoker [4]
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 262 more rows
# How many groups and what are they?

Summarise (grouped) data

… using descriptive tools.

summarise(blomkvist_rt, # not grouped
          mean_rt = mean(rt, na.rm = TRUE),
          N = n())
# A tibble: 1 × 2
  mean_rt     N
    <dbl> <int>
1    639.   267
summarise(blomkvist_grouped, # grouped data
          mean_rt = mean(rt, na.rm = TRUE),
          N = n())
# A tibble: 4 × 3
  smoker mean_rt     N
  <chr>    <dbl> <int>
1 former    653.    77
2 no        635.   163
3 yes       633.    26
4 <NA>      479.     1

Summarise data

Summarising data using descriptive tools.

summarise(blomkvist_rt, # not grouped
          mean_rt = mean(rt, na.rm = TRUE),
          sd_rt = sd(rt, na.rm = TRUE),
          min_rt = min(rt, na.rm = TRUE),
          max_rt = max(rt, na.rm = TRUE),
          N = n())
# A tibble: 1 × 5
  mean_rt sd_rt min_rt max_rt     N
    <dbl> <dbl>  <dbl>  <dbl> <int>
1    639.  190.   379.  1552.   267
summarise(blomkvist_grouped, # grouped data
          mean_rt = mean(rt, na.rm = TRUE),
          sd_rt = sd(rt, na.rm = TRUE),
          min_rt = min(rt, na.rm = TRUE),
          max_rt = max(rt, na.rm = TRUE),
          N = n())
# A tibble: 4 × 6
  smoker mean_rt sd_rt min_rt max_rt     N
  <chr>    <dbl> <dbl>  <dbl>  <dbl> <int>
1 former    653.  152.   411   1111.    77
2 no        635.  203.   379.  1552.   163
3 yes       633.  217.   427.  1527     26
4 <NA>      479.   NA    479.   479.     1

Mutate grouped data

mutate(blomkvist_rt, # not grouped
       mean_rt = mean(rt, na.rm = TRUE))
# A tibble: 267 × 5
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    639.
2     2 no        37  471.    639.
3     3 yes       62  639.    639.
4     4 former    85  708     639.
5     5 former    73  607.    639.
# ℹ 262 more rows
mutate(blomkvist_grouped, # grouped data
       mean_rt = mean(rt, na.rm = TRUE))
# A tibble: 267 × 5
# Groups:   smoker [4]
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    653.
2     2 no        37  471.    635.
3     3 yes       62  639.    633.
4     4 former    85  708     653.
5     5 former    73  607.    653.
# ℹ 262 more rows
# What's the grouping variable?
# What's the difference?

Never forget to ungroup your data

otherwise you will keep performing by-group operations.

blomkvist_grouped # :(
# A tibble: 267 × 4
# Groups:   smoker [4]
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 262 more rows
ungroup(blomkvist_grouped) # :)
# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 262 more rows
# Spot the difference!

Grouping data

  • Continue with exercise 5

Pivoting data: tidy data

Called, a long data format.

Pivoting data: wide format

blomkvist_wide <- select(blomkvist, id, starts_with("rt_"))
blomkvist_wide
# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

RT is distributed across 4 columns: useful for certain analyses but isn’t tidy.

Pivoting data to long format

has never been easier

blomkvist_wide
# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows
pivot_longer(blomkvist_wide, cols = starts_with("rt_"))
# A tibble: 1,068 × 3
     id name       value
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows

Pivoting data to long format

pivot_longer(blomkvist_wide, cols = -id) # shorthand
# A tibble: 1,068 × 3
     id name       value
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows
pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = "variable", 
             values_to = "rt")
# A tibble: 1,068 × 3
     id variable      rt
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows

Pivoting data to long format

pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = "variable",
             values_to = "rt")
# A tibble: 1,068 × 3
     id variable      rt
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows
pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = c(".value", 
                          "response_by", 
                          "dominant"), 
             names_sep = "_")
# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows

Pivoting data back to wide format

bk_long <- pivot_longer(blomkvist_wide, 
                        cols = -id, 
                        names_to = c(".value", "response_by", "dominant"), 
                        names_sep = "_")
bk_long
# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows

Pivoting data back to wide format

bk_long
# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows
pivot_wider(bk_long, 
            names_from = response_by, 
            values_from = rt)
# A tibble: 534 × 4
     id dominant  hand  foot
  <dbl> <chr>    <dbl> <dbl>
1     1 d         702. 1009 
2     1 nd        780.  963.
3     2 d         471.  738.
4     2 nd        497   692.
5     3 d         639.  878 
# ℹ 529 more rows

Pivoting data back to wide format

bk_long
# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows
pivot_wider(bk_long, 
            names_from = c(response_by, dominant), 
            values_from = rt)
# A tibble: 267 × 5
     id hand_d hand_nd foot_d foot_nd
  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1     1   702.    780.  1009     963.
2     2   471.    497    738.    692.
3     3   639.    638    878     786 
4     4   708     639.   902.   1374.
5     5   607.    652    923     805 
# ℹ 262 more rows

Pivoting data back to wide format

bk_long
# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows
pivot_wider(bk_long, 
            names_from = c(response_by, dominant), 
            values_from = rt, 
            names_prefix = "rt_")
# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Pivoting data

Continue with exercise 6

Congratulations!

Congratulations!

… you have worked through a large range of data visualisation and wrangling tools that will give you more flexibility to get the most out of your data.

Advanced data wrangling (homework)

The pipe: %>%

  • %>% moves or “pipes” the result forward into the next function
  • f(x) is the same as x %>% f()
  • Short-cut: Ctrl + Shift + M
select(data, myvar1, myvar2)
# or
data %>% select(myvar1, myvar2)

*assumes first argument is data

The pipe: %>%

# Instead of 
data_1 <- first_step(data)
data_2 <- second_step(data_2)
data_3 <- third_step(data_3)
data_4 <- fourth_step(data_4)

# or
fourth_step(
    third_step(
        second_step(
            first_step(data)
            )
          )
        )
# Just do
data %>% 
  first_step() %>%
  second_step() %>%
  third_step() %>%
  fourth_step()

The pipe: %>%

# A tibble: 2 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1   181     1552.      1072.     1068       1196.
2   305     1527       1416.     1713.      2471.
bk <- read_csv("data/blomkvist.csv")
bk_rts <- select(bk, id, starts_with("rt_"))
bk_rts_flt <- filter(bk_rts, rt_hand_d > 1500)
bk_rts_flt
# A tibble: 2 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1   181     1552.      1072.     1068       1196.
2   305     1527       1416.     1713.      2471.
read_csv("data/blomkvist.csv") %>% 
  select(id, starts_with("rt_")) %>% 
  filter(rt_hand_d > 1500)

The pipe: %>%

Continue with exercise 7

Combine data: overview

# Combine two datasets side-by-side
bind_cols(data_1, data_2)
# Stacking two data sets
bind_rows(data_1, data_2)
# Keep all rows of data_1 and add data_2
left_join(data_1, data_2)
# Keep all rows of data_2 and add data_1
right_join(data_1, data_2)
# Include all rows of both data sets
full_join(data_1, data_2)
# Include data that is present in both data sets
inner_join(data_1, data_2)

Combine data: bind_cols()

bk_id_age <- select(blomkvist, id, age)
bk_id_age
# A tibble: 267 × 2
     id   age
  <dbl> <dbl>
1     1    84
2     2    37
3     3    62
4     4    85
5     5    73
# ℹ 262 more rows
bk_med_smoke <- select(blomkvist, medicine, smoker)
bk_med_smoke
# A tibble: 267 × 2
  medicine smoker
     <dbl> <chr> 
1        8 former
2        1 no    
3        0 yes   
4        4 former
5        5 former
# ℹ 262 more rows

Combine data: bind_cols()

# Combine two datasets side-by-side
bind_cols(bk_id_age, bk_med_smoke)
# A tibble: 267 × 4
     id   age medicine smoker
  <dbl> <dbl>    <dbl> <chr> 
1     1    84        8 former
2     2    37        1 no    
3     3    62        0 yes   
4     4    85        4 former
5     5    73        5 former
# ℹ 262 more rows

Combine data: bind_rows()

bk_former_smokers <- filter(blomkvist, smoker == "former")
bk_smokers <- filter(blomkvist, smoker == "yes")

Combine data: bind_rows()

# Stacking two data sets
bk_smoking <- bind_rows(bk_former_smokers, bk_smokers)
slice_head(bk_smoking, n = 1)
# A tibble: 1 × 10
     id sex     age medicine meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d
  <dbl> <chr> <dbl>    <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>
1     1 male     84        8 a lot    former      702.       780.      1009
# ℹ 1 more variable: rt_foot_nd <dbl>
slice_tail(bk_smoking, n = 1)
# A tibble: 1 × 10
     id sex     age medicine meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d
  <dbl> <chr> <dbl>    <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>
1   342 male     60        3 few      yes          708        686       909
# ℹ 1 more variable: rt_foot_nd <dbl>

This also works when columns are not in the same order as long as the names and types match.

Create two example data sets

bk_smoker <- select(blomkvist, smoker, age)
bk_smoker_o40 <- filter(bk_smoker, age > 40)
bk_sex <- select(blomkvist, sex, age)
bk_sex_u50 <- filter(bk_sex, age < 50)

Create two example data sets

bk_smoker_o40
# A tibble: 195 × 2
  smoker   age
  <chr>  <dbl>
1 former    84
2 yes       62
3 former    85
4 former    73
5 no        65
# ℹ 190 more rows
bk_sex_u50
# A tibble: 99 × 2
  sex      age
  <chr>  <dbl>
1 female    37
2 female    30
3 female    49
4 female    25
5 male      27
# ℹ 94 more rows

Combine data: left_join()

# Keep all rows of bk_sex_u50
bk_joined <- left_join(bk_sex_u50, 
                       bk_smoker_o40, by = "age")
bk_joined
# A tibble: 173 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    37 <NA>  
2 female    30 <NA>  
3 female    49 former
4 female    49 former
5 female    49 no    
# ℹ 168 more rows
range(bk_joined$age)
[1] 20 49

Combine data: right_join()

# Keep all rows of bk_smoker_o40
bk_joined <- right_join(bk_sex_u50, 
                        bk_smoker_o40, by = "age")
bk_joined
# A tibble: 269 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    49 former
2 female    49 former
3 female    49 no    
4 female    49 no    
5 female    49 former
# ℹ 264 more rows
range(bk_joined$age)
[1] 41 96

Combine data: full_join()

# Include all rows of both data sets
bk_joined <- full_join(bk_sex_u50, 
                       bk_smoker_o40, by = "age")
bk_joined
# A tibble: 341 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    37 <NA>  
2 female    30 <NA>  
3 female    49 former
4 female    49 former
5 female    49 no    
# ℹ 336 more rows
range(bk_joined$age)
[1] 20 96

Combine data: inner_join()

# Include only info that is present in both data sets
bk_joined <- inner_join(bk_sex_u50, 
                        bk_smoker_o40, by = "age")
bk_joined
# A tibble: 101 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    49 former
2 female    49 former
3 female    49 no    
4 female    49 no    
5 female    49 former
# ℹ 96 more rows
range(bk_joined$age)
[1] 41 49

Continue with exercise 8

Mutate data with across()

Create a new column.

mutate(blomkvist_rt, log_rt = log(rt))
# A tibble: 267 × 5
     id smoker   age    rt log_rt
  <dbl> <chr>  <dbl> <dbl>  <dbl>
1     1 former    84  702.   6.55
2     2 no        37  471.   6.15
3     3 yes       62  639.   6.46
4     4 former    85  708    6.56
5     5 former    73  607.   6.41
# ℹ 262 more rows

Replace with transformed variable.

mutate(blomkvist_rt, across(rt, log))
# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  6.55
2     2 no        37  6.15
3     3 yes       62  6.46
4     4 former    85  6.56
5     5 former    73  6.41
# ℹ 262 more rows

First argument of across will be used as first argument of function (log) which is supplied as second argument to across.

Mutate data with across()

# Instead of
mutate(blomkvist, rt_hand_d = log(rt_hand_d), 
                  rt_hand_nd = log(rt_hand_nd))
# Do
mutate(blomkvist, across(c(rt_hand_d, rt_hand_nd), log))
# A tibble: 264 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1     1 male      84 a lot    former      6.55       6.66     1009        963.
2     2 female    37 little   no          6.15       6.21      738.       692.
3     3 female    62 none     yes         6.46       6.46      878        786 
4     4 female    85 few      former      6.56       6.46      902.      1374.
5     5 male      73 a lot    former      6.41       6.48      923        805 
# ℹ 259 more rows

Mutate data with across()

mutate(blomkvist, across(where(is.numeric), log))
# A tibble: 264 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1 0     male    4.43 a lot    former      6.55       6.66      6.92       6.87
2 0.693 female  3.61 little   no          6.15       6.21      6.60       6.54
3 1.10  female  4.13 none     yes         6.46       6.46      6.78       6.67
4 1.39  female  4.44 few      former      6.56       6.46      6.80       7.23
5 1.61  male    4.29 a lot    former      6.41       6.48      6.83       6.69
# ℹ 259 more rows
# Not ideal here: see `id` column

Mutate data with across()

mutate(blomkvist, across(starts_with("rt_"), log))
Rows: 264
Columns: 9
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, …
$ sex        <chr> "male", "female", "female", "female", "male", "male", "fema…
$ age        <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27, 60,…
$ meds_cat   <chr> "a lot", "little", "none", "few", "a lot", "none", "none", …
$ smoker     <chr> "former", "no", "yes", "former", "former", "no", "no", "for…
$ rt_hand_d  <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, 6.8,…
$ rt_hand_nd <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, 6.6,…
$ rt_foot_d  <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, 6.7,…
$ rt_foot_nd <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, 6.8,…

Mutate data with across()

mutate(blomkvist, across(starts_with("rt_"), log, .names = "log_{.col}"))
Rows: 264
Columns: 13
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, …
$ sex            <chr> "male", "female", "female", "female", "male", "male", "…
$ age            <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27,…
$ meds_cat       <chr> "a lot", "little", "none", "few", "a lot", "none", "non…
$ smoker         <chr> "former", "no", "yes", "former", "former", "no", "no", …
$ rt_hand_d      <dbl> 702, 471, 639, 708, 607, 542, 571, 509, 737, 550, 548, …
$ rt_hand_nd     <dbl> 780, 497, 638, 639, 652, 499, 527, 547, 865, 569, 507, …
$ rt_foot_d      <dbl> 1009, 738, 878, 902, 923, 687, 778, 743, 750, 629, 653,…
$ rt_foot_nd     <dbl> 963, 692, 786, 1374, 805, 600, 750, 797, 797, 800, 718,…
$ log_rt_hand_d  <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, …
$ log_rt_hand_nd <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, …
$ log_rt_foot_d  <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, …
$ log_rt_foot_nd <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, …

Mutate data with across()

mutate(blomkvist, across(starts_with("rt_"), list(lg = log, sqr = ~.^2), .names = "{.fn}_{.col}"))
Rows: 264
Columns: 17
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, …
$ sex            <chr> "male", "female", "female", "female", "male", "male", "…
$ age            <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27,…
$ meds_cat       <chr> "a lot", "little", "none", "few", "a lot", "none", "non…
$ smoker         <chr> "former", "no", "yes", "former", "former", "no", "no", …
$ rt_hand_d      <dbl> 702, 471, 639, 708, 607, 542, 571, 509, 737, 550, 548, …
$ rt_hand_nd     <dbl> 780, 497, 638, 639, 652, 499, 527, 547, 865, 569, 507, …
$ rt_foot_d      <dbl> 1009, 738, 878, 902, 923, 687, 778, 743, 750, 629, 653,…
$ rt_foot_nd     <dbl> 963, 692, 786, 1374, 805, 600, 750, 797, 797, 800, 718,…
$ lg_rt_hand_d   <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, …
$ sqr_rt_hand_d  <dbl> 492336, 221527, 407895, 501264, 368854, 293403, 325660,…
$ lg_rt_hand_nd  <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, …
$ sqr_rt_hand_nd <dbl> 608920, 247009, 407044, 407895, 425104, 248668, 278080,…
$ lg_rt_foot_d   <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, …
$ sqr_rt_foot_d  <dbl> 1018081, 544152, 770884, 814205, 851929, 471511, 605803…
$ lg_rt_foot_nd  <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, …
$ sqr_rt_foot_nd <dbl> 926727, 479325, 617796, 1886960, 648025, 359600, 562000…

What’s “~.”?

mutate(blomkvist_rt, across(rt, round, 0))
# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84   702
2     2 no        37   471
3     3 yes       62   639
4     4 former    85   708
5     5 former    73   607
# ℹ 262 more rows
mutate(blomkvist_rt, across(rt, ~round(., 0)))
# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84   702
2     2 no        37   471
3     3 yes       62   639
4     4 former    85   708
5     5 former    73   607
# ℹ 262 more rows

What’s “~.”?

  • ~”: we want to make the position of the argument in function explicit.
  • .”: the location of the argument.
  • Always possible but necessary for operator functions (>, ==, ^, +) and when argument is not in first position of supplied function.
mutate(blomkvist_rt, 
       across(rt, ~round(. , 0)), # optional
       across(age, ~.^2),   # operator
       across(rt, ~paste("RT is", .))) # position
# A tibble: 267 × 4
     id smoker   age rt       
  <dbl> <chr>  <dbl> <chr>    
1     1 former  7056 RT is 702
2     2 no      1369 RT is 471
3     3 yes     3844 RT is 639
4     4 former  7225 RT is 708
5     5 former  5329 RT is 607
# ℹ 262 more rows

Filter data with across()

filter(blomkvist, rt_hand_d > 1000, rt_hand_nd > 1000, rt_foot_d > 1000, rt_foot_nd > 100)
# A tibble: 9 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1    70 female    85 a lot    no         1102.      1352.     1458       1335.
2    96 male      77 a lot    no         1241.      1651.     1217.      1266.
3   127 female    96 a lot    no         1179.      1090      1546       4820 
4   152 female    68 none     no         1030.      1095      1137       1075.
5   171 female    92 a lot    no         1025.      1026.     1379.      1310.
# ℹ 4 more rows

Filter data with across()

filter(blomkvist, across(starts_with("rt_"), ~ . > 1000 ))
# A tibble: 9 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1    70 female    85 a lot    no         1102.      1352.     1458       1335.
2    96 male      77 a lot    no         1241.      1651.     1217.      1266.
3   127 female    96 a lot    no         1179.      1090      1546       4820 
4   152 female    68 none     no         1030.      1095      1137       1075.
5   171 female    92 a lot    no         1025.      1026.     1379.      1310.
# ℹ 4 more rows

Summarise variables with across()

summarise(blomkvist, rt_hand_d_mean = mean(rt_hand_d),
                     rt_hand_nd_mean = mean(rt_hand_nd))
# A tibble: 1 × 2
  rt_hand_d_mean rt_hand_nd_mean
           <dbl>           <dbl>
1           639.            637.

Summarise variables with across()

summarise(blomkvist, rt_hand_d_mean = mean(rt_hand_d),
                     rt_hand_nd_mean = mean(rt_hand_nd))
# A tibble: 1 × 2
  rt_hand_d_mean rt_hand_nd_mean
           <dbl>           <dbl>
1           639.            637.
summarise(blomkvist, across(c(rt_hand_d, rt_hand_nd), mean))
# A tibble: 1 × 2
  rt_hand_d rt_hand_nd
      <dbl>      <dbl>
1      639.       637.

Summarise variables with across()

summarise(blomkvist, across(starts_with("rt_"), mean))
# A tibble: 1 × 4
  rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
      <dbl>      <dbl>     <dbl>      <dbl>
1      639.       637.      847.       867.

Summarise variables with across()

summarise(blomkvist, across(starts_with("rt_"), mean))
# A tibble: 1 × 4
  rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
      <dbl>      <dbl>     <dbl>      <dbl>
1      639.       637.      847.       867.
summarise(blomkvist, across(starts_with("rt_"), list(mean = mean, sd = sd)))
# A tibble: 1 × 8
  rt_hand_d_mean rt_hand_d_sd rt_hand_nd_mean rt_hand_nd_sd rt_foot_d_mean
           <dbl>        <dbl>           <dbl>         <dbl>          <dbl>
1           639.         191.            637.          179.           847.
# ℹ 3 more variables: rt_foot_d_sd <dbl>, rt_foot_nd_mean <dbl>,
#   rt_foot_nd_sd <dbl>

Mutate data with c_across()

Apply a function across different columns.

Example: get the sum for of all rts for each participant (row).

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)
# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with c_across()

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)
# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with c_across()

# Using sum function doesn't give the desired result
mutate(blomkvist, sum_rt = sum(c(rt_hand_d, rt_hand_nd, rt_foot_d, rt_foot_nd), na.rm = TRUE))
# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd  sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1 male      84      702.       780.     1009        963. 789356.
2     2 female    37      471.       497       738.       692. 789356.
3     3 female    62      639.       638       878        786  789356.
4     4 female    85      708        639.      902.      1374. 789356.
5     5 male      73      607.       652       923        805  789356.
# ℹ 259 more rows

Mutate data with c_across()

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)
# Still a lot of typing :(
blomkvist_rowwise <- rowwise(blomkvist) # each row / ppt is a group
mutate(blomkvist_rowwise, sum_rt = sum(c_across(c(rt_hand_d, rt_hand_nd, rt_foot_d, rt_foot_nd))))
# A tibble: 264 × 8
# Rowwise: 
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with c_across()

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)
# That's better :)
blomkvist_rowwise <- rowwise(blomkvist) # each row / ppt is a group
mutate(blomkvist_rowwise, sum_rt = sum(c_across(starts_with("rt_"))))
# A tibble: 264 × 8
# Rowwise: 
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with c_across()

# Without rowwise grouping returns the overall sum:
mutate(blomkvist, sum_rt = sum(c_across(starts_with("rt_"))))
# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd  sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1 male      84      702.       780.     1009        963. 789356.
2     2 female    37      471.       497       738.       692. 789356.
3     3 female    62      639.       638       878        786  789356.
4     4 female    85      708        639.      902.      1374. 789356.
5     5 male      73      607.       652       923        805  789356.
# ℹ 259 more rows
# which is not what we want.

Mutating data: case_when()

case_when(condition ~ do) # similar to ifelse()
mutate(blomkvist_rt, age_cat = case_when(age > 70 ~ "high",
                                         age > 40 ~ "middle",
                                         is.na(smoker) ~ "dunno",
                                         TRUE ~ "on the young side"))
# A tibble: 267 × 5
     id smoker   age    rt age_cat          
  <dbl> <chr>  <dbl> <dbl> <chr>            
1     1 former    84  702. high             
2     2 no        37  471. on the young side
3     3 yes       62  639. middle           
4     4 former    85  708  high             
5     5 former    73  607. high             
# ℹ 262 more rows

Using across() to mutate and summarise variables

Open exercise 9 script.

References

Andrews, Mark. 2021. Doing data science in R: An Introduction for Social Scientists. London, UK: SAGE Publications Ltd.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.