Data wrangling in the tidyverse

Goals of data wrangling

Data come in various formats (long, wide) and data type (xlsx, ods, json, csv, sav)
No format is suitable for every goal
Fluency in data wrangling gives you a lot of power.
Make data format suitable to use: e.g. for statistical models (correlations, linear regression), functions, data viz, summary table
Calculate new variables, filter or combine data
Reveal information
Summarise information
(also creating counterbalanced, randomised stimulus lists)

Collection of R packages for data science that share:

common data philosophies
grammar
data structures
best practice
designed to work together

# Installs 19 packages
install.packages("tidyverse")

# Loads 6 packages
library(tidyverse)

Tidy data

Each variable must have its own column.
Each observation must have its own row.
Each value have its own cell.

Why?

Placing variables in columns takes advantage of R’s vectorised nature (faster processing, more compact code).
Consistent data structure allows easier learning because tidyverse functions have similar underpinning principles.

`tidyverse`: verbs

Functions that do specific things to our data.
Must know: read_csv, write_csv, glimpse, select, filter, mutate, group_by / ungroup, summarise, pivot_wider / _longer
Also important: count, pull, slice, across, recode, unique, n, where, everything, ~ and ., map, starts_with, ends_with, contains, separate, unite, transmute, _join, bind_rows / _cols
There are more but these are the most important ones.

tbls (tibbles)

tidyverse is operating with tibbles
Type of data structure
Easier to read in console

# Imports data as data frame
data_as_frame <- read.csv("path_to_data/data.csv")

# Imports data as tibble
data_as_tibble <- read_csv("path_to_data/data.csv")

.csv: comma separated file
readr package: e.g. read_csv, read_delim, read_tsv

For other data formats:

haven package: e.g. read_dta, read_sav, read_sas
readxl package: e.g. read_excel, read_xls, read_xlsx

# Summarise data structure in base R
str(data_as_frame)

# Summarise data structure in tidyverse
glimpse(data_as_tibble)

Open exercise script 1

`tidyverse` functions

Functions follow the principle

function_name(data_name, argument)

where argument specifies which variable / condition etc. the function has to operate on.

# Picking out variables
select(data, variable) 
# Subsetting data
filter(data, variable > 100) 
# Change / add variables
mutate(data, sqr = variable^2)
# Aggregate data
summarise(data, avg = mean(variable))

Selecting variables

Extracts variables you’re interested in.

blomkvist <- read_csv("data/blomkvist.csv")

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ sex        <chr> "male", "female", "female", "…
$ age        <dbl> 84, 37, 62, 85, 73, 65, 30, 4…
$ medicine   <dbl> 8, 1, 0, 4, 5, 0, 0, 0, 11, 0…
$ meds_cat   <chr> "a lot", "little", "none", "f…
$ smoker     <chr> "former", "no", "yes", "forme…
$ rt_hand_d  <dbl> 702, 471, 639, 708, 607, 542,…
$ rt_hand_nd <dbl> 780, 497, 638, 639, 652, 499,…
$ rt_foot_d  <dbl> 1009, 738, 878, 902, 923, 687…
$ rt_foot_nd <dbl> 963, 692, 786, 1374, 805, 600…

Selecting variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, id, sex, age)

# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: select range

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, id:age)

# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: index

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, 1, 2, 3)

# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: index range

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, 1:3)

# A tibble: 267 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# ℹ 262 more rows

Selecting variables: rename

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, id, sex, rt = rt_hand_d)

# A tibble: 267 × 3
     id sex       rt
  <dbl> <chr>  <dbl>
1     1 male    702.
2     2 female  471.
3     3 female  639.
4     4 female  708 
5     5 male    607.
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, id, starts_with("rt_"))

# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, ends_with("d"))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, id, ends_with("d"))

# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, id, contains("hand"))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, id, contains("hand"))

# A tibble: 267 × 3
     id rt_hand_d rt_hand_nd
  <dbl>     <dbl>      <dbl>
1     1      702.       780.
2     2      471.       497 
3     3      639.       638 
4     4      708        639.
5     5      607.       652 
# ℹ 262 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, -contains("hand"))

# A tibble: 267 × 8
     id sex      age medicine meds_cat smoker rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl>    <dbl> <chr>    <chr>      <dbl>      <dbl>
1     1 male      84        8 a lot    former     1009        963.
2     2 female    37        1 little   no          738.       692.
3     3 female    62        0 none     yes         878        786 
4     4 female    85        4 few      former      902.      1374.
5     5 male      73        5 a lot    former      923        805 
# ℹ 262 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, -ends_with("_d"))

???

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, -ends_with("_d"))

# A tibble: 267 × 8
     id sex      age medicine meds_cat smoker rt_hand_nd rt_foot_nd
  <dbl> <chr>  <dbl>    <dbl> <chr>    <chr>       <dbl>      <dbl>
1     1 male      84        8 a lot    former       780.       963.
2     2 female    37        1 little   no           497        692.
3     3 female    62        0 none     yes          638        786 
4     4 female    85        4 few      former       639.      1374.
5     5 male      73        5 a lot    former       652        805 
# ℹ 262 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, -sex:-smoker)

???

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, -sex:-smoker)

# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, where(is.character))

# A tibble: 267 × 3
  sex    meds_cat smoker
  <chr>  <chr>    <chr> 
1 male   a lot    former
2 female little   no    
3 female none     yes   
4 female few      former
5 male   a lot    former
# ℹ 262 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, where(is.numeric))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)

Rows: 267
Columns: 10
$ id         <dbl> 1, 2, 3, …
$ sex        <chr> "male", "…
$ age        <dbl> 84, 37, 6…
$ medicine   <dbl> 8, 1, 0, …
$ meds_cat   <chr> "a lot", …
$ smoker     <chr> "former",…
$ rt_hand_d  <dbl> 702, 471,…
$ rt_hand_nd <dbl> 780, 497,…
$ rt_foot_d  <dbl> 1009, 738…
$ rt_foot_nd <dbl> 963, 692,…

select(blomkvist, where(is.numeric))

# A tibble: 267 × 7
     id   age medicine rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <dbl>    <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1    84        8      702.       780.     1009        963.
2     2    37        1      471.       497       738.       692.
3     3    62        0      639.       638       878        786 
4     4    85        4      708        639.      902.      1374.
5     5    73        5      607.       652       923        805 
# ℹ 262 more rows

Continue with exercise 2

Filtering data

Select variables of interest

blomkvist_rt <- select(blomkvist, id, smoker, age, rt = rt_hand_d)

Subsetting data by selecting rows that meet one condition or more.

filter(data, condition)

Continuous variables

filter(blomkvist_rt, rt >= 708)

# A tibble: 64 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     4 former    85  708 
2     9 former    83  737.
3    12 no        88  889 
4    13 yes       62  884.
5    14 former    88  832.
# ℹ 59 more rows

filter(blomkvist_rt, rt > 708)

# A tibble: 62 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     9 former    83  737.
2    12 no        88  889 
3    13 yes       62  884.
4    14 former    88  832.
5    17 former    80  930 
# ℹ 57 more rows

Continuous variables

filter(blomkvist_rt, rt > 708, rt < 900)

# A tibble: 38 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     9 former    83  737.
2    12 no        88  889 
3    13 yes       62  884.
4    14 former    88  832.
5    39 no        88  727 
# ℹ 33 more rows

Continuous variables

filter(blomkvist_rt, rt > 708 | rt < 900)

# A tibble: 266 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 261 more rows

# which is a logical tautology really

filter(blomkvist_rt, rt < 708 | rt > 900)

# A tibble: 226 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     5 former    73  607.
5     6 no        65  542.
# ℹ 221 more rows

Continuous variables

filter(blomkvist_rt, rt > mean(rt, na.rm = TRUE))

# A tibble: 100 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     4 former    85  708 
3     9 former    83  737.
4    12 no        88  889 
5    13 yes       62  884.
# ℹ 95 more rows

mean(blomkvist_rt$rt, na.rm = TRUE)

[1] 639

NB. What’s `na.rm = TRUE`?

# Data with missing values
y <- c(100, 1150, 200, 43, NA, 15)

mean(y)

[1] NA

mean(y, na.rm = TRUE)

[1] 302

sd(y, na.rm = TRUE)

[1] 480

Categorical variables

filter(blomkvist_rt, smoker == "yes")

# A tibble: 26 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     3 yes       62  639.
2    10 yes       58  550.
3    13 yes       62  884.
4    24 yes       57  612.
5    28 yes       59  586.
# ℹ 21 more rows

unique(blomkvist_rt$smoker)

[1] "former" "no"     "yes"    NA

Categorical variables

filter(blomkvist_rt, smoker != "yes")

# A tibble: 240 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     4 former    85  708 
4     5 former    73  607.
5     6 no        65  542.
# ℹ 235 more rows

unique(blomkvist_rt$smoker)

[1] "former" "no"     "yes"    NA

Categorical variables

filter(blomkvist_rt, smoker %in% c("yes", "former"))

# A tibble: 103 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     3 yes       62  639.
3     4 former    85  708 
4     5 former    73  607.
5     8 former    49  509.
# ℹ 98 more rows

filter(blomkvist_rt, !(smoker %in% c("yes", "former")))

# A tibble: 164 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     2 no        37  471.
2     6 no        65  542.
3     7 no        30  571.
4    11 no        25  548 
5    12 no        88  889 
# ℹ 159 more rows

Missing data

unique(blomkvist_rt$smoker)

[1] "former" "no"     "yes"    NA

filter(blomkvist_rt, smoker == "NA") # ooops!!!

# A tibble: 0 × 4
# ℹ 4 variables: id <dbl>, smoker <chr>, age <dbl>, rt <dbl>

filter(blomkvist_rt, smoker == NA) # double ooops!!!

# A tibble: 0 × 4
# ℹ 4 variables: id <dbl>, smoker <chr>, age <dbl>, rt <dbl>

Missing data

filter(blomkvist_rt, is.na(smoker))

# A tibble: 1 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1   149 <NA>      34  479.

filter(blomkvist_rt, !is.na(smoker))

# A tibble: 266 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 261 more rows

Continue with exercise 3

Mutating data

Adding new variables to the data or changing existing ones.

mutate(blomkvist_rt, rt_2 = rt * rt)

# A tibble: 267 × 5
     id smoker   age    rt    rt_2
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702. 492336.
2     2 no        37  471. 221527.
3     3 yes       62  639. 407895.
4     4 former    85  708  501264 
5     5 former    73  607. 368854.
# ℹ 262 more rows

mutate(blomkvist_rt, rt_2 = rt^2)

# A tibble: 267 × 5
     id smoker   age    rt    rt_2
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702. 492336.
2     2 no        37  471. 221527.
3     3 yes       62  639. 407895.
4     4 former    85  708  501264 
5     5 former    73  607. 368854.
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, log_rt = log(rt))

# A tibble: 267 × 5
     id smoker   age    rt log_rt
  <dbl> <chr>  <dbl> <dbl>  <dbl>
1     1 former    84  702.   6.55
2     2 no        37  471.   6.15
3     3 yes       62  639.   6.46
4     4 former    85  708    6.56
5     5 former    73  607.   6.41
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, is_slow = rt > 700)

# A tibble: 267 × 5
     id smoker   age    rt is_slow
  <dbl> <chr>  <dbl> <dbl> <lgl>  
1     1 former    84  702. TRUE   
2     2 no        37  471. FALSE  
3     3 yes       62  639. FALSE  
4     4 former    85  708  TRUE   
5     5 former    73  607. FALSE  
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, mean_rt = mean(rt, na.rm = TRUE))

# A tibble: 267 × 5
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    639.
2     2 no        37  471.    639.
3     3 yes       62  639.    639.
4     4 former    85  708     639.
5     5 former    73  607.    639.
# ℹ 262 more rows

Mutating data

mutate(blomkvist_rt, mean_rt = mean(rt, na.rm = TRUE),
                     is_slow = rt > mean_rt)

# A tibble: 267 × 6
     id smoker   age    rt mean_rt is_slow
  <dbl> <chr>  <dbl> <dbl>   <dbl> <lgl>  
1     1 former    84  702.    639. TRUE   
2     2 no        37  471.    639. FALSE  
3     3 yes       62  639.    639. FALSE  
4     4 former    85  708     639. TRUE   
5     5 former    73  607.    639. FALSE  
# ℹ 262 more rows

mutate(blomkvist_rt, # or both in one go
       is_slow = rt > mean(rt, na.rm = TRUE))

Mutating data: `recode()`

mutate(blomkvist_rt, smoker_recoded = recode(smoker, former = "former smoker",
                                                     yes = "smoker",
                                                     no = "non-smoker"))

# A tibble: 267 × 5
     id smoker   age    rt smoker_recoded
  <dbl> <chr>  <dbl> <dbl> <chr>         
1     1 former    84  702. former smoker 
2     2 no        37  471. non-smoker    
3     3 yes       62  639. smoker        
4     4 former    85  708  former smoker 
5     5 former    73  607. former smoker 
# ℹ 262 more rows

Mutating data

Continue with exercise 4

Grouping data with `group_by()`

Perform an action (function) for each level the grouping variable individually.

blomkvist_grouped <- group_by(blomkvist_rt, smoker) # Group by smoker
blomkvist_grouped

# A tibble: 267 × 4
# Groups:   smoker [4]
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 262 more rows

# How many groups and what are they?

Summarise (grouped) data

… using descriptive tools.

summarise(blomkvist_rt, # not grouped
          mean_rt = mean(rt, na.rm = TRUE),
          N = n())

# A tibble: 1 × 2
  mean_rt     N
    <dbl> <int>
1    639.   267

summarise(blomkvist_grouped, # grouped data
          mean_rt = mean(rt, na.rm = TRUE),
          N = n())

# A tibble: 4 × 3
  smoker mean_rt     N
  <chr>    <dbl> <int>
1 former    653.    77
2 no        635.   163
3 yes       633.    26
4 <NA>      479.     1

Summarise data

Summarising data using descriptive tools.

summarise(blomkvist_rt, # not grouped
          mean_rt = mean(rt, na.rm = TRUE),
          sd_rt = sd(rt, na.rm = TRUE),
          min_rt = min(rt, na.rm = TRUE),
          max_rt = max(rt, na.rm = TRUE),
          N = n())

# A tibble: 1 × 5
  mean_rt sd_rt min_rt max_rt     N
    <dbl> <dbl>  <dbl>  <dbl> <int>
1    639.  190.   379.  1552.   267

summarise(blomkvist_grouped, # grouped data
          mean_rt = mean(rt, na.rm = TRUE),
          sd_rt = sd(rt, na.rm = TRUE),
          min_rt = min(rt, na.rm = TRUE),
          max_rt = max(rt, na.rm = TRUE),
          N = n())

# A tibble: 4 × 6
  smoker mean_rt sd_rt min_rt max_rt     N
  <chr>    <dbl> <dbl>  <dbl>  <dbl> <int>
1 former    653.  152.   411   1111.    77
2 no        635.  203.   379.  1552.   163
3 yes       633.  217.   427.  1527     26
4 <NA>      479.   NA    479.   479.     1

Mutate grouped data

mutate(blomkvist_rt, # not grouped
       mean_rt = mean(rt, na.rm = TRUE))

# A tibble: 267 × 5
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    639.
2     2 no        37  471.    639.
3     3 yes       62  639.    639.
4     4 former    85  708     639.
5     5 former    73  607.    639.
# ℹ 262 more rows

mutate(blomkvist_grouped, # grouped data
       mean_rt = mean(rt, na.rm = TRUE))

# A tibble: 267 × 5
# Groups:   smoker [4]
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    653.
2     2 no        37  471.    635.
3     3 yes       62  639.    633.
4     4 former    85  708     653.
5     5 former    73  607.    653.
# ℹ 262 more rows

# What's the grouping variable?
# What's the difference?

Never forget to ungroup your data

otherwise you will keep performing by-group operations.

blomkvist_grouped # :(

# A tibble: 267 × 4
# Groups:   smoker [4]
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 262 more rows

ungroup(blomkvist_grouped) # :)

# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# ℹ 262 more rows

# Spot the difference!

Grouping data

Continue with exercise 5

Pivoting data: tidy data

Called, a long data format.

Pivoting data: wide format

blomkvist_wide <- select(blomkvist, id, starts_with("rt_"))
blomkvist_wide

# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

RT is distributed across 4 columns: useful for certain analyses but isn’t tidy.

Pivoting data to long format

has never been easier

blomkvist_wide

# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

pivot_longer(blomkvist_wide, cols = starts_with("rt_"))

# A tibble: 1,068 × 3
     id name       value
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows

Pivoting data to long format

pivot_longer(blomkvist_wide, cols = -id) # shorthand

# A tibble: 1,068 × 3
     id name       value
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows

pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = "variable", 
             values_to = "rt")

# A tibble: 1,068 × 3
     id variable      rt
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows

Pivoting data to long format

pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = "variable",
             values_to = "rt")

# A tibble: 1,068 × 3
     id variable      rt
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# ℹ 1,063 more rows

pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = c(".value", 
                          "response_by", 
                          "dominant"), 
             names_sep = "_")

# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows

Pivoting data back to wide format

bk_long <- pivot_longer(blomkvist_wide, 
                        cols = -id, 
                        names_to = c(".value", "response_by", "dominant"), 
                        names_sep = "_")
bk_long

# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows

Pivoting data back to wide format

bk_long

# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows

pivot_wider(bk_long, 
            names_from = response_by, 
            values_from = rt)

# A tibble: 534 × 4
     id dominant  hand  foot
  <dbl> <chr>    <dbl> <dbl>
1     1 d         702. 1009 
2     1 nd        780.  963.
3     2 d         471.  738.
4     2 nd        497   692.
5     3 d         639.  878 
# ℹ 529 more rows

Pivoting data back to wide format

bk_long

# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows

pivot_wider(bk_long, 
            names_from = c(response_by, dominant), 
            values_from = rt)

# A tibble: 267 × 5
     id hand_d hand_nd foot_d foot_nd
  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1     1   702.    780.  1009     963.
2     2   471.    497    738.    692.
3     3   639.    638    878     786 
4     4   708     639.   902.   1374.
5     5   607.    652    923     805 
# ℹ 262 more rows

Pivoting data back to wide format

bk_long

# A tibble: 1,068 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# ℹ 1,063 more rows

pivot_wider(bk_long, 
            names_from = c(response_by, dominant), 
            values_from = rt, 
            names_prefix = "rt_")

# A tibble: 267 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# ℹ 262 more rows

Pivoting data

Continue with exercise 6

Congratulations!

Congratulations!

… you have worked through a large range of data visualisation and wrangling tools that will give you more flexibility to get the most out of your data.

Use the slides to remind yourself how things work, repeat exercises, look at my advanced materials.
Keep using tidyverse and don’t fall back into old habits (e.g. spreadsheets).
Explore new functions for data processing and visualisation.
R for Data Science by Wickham and Grolemund (2016)
Doing Data Science in R, chapter 3 by Andrews (2021)
Consider using R-Markdown; e.g. some work in progress by myself

Advanced data wrangling (homework)

The pipe: `%>%`

%>% moves or “pipes” the result forward into the next function
f(x) is the same as x %>% f()
Short-cut: Ctrl + Shift + M

select(data, myvar1, myvar2)
# or
data %>% select(myvar1, myvar2)

*assumes first argument is data

The pipe: `%>%`

# Instead of 
data_1 <- first_step(data)
data_2 <- second_step(data_2)
data_3 <- third_step(data_3)
data_4 <- fourth_step(data_4)

# or
fourth_step(
    third_step(
        second_step(
            first_step(data)
            )
          )
        )

# Just do
data %>% 
  first_step() %>%
  second_step() %>%
  third_step() %>%
  fourth_step()

The pipe: `%>%`

# A tibble: 2 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1   181     1552.      1072.     1068       1196.
2   305     1527       1416.     1713.      2471.

bk <- read_csv("data/blomkvist.csv")
bk_rts <- select(bk, id, starts_with("rt_"))
bk_rts_flt <- filter(bk_rts, rt_hand_d > 1500)
bk_rts_flt

# A tibble: 2 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1   181     1552.      1072.     1068       1196.
2   305     1527       1416.     1713.      2471.

read_csv("data/blomkvist.csv") %>% 
  select(id, starts_with("rt_")) %>% 
  filter(rt_hand_d > 1500)

The pipe: `%>%`

Continue with exercise 7

Combine data: overview

# Combine two datasets side-by-side
bind_cols(data_1, data_2)
# Stacking two data sets
bind_rows(data_1, data_2)
# Keep all rows of data_1 and add data_2
left_join(data_1, data_2)
# Keep all rows of data_2 and add data_1
right_join(data_1, data_2)
# Include all rows of both data sets
full_join(data_1, data_2)
# Include data that is present in both data sets
inner_join(data_1, data_2)

Combine data: `bind_cols()`

bk_id_age <- select(blomkvist, id, age)
bk_id_age

# A tibble: 267 × 2
     id   age
  <dbl> <dbl>
1     1    84
2     2    37
3     3    62
4     4    85
5     5    73
# ℹ 262 more rows

bk_med_smoke <- select(blomkvist, medicine, smoker)
bk_med_smoke

# A tibble: 267 × 2
  medicine smoker
     <dbl> <chr> 
1        8 former
2        1 no    
3        0 yes   
4        4 former
5        5 former
# ℹ 262 more rows

Combine data: `bind_cols()`

# Combine two datasets side-by-side
bind_cols(bk_id_age, bk_med_smoke)

# A tibble: 267 × 4
     id   age medicine smoker
  <dbl> <dbl>    <dbl> <chr> 
1     1    84        8 former
2     2    37        1 no    
3     3    62        0 yes   
4     4    85        4 former
5     5    73        5 former
# ℹ 262 more rows

Combine data: `bind_rows()`

bk_former_smokers <- filter(blomkvist, smoker == "former")
bk_smokers <- filter(blomkvist, smoker == "yes")

Combine data: `bind_rows()`

# Stacking two data sets
bk_smoking <- bind_rows(bk_former_smokers, bk_smokers)

slice_head(bk_smoking, n = 1)

# A tibble: 1 × 10
     id sex     age medicine meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d
  <dbl> <chr> <dbl>    <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>
1     1 male     84        8 a lot    former      702.       780.      1009
# ℹ 1 more variable: rt_foot_nd <dbl>

slice_tail(bk_smoking, n = 1)

# A tibble: 1 × 10
     id sex     age medicine meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d
  <dbl> <chr> <dbl>    <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>
1   342 male     60        3 few      yes          708        686       909
# ℹ 1 more variable: rt_foot_nd <dbl>

This also works when columns are not in the same order as long as the names and types match.

Create two example data sets

bk_smoker <- select(blomkvist, smoker, age)
bk_smoker_o40 <- filter(bk_smoker, age > 40)

bk_sex <- select(blomkvist, sex, age)
bk_sex_u50 <- filter(bk_sex, age < 50)

Create two example data sets

bk_smoker_o40

# A tibble: 195 × 2
  smoker   age
  <chr>  <dbl>
1 former    84
2 yes       62
3 former    85
4 former    73
5 no        65
# ℹ 190 more rows

bk_sex_u50

# A tibble: 99 × 2
  sex      age
  <chr>  <dbl>
1 female    37
2 female    30
3 female    49
4 female    25
5 male      27
# ℹ 94 more rows

Combine data: `left_join()`

# Keep all rows of bk_sex_u50
bk_joined <- left_join(bk_sex_u50, 
                       bk_smoker_o40, by = "age")
bk_joined

# A tibble: 173 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    37 <NA>  
2 female    30 <NA>  
3 female    49 former
4 female    49 former
5 female    49 no    
# ℹ 168 more rows

range(bk_joined$age)

[1] 20 49

Combine data: `right_join()`

# Keep all rows of bk_smoker_o40
bk_joined <- right_join(bk_sex_u50, 
                        bk_smoker_o40, by = "age")
bk_joined

# A tibble: 269 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    49 former
2 female    49 former
3 female    49 no    
4 female    49 no    
5 female    49 former
# ℹ 264 more rows

range(bk_joined$age)

[1] 41 96

Combine data: `full_join()`

# Include all rows of both data sets
bk_joined <- full_join(bk_sex_u50, 
                       bk_smoker_o40, by = "age")
bk_joined

# A tibble: 341 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    37 <NA>  
2 female    30 <NA>  
3 female    49 former
4 female    49 former
5 female    49 no    
# ℹ 336 more rows

range(bk_joined$age)

[1] 20 96

Combine data: `inner_join()`

# Include only info that is present in both data sets
bk_joined <- inner_join(bk_sex_u50, 
                        bk_smoker_o40, by = "age")
bk_joined

# A tibble: 101 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    49 former
2 female    49 former
3 female    49 no    
4 female    49 no    
5 female    49 former
# ℹ 96 more rows

range(bk_joined$age)

[1] 41 49

Continue with exercise 8

Mutate data with `across()`

Create a new column.

mutate(blomkvist_rt, log_rt = log(rt))

# A tibble: 267 × 5
     id smoker   age    rt log_rt
  <dbl> <chr>  <dbl> <dbl>  <dbl>
1     1 former    84  702.   6.55
2     2 no        37  471.   6.15
3     3 yes       62  639.   6.46
4     4 former    85  708    6.56
5     5 former    73  607.   6.41
# ℹ 262 more rows

Replace with transformed variable.

mutate(blomkvist_rt, across(rt, log))

# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  6.55
2     2 no        37  6.15
3     3 yes       62  6.46
4     4 former    85  6.56
5     5 former    73  6.41
# ℹ 262 more rows

First argument of across will be used as first argument of function (log) which is supplied as second argument to across.

Mutate data with `across()`

# Instead of
mutate(blomkvist, rt_hand_d = log(rt_hand_d), 
                  rt_hand_nd = log(rt_hand_nd))

# Do
mutate(blomkvist, across(c(rt_hand_d, rt_hand_nd), log))

# A tibble: 264 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1     1 male      84 a lot    former      6.55       6.66     1009        963.
2     2 female    37 little   no          6.15       6.21      738.       692.
3     3 female    62 none     yes         6.46       6.46      878        786 
4     4 female    85 few      former      6.56       6.46      902.      1374.
5     5 male      73 a lot    former      6.41       6.48      923        805 
# ℹ 259 more rows

Mutate data with `across()`

mutate(blomkvist, across(where(is.numeric), log))

# A tibble: 264 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1 0     male    4.43 a lot    former      6.55       6.66      6.92       6.87
2 0.693 female  3.61 little   no          6.15       6.21      6.60       6.54
3 1.10  female  4.13 none     yes         6.46       6.46      6.78       6.67
4 1.39  female  4.44 few      former      6.56       6.46      6.80       7.23
5 1.61  male    4.29 a lot    former      6.41       6.48      6.83       6.69
# ℹ 259 more rows

# Not ideal here: see `id` column

Mutate data with `across()`

mutate(blomkvist, across(starts_with("rt_"), log))

Rows: 264
Columns: 9
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, …
$ sex        <chr> "male", "female", "female", "female", "male", "male", "fema…
$ age        <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27, 60,…
$ meds_cat   <chr> "a lot", "little", "none", "few", "a lot", "none", "none", …
$ smoker     <chr> "former", "no", "yes", "former", "former", "no", "no", "for…
$ rt_hand_d  <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, 6.8,…
$ rt_hand_nd <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, 6.6,…
$ rt_foot_d  <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, 6.7,…
$ rt_foot_nd <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, 6.8,…

Mutate data with `across()`

mutate(blomkvist, across(starts_with("rt_"), log, .names = "log_{.col}"))

Rows: 264
Columns: 13
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, …
$ sex            <chr> "male", "female", "female", "female", "male", "male", "…
$ age            <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27,…
$ meds_cat       <chr> "a lot", "little", "none", "few", "a lot", "none", "non…
$ smoker         <chr> "former", "no", "yes", "former", "former", "no", "no", …
$ rt_hand_d      <dbl> 702, 471, 639, 708, 607, 542, 571, 509, 737, 550, 548, …
$ rt_hand_nd     <dbl> 780, 497, 638, 639, 652, 499, 527, 547, 865, 569, 507, …
$ rt_foot_d      <dbl> 1009, 738, 878, 902, 923, 687, 778, 743, 750, 629, 653,…
$ rt_foot_nd     <dbl> 963, 692, 786, 1374, 805, 600, 750, 797, 797, 800, 718,…
$ log_rt_hand_d  <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, …
$ log_rt_hand_nd <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, …
$ log_rt_foot_d  <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, …
$ log_rt_foot_nd <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, …

Mutate data with `across()`

mutate(blomkvist, across(starts_with("rt_"), list(lg = log, sqr = ~.^2), .names = "{.fn}_{.col}"))

Rows: 264
Columns: 17
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, …
$ sex            <chr> "male", "female", "female", "female", "male", "male", "…
$ age            <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27,…
$ meds_cat       <chr> "a lot", "little", "none", "few", "a lot", "none", "non…
$ smoker         <chr> "former", "no", "yes", "former", "former", "no", "no", …
$ rt_hand_d      <dbl> 702, 471, 639, 708, 607, 542, 571, 509, 737, 550, 548, …
$ rt_hand_nd     <dbl> 780, 497, 638, 639, 652, 499, 527, 547, 865, 569, 507, …
$ rt_foot_d      <dbl> 1009, 738, 878, 902, 923, 687, 778, 743, 750, 629, 653,…
$ rt_foot_nd     <dbl> 963, 692, 786, 1374, 805, 600, 750, 797, 797, 800, 718,…
$ lg_rt_hand_d   <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, …
$ sqr_rt_hand_d  <dbl> 492336, 221527, 407895, 501264, 368854, 293403, 325660,…
$ lg_rt_hand_nd  <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, …
$ sqr_rt_hand_nd <dbl> 608920, 247009, 407044, 407895, 425104, 248668, 278080,…
$ lg_rt_foot_d   <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, …
$ sqr_rt_foot_d  <dbl> 1018081, 544152, 770884, 814205, 851929, 471511, 605803…
$ lg_rt_foot_nd  <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, …
$ sqr_rt_foot_nd <dbl> 926727, 479325, 617796, 1886960, 648025, 359600, 562000…

What’s “`~.`”?

mutate(blomkvist_rt, across(rt, round, 0))

# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84   702
2     2 no        37   471
3     3 yes       62   639
4     4 former    85   708
5     5 former    73   607
# ℹ 262 more rows

mutate(blomkvist_rt, across(rt, ~round(., 0)))

# A tibble: 267 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84   702
2     2 no        37   471
3     3 yes       62   639
4     4 former    85   708
5     5 former    73   607
# ℹ 262 more rows

What’s “`~.`”?

“~”: we want to make the position of the argument in function explicit.
“.”: the location of the argument.
Always possible but necessary for operator functions (>, ==, ^, +) and when argument is not in first position of supplied function.

mutate(blomkvist_rt, 
       across(rt, ~round(. , 0)), # optional
       across(age, ~.^2),   # operator
       across(rt, ~paste("RT is", .))) # position

# A tibble: 267 × 4
     id smoker   age rt       
  <dbl> <chr>  <dbl> <chr>    
1     1 former  7056 RT is 702
2     2 no      1369 RT is 471
3     3 yes     3844 RT is 639
4     4 former  7225 RT is 708
5     5 former  5329 RT is 607
# ℹ 262 more rows

Filter data with `across()`

filter(blomkvist, rt_hand_d > 1000, rt_hand_nd > 1000, rt_foot_d > 1000, rt_foot_nd > 100)

# A tibble: 9 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1    70 female    85 a lot    no         1102.      1352.     1458       1335.
2    96 male      77 a lot    no         1241.      1651.     1217.      1266.
3   127 female    96 a lot    no         1179.      1090      1546       4820 
4   152 female    68 none     no         1030.      1095      1137       1075.
5   171 female    92 a lot    no         1025.      1026.     1379.      1310.
# ℹ 4 more rows

Filter data with `across()`

filter(blomkvist, across(starts_with("rt_"), ~ . > 1000 ))

# A tibble: 9 × 9
     id sex      age meds_cat smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl> <chr>    <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1    70 female    85 a lot    no         1102.      1352.     1458       1335.
2    96 male      77 a lot    no         1241.      1651.     1217.      1266.
3   127 female    96 a lot    no         1179.      1090      1546       4820 
4   152 female    68 none     no         1030.      1095      1137       1075.
5   171 female    92 a lot    no         1025.      1026.     1379.      1310.
# ℹ 4 more rows

Summarise variables with `across()`

summarise(blomkvist, rt_hand_d_mean = mean(rt_hand_d),
                     rt_hand_nd_mean = mean(rt_hand_nd))

# A tibble: 1 × 2
  rt_hand_d_mean rt_hand_nd_mean
           <dbl>           <dbl>
1           639.            637.

Summarise variables with `across()`

summarise(blomkvist, rt_hand_d_mean = mean(rt_hand_d),
                     rt_hand_nd_mean = mean(rt_hand_nd))

# A tibble: 1 × 2
  rt_hand_d_mean rt_hand_nd_mean
           <dbl>           <dbl>
1           639.            637.

summarise(blomkvist, across(c(rt_hand_d, rt_hand_nd), mean))

# A tibble: 1 × 2
  rt_hand_d rt_hand_nd
      <dbl>      <dbl>
1      639.       637.

Summarise variables with `across()`

summarise(blomkvist, across(starts_with("rt_"), mean))

# A tibble: 1 × 4
  rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
      <dbl>      <dbl>     <dbl>      <dbl>
1      639.       637.      847.       867.

Summarise variables with `across()`

summarise(blomkvist, across(starts_with("rt_"), mean))

# A tibble: 1 × 4
  rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
      <dbl>      <dbl>     <dbl>      <dbl>
1      639.       637.      847.       867.

summarise(blomkvist, across(starts_with("rt_"), list(mean = mean, sd = sd)))

# A tibble: 1 × 8
  rt_hand_d_mean rt_hand_d_sd rt_hand_nd_mean rt_hand_nd_sd rt_foot_d_mean
           <dbl>        <dbl>           <dbl>         <dbl>          <dbl>
1           639.         191.            637.          179.           847.
# ℹ 3 more variables: rt_foot_d_sd <dbl>, rt_foot_nd_mean <dbl>,
#   rt_foot_nd_sd <dbl>

Mutate data with `c_across()`

Apply a function across different columns.

Example: get the sum for of all rts for each participant (row).

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)

# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with `c_across()`

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)

# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with `c_across()`

# Using sum function doesn't give the desired result
mutate(blomkvist, sum_rt = sum(c(rt_hand_d, rt_hand_nd, rt_foot_d, rt_foot_nd), na.rm = TRUE))

# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd  sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1 male      84      702.       780.     1009        963. 789356.
2     2 female    37      471.       497       738.       692. 789356.
3     3 female    62      639.       638       878        786  789356.
4     4 female    85      708        639.      902.      1374. 789356.
5     5 male      73      607.       652       923        805  789356.
# ℹ 259 more rows

Mutate data with `c_across()`

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)

# Still a lot of typing :(
blomkvist_rowwise <- rowwise(blomkvist) # each row / ppt is a group
mutate(blomkvist_rowwise, sum_rt = sum(c_across(c(rt_hand_d, rt_hand_nd, rt_foot_d, rt_foot_nd))))

# A tibble: 264 × 8
# Rowwise: 
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with `c_across()`

# Long way
mutate(blomkvist, sum_rt = rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)

# That's better :)
blomkvist_rowwise <- rowwise(blomkvist) # each row / ppt is a group
mutate(blomkvist_rowwise, sum_rt = sum(c_across(starts_with("rt_"))))

# A tibble: 264 × 8
# Rowwise: 
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>  <dbl>
1     1 male      84      702.       780.     1009        963.  3454.
2     2 female    37      471.       497       738.       692.  2398.
3     3 female    62      639.       638       878        786   2941.
4     4 female    85      708        639.      902.      1374.  3623.
5     5 male      73      607.       652       923        805   2987.
# ℹ 259 more rows

Mutate data with `c_across()`

# Without rowwise grouping returns the overall sum:
mutate(blomkvist, sum_rt = sum(c_across(starts_with("rt_"))))

# A tibble: 264 × 8
     id sex      age rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd  sum_rt
  <dbl> <chr>  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1 male      84      702.       780.     1009        963. 789356.
2     2 female    37      471.       497       738.       692. 789356.
3     3 female    62      639.       638       878        786  789356.
4     4 female    85      708        639.      902.      1374. 789356.
5     5 male      73      607.       652       923        805  789356.
# ℹ 259 more rows

# which is not what we want.

Mutating data: `case_when()`

case_when(condition ~ do) # similar to ifelse()

mutate(blomkvist_rt, age_cat = case_when(age > 70 ~ "high",
                                         age > 40 ~ "middle",
                                         is.na(smoker) ~ "dunno",
                                         TRUE ~ "on the young side"))

# A tibble: 267 × 5
     id smoker   age    rt age_cat          
  <dbl> <chr>  <dbl> <dbl> <chr>            
1     1 former    84  702. high             
2     2 no        37  471. on the young side
3     3 yes       62  639. middle           
4     4 former    85  708  high             
5     5 former    73  607. high             
# ℹ 262 more rows

Using `across()` to mutate and summarise variables

Open exercise 9 script.

References

Andrews, Mark. 2021. Doing data science in R: An Introduction for Social Scientists. London, UK: SAGE Publications Ltd.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.