About me: Jens Roeser

  • early career researchfellow @ psychology department (Nottingham Trent University)
  • theory: psycholinguistics; language production / comprehension / acquisition
  • methods: Bayesian modelling (talk to me about mixture models, Roeser et al. 2021) in Stan; keystroke logging; eyetracking
  • teaching: statistics – R of course (psyntur, Andrews and Roeser 2021); cognitive psychology; language acquisition
  • twitter: https://twitter.com/jens_roeser

Outline for today

  • Data wrangling with tidyverse (50%)
  • Data viz with ggplot2 (40%)
  • R-Markdown (10%)
  • Lots of hands-on exercises

Why should I care?

Why using R (or code in general) to handle data?

** What do you think? **

Why should I care?

Why using R (or code in general) to handle data?

  • > 70% to 80% of data analysis is data wrangling
  • Open source: R is and always will be free of charge
  • Reduce human error
  • Reduce manual work
  • Reproducibility: publish your code and look at code of other researchers
  • Flexibility: different ways of looking at data
  • Quickly growing number of available add-ons (packages) for data analysis
  • Speed: faster than manually transforming data in spreadsheets
  • Processing of large data sets is not going to be possible in spreadsheets
  • Large community of friendly peer support

Rules!

  • Never change your data manually; document everything in code.
    • Retrospective amendments made easy
    • Documentation / reproducibility
  • Organized working environment
    • .Rproj with one director per project with sub-directories for scripts, data, plots, etc
    • Short scripts: less code with one clear purpose is always better (test is: does the name of your script suggest a specific or general purpose?)
  • Comment your code (# Ceci n'est pas un comment!)
  • If possible, use tidyverse instead of base R.

Download repository

  • Download: https://github.com/jensroes/hallam-r-workshop
  • Click on: Code > Download ZIP > unzip directory on your machine.
  • Open project by double clicking on hallam-r-workshop.Rproj
  • wrangling/exercises/: exercises associated with each topic
  • data/: scripts read data from here
  • wrangling/slides.Rmd: these slides in R markdown format (.html format provided as well)

Goals of data wrangling: goals

  • Data come in various formats (long, wide) and data type (xlsx, ods, json, csv, sav)
  • No format is suitable for every goal
  • Fluency in data wrangling gives you a lot of power.
  • Make data format suitable to use: e.g. for statistical models (correlations, linear regression), functions, data viz, summary table
  • Calculate new variables, filter or combine data
  • Reveal information
  • Summarise information
  • (also creating counterbalanced, randomised stimulus lists)

tidyverse

Collection of R packages for data science that share:

  • common data philosophies
  • grammar
  • data structures
  • best practice
  • designed to work together

tidyverse

# Installs 19 packages
install.packages("tidyverse")
# Loads 6 packages
library(tidyverse)

tidyverse

Tidy data

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value have its own cell.

Why?

  • Placing variables in columns takes advantage of R’s vectorised nature (faster processing, more compact code).
  • Consistent data structure allows easier learning of related tools because they have similar underpinning principles (except similar input structures).

tidyverse: verbs

  • Functions that do specific things to our data.

  • Must know: read_csv, write_csv, glimpse, select, filter, mutate, group_by / ungroup, summarise, pivot_wider / _longer, _join, bind_rows / _cols

  • Also important: count, pull, slice, across, recode, unique, n, where, everything, ~ and ., map, starts_with, ends_with, contains, separate, unite, transmute

  • There are more but these are the most important ones.

Example data set: Blomkvist et al. (2017)

  • Age-related changes in cognitive performance through adolescence and adulthood in a real-world task.

Real-world task: StarCraft 2

  • Real-time strategy video game
  • Nintendo Wii Balance Board

Example data set: Blomkvist et al. (2017)

blomkvist <- read_csv("../data/blomkvist.csv")
glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
$ sex         <chr> "male", "female", "female", "female", …
$ age         <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58…
$ medicine    <dbl> 8, 1, 0, 4, 5, 0, 0, 0, 11, 0, 0, 4, 3…
$ smoker      <chr> "former", "no", "yes", "former", "form…
$ pal_work    <dbl> NA, 2, NA, NA, NA, 1, 3, 1, NA, 4, 2, …
$ pal_leisure <dbl> 1, 2, 2, 2, 3, 3, 2, 2, 1, 3, 3, 2, 1,…
$ rt_hand_d   <dbl> 702, 471, 639, 708, 607, 542, 571, 509…
$ rt_hand_nd  <dbl> 780, 497, 638, 639, 652, 499, 527, 547…
$ rt_foot_d   <dbl> 1009, 738, 878, 902, 923, 687, 778, 74…
$ rt_foot_nd  <dbl> 963, 692, 786, 1374, 805, 600, 750, 79…
  • Average reaction time (rt) of dominant (_d) or non-dominant (_nd) hand or foot in msecs
  • medicine: number of drugs used daily
  • pal: physical activity level: 1 (least) to 4 (most active)

tbls (tibble)

  • tidyverse is operating with tibbles
  • Type of data structure
  • Easier to read in console
# Imports data as data frame
data_as_frame <- read.csv("path_to_data/data.csv")
# Imports data as tibble
data_as_tibble <- read_csv("path_to_data/data.csv")
  • .csv: comma separated file
  • readr package: e.g. read_csv, read_delim, read_tsv

For other data formats:

  • haven package: e.g. read_dta, read_sav, read_sas
  • readxl package: e.g. read_excel, read_xls, read_xlsx
# Summarise data structure in base R
str(data_as_frame)
# Summarise data structure in tidyverse
glimpse(data_as_tibble)

Open exercise script 1

tidyverse functions

Functions follow the principle

function_name(data_name, argument)

where argument specifies which variable / condition etc. the function has to operate on.

# Picking out variables
select(data, variable1) 
# Subsetting data
filter(data, variable > 100) 
# Change / add variables
mutate(data, variable_sqr = variable^2)
# Aggregate data
summarise(data, mean_var = mean(variable)) 

Selecting variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ sex         <chr> "male", "female", "female", "female", "male", "male", "fem…
$ age         <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 88, 27…
$ medicine    <dbl> 8, 1, 0, 4, 5, 0, 0, 0, 11, 0, 0, 4, 3, 8, 1, 3, 4, 1, 1, …
$ smoker      <chr> "former", "no", "yes", "former", "former", "no", "no", "fo…
$ pal_work    <dbl> NA, 2, NA, NA, NA, 1, 3, 1, NA, 4, 2, NA, 3, NA, 2, 3, NA,…
$ pal_leisure <dbl> 1, 2, 2, 2, 3, 3, 2, 2, 1, 3, 3, 2, 1, 1, 3, 3, 1, 3, 1, 2…
$ rt_hand_d   <dbl> 702, 471, 639, 708, 607, 542, 571, 509, 737, 550, 548, 889…
$ rt_hand_nd  <dbl> 780, 497, 638, 639, 652, 499, 527, 547, 865, 569, 507, 766…
$ rt_foot_d   <dbl> 1009, 738, 878, 902, 923, 687, 778, 743, 750, 629, 653, 79…
$ rt_foot_nd  <dbl> 963, 692, 786, 1374, 805, 600, 750, 797, 797, 800, 718, 86…

Selecting variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id, sex, age)
# A tibble: 354 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# … with 349 more rows

Selecting variables: select range

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id:age)
# A tibble: 354 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# … with 349 more rows

Selecting variables: index

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, 1, 2, 3)
# A tibble: 354 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# … with 349 more rows

Selecting variables: index range

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, 1:3)
# A tibble: 354 × 3
     id sex      age
  <dbl> <chr>  <dbl>
1     1 male      84
2     2 female    37
3     3 female    62
4     4 female    85
5     5 male      73
# … with 349 more rows

Selecting variables: rename

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id, sex, rt = rt_hand_d)
# A tibble: 354 × 3
     id sex       rt
  <dbl> <chr>  <dbl>
1     1 male    702.
2     2 female  471.
3     3 female  639.
4     4 female  708 
5     5 male    607.
# … with 349 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id, starts_with("rt_"))
# A tibble: 354 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# … with 349 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id, ends_with("d"))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id, ends_with("d"))
# A tibble: 354 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# … with 349 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id, contains("hand"))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, id, contains("hand"))
# A tibble: 354 × 3
     id rt_hand_d rt_hand_nd
  <dbl>     <dbl>      <dbl>
1     1      702.       780.
2     2      471.       497 
3     3      639.       638 
4     4      708        639.
5     5      607.       652 
# … with 349 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, -contains("hand"))
# A tibble: 354 × 9
     id sex      age medicine smoker pal_work pal_leisure rt_foot_d rt_foot_nd
  <dbl> <chr>  <dbl>    <dbl> <chr>     <dbl>       <dbl>     <dbl>      <dbl>
1     1 male      84        8 former       NA           1     1009        963.
2     2 female    37        1 no            2           2      738.       692.
3     3 female    62        0 yes          NA           2      878        786 
4     4 female    85        4 former       NA           2      902.      1374.
5     5 male      73        5 former       NA           3      923        805 
# … with 349 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, -ends_with("_d"))

???

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, -ends_with("_d"))
# A tibble: 354 × 9
     id sex      age medicine smoker pal_work pal_leisure rt_hand_nd rt_foot_nd
  <dbl> <chr>  <dbl>    <dbl> <chr>     <dbl>       <dbl>      <dbl>      <dbl>
1     1 male      84        8 former       NA           1       780.       963.
2     2 female    37        1 no            2           2       497        692.
3     3 female    62        0 yes          NA           2       638        786 
4     4 female    85        4 former       NA           2       639.      1374.
5     5 male      73        5 former       NA           3       652        805 
# … with 349 more rows

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, -sex:-smoker)

???

(Un-)selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, -sex:-smoker)
# A tibble: 354 × 7
     id pal_work pal_leisure rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>    <dbl>       <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1       NA           1      702.       780.     1009        963.
2     2        2           2      471.       497       738.       692.
3     3       NA           2      639.       638       878        786 
4     4       NA           2      708        639.      902.      1374.
5     5       NA           3      607.       652       923        805 
# … with 349 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, where(is.character))
# A tibble: 354 × 2
  sex    smoker
  <chr>  <chr> 
1 male   former
2 female no    
3 female yes   
4 female former
5 male   former
# … with 349 more rows

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, where(is.numeric))

???

Selecting multiple variables

Extracts variables you’re interested in.

glimpse(blomkvist)
Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3,…
$ sex         <chr> "male", …
$ age         <dbl> 84, 37, …
$ medicine    <dbl> 8, 1, 0,…
$ smoker      <chr> "former"…
$ pal_work    <dbl> NA, 2, N…
$ pal_leisure <dbl> 1, 2, 2,…
$ rt_hand_d   <dbl> 702, 471…
$ rt_hand_nd  <dbl> 780, 497…
$ rt_foot_d   <dbl> 1009, 73…
$ rt_foot_nd  <dbl> 963, 692…
select(blomkvist, where(is.numeric))
# A tibble: 354 × 9
     id   age medicine pal_work pal_leisure rt_hand_d rt_hand_nd rt_foot_d
  <dbl> <dbl>    <dbl>    <dbl>       <dbl>     <dbl>      <dbl>     <dbl>
1     1    84        8       NA           1      702.       780.     1009 
2     2    37        1        2           2      471.       497       738.
3     3    62        0       NA           2      639.       638       878 
4     4    85        4       NA           2      708        639.      902.
5     5    73        5       NA           3      607.       652       923 
# … with 349 more rows, and 1 more variable: rt_foot_nd <dbl>

Continue with exercise 2

Filtering data

Select variables of interest

blomkvist_rt <- select(blomkvist, id, smoker, age, rt = rt_hand_d)

Subsetting data by selecting rows that meet one condition or more.

filter(data, condition)

Continuous variables

filter(blomkvist_rt, rt >= 708)
# A tibble: 81 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     4 former    85  708 
2     9 former    83  737.
3    12 no        88  889 
4    13 yes       62  884.
5    14 former    88  832.
# … with 76 more rows
filter(blomkvist_rt, rt > 708)
# A tibble: 79 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     9 former    83  737.
2    12 no        88  889 
3    13 yes       62  884.
4    14 former    88  832.
5    17 former    80  930 
# … with 74 more rows

Continuous variables

filter(blomkvist_rt, rt > 708, rt < 900)
# A tibble: 49 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     9 former    83  737.
2    12 no        88  889 
3    13 yes       62  884.
4    14 former    88  832.
5    39 no        88  727 
# … with 44 more rows

Continuous variables

filter(blomkvist_rt, rt > 708 | rt < 900)
# A tibble: 353 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# … with 348 more rows
# which is a logical tautology really
filter(blomkvist_rt, rt < 708 | rt > 900)
# A tibble: 302 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     5 former    73  607.
5     6 no        65  542.
# … with 297 more rows

Continuous variables

filter(blomkvist_rt, rt > mean(rt, na.rm = TRUE))
# A tibble: 126 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     3 yes       62  639.
3     4 former    85  708 
4     9 former    83  737.
5    12 no        88  889 
# … with 121 more rows
mean(blomkvist_rt$rt, na.rm = TRUE)
[1] 637

NB. What’s na.rm = TRUE?

# Data with missing values
y <- c(100, 1150, 200, 43, NA, 15)
mean(y)
[1] NA
mean(y, na.rm = TRUE)
[1] 302
sd(y, na.rm = TRUE)
[1] 480

Categorical variables

filter(blomkvist_rt, smoker == "yes")
# A tibble: 30 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     3 yes       62  639.
2    10 yes       58  550.
3    13 yes       62  884.
4    24 yes       57  612.
5    28 yes       59  586.
# … with 25 more rows
unique(blomkvist_rt$smoker)
[1] "former" "no"     "yes"    NA      

Categorical variables

filter(blomkvist_rt, smoker != "yes")
# A tibble: 315 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     4 former    85  708 
4     5 former    73  607.
5     6 no        65  542.
# … with 310 more rows
unique(blomkvist_rt$smoker)
[1] "former" "no"     "yes"    NA      

Categorical variables

filter(blomkvist_rt, smoker %in% c("yes", "former"))
# A tibble: 132 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     3 yes       62  639.
3     4 former    85  708 
4     5 former    73  607.
5     8 former    49  509.
# … with 127 more rows
filter(blomkvist_rt, !(smoker %in% c("yes", "former")))
# A tibble: 222 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     2 no        37  471.
2     6 no        65  542.
3     7 no        30  571.
4    11 no        25  548 
5    12 no        88  889 
# … with 217 more rows

Missing data

unique(blomkvist_rt$smoker)
[1] "former" "no"     "yes"    NA      
filter(blomkvist_rt, smoker == "NA") # ooops!!!
# A tibble: 0 × 4
# … with 4 variables: id <dbl>, smoker <chr>, age <dbl>, rt <dbl>
filter(blomkvist_rt, smoker == NA) # double ooops!!!
# A tibble: 0 × 4
# … with 4 variables: id <dbl>, smoker <chr>, age <dbl>, rt <dbl>

Missing data

filter(blomkvist_rt, is.na(smoker))
# A tibble: 9 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1    41 <NA>      52  592 
2   144 <NA>      36  501.
3   149 <NA>      34  479.
4   178 <NA>      26  466 
5   183 <NA>      25  888.
# … with 4 more rows
filter(blomkvist_rt, !is.na(smoker))
# A tibble: 345 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# … with 340 more rows

Continue with exercise 3

Mutating data

Adding new variables to the data or changing existing ones.

mutate(blomkvist_rt, rt_2 = rt * rt)
# A tibble: 354 × 5
     id smoker   age    rt    rt_2
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702. 492336.
2     2 no        37  471. 221527.
3     3 yes       62  639. 407895.
4     4 former    85  708  501264 
5     5 former    73  607. 368854.
# … with 349 more rows
mutate(blomkvist_rt, rt_2 = rt^2)
# A tibble: 354 × 5
     id smoker   age    rt    rt_2
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702. 492336.
2     2 no        37  471. 221527.
3     3 yes       62  639. 407895.
4     4 former    85  708  501264 
5     5 former    73  607. 368854.
# … with 349 more rows

Mutating data

mutate(blomkvist_rt, log_rt = log(rt))
# A tibble: 354 × 5
     id smoker   age    rt log_rt
  <dbl> <chr>  <dbl> <dbl>  <dbl>
1     1 former    84  702.   6.55
2     2 no        37  471.   6.15
3     3 yes       62  639.   6.46
4     4 former    85  708    6.56
5     5 former    73  607.   6.41
# … with 349 more rows

Mutating data

mutate(blomkvist_rt, is_slow = rt > 700)
# A tibble: 354 × 5
     id smoker   age    rt is_slow
  <dbl> <chr>  <dbl> <dbl> <lgl>  
1     1 former    84  702. TRUE   
2     2 no        37  471. FALSE  
3     3 yes       62  639. FALSE  
4     4 former    85  708  TRUE   
5     5 former    73  607. FALSE  
# … with 349 more rows

Mutating data

mutate(blomkvist_rt, mean_rt = mean(rt, na.rm = TRUE))
# A tibble: 354 × 5
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    637.
2     2 no        37  471.    637.
3     3 yes       62  639.    637.
4     4 former    85  708     637.
5     5 former    73  607.    637.
# … with 349 more rows

Mutating data

mutate(blomkvist_rt, mean_rt = mean(rt, na.rm = TRUE),
                     is_slow = rt > mean_rt)
# A tibble: 354 × 6
     id smoker   age    rt mean_rt is_slow
  <dbl> <chr>  <dbl> <dbl>   <dbl> <lgl>  
1     1 former    84  702.    637. TRUE   
2     2 no        37  471.    637. FALSE  
3     3 yes       62  639.    637. TRUE   
4     4 former    85  708     637. TRUE   
5     5 former    73  607.    637. FALSE  
# … with 349 more rows
mutate(blomkvist_rt, # or both in one go
       is_slow = rt > mean(rt, na.rm = TRUE))

Mutating data: recode()

mutate(blomkvist_rt, smoker_recoded = recode(smoker, former = "former smoker",
                                                     yes = "smoker",
                                                     no = "non-smoker"))
# A tibble: 354 × 5
     id smoker   age    rt smoker_recoded
  <dbl> <chr>  <dbl> <dbl> <chr>         
1     1 former    84  702. former smoker 
2     2 no        37  471. non-smoker    
3     3 yes       62  639. smoker        
4     4 former    85  708  former smoker 
5     5 former    73  607. former smoker 
# … with 349 more rows

Mutating data: cut()

mutate(blomkvist_rt, age_cat = cut(age, 
                                   breaks = 3, 
                                   labels = c("low", "middle", "high")))
# A tibble: 354 × 5
     id smoker   age    rt age_cat
  <dbl> <chr>  <dbl> <dbl> <fct>  
1     1 former    84  702. high   
2     2 no        37  471. low    
3     3 yes       62  639. middle 
4     4 former    85  708  high   
5     5 former    73  607. high   
# … with 349 more rows

Mutating data: case_when()

case_when(condition ~ do) # similar to ifelse()
mutate(blomkvist_rt, age_cat = case_when(age > 70 ~ "high",
                                         age > 40 ~ "middle",
                                         is.na(smoker) ~ "dunno",
                                         TRUE ~ "on the young side"))
# A tibble: 354 × 5
     id smoker   age    rt age_cat          
  <dbl> <chr>  <dbl> <dbl> <chr>            
1     1 former    84  702. high             
2     2 no        37  471. on the young side
3     3 yes       62  639. middle           
4     4 former    85  708  high             
5     5 former    73  607. high             
# … with 349 more rows

Mutating data

Continue with exercise 4

Grouping data with group_by()

Perform an action (function) for each level the grouping variable individually.

blomkvist_grouped <- group_by(blomkvist_rt, smoker) # Group by smoker
blomkvist_grouped 
# A tibble: 354 × 4
# Groups:   smoker [4]
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# … with 349 more rows
# How many groups and what are they?

Mutate grouped data

mutate(blomkvist_rt, # not grouped
       mean_rt = mean(rt, na.rm = TRUE))
# A tibble: 354 × 5
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    637.
2     2 no        37  471.    637.
3     3 yes       62  639.    637.
4     4 former    85  708     637.
5     5 former    73  607.    637.
# … with 349 more rows
mutate(blomkvist_grouped, # grouped data
       mean_rt = mean(rt, na.rm = TRUE))
# A tibble: 354 × 5
# Groups:   smoker [4]
     id smoker   age    rt mean_rt
  <dbl> <chr>  <dbl> <dbl>   <dbl>
1     1 former    84  702.    676.
2     2 no        37  471.    623.
3     3 yes       62  639.    628.
4     4 former    85  708     676.
5     5 former    73  607.    676.
# … with 349 more rows
# What's the grouping variable?
# What's the difference?

Never forget to ungroup your data

otherwise you will keep performing by-group operations, even if you didn’t intend to.

blomkvist_grouped # :(
# A tibble: 354 × 4
# Groups:   smoker [4]
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# … with 349 more rows
ungroup(blomkvist_grouped) # :)
# A tibble: 354 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  702.
2     2 no        37  471.
3     3 yes       62  639.
4     4 former    85  708 
5     5 former    73  607.
# … with 349 more rows
# Spot the difference!

Summarise (grouped) data

… using descriptive tools.

summarise(blomkvist_rt, # not grouped
          mean_rt = mean(rt, na.rm = TRUE),
          N = n())
# A tibble: 1 × 2
  mean_rt     N
    <dbl> <int>
1    637.   354
summarise(blomkvist_grouped, # grouped data
          mean_rt = mean(rt, na.rm = TRUE),
          N = n())
# A tibble: 4 × 3
  smoker mean_rt     N
  <chr>    <dbl> <int>
1 former    676.   102
2 no        623.   213
3 yes       628.    30
4 <NA>      549.     9

Summarise data

Summarising data using descriptive tools.

summarise(blomkvist_rt, # not grouped
          mean_rt = mean(rt, na.rm = TRUE),
          sd_rt = sd(rt, na.rm = TRUE),
          min_rt = min(rt, na.rm = TRUE),
          max_rt = max(rt, na.rm = TRUE),
          N = n())
# A tibble: 1 × 5
  mean_rt sd_rt min_rt max_rt     N
    <dbl> <dbl>  <dbl>  <dbl> <int>
1    637.  204.   379.   2076   354
summarise(blomkvist_grouped, # grouped data
          mean_rt = mean(rt, na.rm = TRUE),
          sd_rt = sd(rt, na.rm = TRUE),
          min_rt = min(rt, na.rm = TRUE),
          max_rt = max(rt, na.rm = TRUE),
          N = n())
# A tibble: 4 × 6
  smoker mean_rt sd_rt min_rt max_rt     N
  <chr>    <dbl> <dbl>  <dbl>  <dbl> <int>
1 former    676.  231.   390.  2076    102
2 no        623.  192.   379.  1552.   213
3 yes       628.  202.   427.  1527     30
4 <NA>      549.  133.   466    888.     9

Grouping data

  • Continue with exercise 5

Mutate data with across()

Create a new column.

mutate(blomkvist_rt, log_rt = log(rt))
# A tibble: 354 × 5
     id smoker   age    rt log_rt
  <dbl> <chr>  <dbl> <dbl>  <dbl>
1     1 former    84  702.   6.55
2     2 no        37  471.   6.15
3     3 yes       62  639.   6.46
4     4 former    85  708    6.56
5     5 former    73  607.   6.41
# … with 349 more rows

Replace with transformed variable.

mutate(blomkvist_rt, across(rt, log))
# A tibble: 354 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84  6.55
2     2 no        37  6.15
3     3 yes       62  6.46
4     4 former    85  6.56
5     5 former    73  6.41
# … with 349 more rows

First argument of across will be used as first argument of function (log) which is supplied as second argument to across.

Mutate data with across()

# Instead of
mutate(blomkvist, rt_hand_d = log(rt_hand_d), 
                  rt_hand_nd = log(rt_hand_nd))
# Do
mutate(blomkvist, across(c(rt_hand_d, rt_hand_nd), log))
# A tibble: 342 × 7
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1     1    84 former      6.55       6.66     1009        963.
2     2    37 no          6.15       6.21      738.       692.
3     3    62 yes         6.46       6.46      878        786 
4     4    85 former      6.56       6.46      902.      1374.
5     5    73 former      6.41       6.48      923        805 
# … with 337 more rows

Mutate data with across()

mutate(blomkvist, across(where(is.numeric), log))
# A tibble: 342 × 7
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1 0      4.43 former      6.55       6.66      6.92       6.87
2 0.693  3.61 no          6.15       6.21      6.60       6.54
3 1.10   4.13 yes         6.46       6.46      6.78       6.67
4 1.39   4.44 former      6.56       6.46      6.80       7.23
5 1.61   4.29 former      6.41       6.48      6.83       6.69
# … with 337 more rows
# Not ideal here: see `id` column

Mutate data with across()

mutate(blomkvist, across(starts_with("rt_"), log))
Rows: 342
Columns: 7
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, …
$ age        <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27, 60,…
$ smoker     <chr> "former", "no", "yes", "former", "former", "no", "no", "for…
$ rt_hand_d  <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, 6.8,…
$ rt_hand_nd <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, 6.6,…
$ rt_foot_d  <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, 6.7,…
$ rt_foot_nd <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, 6.8,…

Mutate data with across()

mutate(blomkvist, across(starts_with("rt_"), log, .names = "log_{.col}"))
Rows: 342
Columns: 11
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, …
$ age            <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27,…
$ smoker         <chr> "former", "no", "yes", "former", "former", "no", "no", …
$ rt_hand_d      <dbl> 702, 471, 639, 708, 607, 542, 571, 509, 737, 550, 548, …
$ rt_hand_nd     <dbl> 780, 497, 638, 639, 652, 499, 527, 547, 865, 569, 507, …
$ rt_foot_d      <dbl> 1009, 738, 878, 902, 923, 687, 778, 743, 750, 629, 653,…
$ rt_foot_nd     <dbl> 963, 692, 786, 1374, 805, 600, 750, 797, 797, 800, 718,…
$ log_rt_hand_d  <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, …
$ log_rt_hand_nd <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, …
$ log_rt_foot_d  <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, …
$ log_rt_foot_nd <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, …

Mutate data with across()

mutate(blomkvist, across(starts_with("rt_"), list(lg = log, sqr = ~.^2), .names = "{.fn}_{.col}"))
Rows: 342
Columns: 15
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, …
$ age            <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58, 25, 88, 62, 27,…
$ smoker         <chr> "former", "no", "yes", "former", "former", "no", "no", …
$ rt_hand_d      <dbl> 702, 471, 639, 708, 607, 542, 571, 509, 737, 550, 548, …
$ rt_hand_nd     <dbl> 780, 497, 638, 639, 652, 499, 527, 547, 865, 569, 507, …
$ rt_foot_d      <dbl> 1009, 738, 878, 902, 923, 687, 778, 743, 750, 629, 653,…
$ rt_foot_nd     <dbl> 963, 692, 786, 1374, 805, 600, 750, 797, 797, 800, 718,…
$ lg_rt_hand_d   <dbl> 6.6, 6.2, 6.5, 6.6, 6.4, 6.3, 6.3, 6.2, 6.6, 6.3, 6.3, …
$ sqr_rt_hand_d  <dbl> 492336, 221527, 407895, 501264, 368854, 293403, 325660,…
$ lg_rt_hand_nd  <dbl> 6.7, 6.2, 6.5, 6.5, 6.5, 6.2, 6.3, 6.3, 6.8, 6.3, 6.2, …
$ sqr_rt_hand_nd <dbl> 608920, 247009, 407044, 407895, 425104, 248668, 278080,…
$ lg_rt_foot_d   <dbl> 6.9, 6.6, 6.8, 6.8, 6.8, 6.5, 6.7, 6.6, 6.6, 6.4, 6.5, …
$ sqr_rt_foot_d  <dbl> 1018081, 544152, 770884, 814205, 851929, 471511, 605803…
$ lg_rt_foot_nd  <dbl> 6.9, 6.5, 6.7, 7.2, 6.7, 6.4, 6.6, 6.7, 6.7, 6.7, 6.6, …
$ sqr_rt_foot_nd <dbl> 926727, 479325, 617796, 1886960, 648025, 359600, 562000…

What’s “~.

mutate(blomkvist_rt, across(rt, round, 0))
# A tibble: 354 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84   702
2     2 no        37   471
3     3 yes       62   639
4     4 former    85   708
5     5 former    73   607
# … with 349 more rows
mutate(blomkvist_rt, across(rt, ~round(., 0)))
# A tibble: 354 × 4
     id smoker   age    rt
  <dbl> <chr>  <dbl> <dbl>
1     1 former    84   702
2     2 no        37   471
3     3 yes       62   639
4     4 former    85   708
5     5 former    73   607
# … with 349 more rows

What’s “~.

  • ~”: we want to make the position of the argument in function explicit.
  • .”: the location of the argument.
  • Always possible but necessary for operator functions (>, ==, ^, +) and when argument is not in first position of supplied function.
mutate(blomkvist_rt, 
       across(rt, ~round(. , 0)), # optional
       across(age, ~.^2),   # operator
       across(rt, ~paste("RT is", .))) # position
# A tibble: 354 × 4
     id smoker   age rt       
  <dbl> <chr>  <dbl> <chr>    
1     1 former  7056 RT is 702
2     2 no      1369 RT is 471
3     3 yes     3844 RT is 639
4     4 former  7225 RT is 708
5     5 former  5329 RT is 607
# … with 349 more rows

Mutate data with c_across()

Apply a function across different columns.

Example: get the mean for of all rts for each participant (row).

# Long way
mutate(blomkvist, mean_rt = (rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)/4)
# A tibble: 342 × 8
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd mean_rt
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1    84 former      702.       780.     1009        963.    863.
2     2    37 no          471.       497       738.       692.    599.
3     3    62 yes         639.       638       878        786     735.
4     4    85 former      708        639.      902.      1374.    906.
5     5    73 former      607.       652       923        805     747.
# … with 337 more rows

Mutate data with c_across()

# Long way
mutate(blomkvist, mean_rt = (rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)/4)
# Still a lot of typing :(
blomkvist_rowwise <- rowwise(blomkvist) # each row / ppt is a group
mutate(blomkvist_rowwise, mean_rt = mean(c_across(c(rt_hand_d, rt_hand_nd, rt_foot_d, rt_foot_nd))))
# A tibble: 342 × 8
# Rowwise: 
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd mean_rt
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1    84 former      702.       780.     1009        963.    863.
2     2    37 no          471.       497       738.       692.    599.
3     3    62 yes         639.       638       878        786     735.
4     4    85 former      708        639.      902.      1374.    906.
5     5    73 former      607.       652       923        805     747.
# … with 337 more rows

Mutate data with c_across()

# Long way
mutate(blomkvist, mean_rt = (rt_hand_d + rt_hand_nd + rt_foot_d + rt_foot_nd)/4)
# That's better :)
blomkvist_rowwise <- rowwise(blomkvist) # each row / ppt is a group
mutate(blomkvist_rowwise, mean_rt = mean(c_across(starts_with("rt_"))))
# A tibble: 342 × 8
# Rowwise: 
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd mean_rt
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1    84 former      702.       780.     1009        963.    863.
2     2    37 no          471.       497       738.       692.    599.
3     3    62 yes         639.       638       878        786     735.
4     4    85 former      708        639.      902.      1374.    906.
5     5    73 former      607.       652       923        805     747.
# … with 337 more rows

Mutate data with c_across()

# Without rowwise grouping returns the grand average:
mutate(blomkvist, mean_rt = mean(c_across(starts_with("rt_"))))
# A tibble: 342 × 8
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd mean_rt
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>   <dbl>
1     1    84 former      702.       780.     1009        963.    760.
2     2    37 no          471.       497       738.       692.    760.
3     3    62 yes         639.       638       878        786     760.
4     4    85 former      708        639.      902.      1374.    760.
5     5    73 former      607.       652       923        805     760.
# … with 337 more rows
# which is not what we want.

Filter data with across()

filter(blomkvist, rt_hand_d > 1000, rt_hand_nd > 1000, rt_foot_d > 1000, rt_foot_nd > 100)
# A tibble: 12 × 7
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1    70    85 no         1102.      1352.     1458       1335.
2    96    77 no         1241.      1651.     1217.      1266.
3   127    96 no         1179.      1090      1546       4820 
4   152    68 no         1030.      1095      1137       1075.
5   171    92 no         1025.      1026.     1379.      1310.
# … with 7 more rows

Filter data with across()

filter(blomkvist, across(starts_with("rt_"), ~ . > 1000 ))
# A tibble: 12 × 7
     id   age smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1    70    85 no         1102.      1352.     1458       1335.
2    96    77 no         1241.      1651.     1217.      1266.
3   127    96 no         1179.      1090      1546       4820 
4   152    68 no         1030.      1095      1137       1075.
5   171    92 no         1025.      1026.     1379.      1310.
# … with 7 more rows

Summarise variables with across()

summarise(blomkvist, rt_hand_d_mean = mean(rt_hand_d),
                     rt_hand_nd_mean = mean(rt_hand_nd))
# A tibble: 1 × 2
  rt_hand_d_mean rt_hand_nd_mean
           <dbl>           <dbl>
1           638.            640.

Summarise variables with across()

summarise(blomkvist, across(c(rt_hand_d, rt_hand_nd), mean))
# A tibble: 1 × 2
  rt_hand_d rt_hand_nd
      <dbl>      <dbl>
1      638.       640.

Summarise variables with across()

summarise(blomkvist, across(starts_with("rt_"), mean))
# A tibble: 1 × 4
  rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
      <dbl>      <dbl>     <dbl>      <dbl>
1      638.       640.      893.       871.

Summarise variables with across()

summarise(blomkvist, across(starts_with("rt_"), mean))
# A tibble: 1 × 4
  rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
      <dbl>      <dbl>     <dbl>      <dbl>
1      638.       640.      893.       871.
summarise(blomkvist, across(starts_with("rt_"), list(mean = mean, sd = sd)))
# A tibble: 1 × 8
  rt_hand_d_mean rt_hand_d_sd rt_hand_nd_mean rt_hand_nd_sd rt_foot_d_mean
           <dbl>        <dbl>           <dbl>         <dbl>          <dbl>
1           638.         206.            640.          188.           893.
# … with 3 more variables: rt_foot_d_sd <dbl>, rt_foot_nd_mean <dbl>,
#   rt_foot_nd_sd <dbl>

Using across() to mutate and summarise variables

Open exercise 6 script.

Pivoting data: tidy data

Called, a long data format.

Pivoting data: wide format

blomkvist_wide <- select(blomkvist, id, starts_with("rt_"))
blomkvist_wide
# A tibble: 342 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# … with 337 more rows

RT is distributed across 4 columns: useful for certain analyses but isn’t tidy.

Pivoting data to long format

has never been easier

blomkvist_wide
# A tibble: 342 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# … with 337 more rows
pivot_longer(blomkvist_wide, cols = starts_with("rt_"))
# A tibble: 1,368 × 3
     id name       value
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# … with 1,363 more rows

Pivoting data to long format

pivot_longer(blomkvist_wide, cols = -id) # shorthand
# A tibble: 1,368 × 3
     id name       value
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# … with 1,363 more rows
pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = "variable", 
             values_to = "rt")
# A tibble: 1,368 × 3
     id variable      rt
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# … with 1,363 more rows

Pivoting data to long format

pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = "variable",
             values_to = "rt")
# A tibble: 1,368 × 3
     id variable      rt
  <dbl> <chr>      <dbl>
1     1 rt_hand_d   702.
2     1 rt_hand_nd  780.
3     1 rt_foot_d  1009 
4     1 rt_foot_nd  963.
5     2 rt_hand_d   471.
# … with 1,363 more rows
pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = c(".value", 
                          "response_by", 
                          "dominant"), 
             names_pattern = "(.+)_(.+)_(.+)")
# A tibble: 1,368 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# … with 1,363 more rows

Pivoting data to long format

# Flexible naming pattern
pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = c(".value", 
                          "response_by", 
                          "dominant"), 
             names_pattern = "(.+)_(.+)_(.+)")
# A tibble: 1,368 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# … with 1,363 more rows
# Simplifies to
pivot_longer(blomkvist_wide, 
             cols = -id, 
             names_to = c(".value", 
                          "response_by", 
                          "dominant"), 
             names_sep = "_")
# A tibble: 1,368 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# … with 1,363 more rows

Pivoting data back to wide format

bk_long <- pivot_longer(blomkvist_wide, 
                        cols = -id, 
                        names_to = c(".value", "response_by", "dominant"), 
                        names_sep = "_")
bk_long
# A tibble: 1,368 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# … with 1,363 more rows

Pivoting data back to wide format

bk_long
# A tibble: 1,368 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# … with 1,363 more rows
pivot_wider(bk_long, 
            names_from = response_by, 
            values_from = rt)
# A tibble: 684 × 4
     id dominant  hand  foot
  <dbl> <chr>    <dbl> <dbl>
1     1 d         702. 1009 
2     1 nd        780.  963.
3     2 d         471.  738.
4     2 nd        497   692.
5     3 d         639.  878 
# … with 679 more rows

Pivoting data back to wide format

bk_long
# A tibble: 1,368 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# … with 1,363 more rows
pivot_wider(bk_long, 
            names_from = c(response_by, dominant), 
            values_from = rt)
# A tibble: 342 × 5
     id hand_d hand_nd foot_d foot_nd
  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1     1   702.    780.  1009     963.
2     2   471.    497    738.    692.
3     3   639.    638    878     786 
4     4   708     639.   902.   1374.
5     5   607.    652    923     805 
# … with 337 more rows

Pivoting data back to wide format

bk_long
# A tibble: 1,368 × 4
     id response_by dominant    rt
  <dbl> <chr>       <chr>    <dbl>
1     1 hand        d         702.
2     1 hand        nd        780.
3     1 foot        d        1009 
4     1 foot        nd        963.
5     2 hand        d         471.
# … with 1,363 more rows
pivot_wider(bk_long, 
            names_from = c(response_by, dominant), 
            values_from = rt, 
            names_prefix = "rt_")
# A tibble: 342 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1     1      702.       780.     1009        963.
2     2      471.       497       738.       692.
3     3      639.       638       878        786 
4     4      708        639.      902.      1374.
5     5      607.       652       923        805 
# … with 337 more rows

Pivoting data

Continue with exercise 7

Combine data: overview

# Combine two datasets side-by-side
bind_cols(data_1, data_2)
# Stacking two data sets
bind_rows(data_1, data_2)
# Keep all rows of data_1 and add data_2
left_join(data_1, data_2)
# Keep all rows of data_2 and add data_1
right_join(data_1, data_2)
# Include all rows of both data sets
full_join(data_1, data_2)
# Include data that is present in both data sets
inner_join(data_1, data_2)

Combine data: bind_cols()

bk_id_age <- select(blomkvist, id, age)
bk_id_age
# A tibble: 354 × 2
     id   age
  <dbl> <dbl>
1     1    84
2     2    37
3     3    62
4     4    85
5     5    73
# … with 349 more rows
bk_med_smoke <- select(blomkvist, medicine, smoker)
bk_med_smoke
# A tibble: 354 × 2
  medicine smoker
     <dbl> <chr> 
1        8 former
2        1 no    
3        0 yes   
4        4 former
5        5 former
# … with 349 more rows

Combine data: bind_cols()

# Combine two datasets side-by-side
bind_cols(bk_id_age, bk_med_smoke)
# A tibble: 354 × 4
     id   age medicine smoker
  <dbl> <dbl>    <dbl> <chr> 
1     1    84        8 former
2     2    37        1 no    
3     3    62        0 yes   
4     4    85        4 former
5     5    73        5 former
# … with 349 more rows

Combine data: bind_rows()

bk_former_smokers <- filter(blomkvist, smoker == "former")
bk_smokers <- filter(blomkvist, smoker == "yes")

Combine data: bind_rows()

# Stacking two data sets
bk_smoking <- bind_rows(bk_former_smokers, bk_smokers)
slice_head(bk_smoking, n = 1)
# A tibble: 1 × 9
     id sex     age medicine smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr> <dbl>    <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1     1 male     84        8 former      702.       780.      1009       963.
slice_tail(bk_smoking, n = 1)
# A tibble: 1 × 9
     id sex     age medicine smoker rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl> <chr> <dbl>    <dbl> <chr>      <dbl>      <dbl>     <dbl>      <dbl>
1   342 male     60        3 yes          708        686       909        854

This also works when columns are not in the same order as long as the names and types match.

Create two example data sets

bk_smoker <- select(blomkvist, smoker, age)
bk_smoker_o40 <- filter(bk_smoker, age > 40)
bk_sex <- select(blomkvist, sex, age)
bk_sex_u50 <- filter(bk_sex, age < 50)

Create two example data sets

bk_smoker_o40
# A tibble: 248 × 2
  smoker   age
  <chr>  <dbl>
1 former    84
2 yes       62
3 former    85
4 former    73
5 no        65
# … with 243 more rows
bk_sex_u50
# A tibble: 144 × 2
  sex      age
  <chr>  <dbl>
1 female    37
2 female    30
3 female    49
4 female    25
5 male      27
# … with 139 more rows

Combine data: left_join()

# Keep all rows of bk_sex_u50
bk_joined <- left_join(bk_sex_u50, 
                       bk_smoker_o40, by = "age")
bk_joined
# A tibble: 290 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    37 <NA>  
2 female    30 <NA>  
3 female    49 former
4 female    49 no    
5 female    49 former
# … with 285 more rows
range(bk_joined$age)
[1] 20 49

Combine data: right_join()

# Keep all rows of bk_smoker_o40
bk_joined <- right_join(bk_sex_u50, 
                        bk_smoker_o40, by = "age")
bk_joined
# A tibble: 394 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    49 former
2 female    49 no    
3 female    49 former
4 female    49 former
5 female    49 no    
# … with 389 more rows
range(bk_joined$age)
[1] 41 99

Combine data: full_join()

# Include all rows of both data sets
bk_joined <- full_join(bk_sex_u50, 
                       bk_smoker_o40, by = "age")
bk_joined
# A tibble: 500 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    37 <NA>  
2 female    30 <NA>  
3 female    49 former
4 female    49 no    
5 female    49 former
# … with 495 more rows
range(bk_joined$age)
[1] 20 99

Combine data: inner_join()

# Include only info that is present in both data sets
bk_joined <- inner_join(bk_sex_u50, 
                        bk_smoker_o40, by = "age")
bk_joined
# A tibble: 184 × 3
  sex      age smoker
  <chr>  <dbl> <chr> 
1 female    49 former
2 female    49 no    
3 female    49 former
4 female    49 former
5 female    49 no    
# … with 179 more rows
range(bk_joined$age)
[1] 41 49

Continue with exercise 8

The pipe: %>%

  • %>% moves or “pipes” the result forward into the next function
  • f(x) is the same as x %>% f()
  • Short-cut: Ctrl + Shift + M
select(data, myvar1, myvar2)
# or
data %>% select(myvar1, myvar2)

*assumes first argument is data

The pipe: %>%

# Instead of 
data_1 <- first_step(data)
data_2 <- second_step(data_2)
data_3 <- third_step(data_3)
data_4 <- fourth_step(data_4)

# or
fourth_step(
    third_step(
        second_step(
            first_step(data)
            )
          )
        )
# Just do
data %>% 
  first_step() %>%
  second_step() %>%
  third_step() %>%
  fourth_step()

The pipe: %>%

bk <- read_csv("../data/blomkvist.csv")
bk_rts <- select(bk, id, starts_with("rt_"))
bk_rts_flt <- filter(bk_rts, rt_hand_d > 1500)
bk_rts_flt
# A tibble: 4 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1   181     1552.      1072.     1068       1196.
2   262     1595.      1162.     1508       1496.
3   290     2076       1845     17094.      3874.
4   305     1527       1416.     1713.      2471.
read_csv("../data/blomkvist.csv") %>% 
  select(id, starts_with("rt_")) %>% 
  filter(rt_hand_d > 1500)
# A tibble: 4 × 5
     id rt_hand_d rt_hand_nd rt_foot_d rt_foot_nd
  <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
1   181     1552.      1072.     1068       1196.
2   262     1595.      1162.     1508       1496.
3   290     2076       1845     17094.      3874.
4   305     1527       1416.     1713.      2471.

The pipe: %>%

Continue with exercise 9

Recommended reading

References

Andrews, Mark, and Jens Roeser. 2021. Psyntur: Helper Tools for Teaching Statistical Data Analysis. https://CRAN.R-project.org/package=psyntur.

Blomkvist, Andreas W., Fredrik Eika, Martin T. Rahbek, Karin D. Eikhof, Mette D. Hansen, Malene Søndergaard, Jesper Ryg, Stig Andersen, and Martin G. Jørgensen. 2017. “Reference Data on Reaction Time and Aging Using the Nintendo Wii Balance Board: A Cross-Sectional Study of 354 Subjects from 20 to 99 Years of Age.” PLoS One 12 (12): e0189598. https://doi.org/10.1371/journal.pone.0189598.

Roeser, Jens, Sven De Maeyer, Mariëlle Leijten, and Luuk Van Waes. 2021. “Modelling Typing Disfluencies as Finite Mixture Process.” Reading and Writing, 1–26.