Handeling and exploring data in the tidyverse

About me: Jens Roeser

early career researchfellow @ psychology department (Nottingham Trent University)
theory: psycholinguistics; language production / comprehension / acquisition
methods: Bayesian modelling (talk to me about mixture models, Roeser et al. 2021) in Stan; keystroke logging; eyetracking
teaching: statistics – R of course (psyntur, Andrews and Roeser 2021); cognitive psychology; language acquisition
twitter: https://twitter.com/jens_roeser

Outline for today

Data wrangling with tidyverse (50%)
Data viz with ggplot2 (40%)
R-Markdown (10%)
Lots of hands-on exercises

Why should I care?

Why using R (or code in general) to handle data?

** What do you think? **

Why should I care?

Why using R (or code in general) to handle data?

> 70% to 80% of data analysis is data wrangling
Open source: R is and always will be free of charge
Reduce human error
Reduce manual work
Reproducibility: publish your code and look at code of other researchers
Flexibility: different ways of looking at data
Quickly growing number of available add-ons (packages) for data analysis
Speed: faster than manually transforming data in spreadsheets
Processing of large data sets is not going to be possible in spreadsheets
Large community of friendly peer support

Rules!

Never change your data manually; document everything in code.
- Retrospective amendments made easy
- Documentation / reproducibility
Organized working environment
- .Rproj with one director per project with sub-directories for scripts, data, plots, etc
- Short scripts: less code with one clear purpose is always better (test is: does the name of your script suggest a specific or general purpose?)
Comment your code (# Ceci n'est pas un comment!)
If possible, use tidyverse instead of base R.

Download repository

Download: https://github.com/jensroes/hallam-r-workshop
Click on: Code > Download ZIP > unzip directory on your machine.
Open project by double clicking on hallam-r-workshop.Rproj
wrangling/exercises/: exercises associated with each topic
data/: scripts read data from here
wrangling/slides.Rmd: these slides in R markdown format (.html format provided as well)

Goals of data wrangling: goals

Data come in various formats (long, wide) and data type (xlsx, ods, json, csv, sav)
No format is suitable for every goal
Fluency in data wrangling gives you a lot of power.
Make data format suitable to use: e.g. for statistical models (correlations, linear regression), functions, data viz, summary table
Calculate new variables, filter or combine data
Reveal information
Summarise information
(also creating counterbalanced, randomised stimulus lists)

`tidyverse`

Collection of R packages for data science that share:

common data philosophies
grammar
data structures
best practice
designed to work together

`tidyverse`

# Installs 19 packages
install.packages("tidyverse")

# Loads 6 packages
library(tidyverse)

`tidyverse`

Tidy data

Each variable must have its own column.
Each observation must have its own row.
Each value have its own cell.

Why?

Placing variables in columns takes advantage of R’s vectorised nature (faster processing, more compact code).
Consistent data structure allows easier learning of related tools because they have similar underpinning principles (except similar input structures).

`tidyverse`: verbs

Functions that do specific things to our data.
Must know: read_csv, write_csv, glimpse, select, filter, mutate, group_by / ungroup, summarise, pivot_wider / _longer, _join, bind_rows / _cols
Also important: count, pull, slice, across, recode, unique, n, where, everything, ~ and ., map, starts_with, ends_with, contains, separate, unite, transmute
There are more but these are the most important ones.

Example data set: Blomkvist et al. (2017)

Age-related changes in cognitive performance through adolescence and adulthood in a real-world task.

Real-world task: StarCraft 2

Real-time strategy video game
Nintendo Wii Balance Board

Example data set: Blomkvist et al. (2017)

blomkvist <- read_csv("../data/blomkvist.csv")
glimpse(blomkvist)

Rows: 354
Columns: 11
$ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,…
$ sex         <chr> "male", "female", "female", "female", …
$ age         <dbl> 84, 37, 62, 85, 73, 65, 30, 49, 83, 58…
$ medicine    <dbl> 8, 1, 0, 4, 5, 0, 0, 0, 11, 0, 0, 4, 3…
$ smoker      <chr> "former", "no", "yes", "former", "form…
$ pal_work    <dbl> NA, 2, NA, NA, NA, 1, 3, 1, NA, 4, 2, …
$ pal_leisure <dbl> 1, 2, 2, 2, 3, 3, 2, 2, 1, 3, 3, 2, 1,…
$ rt_hand_d   <dbl> 702, 471, 639, 708, 607, 542, 571, 509…
$ rt_hand_nd  <dbl> 780, 497, 638, 639, 652, 499, 527, 547…
$ rt_foot_d   <dbl> 1009, 738, 878, 902, 923, 687, 778, 74…
$ rt_foot_nd  <dbl> 963, 692, 786, 1374, 805, 600, 750, 79…

Average reaction time (rt) of dominant (_d) or non-dominant (_nd) hand or foot in msecs
medicine: number of drugs used daily
pal: physical activity level: 1 (least) to 4 (most active)

tbls (tibble)

tidyverse is operating with tibbles
Type of data structure
Easier to read in console

# Imports data as data frame
data_as_frame <- read.csv("path_to_data/data.csv")
# Imports data as tibble
data_as_tibble <- read_csv("path_to_data/data.csv")

.csv: comma separated file
readr package: e.g. read_csv, read_delim, read_tsv

For other data formats:

haven package: e.g. read_dta, read_sav, read_sas
readxl package: e.g. read_excel, read_xls, read_xlsx

# Summarise data structure in base R
str(data_as_frame)
# Summarise data structure in tidyverse
glimpse(data_as_tibble)

Open exercise script 1

`tidyverse` functions

Functions follow the principle

function_name(data_name, argument)

where argument specifies which variable / condition etc. the function has to operate on.

# Picking out variables
select(data, variable1) 
# Subsetting data
filter(data, variable > 100) 
# Change / add variables
mutate(data, variable_sqr = variable^2)
# Aggregate data
summarise(data, mean_var = mean(variable))