R Tidyverse - tutorials

Quick introduction to R studio/Markdown

Your first R markdown document

The first step is to open R studio and click on the + bottom, on the upper left corner.
Select R Markdown:

This is what you should get:

Now you are ready to start working!

R markdown options

Working in R markdown makes coding so much easier! What you need to know is quite simple:

You write codes in chunks

# This is a chunk

Anything else you write outside chunks, will not be run.

Headlines

You can exploit R makrdown to create nice summaries of your analyses. You can also use html coding to set up headlines, change the font etc.

Begin a headline with a single pound symbol (#) followed by a space, then write your headline text.
For subheadlines, use multiple pound symbols (##, ###, ####, etc.) to indicate the hierarchy level.
Each additional pound symbol represents a deeper level of hierarchy.
Ensure there is a space between the pound symbols and the text of the headline or subheadline.

Changing Font and Face

Use asterisks (*) or underscores (_) to emphasize text.
Add an * or _ before and after the word/sentence. Make sure there is no space. Examples: italics or italics
Add two ** or __ before and after the word/sentence. Make sure there is no space. Examples: bold or bold
Add three *** or __ before and after the word/sentence. Make sure there is no space. Examples: bold and italics or bold and italics for bold and italics.
Additionally, you can use backticks to indicate code or monospace font (e.g., code)

You can also freely decide which chunks should be printed in the knitted document, which should appear but not run, etc.

Common chunk options include:

echo = FALSE to suppress the display of code within the document but still show its output.
eval = FALSE to prevent code execution.
include = FALSE to hide both code and output.
message = FALSE to suppress messages generated by code execution.
warning = FALSE to suppress warnings generated by code execution.

Getting started

Install packages

### Install packages 
# install.packages("ggplot2")
# install.packages("ggridges")
# install.packages("tidyverse")
# install.packages("janitor")
# install.packages("kableExtra")
# install.packages("unikn")
# install.packages("ggpubr")
# install.packages("sjPlot")
# install.packages("lme4")
# install.packages("lmerTest")
# install.packages("car")
# install.packages("readxl")

### Load libraries 
library(tidyverse)
library(ggplot2)
library(ggridges)
library(janitor) # helpful to clean col names `clean_names()`
library(kableExtra) # to display and edit tables 
library(unikn) # for uni Konstanz theme 
library(ggpubr) # to arrange plots 
library(sjPlot) # to visualize model estimates/predicted values 
library(lme4)
library(lmerTest)
library(car)
library(readxl) # read excel data 

# set options for tables 
bs_style <- c("striped", "hover", "condensed", "responsive")
options(kable_styling_bootstrap_options = bs_style)

Load data

It’s important that you save your .Rmd file in the same folder where your data are stored. This would make loading data much easier. You also won’t need to set any directory.
Use the function read_csv("name_data") from the package tidyverse to load your data
If your data are in .xlsx, you can use the function read_excel() from the package readxl

### load data 
read_csv("complete.csv") -> df

Structure of the dataset:

ID: ID participants
age
L1_code: Participants’ native language
education: Participants’ education level
name_school: Name of the school where participants were tested
class: School grade (1 to 5 + uni)
group: DYS = dyslexics; TD = typically developing participants
other_diagnoses: Other diagnoses (beyond dyslexia for DYS)
n_other_diagnoses: Number of other diagnoses (beyond dyslexia for DYS)
age_diagnosis: Age of dyslexia diagnosis
AoO: Age of Onset for English (L2)
wr_time_z: Word reading time (z score)
wr_error_z: Word reading errors (z score)
wr_syl_z: Syllable per second in word reading (z score)
nwr_time_z: Nonword reading time (z score)
nwr_error_z: Nonword reading errors (z score)
nwr_syl_z: Syllable per second in nonword reading (z score)
group.exclusion: Another categorization of group (ignore this)
pa.rt, pa.acc, pa.bis: Phonological awareness measurues (RT, accuracy and speed-accuracy trade off)
forward and backward: memory tests
lexita.acc and lexita.rt, lexita.bis: Italian vocabulary (RT, accuracy and speed-accuracy trade off)
lextale.acc and lextale.rt, lextale.bis: Italian vocabulary (RT, accuracy and speed-accuracy trade off)
it.ok.acc: Accuracy in Italian orthographic knowledge test
en.ok.acc, en.ok.rt, en.ok.bis: English orthographic knowledge (RT, accuracy and speed-accuracy trade off)
va.span.acc, va.span.d, va.span.c: Visual attention span measures
eng.prof: Self-assessed English proficiency
eng.use: Self-assessed English use
eng.read: Self-assessed English reading exposure
it.read: Self-assessed Italian reading exposure

Data manipulation with tidyverse

`select`

Select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f selects all columns from a on the left to f on the right) or type (e.g. where(is.numeric) selects all numeric columns).

Tidyverse selections implement a dialect of R where operators make it easy to select variables:

: for selecting a range of consecutive variables.

! for taking the complement of a set of variables.

& and | for selecting the intersection or the union of two sets of variables.

c() for combining selections.

Examples: Let’s say we want to select the first three rows in our dataset

df %>% dplyr::select(1:3)

## # A tibble: 94 × 3
##    ID       age L1_code        
##    <chr>  <dbl> <chr>          
##  1 LAUVEN  20.3 IT             
##  2 LEVEN   20.5 IT             
##  3 LAE02   14.2 IT             
##  4 LAE03   14.6 IT             
##  5 LAE04   18.7 IT             
##  6 LAE05   17.4 IT and ROM (HL)
##  7 LAE06   14.3 IT             
##  8 LAE07   14.9 IT             
##  9 LAE08   14.1 IT             
## 10 LAE09   17.1 IT             
## # ℹ 84 more rows

Here, we want to select the columns ID, age, and the reading measures: wr_time_z, wr_error_z, nwr_time_z, nwr_error_z

### Option 1: type all cols names 
df %>% dplyr::select(ID, age, wr_time_z, wr_error_z, nwr_time_z, nwr_error_z)

## # A tibble: 94 × 6
##    ID       age wr_time_z wr_error_z nwr_time_z nwr_error_z
##    <chr>  <dbl>     <dbl>      <dbl>      <dbl>       <dbl>
##  1 LAUVEN  20.3     -0.58      -1.23      -1.05       -1.45
##  2 LEVEN   20.5     -0.23       0.1       -1.05        0.43
##  3 LAE02   14.2     -0.97       0.59      -0.34        0.99
##  4 LAE03   14.6      1.65       0.92       0.63        0.18
##  5 LAE04   18.7     -4.58       1.04      -0.99       -0.43
##  6 LAE05   17.4     -1.42       0.45      -0.03       -1.23
##  7 LAE06   14.3      0.42       0.59       0.04        0.72
##  8 LAE07   14.9     -1.4       -0.39      -1.17        0.45
##  9 LAE08   14.1     -0.33      -0.39      -0.44       -0.36
## 10 LAE09   17.1      0.32       1.04       0.46        1.15
## # ℹ 84 more rows

### Option 2: Since reading measures are consecutive, we can also write the following: 
df %>% dplyr::select(ID, age, wr_time_z:wr_error_z)

## # A tibble: 94 × 4
##    ID       age wr_time_z wr_error_z
##    <chr>  <dbl>     <dbl>      <dbl>
##  1 LAUVEN  20.3     -0.58      -1.23
##  2 LEVEN   20.5     -0.23       0.1 
##  3 LAE02   14.2     -0.97       0.59
##  4 LAE03   14.6      1.65       0.92
##  5 LAE04   18.7     -4.58       1.04
##  6 LAE05   17.4     -1.42       0.45
##  7 LAE06   14.3      0.42       0.59
##  8 LAE07   14.9     -1.4       -0.39
##  9 LAE08   14.1     -0.33      -0.39
## 10 LAE09   17.1      0.32       1.04
## # ℹ 84 more rows

In the following example we want to select all the numeric columns (1) and all columns that are characters

df %>% dplyr::select(where(is.numeric))

## # A tibble: 94 × 29
##      age n_other_diagnoses   AoO wr_time_z wr_error_z nwr_time_z nwr_error_z
##    <dbl>             <dbl> <dbl>     <dbl>      <dbl>      <dbl>       <dbl>
##  1  20.3                 1     8     -0.58      -1.23      -1.05       -1.45
##  2  20.5                NA     6     -0.23       0.1       -1.05        0.43
##  3  14.2                NA     6     -0.97       0.59      -0.34        0.99
##  4  14.6                NA     5      1.65       0.92       0.63        0.18
##  5  18.7                NA     4     -4.58       1.04      -0.99       -0.43
##  6  17.4                NA     5     -1.42       0.45      -0.03       -1.23
##  7  14.3                NA     5      0.42       0.59       0.04        0.72
##  8  14.9                NA     6     -1.4       -0.39      -1.17        0.45
##  9  14.1                NA     6     -0.33      -0.39      -0.44       -0.36
## 10  17.1                NA     6      0.32       1.04       0.46        1.15
## # ℹ 84 more rows
## # ℹ 22 more variables: pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>, forward <dbl>,
## #   backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>, lexita.bis <dbl>,
## #   lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>, it.ok.acc <dbl>,
## #   en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, va.span.acc <dbl>,
## #   va.span.d <dbl>, va.span.c <dbl>, eng.prof <dbl>, eng.use <dbl>,
## #   eng.read <dbl>, it.read <dbl>

df %>% dplyr::select(where(is.character))

## # A tibble: 94 × 9
##    ID    L1_code education name_school class group other_diagnoses age_diagnosis
##    <chr> <chr>   <chr>     <chr>       <chr> <chr> <chr>           <chr>        
##  1 LAUV… IT      BA        university  uni   DYS   disortografia   11           
##  2 LEVEN IT      BA        university  uni   TD    <NA>            -            
##  3 LAE02 IT      HS        LAE         1     TD    <NA>            -            
##  4 LAE03 IT      HS        LAE         1     TD    <NA>            -            
##  5 LAE04 IT      HS        LAE         4     DYS   <NA>            17           
##  6 LAE05 IT and… HS        LAE         4     TD    <NA>            -            
##  7 LAE06 IT      HS        LAE         1     TD    <NA>            -            
##  8 LAE07 IT      HS        LAE         1     TD    <NA>            -            
##  9 LAE08 IT      HS        LAE         1     TD    <NA>            -            
## 10 LAE09 IT      HS        LAE         4     TD    <NA>            -            
## # ℹ 84 more rows
## # ℹ 1 more variable: group.exclusion <chr>

For more information, run the following code, or type select in the help section and select dplyr::select

??dplyr::select

`filter`

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.

Filter operators

There are many functions and operators that are useful when constructing the expressions used to filter the data:

== : Values equal to
> and >= : Values bigger than / bigger, equal to
< and <= : Values smaller than / smaller, equal to
& = and
| = or
! : Different from
is.na() : is NA
!is.na() : is not NA
%in% : values in the specified c()

Examples

Filter only the rows where group == “DYS”

df %>% filter(group == "DYS")

## # A tibble: 32 × 38
##    ID       age L1_code education name_school class group other_diagnoses
##    <chr>  <dbl> <chr>   <chr>     <chr>       <chr> <chr> <chr>          
##  1 LAUVEN  20.3 IT      BA        university  uni   DYS   disortografia  
##  2 LAE04   18.7 IT      HS        LAE         4     DYS   <NA>           
##  3 LAE16   16.6 IT      HS        LAE         3     DYS   <NA>           
##  4 LAE17   17.1 IT      HS        LAE         3     DYS   <NA>           
##  5 LAE45   19   IT      HS        LAE         5     DYS   discalculia    
##  6 LAE46   18.5 IT      HS        LAE         5     DYS   disgrafia      
##  7 LAE47   18.1 IT      HS        LAE         5     DYS   discalculia    
##  8 LC03    16.4 IT      HS        LC          3     DYS   <NA>           
##  9 LC06    19.2 IT      HS        LC          5     DYS   disgrafia      
## 10 LC11    16.6 IT      HS        LC          3     DYS   disgrafia      
## # ℹ 22 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <chr>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

Filter all rows where group == “TD” AND n_other_diagnosis is NA

df %>% filter(group == "TD" & is.na(n_other_diagnoses))

## # A tibble: 59 × 38
##    ID      age L1_code         education name_school class group other_diagnoses
##    <chr> <dbl> <chr>           <chr>     <chr>       <chr> <chr> <chr>          
##  1 LEVEN  20.5 IT              BA        university  uni   TD    <NA>           
##  2 LAE02  14.2 IT              HS        LAE         1     TD    <NA>           
##  3 LAE03  14.6 IT              HS        LAE         1     TD    <NA>           
##  4 LAE05  17.4 IT and ROM (HL) HS        LAE         4     TD    <NA>           
##  5 LAE06  14.3 IT              HS        LAE         1     TD    <NA>           
##  6 LAE07  14.9 IT              HS        LAE         1     TD    <NA>           
##  7 LAE08  14.1 IT              HS        LAE         1     TD    <NA>           
##  8 LAE09  17.1 IT              HS        LAE         4     TD    <NA>           
##  9 LAE10  17.7 IT              HS        LAE         4     TD    <NA>           
## 10 LAE11  15.4 IT              HS        LAE         2     TD    <NA>           
## # ℹ 49 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <chr>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

Filter all rows where name_school is “LAE” OR “LC”

### Option 1
df %>% filter(name_school == "LAE" | name_school == "LC")

## # A tibble: 56 × 38
##    ID      age L1_code         education name_school class group other_diagnoses
##    <chr> <dbl> <chr>           <chr>     <chr>       <chr> <chr> <chr>          
##  1 LAE02  14.2 IT              HS        LAE         1     TD    <NA>           
##  2 LAE03  14.6 IT              HS        LAE         1     TD    <NA>           
##  3 LAE04  18.7 IT              HS        LAE         4     DYS   <NA>           
##  4 LAE05  17.4 IT and ROM (HL) HS        LAE         4     TD    <NA>           
##  5 LAE06  14.3 IT              HS        LAE         1     TD    <NA>           
##  6 LAE07  14.9 IT              HS        LAE         1     TD    <NA>           
##  7 LAE08  14.1 IT              HS        LAE         1     TD    <NA>           
##  8 LAE09  17.1 IT              HS        LAE         4     TD    <NA>           
##  9 LAE10  17.7 IT              HS        LAE         4     TD    <NA>           
## 10 LAE11  15.4 IT              HS        LAE         2     TD    <NA>           
## # ℹ 46 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <chr>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

### Option 2
df %>% filter(name_school %in% c("LAE", "LC") )

## # A tibble: 56 × 38
##    ID      age L1_code         education name_school class group other_diagnoses
##    <chr> <dbl> <chr>           <chr>     <chr>       <chr> <chr> <chr>          
##  1 LAE02  14.2 IT              HS        LAE         1     TD    <NA>           
##  2 LAE03  14.6 IT              HS        LAE         1     TD    <NA>           
##  3 LAE04  18.7 IT              HS        LAE         4     DYS   <NA>           
##  4 LAE05  17.4 IT and ROM (HL) HS        LAE         4     TD    <NA>           
##  5 LAE06  14.3 IT              HS        LAE         1     TD    <NA>           
##  6 LAE07  14.9 IT              HS        LAE         1     TD    <NA>           
##  7 LAE08  14.1 IT              HS        LAE         1     TD    <NA>           
##  8 LAE09  17.1 IT              HS        LAE         4     TD    <NA>           
##  9 LAE10  17.7 IT              HS        LAE         4     TD    <NA>           
## 10 LAE11  15.4 IT              HS        LAE         2     TD    <NA>           
## # ℹ 46 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <chr>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

Run the following code for more information about filter, or type filter in the help section

??dplyr::filter

`mutate`

mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

Examples:

In what follows, we want to mutate the existing column age, which is currently into a numeric vector

### Option 1
df %>% 
  mutate(age = as.double(age))

## # A tibble: 94 × 38
##    ID       age L1_code        education name_school class group other_diagnoses
##    <chr>  <dbl> <chr>          <chr>     <chr>       <chr> <chr> <chr>          
##  1 LAUVEN  20.3 IT             BA        university  uni   DYS   disortografia  
##  2 LEVEN   20.5 IT             BA        university  uni   TD    <NA>           
##  3 LAE02   14.2 IT             HS        LAE         1     TD    <NA>           
##  4 LAE03   14.6 IT             HS        LAE         1     TD    <NA>           
##  5 LAE04   18.7 IT             HS        LAE         4     DYS   <NA>           
##  6 LAE05   17.4 IT and ROM (H… HS        LAE         4     TD    <NA>           
##  7 LAE06   14.3 IT             HS        LAE         1     TD    <NA>           
##  8 LAE07   14.9 IT             HS        LAE         1     TD    <NA>           
##  9 LAE08   14.1 IT             HS        LAE         1     TD    <NA>           
## 10 LAE09   17.1 IT             HS        LAE         4     TD    <NA>           
## # ℹ 84 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <chr>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

### Option 2
df %>% 
  mutate(age = as.numeric(age))

## # A tibble: 94 × 38
##    ID       age L1_code        education name_school class group other_diagnoses
##    <chr>  <dbl> <chr>          <chr>     <chr>       <chr> <chr> <chr>          
##  1 LAUVEN  20.3 IT             BA        university  uni   DYS   disortografia  
##  2 LEVEN   20.5 IT             BA        university  uni   TD    <NA>           
##  3 LAE02   14.2 IT             HS        LAE         1     TD    <NA>           
##  4 LAE03   14.6 IT             HS        LAE         1     TD    <NA>           
##  5 LAE04   18.7 IT             HS        LAE         4     DYS   <NA>           
##  6 LAE05   17.4 IT and ROM (H… HS        LAE         4     TD    <NA>           
##  7 LAE06   14.3 IT             HS        LAE         1     TD    <NA>           
##  8 LAE07   14.9 IT             HS        LAE         1     TD    <NA>           
##  9 LAE08   14.1 IT             HS        LAE         1     TD    <NA>           
## 10 LAE09   17.1 IT             HS        LAE         4     TD    <NA>           
## # ℹ 84 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <chr>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

Here, we want to create a new column (school_type), which follows these conditions: If name_school is LAE, then school_type is “Linguistic”, otherwise it’s “Other”

df %>% 
  mutate(school_type = if_else(name_school == "LAE", "Linguistic", "Other")) %>% 
  
  ### Check the result
  
  filter(ID %in% c("LAE03", "LC01", "MEN01", "VER01", "LEVEN")) %>% 
  dplyr::select(name_school, school_type)

## # A tibble: 5 × 2
##   name_school school_type
##   <chr>       <chr>      
## 1 university  Other      
## 2 LAE         Linguistic 
## 3 LC          Other      
## 4 MEN         Other      
## 5 VER         Other

The if_else function works as follows:

if_else(CONDITION, 
        VALUE IF CONDITION IS TRUE, 
        VALUE IF CONDITION IS FALSE)

It is also possible to use case_when when we have multiple conditions. Here, we have 5 levels under the name_school column: “LAE”, “LC”, “MEN”, “university”.

case_when works as follows:

case_when(
  CONDITION ~ VALUE, 
  CONDITION ~ VALUE, 
  ... 
)

Let’s see a concrete example:

If name_school is LAE = Linguistic, if name_school is LC = Scientific, if name_school is MEN = agrarian, if name_school is VER = university, if name_school is university, do not change it:

df %>% 
  mutate(school_type = case_when(
    name_school == "LAE" ~ "Linguistic", 
    name_school == "LC" ~ "Scientific", 
    name_school == "MEN" ~ "Agrarian", 
    name_school == "VER" ~ "university", 
    TRUE ~ name_school
  )) %>% 
  
  ### check 
  filter(ID %in% c("LAE03", "LC01", "MEN01", "VER01", "LEVEN")) %>% 
  dplyr::select(name_school, school_type)

## # A tibble: 5 × 2
##   name_school school_type
##   <chr>       <chr>      
## 1 university  university 
## 2 LAE         Linguistic 
## 3 LC          Scientific 
## 4 MEN         Agrarian   
## 5 VER         university

`mutate_all`

You can use mutate_all to mutate all columns in a dataset. For instance, let’s select the four reading columns (wr_time_z, wr_error_z, nwr_time_z, nwr_error_z) and scale all columns

df %>% 
  dplyr::select(wr_time_z:nwr_error_z) %>% 
  mutate_all(~scale(.x))

## # A tibble: 94 × 4
##    wr_time_z[,1] wr_error_z[,1] nwr_time_z[,1] nwr_error_z[,1]
##            <dbl>          <dbl>          <dbl>           <dbl>
##  1         0.546         -0.115          0.156         -0.105 
##  2         0.680          0.527          0.156          0.683 
##  3         0.396          0.763          0.420          0.918 
##  4         1.40           0.923          0.781          0.578 
##  5        -0.994          0.981          0.178          0.322 
##  6         0.222          0.696          0.535         -0.0131
##  7         0.931          0.763          0.561          0.805 
##  8         0.230          0.290          0.112          0.691 
##  9         0.642          0.290          0.383          0.352 
## 10         0.892          0.981          0.717          0.985 
## # ℹ 84 more rows

`mutate_if`

mutate_if is used to mutate all variables which meet a specific condition. For instance, let’s mutate all variables that are character into factors:

### Initial df 
df %>% select(where(is.character))

## # A tibble: 94 × 9
##    ID    L1_code education name_school class group other_diagnoses age_diagnosis
##    <chr> <chr>   <chr>     <chr>       <chr> <chr> <chr>           <chr>        
##  1 LAUV… IT      BA        university  uni   DYS   disortografia   11           
##  2 LEVEN IT      BA        university  uni   TD    <NA>            -            
##  3 LAE02 IT      HS        LAE         1     TD    <NA>            -            
##  4 LAE03 IT      HS        LAE         1     TD    <NA>            -            
##  5 LAE04 IT      HS        LAE         4     DYS   <NA>            17           
##  6 LAE05 IT and… HS        LAE         4     TD    <NA>            -            
##  7 LAE06 IT      HS        LAE         1     TD    <NA>            -            
##  8 LAE07 IT      HS        LAE         1     TD    <NA>            -            
##  9 LAE08 IT      HS        LAE         1     TD    <NA>            -            
## 10 LAE09 IT      HS        LAE         4     TD    <NA>            -            
## # ℹ 84 more rows
## # ℹ 1 more variable: group.exclusion <chr>

### Mutation of <chr> variables into <fct>
df %>% mutate_if(is.character, as.factor)

## # A tibble: 94 × 38
##    ID       age L1_code        education name_school class group other_diagnoses
##    <fct>  <dbl> <fct>          <fct>     <fct>       <fct> <fct> <fct>          
##  1 LAUVEN  20.3 IT             BA        university  uni   DYS   disortografia  
##  2 LEVEN   20.5 IT             BA        university  uni   TD    <NA>           
##  3 LAE02   14.2 IT             HS        LAE         1     TD    <NA>           
##  4 LAE03   14.6 IT             HS        LAE         1     TD    <NA>           
##  5 LAE04   18.7 IT             HS        LAE         4     DYS   <NA>           
##  6 LAE05   17.4 IT and ROM (H… HS        LAE         4     TD    <NA>           
##  7 LAE06   14.3 IT             HS        LAE         1     TD    <NA>           
##  8 LAE07   14.9 IT             HS        LAE         1     TD    <NA>           
##  9 LAE08   14.1 IT             HS        LAE         1     TD    <NA>           
## 10 LAE09   17.1 IT             HS        LAE         4     TD    <NA>           
## # ℹ 84 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <fct>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <fct>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

`mutate` across multiple columns

You can combine the function mutate with the function across to apply the same mutation across more than one column. They should be next to each other.

For instance, we want to mutate the three columns pa.acc, pa.rt and pa.bis and we want to scale them using the function scale:

df %>% mutate(across(pa.rt:pa.bis, ~scale(.x))) %>% 
  
  ### check
  dplyr::select(pa.rt:pa.bis)

## # A tibble: 94 × 3
##    pa.rt[,1] pa.acc[,1] pa.bis[,1]
##        <dbl>      <dbl>      <dbl>
##  1    -0.615      0.633      0.688
##  2    -0.642      0.633      0.703
##  3    -0.442      0.996      0.793
##  4    -0.939      0.996      1.07 
##  5    -0.855      0.814      0.920
##  6    -1.00       0.996      1.10 
##  7    -1.09       0.633      0.948
##  8    -0.794      0.270      0.586
##  9    -0.707      0.814      0.839
## 10    -0.972      0.996      1.08 
## # ℹ 84 more rows

`group_by`

Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup()removes grouping. It is very important to ungroup after the manipulation is applied

Imagine we want to create a new column using the function mutate, which has the mean value of the lexita.acc by group (DYS vs. TD):

df %>% 
  ### First, we group the df as we need it, here, by group
  group_by(group) %>% 
  mutate(mean.lexita = mean(lexita.acc)) %>% 
  ### ALWAYS ungroup
  ungroup() %>% 
  
  ### check result 
  dplyr::select(ID, group, lexita.acc, mean.lexita)

## # A tibble: 94 × 4
##    ID     group lexita.acc mean.lexita
##    <chr>  <chr>      <dbl>       <dbl>
##  1 LAUVEN DYS           48        52.7
##  2 LEVEN  TD            58        56.8
##  3 LAE02  TD            57        56.8
##  4 LAE03  TD            57        56.8
##  5 LAE04  DYS           55        52.7
##  6 LAE05  TD            58        56.8
##  7 LAE06  TD            59        56.8
##  8 LAE07  TD            56        56.8
##  9 LAE08  TD            56        56.8
## 10 LAE09  TD            56        56.8
## # ℹ 84 more rows

`summarize` or `summarise`

summarise()creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

Let’s visualize the mean, sd and range of lexita.acc score for each group in a table using summarize

### Option 1: Specify the grouping variable using `group_by`
df %>% 
  group_by(group) %>% 
  summarize(
    mean = mean(lexita.acc),
    sd = sd(lexita.acc),
    range = range(lexita.acc)
  )

## # A tibble: 4 × 4
## # Groups:   group [2]
##   group  mean    sd range
##   <chr> <dbl> <dbl> <dbl>
## 1 DYS    52.7  7.89    17
## 2 DYS    52.7  7.89    60
## 3 TD     56.8  2.34    47
## 4 TD     56.8  2.34    60

### Option 2: Specify the grouping variable within the summarize function using .by = ""
df %>% 
  summarize(
    mean = mean(lexita.acc),
    sd = sd(lexita.acc),
    range = range(lexita.acc),
    
    .by = "group"
  )

## # A tibble: 4 × 4
##   group  mean    sd range
##   <chr> <dbl> <dbl> <dbl>
## 1 DYS    52.7  7.89    17
## 2 DYS    52.7  7.89    60
## 3 TD     56.8  2.34    47
## 4 TD     56.8  2.34    60

`rename`

rename() changes the names of individual variables using new_name = old_name syntax

Example: We want to change the original ID column into ID_participant:

df %>% rename(ID_participant = ID)

## # A tibble: 94 × 38
##    ID_participant   age L1_code         education name_school class group
##    <chr>          <dbl> <chr>           <chr>     <chr>       <chr> <chr>
##  1 LAUVEN          20.3 IT              BA        university  uni   DYS  
##  2 LEVEN           20.5 IT              BA        university  uni   TD   
##  3 LAE02           14.2 IT              HS        LAE         1     TD   
##  4 LAE03           14.6 IT              HS        LAE         1     TD   
##  5 LAE04           18.7 IT              HS        LAE         4     DYS  
##  6 LAE05           17.4 IT and ROM (HL) HS        LAE         4     TD   
##  7 LAE06           14.3 IT              HS        LAE         1     TD   
##  8 LAE07           14.9 IT              HS        LAE         1     TD   
##  9 LAE08           14.1 IT              HS        LAE         1     TD   
## 10 LAE09           17.1 IT              HS        LAE         4     TD   
## # ℹ 84 more rows
## # ℹ 31 more variables: other_diagnoses <chr>, n_other_diagnoses <dbl>,
## #   age_diagnosis <chr>, AoO <dbl>, wr_time_z <dbl>, wr_error_z <dbl>,
## #   nwr_time_z <dbl>, nwr_error_z <dbl>, group.exclusion <chr>, pa.rt <dbl>,
## #   pa.acc <dbl>, pa.bis <dbl>, forward <dbl>, backward <dbl>,
## #   lexita.acc <dbl>, lexita.rt <dbl>, lexita.bis <dbl>, lextale.acc <dbl>,
## #   lextale.rt <dbl>, lextale.bis <dbl>, it.ok.acc <dbl>, en.ok.acc <dbl>, …

`relocate`

Use relocate() to change column positions, using the same syntax as select() to make it easy to move blocks of columns at once.

relocate(.data, ..., .before = NULL, .after = NULL)

Example: Move the column group after the column ID

df %>% relocate(group, .after = "ID")

## # A tibble: 94 × 38
##    ID     group   age L1_code        education name_school class other_diagnoses
##    <chr>  <chr> <dbl> <chr>          <chr>     <chr>       <chr> <chr>          
##  1 LAUVEN DYS    20.3 IT             BA        university  uni   disortografia  
##  2 LEVEN  TD     20.5 IT             BA        university  uni   <NA>           
##  3 LAE02  TD     14.2 IT             HS        LAE         1     <NA>           
##  4 LAE03  TD     14.6 IT             HS        LAE         1     <NA>           
##  5 LAE04  DYS    18.7 IT             HS        LAE         4     <NA>           
##  6 LAE05  TD     17.4 IT and ROM (H… HS        LAE         4     <NA>           
##  7 LAE06  TD     14.3 IT             HS        LAE         1     <NA>           
##  8 LAE07  TD     14.9 IT             HS        LAE         1     <NA>           
##  9 LAE08  TD     14.1 IT             HS        LAE         1     <NA>           
## 10 LAE09  TD     17.1 IT             HS        LAE         4     <NA>           
## # ℹ 84 more rows
## # ℹ 30 more variables: n_other_diagnoses <dbl>, age_diagnosis <chr>, AoO <dbl>,
## #   wr_time_z <dbl>, wr_error_z <dbl>, nwr_time_z <dbl>, nwr_error_z <dbl>,
## #   group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

Move the colums wr_time_z, wr_error_z, nwr_time_z and nwr_error_z after the column ID and then move age before wr_time_z

df %>% 
  ### step 1
  relocate(wr_time_z:nwr_error_z, .after = "ID") %>% 
  ### step 2
  relocate(age, .before = "wr_time_z")

## # A tibble: 94 × 38
##    ID       age wr_time_z wr_error_z nwr_time_z nwr_error_z L1_code    education
##    <chr>  <dbl>     <dbl>      <dbl>      <dbl>       <dbl> <chr>      <chr>    
##  1 LAUVEN  20.3     -0.58      -1.23      -1.05       -1.45 IT         BA       
##  2 LEVEN   20.5     -0.23       0.1       -1.05        0.43 IT         BA       
##  3 LAE02   14.2     -0.97       0.59      -0.34        0.99 IT         HS       
##  4 LAE03   14.6      1.65       0.92       0.63        0.18 IT         HS       
##  5 LAE04   18.7     -4.58       1.04      -0.99       -0.43 IT         HS       
##  6 LAE05   17.4     -1.42       0.45      -0.03       -1.23 IT and RO… HS       
##  7 LAE06   14.3      0.42       0.59       0.04        0.72 IT         HS       
##  8 LAE07   14.9     -1.4       -0.39      -1.17        0.45 IT         HS       
##  9 LAE08   14.1     -0.33      -0.39      -0.44       -0.36 IT         HS       
## 10 LAE09   17.1      0.32       1.04       0.46        1.15 IT         HS       
## # ℹ 84 more rows
## # ℹ 30 more variables: name_school <chr>, class <chr>, group <chr>,
## #   other_diagnoses <chr>, n_other_diagnoses <dbl>, age_diagnosis <chr>,
## #   AoO <dbl>, group.exclusion <chr>, pa.rt <dbl>, pa.acc <dbl>, pa.bis <dbl>,
## #   forward <dbl>, backward <dbl>, lexita.acc <dbl>, lexita.rt <dbl>,
## #   lexita.bis <dbl>, lextale.acc <dbl>, lextale.rt <dbl>, lextale.bis <dbl>,
## #   it.ok.acc <dbl>, en.ok.acc <dbl>, en.ok.rt <dbl>, en.ok.bis <dbl>, …

`pivot_longer` & `pivot_wider`

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.
pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows.

Examples

df %>% 
  dplyr::select(ID, wr_time_z:nwr_error_z) -> ex.1

### Original dataset
ex.1

## # A tibble: 94 × 5
##    ID     wr_time_z wr_error_z nwr_time_z nwr_error_z
##    <chr>      <dbl>      <dbl>      <dbl>       <dbl>
##  1 LAUVEN     -0.58      -1.23      -1.05       -1.45
##  2 LEVEN      -0.23       0.1       -1.05        0.43
##  3 LAE02      -0.97       0.59      -0.34        0.99
##  4 LAE03       1.65       0.92       0.63        0.18
##  5 LAE04      -4.58       1.04      -0.99       -0.43
##  6 LAE05      -1.42       0.45      -0.03       -1.23
##  7 LAE06       0.42       0.59       0.04        0.72
##  8 LAE07      -1.4       -0.39      -1.17        0.45
##  9 LAE08      -0.33      -0.39      -0.44       -0.36
## 10 LAE09       0.32       1.04       0.46        1.15
## # ℹ 84 more rows

### Pivot longer 
ex.1 %>% 
  pivot_longer(names_to = "Reading Measure", 
               values_to = "Reading Value",
               
               ### here we specify which columns need to be manipulated, so from the second to the fifth
               2:5)

## # A tibble: 376 × 3
##    ID     `Reading Measure` `Reading Value`
##    <chr>  <chr>                       <dbl>
##  1 LAUVEN wr_time_z                   -0.58
##  2 LAUVEN wr_error_z                  -1.23
##  3 LAUVEN nwr_time_z                  -1.05
##  4 LAUVEN nwr_error_z                 -1.45
##  5 LEVEN  wr_time_z                   -0.23
##  6 LEVEN  wr_error_z                   0.1 
##  7 LEVEN  nwr_time_z                  -1.05
##  8 LEVEN  nwr_error_z                  0.43
##  9 LAE02  wr_time_z                   -0.97
## 10 LAE02  wr_error_z                   0.59
## # ℹ 366 more rows

Example for pivot_wider

ex.1 %>% 
  pivot_longer(names_to = "Reading Measure", values_to = "Reading Value",
               ### here we specify which columns need to be manipulated, so from the second to the fifth
               2:5) -> ex.2

### Original data
ex.2

## # A tibble: 376 × 3
##    ID     `Reading Measure` `Reading Value`
##    <chr>  <chr>                       <dbl>
##  1 LAUVEN wr_time_z                   -0.58
##  2 LAUVEN wr_error_z                  -1.23
##  3 LAUVEN nwr_time_z                  -1.05
##  4 LAUVEN nwr_error_z                 -1.45
##  5 LEVEN  wr_time_z                   -0.23
##  6 LEVEN  wr_error_z                   0.1 
##  7 LEVEN  nwr_time_z                  -1.05
##  8 LEVEN  nwr_error_z                  0.43
##  9 LAE02  wr_time_z                   -0.97
## 10 LAE02  wr_error_z                   0.59
## # ℹ 366 more rows

### Pivot wider 
ex.2 %>% 
  pivot_wider(names_from =  "Reading Measure", 
              values_from =  "Reading Value")

## # A tibble: 94 × 5
##    ID     wr_time_z wr_error_z nwr_time_z nwr_error_z
##    <chr>      <dbl>      <dbl>      <dbl>       <dbl>
##  1 LAUVEN     -0.58      -1.23      -1.05       -1.45
##  2 LEVEN      -0.23       0.1       -1.05        0.43
##  3 LAE02      -0.97       0.59      -0.34        0.99
##  4 LAE03       1.65       0.92       0.63        0.18
##  5 LAE04      -4.58       1.04      -0.99       -0.43
##  6 LAE05      -1.42       0.45      -0.03       -1.23
##  7 LAE06       0.42       0.59       0.04        0.72
##  8 LAE07      -1.4       -0.39      -1.17        0.45
##  9 LAE08      -0.33      -0.39      -0.44       -0.36
## 10 LAE09       0.32       1.04       0.46        1.15
## # ℹ 84 more rows

R Tidyverse - tutorials

Ilaria Venagli (TA)

2024-03-11

Quick introduction to R studio/Markdown

Your first R markdown document

R markdown options

Getting started

Install packages

Load data

Data manipulation with tidyverse

`select`

`filter`

`mutate`

`mutate_all`

`mutate_if`

`mutate` across multiple columns

`group_by`

`summarize` or `summarise`

`rename`

`relocate`

`pivot_longer` & `pivot_wider`

Helpful resources

R Tidyverse - tutorials

Ilaria Venagli (TA)

2024-03-11

Quick introduction to R studio/Markdown

Your first R markdown document

R markdown options

Getting started

Install packages

Load data

Data manipulation with tidyverse

select

filter

mutate

mutate_all

mutate_if

mutate across multiple columns

group_by

summarize or summarise

rename

relocate

pivot_longer & pivot_wider

Helpful resources

`select`

`filter`

`mutate`

`mutate_all`

`mutate_if`

`mutate` across multiple columns

`group_by`

`summarize` or `summarise`

`rename`

`relocate`

`pivot_longer` & `pivot_wider`