knitr::opts_chunk$set(include = TRUE, echo = TRUE, message = FALSE, fig.width = 4, fig.height = 4, fig.align = 'center', incldue = FALSE)
library(knitr)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(broom)
library(ez)
In this tutorial, we will go through basic RT analyses for typical psychophysical experiments. The analyses are done with R language. So in the first part, we will quick go through basic analysis flow in R. If you are familiar with R, you can skip the part I. In the second part, we will analyze the data you just collected.
You may have heard about R or RStudio. R is a programming language while RStudio is an interface (integrated development environment, IDE). RStudio helps you organize your codes and data, and managing R packages (collections of special functions). Basically you need both both. They can be downloaded at:
They have been installed in your computer. R packages extend the functionality of R by providing additional functions. Some packages we will use quite often, such as tidyverse, ezANOVA. Particular the tidyverse package, which provides a grammar-like data analysis. Using tidyverse can let you focus more on your data, not on programming itself. There are many free online resources for learning R. I recommend the following three:
To make an efficient data analysis, you need a workflow. Typically we need to do following tasks in data analysis.
This has become a standard routine in data science.
include_graphics('img/tidy_process.png')
Importing experimental data is relative easy. There are several functions available for you to do this. * Reading text or excel data * read.csv() * read.xlsx() from xlsx package * fread() from data.table package * Reading matlab file * readMat() from R.matlab
For example:
dat = read.csv('data/stroop.csv')
head(dat, n = 3)
Very often we work with data table that organize the data like the above. That is, each trial stores in a row, while each column stores individual variable. This type of table is called tidy data. If your raw data are not in this type of format, you may need to convert to this type of format first.
Experimental raw data contain detail information about the experimental design and observers’ responses. Often we need to manipulate data, such as selecting variables and averaging across participants etc. tidyverse package provides five types of data manipulation:
* `select()`
* `filter()`
* `arrange()`
* `mutate()`
* `summarize()`
These are key functions for data manipulation. You can find a cheatsheet for this from the help - Cheatsheets sub-menu. Importantly, tidyverse also provide a connection operator pipe (%>%) for those functions. That is, you can combine those functions together, like a natural sentence.
knitr::include_graphics('figs/r_dplyr_pipes.pdf')
For example, we want to use ‘dat’ table, filter out those error trials, and then average mean reaction times for individual participants. This analysis flow can be translated into R directly:
dat %>% filter(acc == TRUE) %>% group_by(sub) %>% summarise(mRT = mean(rt)) %>% spread(sub, mRT) %>% kable(., digits = 2)
tidyverse includes one visualization package ggplot2. Similar to the data manipulation, ggplot is also a grammar of graphics, which consists of a data set, a set of geoms, and a coordinate system
ggplot(data, aes(x,y)) + geom_point() + geom_smooth(method = 'lm')
Note though it uses the plus + operator, not the pipe %>%. Here are some basic plots:
geom_bar(stat = 'identity', position = 'dodge')
geom_line()
geom_errorbar(aes(ymin, ymax))
We will know more in the next part.
Sometimes you may find it a bit strange. For example, there are three operators for storing values: =, <-, ->. The latter two have direction. To make an array from numbers, you need the concatenate operator c(). For example, c(1,4, 5). R codes can be written in multiple lines, if you have +, %>%, , at the end of previous line. And R codes do not require ‘;’ for ending. In addition, you need to load packages at the very beginning of your code.
This is an experiment that replicates Horowitz & Wolfe (1998), who found search efficiency, measured by the search slope (ms/item), did not changed from the static to the dynamic display. That is, the slopes were similar for both static and dynamic displays. There was only an overall shift in RTs. Horowitz & Wolfe (1998) did not provide any account for this mysterious behavior findings, but rather argued that the search does not have memory (same for the static and dynamic).
In this experiment, a search display was either static or dynamic (i.e., the location of the items are reshuffled every 110 ms). The display was presented until a response or 5 seconds maximum. In addition, the set size of the display varied with 8, 12, or 16 items, and target presence vs. absence 1:1. The static and dynamic search displays were presented in separated block, counter-balanced. And participants were asked for detecting the target (Absence vs. Presence)
Summary of the manipulated factors
The raw file generated by matlab is a text file with separation of ‘,’. Note each participant has one individual file. The first step is to read into workspace and combine them together. tidyverse package provides a useful function for this, called map_df(a_list, function). That is, you can provide a list of files, and apply the function to each element of the list.
# list of participants
subs = list.files(path = 'search', full.names = TRUE, pattern='*.txt')
raw = map_df(subs, read.csv, header = FALSE)
head(raw)
Now we have the raw data. But we need to make some further preprocessing, such as renaming the column variables, adding participant number, adding accuracy etc.
# rename columns using names()
names(raw) = c('target','setSize','block','resp','rt')
# calculate accuracy
raw$acc = raw$target == raw$resp
# convert target from number to readable factors
raw$target = factor(raw$target, labels = c('present','absent'))
# convert set size from coding index to real items
raw$setSize = raw$setSize*4+4 # 8, 12, 16 items
# convert block from number to readable factors.
raw$block = factor(raw$block, labels = c('dynamic','static'))
# add subject No. and factorize it
raw$sub = rep(1:length(subs), each = 420)
raw$sub = factor(raw$sub)
# show some data from the head
head(raw)
The first step of data analysis is to remove those error trials. However, before doing this, we must make sure there is no speed-accuracy trade-off.
Question: Why should we care about the speed-accuracy trade-off (SATO)?
Your answer:
We use pipe operator %>% to combine the analyses.
# we use piping ...
# please read the following lines
# and guess what columns are in the final table msuberrors
raw %>% group_by(sub, setSize, block) %>%
summarise(merror = 1 - mean(acc)) -> msuberrors
# visualize the errors
msuberrors %>% ggplot(aes(x = sub, y = merror)) + geom_bar(stat = 'identity', position = position_dodge()) +
facet_grid(setSize~block)
By visual inspection, we see huge error rates in the dynamic target-absent condition. This suggests there might be some speed-accuracy trade-off. Thus, we do a repeated-measures ANOVA on the error rates.
anova_err = ezANOVA(msuberrors,
dv = merror,
within = .(setSize, block),
wid = sub)
## Warning: "setSize" will be treated as numeric.
## Warning: There is at least one numeric within variable, therefore aov()
## will be used for computation and no assumption checks will be obtained.
anova_err$ANOVA
Questions: Did error rates differ among set size, block? Is there any interaction?
For RT analysis, we usually only include those correct trials. Thus, first we need to filter out those errors (We did error analysis above already), and then average mean RTs for each condition, each subject. Below is an example output.
# Please start from `raw` data, average correct mean RTs,
# and store to `msubrts'
msubrts = raw %>% filter(acc == TRUE) %>%
group_by(sub,block, setSize, target) %>%
summarise(mrt = mean(rt))
head(msubrts)
To validate original study (Horowitz & Wolfe, 1998), we visualize the mean RTs as a function of set size (line plot).
# please complete the following codes
msubrts %>%
group_by(block, setSize, target) %>% # please fill the code here (not run)
summarise(mmrt = mean(mrt)) %>%
ggplot(aes(x = setSize, y = mmrt,
color = block, shape = target )) +
geom_point(size=3) + geom_line() + facet_wrap(~block)
As we can see that for the ‘target-present’ condition, the slopes for the static and visual displays are similar. But for the target-absent condition, the slope for the static display is clearly steeper than that for the dynamic display.
We use linear regression to estimate the search slopes.
# select one participant and one condition
# Task: please change the following subject name to yours.
sub1 = msubrts %>% filter(sub == 1 & target == 'present')
# estimate a linear model
lm(mrt ~ setSize, data = sub1)
##
## Call:
## lm(formula = mrt ~ setSize, data = sub1)
##
## Coefficients:
## (Intercept) setSize
## 0.57807 0.02993
Question: What is the slope of the above linear regression?
We have to estimate multiple linear models, so we’d better build a function for this. The linear regression in R is lm(). If you are not familiar with the linear regression, please search in ‘help’.
linear_model <- function(df){
lm(mrt ~ setSize, data = df)
}
Now we want to do linear regression for each condition in each individual participants. In a standard program, you need to use loop functions. In tidyverse, you can use groupby() function plus the map() function. The following is an example code how you can get the slopes easily within one pipe line.
msubrts %>% group_by(sub, block, target) %>%
nest() %>%
mutate(model = map(data, linear_model)) %>%
mutate(glance = map(model, broom::tidy)) %>%
unnest(glance, .drop = TRUE) -> linear_est
head(linear_est)
Once we get the linear estimation, we now can do visualization, statistical analysis etc. Let’s first visualize the mean slopes.
linear_est %>% group_by(block) %>%
filter(term == 'setSize') %>%
summarise(slope = mean(estimate), n = n(), se = sd(estimate)/sqrt(n-1)) %>%
ggplot(aes(block, slope, fill = block, color = block)) +
geom_bar(stat = 'identity') +
geom_errorbar(aes(ymin = slope - se, ymax = slope + se), width = 0.5)
One easy way to do repeated-measures ANOVA is to use ez package. In the ez package, you can use the function ezANOVA(). Here is an example how you can do this for the slope comparison:
linear_est %>% filter(term == 'setSize') %>%
ezANOVA(data = .,
dv = estimate,
within = .(target, block),
wid =sub)
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 2 target 1 9 0.1539633 0.703906815 0.005363524
## 3 block 1 9 14.1458892 0.004477398 * 0.127172076
## 4 target:block 1 9 9.7402785 0.012299938 * 0.156355355