Intro

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.7
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggthemes)
theme_set(theme_few())
sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(sum(!is.na((x))))}
ci <- function(x) {sem(x) * 1.96} # reasonable approximation 

This is problem set #4, in which we hope you will practice the visualization package ggplot2, as well as hone your knowledge of the packages tidyr and dplyr. You’ll look at two different datasets here.

First, data on children’s looking at social targets from Frank, Vul, Saxe (2011, Infancy).

Second, data from Sklar et al. (2012) on the unconscious processing of arithmetic stimuli.

In both of these cases, the goal is to poke around the data and make some plots to reveal the structure of the dataset.

Part 1

This part is a warmup, it should be relatively straightforward ggplot2 practice.

Load data from Frank, Vul, Saxe (2011, Infancy), a study in which we measured infants’ looking to hands in moving scenes. There were infants from 3 months all the way to about two years, and there were two movie conditions (Faces_Medium, in which kids played on a white background, and Faces_Plus, in which the backgrounds were more complex and the people in the videos were both kids and adults). An eye-tracker measured children’s attention to faces. This version of the dataset only gives two conditions and only shows the amount of looking at hands (other variables were measured as well).

fvs <- read_csv("data/FVS2011-hands.csv")
## Parsed with column specification:
## cols(
##   subid = col_integer(),
##   age = col_double(),
##   condition = col_character(),
##   hand.look = col_double()
## )

First, use ggplot to plot a histogram of the ages of children in the study. NOTE: this is a repeated measures design, so you can’t just take a histogram of every measurement.

fvs %>% 
  # using the group_by and unique functions makes sure no duplicate subids exist
  # and if the same subid had a different age, that an error will be thrown. Neat!
  group_by(subid) %>%
  summarize(age=unique(age)) %>%
  ggplot() +
  geom_histogram(aes(x=age)) +
  labs(
    title="Distribution of subjects' ages",
    x="Age (months)",
    y="Children"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Second, make a scatter plot showing the difference in hand looking by age and condition. Add appropriate smoothing lines. Take the time to fix the axis labels and make the plot look nice.

fvs %>%
  ggplot(aes(x=age, y=hand.look, color=condition)) +
  geom_point() +
  geom_smooth() +
  labs(
    title="Distribution of subjects' ages",
    x="Age (months)",
    y="Time looking at hands (units not specified)",
    color="Video"
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What do you conclude from this pattern of data?

Around 15 months children begin to pay more attention to hands. The amount of attention they pay depends on the video.

What statistical analyses would you perform here to quantify these differences?

Looking at the correlation of age and time would be primary. If there’s an intuitive reason to test the different videos, then a mixed-effect model might be nice. (I am not certain of this, and this is part of the reason I am taking Psych 252.)

Part 2

Sklar et al. (2012) claim evidence for unconscious arithmetic processing - they prime participants with arithmetic problems and claim that the authors are faster to repeat the answers. We’re going to do a reanalysis of their Experiment 6, which is the primary piece of evidence for that claim. The dataare generously shared by Asael Sklar. (You may recall these data from the tidyverse tutorial earlier in the quarter).

Data Prep

First read in two data files and subject info. A and B refer to different trial order counterbalances.

subinfo <- read_csv("data/sklar_expt6_subinfo_corrected.csv") %>% 
  mutate(subid=as.character(subid))
## Parsed with column specification:
## cols(
##   subid = col_integer(),
##   presentation.time = col_integer(),
##   subjective.test = col_integer(),
##   objective.test = col_double()
## )
d_a <- read_csv("data/sklar_expt6a_corrected.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   prime = col_character(),
##   congruent = col_character(),
##   operand = col_character()
## )
## See spec(...) for full column specifications.
d_b <- read_csv("data/sklar_expt6b_corrected.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   prime = col_character(),
##   congruent = col_character(),
##   operand = col_character()
## )
## See spec(...) for full column specifications.

gather these datasets into long (“tidy data”) form. If you need to review tidying, here’s the link to R4DS (bookmark it!). Remember that you can use select_helpers to help in your gathering.

Once you’ve tidied, bind all the data together. Check out bind_rows.

The resulting tidy dataset should look like this:

    prime prime.result target congruent operand distance counterbalance subid    rt
    <chr>        <int>  <int>     <chr>   <chr>    <int>          <int> <dbl> <int>
 1 =1+2+5            8      9        no       A       -1              1     1   597
 2 =1+3+5            9     11        no       A       -2              1     1   699
 3 =1+4+3            8     12        no       A       -4              1     1   700
 4 =1+6+3           10     12        no       A       -2              1     1   628
 5 =1+9+2           12     11        no       A        1              1     1   768
 6 =1+9+3           13     12        no       A        1              1     1   595
d_rt <- bind_rows(
  d_a %>% gather(subid, rt, `1`:`21`),
  d_b %>% gather(subid, rt, `22`:`42`)
)

Merge these with subject info. You will need to look into merge and its relatives, left_ and right_join. Call this dataframe d, by convention.

d <- left_join(d_rt, subinfo, by='subid')

Clean up the factor structure (just to make life easier). No need to, but if you want, you can make this more tidyverse-ish.

d$presentation.time <- factor(d$presentation.time)
levels(d$operand) <- c("addition","subtraction")

Data Analysis Preliminaries

Examine the basic properties of the dataset. First, show a histogram of reaction times.

ggplot(d) +
  geom_histogram(aes(x=rt)) +
  labs(
    title="Distribution of reaction times",
    x="Reaction Time",
    y="Problems"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 237 rows containing non-finite values (stat_bin).

Challenge question: what is the sample rate of the input device they are using to gather RTs?

I like to use graphical methods for accuracy and intuition and mathematical methods for precision. The first step is to plot the data so that the sample rate is clearly visible.

reactions_over_time <- as.tibble(table(d$rt)) %>%
  rename(rt=Var1) %>%
  mutate(rt=as.integer(rt)) %>%
  right_join(tibble(rt=1:max(.$rt)), by='rt') %>%
  mutate(n=replace_na(n, 0))

ggplot(reactions_over_time) +
  geom_line(aes(x=rt, y=n))

The sampling artifacts are easily visible in the plot above. We would expect reaction time to be continuous, but the plotting appears to be discrete. There are 14 spikes in between the 500 and 1000ms marks. This heuristic gives a sampling rate of about 28Hz.

In order to calculate the sample rate with precision, we must understand the frequency domain of the signal using the Fourier transform.

reactions_frequency <- reactions_over_time %>%
  magrittr::use_series(n) %>%
  fft() %>%
  tibble(fft=.) %>%
  mutate(
    mag=abs(fft),
    ind=1:n()
  ) 

reactions_frequency %>%
  ggplot() +
  geom_line(aes(x=ind, y=mag))

At first, it appears there is not a frequency that dominates the other frequencies, as there are many peaks with comparable amplitude. The frequency domain plot is very “spiky.” However, this effect can be interpreted if we understand the sampling data in the time domain as a Dirac comb (also called an impulse train). A Dirac comb is a function that has tiny-width “spikes” at regular intervals along its domain. Importantly, the Fourier transform of a Dirac comb is another Dirac comb - which explains the spiky-ness of the frequency function. With that fact, we can be confident that the first peak off zero in the frequency domain is the frequency we are looking for. Zooming in on that first stretch, we find:

peak <- 68
samples <- max(reactions_frequency$ind)
samples
## [1] 2336
reactions_frequency %>%
  ggplot() +
  geom_line(aes(x=ind, y=mag)) +
  xlim(0, 100) +
  annotate("point", x = reactions_frequency$ind[[peak]], y = reactions_frequency$mag[[peak]], colour = "blue")
## Warning: Removed 2236 rows containing missing values (geom_path).

The peak is at the 68th position in the array. Using the DFT formula given in the R documentation, \[y_h = \sum_{k=1}^n z_k\cdot e^{-2\pi i(h-1)\frac{k-1}{n}}\] the frequency is \(h-1\) cycles in \(n\) entries A short bit of dimensional analysis yields \[ f=\frac{h-1 \text{ cycles}}{n \text{ entries}} = \frac{67 \text{ cycles}}{2336 \text{ entries}} \times \frac{1000 \text{ entries}}{1 \text{ second}} \approx 28.68 \text{Hz} \] This is consistent with the 28 Hz we estimated from the number of spikes. Therefore, the result is about 28.7Hz

Sklar et al. did two manipulation checks. Subjective - asking participants whether they saw the primes - and objective - asking them to report the parity of the primes (even or odd) to find out if they could actually read the primes when they tried. Examine both the unconscious and conscious manipulation checks. What do you see? Are they related to one another?

d %>% 
  group_by(subid) %>% 
  summarize(
    subjective.test=factor(unique(subjective.test)),
    objective.test=unique(objective.test)
  ) %>%
  ggplot(aes(x=objective.test, fill=subjective.test)) +
  geom_histogram(alpha=0.5, position="identity", bins=10)

By my intuition, they are related. All but four participants who claimed not to see the primes were right on the parity less than 60% of the time, while all but four participants who claimed to see the primes were right on the parity more than 60% of the time.

In Experiments 6, 7, and 9, we used the binomial distribution to determine whether each participant performed better than chance on the objective block and excluded from analyses all those participants who did (21, 30, and 7 participants in Experiments 6, 7, and 9, respectively). Note that, although the number of excluded participants may seem high, they fall within the normal range of long-duration CFS priming, in which suc- cessful suppression is strongly affected by individual differences (38). We additionally excluded participants who reported any subjective awareness of the primes (four, five, and three participants in Experiments 6, 7, and 9, respectively).

OK, let’s turn back to the measure and implement Sklar et al.’s exclusion criterion. You need to have said you couldn’t see (subjective test) and also be not significantly above chance on the objective test (< .6 correct). Call your new data frame ds.

ds <- d %>%
  filter(
    subjective.test == 0 &
      objective.test < .6
  )

Replicating Sklar et al.’s analysis

Sklar et al. show a plot of a “facilitation effect” - the amount faster you are for prime-congruent naming compared with prime-incongruent naming. They then show plot this difference score for the subtraction condition and for the two prime times they tested. Try to reproduce this analysis.

HINT: first take averages within subjects, then compute your error bars across participants, using the ci function (defined above). Sklar et al. use SEM (and do it incorectly, actually), but CI is more useful for “inference by eye” as discussed in class.

HINT 2: remember that in class, we reviewed the common need to group_by and summarise twice, the first time to get means for each subject, the second time to compute statistics across subjects.

HINT 3: The final summary dataset should have 4 rows and 5 columns (2 columns for the two conditions and 3 columns for the outcome: reaction time, ci, and n).

ds_summary <- ds %>%
  group_by(congruent, operand, subid) %>%
  # IT IS NOT SPECIFIED WHAT TO DO WITH THE NA VALUES
  # I assume we remove them, but it comes with the caveat that
  # we're only comparing completed tries.
  # What if there was an confounding effect on completion?
  summarize(mean_rt = mean(rt, na.rm = TRUE)) %>%
  group_by(congruent, operand) %>%
  summarize(
    rt = mean(mean_rt),
    ci = ci(mean_rt),
    n = n()
  )

knitr::kable(ds_summary)
congruent operand rt ci n
no A 684.0067 41.22950 17
no S 677.0909 53.51070 17
yes A 688.4300 44.47832 17
yes S 661.9278 51.66896 17

Now plot this summary, giving more or less the bar plot that Sklar et al. gave (though I would keep operation as a variable here. Make sure you get some error bars on there (e.g. geom_errorbar or geom_linerange).

dodge <- position_dodge(width=0.9)
ggplot(ds_summary, aes(x=operand, fill=congruent)) +
  geom_col(aes(y=rt), position=dodge) +
  geom_errorbar(aes(ymin = rt - ci, ymax= rt + ci), position=dodge, width=0.25) +
  labs(
    x="Operation",
    y="Reaction Time",
    fill="Congruent with Prime?",
    title="Effect of Congruence on Reaction Time",
    subtitle="Note there is no priming effect in either operation."
  ) +
  scale_x_discrete(
    labels=c(
      "A" = "Addition",
      "S" = "Subtraction"
    )
  ) + 
  ggthemes::theme_few() +
  theme(legend.position = "bottom")

What do you see here? How close is it to what Sklar et al. report? How do you interpret these data?

Across both operations, the confidence intervals overlap for a majority of their length. It is not close to Sklar’s reporting of the data. I interpret the data that there is no evidence for an effect of priming on solution reaction time.