Intro

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(ggthemes)
theme_set(theme_few())
sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(sum(!is.na((x))))}
ci <- function(x) {sem(x) * 1.96} # reasonable approximation

This is problem set #4, in which we hope you will practice the visualization package ggplot2, as well as hone your knowledge of the packages tidyr and dplyr. You’ll look at two different datasets here.

First, data on children’s looking at social targets from Frank, Vul, Saxe (2011, Infancy).

Second, data from Sklar et al. (2012) on the unconscious processing of arithmetic stimuli.

In both of these cases, the goal is to poke around the data and make some plots to reveal the structure of the dataset.

Part 1

This part is a warmup, it should be relatively straightforward ggplot2 practice.

Load data from Frank, Vul, Saxe (2011, Infancy), a study in which we measured infants’ looking to hands in moving scenes. There were infants from 3 months all the way to about two years, and there were two movie conditions (Faces_Medium, in which kids played on a white background, and Faces_Plus, in which the backgrounds were more complex and the people in the videos were both kids and adults). An eye-tracker measured children’s attention to faces. This version of the dataset only gives two conditions and only shows the amount of looking at hands (other variables were measured as well).

fvs <- read_csv("data/FVS2011-hands.csv")

## Parsed with column specification:
## cols(
##   subid = col_integer(),
##   age = col_double(),
##   condition = col_character(),
##   hand.look = col_double()
## )

First, use ggplot to plot a histogram of the ages of children in the study. NOTE: this is a repeated measures design, so you can’t just take a histogram of every measurement.

fvs_age <- fvs %>%
        group_by(subid) %>%
        summarise(n = n(), age = mean(age)) # I assume that each subid maps to only 1 age, so taking mean(age) should work

ggplot(fvs_age, aes(x = age)) + 
        geom_histogram() +
        xlab("Age (months)") +
        ylab("Number of Children")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Second, make a scatter plot showing the difference in hand looking by age and condition. Add appropriate smoothing lines. Take the time to fix the axis labels and make the plot look nice.

ggplot(fvs, aes(x = age, y = hand.look, col=condition)) + 
        geom_point() + 
        geom_smooth(method = "lm") +
        ggtitle("Hand looking by age and condition") +
        ylab("Proportion of Hand Looking") +
        xlab("Age (months)") +
        xlim(0, 30)

What do you conclude from this pattern of data?

Firstly, it seems that there is a main effect of age. As children grow older, the proportion of hand looking is greater. Furthermore, it seems that age and condition interacts. The effects of age is more pronounced in faces plus, as evidenced by the steeper slope. There is a possible main effect of condition. It is not exactly clear in this graph whether the proportion of hand looking is greater for the faces plus condition than the faces medium condition. I would graph a bar plot of hand looking by condition ith 95% confidence interval to tell.

What statistical analyses would you perform here to quantify these differences?

I would do a linear mixed effect model, with the random effects of participant. I would test if the main effect of condition is significant, if the main effect of age is significant, and if the interaction of age by condition is signfiicant.

library(lmerTest)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

## Loading required package: lme4

## 
## Attaching package: 'lmerTest'

## The following object is masked from 'package:lme4':
## 
##     lmer

## The following object is masked from 'package:stats':
## 
##     step

library(broom)
library(knitr)

mod <- lmer(data = fvs, hand.look ~ age * condition + (1|subid))

kable(tidy(summary(mod)$coef))

.rownames	Estimate	Std..Error	df	t.value	Pr…t..
(Intercept)	0.0230142	0.0154159	221.3662	1.492887	0.1368905
age	0.0030590	0.0011758	221.5222	2.601550	0.0099059
conditionFaces_Plus	-0.0289396	0.0196491	111.2632	-1.472818	0.1436233
age:conditionFaces_Plus	0.0039999	0.0014994	111.9182	2.667652	0.0087720

It seems that the main effect of age and the interaction between age and condition is significant. The main effect of condition is not significant.

Part 2

Sklar et al. (2012) claim evidence for unconscious arithetic processing - they prime participants with arithmetic problems and claim that the authors are faster to repeat the answers. We’re going to do a reanalysis of their Experiment 6, which is the primary piece of evidence for that claim. The data are generously shared by Asael Sklar. (You may recall these data from the tidyverse tutorial earlier in the quarter).

Data Prep

First read in two data files and subject info. A and B refer to different trial order counterbalances.

subinfo <- read_csv("data/sklar_expt6_subinfo_corrected.csv")

## Parsed with column specification:
## cols(
##   subid = col_integer(),
##   presentation.time = col_integer(),
##   subjective.test = col_integer(),
##   objective.test = col_double()
## )

d_a <- read_csv("data/sklar_expt6a_corrected.csv")

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   prime = col_character(),
##   congruent = col_character(),
##   operand = col_character()
## )

## See spec(...) for full column specifications.

d_b <- read_csv("data/sklar_expt6b_corrected.csv")

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   prime = col_character(),
##   congruent = col_character(),
##   operand = col_character()
## )
## See spec(...) for full column specifications.

gather these datasets into long (“tidy data”) form and get rid of the Xs in the headers. If you need to review tidying, here’s the link to R4DS (bookmark it!). Remember that you can use select_helpers to help in your gathering.

Once you’ve tidied, bind all the data together. Check out bind_rows.

The resulting tidy dataset should look like this:

    prime prime.result target congruent operand distance counterbalance subid    rt
    <chr>        <int>  <int>     <chr>   <chr>    <int>          <int> <dbl> <int>
 1 =1+2+5            8      9        no       A       -1              1     1   597
 2 =1+3+5            9     11        no       A       -2              1     1   699
 3 =1+4+3            8     12        no       A       -4              1     1   700
 4 =1+6+3           10     12        no       A       -2              1     1   628
 5 =1+9+2           12     11        no       A        1              1     1   768
 6 =1+9+3           13     12        no       A        1              1     1   595

d_a_tidy <- d_a %>%
        gather(key = subid, value = rt, grep("\\d", colnames(d_a), value = T)) # the grep function returns all column names with a digit

d_b_tidy <- d_b %>%
        gather(key = subid, value = rt, grep("\\d", colnames(d_b), value = T))

d_tidy <- bind_rows(d_a_tidy, d_b_tidy)

kable(head(d_tidy))

prime	prime.result	target	congruent	operand	distance	counterbalance	subid	rt
=1+2+5	8	9	no	A	-1	1	1	597
=1+3+5	9	11	no	A	-2	1	1	699
=1+4+3	8	12	no	A	-4	1	1	700
=1+6+3	10	12	no	A	-2	1	1	628
=1+9+2	12	11	no	A	1	1	1	768
=1+9+3	13	12	no	A	1	1	1	595

Merge these with subject info. You will need to look into merge and its relatives, left_ and right_join. Call this dataframe d, by convention.

d_tidy$subid <- as.integer(d_tidy$subid) # need to make subid column an integer before merging
d <- left_join(d_tidy, subinfo, by="subid")

Clean up the factor structure (just to make life easier). No need to, but if you want, you can make this more tidyverse-ish.

# make sure these variables are factors
d$presentation.time <- factor(d$presentation.time)
d$congruent <- factor(d$congruent)
d$operand <- factor(d$operand)
d$subjective.test <- factor(d$subjective.test)

#rename operand to something more useful
levels(d$operand) <- c("addition","subtraction")
levels(d$subjective.test) <- c("no", "yes")

Data Analysis Preliminaries

Examine the basic properties of the dataset. First, show a histogram of reaction times.

ggplot(d, aes(x = rt)) + geom_histogram() + ggtitle("RT Histogram")#warnings are thrown because there are somme NA values, presumably non responses

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 237 rows containing non-finite values (stat_bin).

ggplot(d, aes(x = rt, fill = congruent)) + 
        geom_histogram(binwidth = 40) + 
        facet_grid(operand ~ congruent) +
        ggtitle("RT histogram by operand and congruency")

## Warning: Removed 237 rows containing non-finite values (stat_bin).

Challenge question: what is the sample rate of the input device they are using to gather RTs?

~ 35 ms with 3ms clusters. It’s obvious when you do sort(d$rt). We can also graphically see this by setting the histogram to have binwidth = 3. I’m not sure how to interpret the 3ms clusters, but it seems like the device is reading off every 35 ms.

ggplot(d, aes(x = rt)) + geom_histogram(binwidth = 3) + ggtitle("RT Histogram w/ binwidth = 3")

## Warning: Removed 237 rows containing non-finite values (stat_bin).

Sklar et al. did two manipulation checks. Subjective - asking participants whether they saw the primes - and objective - asking them to report the parity of the primes (even or odd) to find out if they could actually read the primes when they tried. Examine both the unconscious and conscious manipulation checks. What do you see? Are they related to one another?

Subjective Test

ggplot(d %>% select(subid, subjective.test) %>% distinct(), 
       aes(x = subjective.test, fill = subjective.test)) + 
        geom_bar() +
        xlab("Subjective Test") +
        ggtitle("Participant Self Report")

There is an equal number (21) of participants reporting that they did see the primes (yes) and did not (no).

Objective Test

ggplot(d %>% select(subid, objective.test) %>% distinct(),
       aes(x = objective.test)) +
        geom_histogram(binwidth = 0.05) +
        xlab("Objective Test") +
        ggtitle("Participant Partity Recognition")

ggplot(d %>% select(subid, subjective.test, objective.test) %>% distinct(), 
       aes(x = subjective.test, y = objective.test, fill = subjective.test)) + 
        geom_boxplot() +
        xlab("Subjective Test") + 
        ylab("Objective Test")

From the side by side boxplot comparison, it seems that those who reported that they saw the primes (subjective.test = 1) have greater accuracy at reporting the parity of the primes (greater objective.test). Both subjective.test and objective.test are positively correlated with each other. To test significance of this correlation, we can run a simple linear model (p < 0.001).

d_cor <- d %>% select(subid, subjective.test, objective.test) %>% distinct()

summary(lm(data = d_cor, objective.test ~ subjective.test))

## 
## Call:
## lm(formula = objective.test ~ subjective.test, data = d_cor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31615 -0.11109 -0.00320  0.07808  0.37911 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.54276    0.03323  16.333  < 2e-16 ***
## subjective.testyes  0.21089    0.04700   4.487 5.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1523 on 40 degrees of freedom
## Multiple R-squared:  0.3348, Adjusted R-squared:  0.3182 
## F-statistic: 20.14 on 1 and 40 DF,  p-value: 5.966e-05

In Experiments 6, 7, and 9, we used the binomial distribution to determine whether each participant performed better than chance on the objective block and excluded from analyses all those participants who did (21, 30, and 7 participants in Experiments 6, 7, and 9, respectively). Note that, although the number of excluded participants may seem high, they fall within the normal range of long-duration CFS priming, in which suc- cessful suppression is strongly affected by individual differences (38). We additionally excluded participants who reported any subjective awareness of the primes (four, five, and three participants in Experiments 6, 7, and 9, respectively).

OK, let’s turn back to the measure and implement Sklar et al.’s exclusion criterion. You need to have said you couldn’t see (subjective test) and also be not significantly above chance on the objective test (< .6 correct). Call your new data frame ds.

ds <- d %>% 
        filter(subjective.test == "no") %>%
        filter(objective.test < .6)

Replicating Sklar et al.’s analysis

Sklar et al. show a plot of a “facilitation effect” - the amount faster you are for prime-congruent naming compared with prime-incongruent naming. They then show plot this difference score for the subtraction condition and for the two prime times they tested. Try to reproduce this analysis.

HINT: first take averages within subjects, then compute your error bars across participants, using the sem function (defined above).

HINT 2: remember that in class, we reviewed the common need to group_by and summarise twice, the first time to get means for each subject, the second time to compute statistics across subjects.

HINT 3: The final summary dataset should have 4 rows and 5 columns (2 columns for the two conditions and 3 columns for the outcome: reaction time, ci, and n).

ds_average <- ds %>%
        group_by(subid, congruent, presentation.time, operand) %>%
        summarise(mean.rt = mean(rt, na.rm = T)) %>%
        spread(congruent, mean.rt) %>% # so that one row has both incongruent (no) and congruent (yes)
        mutate(facilitation = no - yes) %>% # Facilitation = incongruent - congruent
        group_by(presentation.time, operand) %>%
        summarise(mean.facilitation = mean(facilitation), ci = ci(facilitation), num = n())

Now plot this summary, giving more or less the bar plot that Sklar et al. gave (though I would keep operation as a variable here. Make sure you get some error bars on there (e.g. geom_errorbar or geom_linerange).

# with 95% CI errorbars
ggplot(ds_average, aes(x = presentation.time, y = mean.facilitation, fill = presentation.time)) + 
        geom_bar(stat = "identity") + 
        geom_errorbar(aes(ymin = mean.facilitation - ci, ymax = mean.facilitation + ci), position = "dodge", width = 0.25) +
        facet_grid(. ~ operand) + 
        xlab("Presentation Duration") + 
        ylab("Facilitation (ms)") +
        ggtitle("Facilitation Effect by operand and presentation time with 95% CI bars")

What do you see here? How close is it to what Sklar et al. report? Do the error bars match? How do you interpret these data?

Just as Sklar et al. reported, there seems to be no facilitation effect in the addition condition. The means of the facilitation effect for the subtraction condition looks very similar to the one shown in Sklar et al. report. The error bars do not seem to match. The error bars (SEM) here seem to be wider than Sklar et. al.’s report. This could possibly be a result of removing reaction times that are outliers (because some rts were extreme eg. +2000 ms).

From the 95% CI plot, I infer that both 1700ms and 2000ms presentation time seem to have facilitation effects that are significantly greater than 0. The 95% CI errorbars for both presentation times do not include 0, indicating that there should be a significant main effect of congruent vs noncongruent primes for the subtraction condition. It semes like there is no significant difference of facilitation effect between 1700 ms and 2000 ms prentation time.

Challenge problem: verify Sklar et al.’s claim about the relationship between RT and the objective manipulation check.

Sklar et al. report claims that the objective manipulation check was a measure of participant’s lack of awareness of the prime. Those who are above chance level (objective.test > 0.6) would be aware of the primes, and those who are not above chance level would not be aware of the prime. Thus, by their logic, it is expected that those who are aware of the primes do not show the “facilitation effect.”

First let’s look at if there is an effect of awareness of prime on RT.

d$aware <- d$objective.test > 0.6

ggplot(d, aes(x = rt)) + geom_histogram() + facet_grid(. ~ aware)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 237 rows containing non-finite values (stat_bin).

The overal shape and center of the distributions of RT seem to be the same. Next, let’s see if there is a difference in the distribution of facilitation effects by awareness.

d_aware <- d %>%
        group_by(subid, operand, congruent, aware) %>%
        summarise(mean.rt = mean(rt, na.rm = T)) %>%
        spread(congruent, mean.rt) %>%
        mutate(facilitation = no - yes)

ggplot(d_aware, aes(x = aware, y = facilitation)) + geom_boxplot()

There doesn’t seem to be a difference in facilitation effect by awareness. Maybe we need to plot the difference in distrubition of facilitation effect by awareness and operand, since the authors reported only an effect in the subtraction condition.

ggplot(d_aware, aes(x = aware, y = facilitation)) + geom_boxplot() + facet_grid(. ~ operand)

Again, there doesn’t seem to be a difference in facilitation effect by awareness. This is concerning because it seems that the exclusion criteria objective.test isn’t meaningful, and just an arbitrary decision. Two things might be happening: 1) Objective.test is not actually capturaing what it’s supposed to be capturing, awareness of the primes… or 2) the “facilitation effect” as Sklar et al. calls it, isn’t really an unconcsious effect.

1 seems to be less likely, so let’s test out 2. We can test out whether there is a negative relationship between objective.test and facilitation effect. We would expect that the more that participants are getting the primes right in the objective.test, the less that the mean facilitation effect would show. I will only look at subtraction trials, since Sklar et al. suggested that addition may be too easy, thus the facilitation effect wouldn’t be discussed.

d_obj <- d %>%
        group_by(subid, congruent, operand, objective.test) %>%
        summarise(mean.rt = mean(rt, na.rm = T)) %>%
        spread(congruent, mean.rt) %>% # so that one row has both incongruent (no) and congruent (yes)
        mutate(facilitation = no - yes) %>% 
        filter(operand == "subtraction")# Facilitation = incongruent - congruent

summary(lm(data = d_obj, facilitation ~ objective.test))

## 
## Call:
## lm(formula = facilitation ~ objective.test, data = d_obj)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.835 -14.119   1.486  15.192  67.208 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -6.248     13.591  -0.460    0.648
## objective.test   26.314     20.184   1.304    0.200
## 
## Residual standard error: 23.84 on 40 degrees of freedom
## Multiple R-squared:  0.04076,    Adjusted R-squared:  0.01678 
## F-statistic:   1.7 on 1 and 40 DF,  p-value: 0.1998

The results show that objective.test does not correlate significantly with facilitation effect.

The next and final step I will take to investigate this question is to plot out the mean facilitation effect as a function of the objective.test exclusion number. Sklar et al. used 0.6 as the boundary between unaware and aware participants. By shifting 0.6 to 1, we are including more aware participants into the data, and so we are going to expect that the mean facilitation effect will decrease. I include this plot because I want to see how the mean facilitation would change if I changed this 0.6 number to some other numbers.

#computes mean and se facilitation effect by operand
compute_mean_se <- function(d, exclusion_num){
        ans <- d %>% 
                filter(objective.test < exclusion_num) %>%
                group_by(subid, congruent, operand) %>%
                summarise(mean.rt = mean(rt, na.rm = T)) %>%
                spread(congruent, mean.rt) %>%
                mutate(facilitation = no - yes) %>%
                group_by(operand) %>%
                summarise(mean.facilitation = mean(facilitation), se = sem(facilitation)) %>%
                mutate(exclusion_num = exclusion_num)
        
        return(ans)
}

exclusion_nums <- seq(0.5, 1, 0.05)

d_ans <- compute_mean_se(d, 0.45)

for(i in exclusion_nums){
        d_ans = bind_rows(d_ans, compute_mean_se(d, i))
}

ggplot(d_ans, aes(x = exclusion_num, y = mean.facilitation)) + 
        geom_point() + 
        facet_grid(. ~ operand) + 
        geom_errorbar(aes(ymin = mean.facilitation - se, ymax = mean.facilitation + se)) +
        geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess'

We do see a negative relationship of the exclusion number in objective.test and mean facilitation. The effect is very weak and only seen in the subtraction condition. But it generally seems like it doesn’t matter where objective.test exclusion number is set, so long as it is between (0.6 - 1). The early sharp decline and spike in SEM error bars in the subtraction condition can probably be attributed to not enough participants (very few were below 0.5).

PS#4: Exploratory Data Analysis

Michael Ko