library(tidyverse)
## -- Attaching packages ---------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## -- Conflicts ------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.5.1
theme_set(theme_few())
sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(sum(!is.na((x))))}
ci <- function(x) {sem(x) * 1.96} # reasonable approximation
This is problem set #5, in which we hope you will practice the visualization package ggplot2
, as well as hone your knowledge of the packages tidyr
and dplyr
. You’ll look at two different datasets here.
First, data on children’s looking at social targets from Frank, Vul, Saxe (2011, Infancy).
Second, data from Sklar et al. (2012) on the unconscious processing of arithmetic stimuli.
In both of these cases, the goal is to poke around the data and make some plots to reveal the structure of the dataset.
This part is a warmup, it should be relatively straightforward ggplot2
practice.
Load data from Frank, Vul, Saxe (2011, Infancy), a study in which we measured infants’ looking to hands in moving scenes. There were infants from 3 months all the way to about two years, and there were two movie conditions (Faces_Medium
, in which kids played on a white background, and Faces_Plus
, in which the backgrounds were more complex and the people in the videos were both kids and adults). An eye-tracker measured children’s attention to faces. This version of the dataset only gives two conditions and only shows the amount of looking at hands (other variables were measured as well).
fvs <- read_csv("data/FVS2011-hands.csv")
## Parsed with column specification:
## cols(
## subid = col_integer(),
## age = col_double(),
## condition = col_character(),
## hand.look = col_double()
## )
First, use ggplot
to plot a histogram of the ages of children in the study. NOTE: this is a repeated measures design, so you can’t just take a histogram of every measurement.
# generate table of subid and age
fvs_ages <- fvs %>%
select(subid, age) %>%
group_by(subid) %>%
summarize(age=mean(age, na.rm=TRUE))
# histogram of age
ggplot(fvs_ages, aes(x=age)) +
geom_histogram(binwidth=1)
Second, make a scatter plot showing the difference in hand looking by age and condition. Add appropriate smoothing lines. Take the time to fix the axis labels and make the plot look nice.
# scatterplot of hand looking by age and condition
ggplot(fvs, aes(x = age,
y = hand.look,
color = factor(condition, labels = c("Whole Person", "Multiple People")))) +
geom_point() +
labs(x="Age (months)",
y="% looking to hands",
color="Condition") +
geom_smooth(method = "lm") +
ggtitle("Looking at hands by condition and age")
What do you conclude from this pattern of data?
It looks like the older infants may be looking longer at hands in general than the younger infants. It also looks like the older infants (but not the younger infants) may be looking longer at hands in the Multiple People movies where the backgrounds were more complex and the people in the videos were both kids and adults (vs the Whole Person movies where kids played on a white background).
What statistical analyses would you perform here to quantify these differences?
Linear regression model with an interaction term to look for a main effect of age and an interaction between age and condition on looking time at hands.
Sklar et al. (2012) claim evidence for unconscious arithmetic processing - they prime participants with arithmetic problems and claim that the authors are faster to repeat the answers. We’re going to do a reanalysis of their Experiment 6, which is the primary piece of evidence for that claim. The dataare generously shared by Asael Sklar. (You may recall these data from the tidyverse
tutorial earlier in the quarter).
First read in two data files and subject info. A and B refer to different trial order counterbalances.
subinfo <- read_csv("data/sklar_expt6_subinfo_corrected.csv")
## Parsed with column specification:
## cols(
## subid = col_integer(),
## presentation.time = col_integer(),
## subjective.test = col_integer(),
## objective.test = col_double()
## )
d_a <- read_csv("data/sklar_expt6a_corrected.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## prime = col_character(),
## congruent = col_character(),
## operand = col_character()
## )
## See spec(...) for full column specifications.
d_b <- read_csv("data/sklar_expt6b_corrected.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## prime = col_character(),
## congruent = col_character(),
## operand = col_character()
## )
## See spec(...) for full column specifications.
gather
these datasets into long (“tidy data”) form. If you need to review tidying, here’s the link to R4DS (bookmark it!). Remember that you can use select_helpers
to help in your gather
ing.
Once you’ve tidied, bind all the data together. Check out bind_rows
.
The resulting tidy dataset should look like this:
prime prime.result target congruent operand distance counterbalance subid rt
<chr> <int> <int> <chr> <chr> <int> <int> <dbl> <int>
1 =1+2+5 8 9 no A -1 1 1 597
2 =1+3+5 9 11 no A -2 1 1 699
3 =1+4+3 8 12 no A -4 1 1 700
4 =1+6+3 10 12 no A -2 1 1 628
5 =1+9+2 12 11 no A 1 1 1 768
6 =1+9+3 13 12 no A 1 1 1 595
# gather subject columns into rows
data_a_tidy <- d_a %>%
gather(subid, rt, "1":"21")
data_b_tidy <- d_b %>%
gather(subid, rt, "22":"42")
# merge a, b
data_tidy <- merge(data_a_tidy, data_b_tidy, all.x = T, all.y = T)
Merge these with subject info. You will need to look into merge and its relatives, left_
and right_join
. Call this dataframe d
, by convention.
# merge a+b with subinfo, make sure subid is integer in both a+b and subinfo
data_tidy$subid <- as.integer(data_tidy$subid)
d <- left_join(data_tidy, subinfo, by = "subid")
Clean up the factor structure (just to make life easier). No need to, but if you want, you can make this more tidyverse
-ish.
d$presentation.time <- factor(d$presentation.time)
levels(d$operand) <- c("addition","subtraction")
Examine the basic properties of the dataset. First, show a histogram of reaction times.
# histogram of rt
ggplot(data_tidy, aes(x = rt)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 237 rows containing non-finite values (stat_bin).
Challenge question: what is the sample rate of the input device they are using to gather RTs?
35 milliseconds
Sklar et al. did two manipulation checks. Subjective - asking participants whether they saw the primes - and objective - asking them to report the parity of the primes (even or odd) to find out if they could actually read the primes when they tried. Examine both the unconscious and conscious manipulation checks. What do you see? Are they related to one another?
cor.test(d$subjective.test, d$objective.test, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: d$subjective.test and d$objective.test
## t = 57.052, df = 6466, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5622115 0.5946395
## sample estimates:
## cor
## 0.5786542
Yes, there’s a medium size correlation between unconscious and conscious manipulation checks.
In Experiments 6, 7, and 9, we used the binomial distribution to determine whether each participant performed better than chance on the objective block and excluded from analyses all those participants who did (21, 30, and 7 participants in Experiments 6, 7, and 9, respectively). Note that, although the number of excluded participants may seem high, they fall within the normal range of long-duration CFS priming, in which suc- cessful suppression is strongly affected by individual differences (38). We additionally excluded participants who reported any subjective awareness of the primes (four, five, and three participants in Experiments 6, 7, and 9, respectively).
OK, let’s turn back to the measure and implement Sklar et al.’s exclusion criterion. You need to have said you couldn’t see (subjective test) and also be not significantly above chance on the objective test (< .6 correct). Call your new data frame ds
.
ds <- d %>%
filter(subjective.test == 0,
objective.test < .6)
sklar_et_al_2012_facilitation_effect
Sklar et al. show a plot of a “facilitation effect” - the amount faster you are for prime-congruent naming compared with prime-incongruent naming. They then show plot this difference score for the subtraction condition and for the two prime times they tested. Try to reproduce this analysis.
HINT: first take averages within subjects, then compute your error bars across participants, using the ci
function (defined above). Sklar et al. use SEM (and do it incorectly, actually), but CI is more useful for “inference by eye” as discussed in class.
HINT 2: remember that in class, we reviewed the common need to group_by
and summarise
twice, the first time to get means for each subject, the second time to compute statistics across subjects.
HINT 3: The final summary dataset should have 4 rows and 5 columns (2 columns for the two conditions and 3 columns for the outcome: reaction time, ci, and n).
# step 1 - rt per congruent and operand conditions (per subject)
ds_subject <- ds %>%
group_by(subid, congruent, operand) %>%
summarize(subid_rt=mean(rt, na.rm=TRUE))
# step 2 - rt per congruent and operand conditions (across subjects)
ds_summary <- ds_subject %>%
group_by(congruent, operand) %>%
summarize(avg_rt=mean(subid_rt, na.rm=TRUE),
ci=ci(subid_rt),
n=n())
Now plot this summary, giving more or less the bar plot that Sklar et al. gave (though I would keep operation as a variable here. Make sure you get some error bars on there (e.g. geom_errorbar
or geom_linerange
).
# calculate per subject average rt across congruent, operand, presentation.time
ds_subject_withP <- ds %>%
group_by(subid, congruent, operand, presentation.time) %>%
summarize(subid_rt=mean(rt, na.rm=TRUE))
# subtraction only, calculate facilitation effect per subject, then calculate facilitation effect across presentation time
ds_facilitation <- ds_subject_withP %>%
filter(operand == "S") %>%
spread(congruent, subid_rt) %>%
mutate(diff = no-yes) %>%
group_by(presentation.time) %>%
summarize(rt_diff = mean(diff), ci = ci(diff), n = n())
# bar graph of facilitation effect by presentation time
ggplot(ds_facilitation, aes(x = presentation.time, y = rt_diff)) +
geom_bar(position = "dodge", stat = "identity") +
geom_errorbar(aes(ymin = rt_diff - ci, ymax = rt_diff + ci),
width=.2,
position=position_dodge(.9)) +
labs(title = "Experiment 6", x = "Presentation duration", y = "Facilitation (ms)")
What do you see here? How close is it to what Sklar et al. report? How do you interpret these data?
Same means, but the error bars are much larger. Since the error bars between presentation durations overlap, there is no difference in facilitation between prime times. So prime time has no effect on facilitation.