This homework is due by Thursday, January 26th, 8:00pm. Upload a pdf file to Canvas called
2_wrangling.pdf
In this homework, you’ll combine data-wrangling and visualization to reverse-engineer a professional data visualization. The goal of this homework is to replicate Figure 1 as closely as possible. This is known as a “Copy the Master” assignment (Nolan and Perrett 2016). If you have any questions, feel free to come to section (Monday 3:30-4:30 Econ 139 or Tuesday 4:30-5:30pm Littlefield 103), sign up for OH (links on the course website), or post on Ed Discussion.
You’ll be working with a dataset collected in the high throughput psychophysics experimental tradition (Rajalingham et al. 2018). One variant of this approach is to collect data from hundreds, even thousands of participants, in order to generate reliable estimates of perceptual behaviors. Typically, this is accomplished using online data collection. In this dataset there were originally 345 subjects who collectively performed 42,113 trials of a visual discrimination task.
In order to generate the main plot, we’ll analyze a subset of this human behavior. When this is done, we’ll compare human performance with the behavior of neurons and a computational model performing the same task. The neural data is from a neuroscience/psychophysics experiment conducted in non-human primates (Majaj et al. 2015). The computational model data, in our case, was used to validate its correspondence with visual area IT. If you’re interested in this model class, (Yamins and DiCarlo 2016) provides a nice intro. The basic idea is that we can use biologically plausible computational models to understand the relationship between the brain and behavior.
These data were collected to generate a stimulus set with an interesting pattern of perceptual properties: 1. Early visual regions (e.g. primate area V4) can’t do the task at all 2. Late visual regions (e.g. primate aread IT) can do the task on some items 3. Neurotypical humans can do the task on all items (outperforming IT and V4)
The beautiful plot you want to recreate.
Panel A compares model predictions (x axis) with the neural predictions from V4 (y axis). Panel B shows the model predictions (x axis) compared with the neural predictions from IT (y axis), showing that the model is a good approximation for this visual region. Panel C shows the model predictions (x axis) with the behavior of neurotypical humans (y axis). Humans clearly outperform the model (and are clearly better than IT or V4). Panel D is a summary plot, with just the trends from the first three. Panel E is the relationship between reaction time (x axis) and the difference between human performance and neural area IT. This demonstrates that the degree to which human accuracy diverges from visual area IT (but not V4), the more time is required to perform these tasks. Together these results (alongside recent lesion studies) suggest that other brain regions, beyond these canonical sensory regions, are implicated in these kinds of behaviors.
What kind of behavior? Formally, they’re called concurrent visual discrimination (“oddity”) tasks that vary in difficulty. The task is to identify on each trial which one out of three images is the odd one. Here is a screenshot of an example trial. In this case, the lion is the odd one out.
Example trial. Elephant is the typical object while lion is the oddity object. In the control trials (V0), all objects are presented from the same viewpoint. In the experimental trials (V3), all objects are presented from different viewpoints. This trial here is one of the V3 elephant trials.
You’ll be working with two dataframes in this homework. First, we’ll provide you with a data frame that contains the human behavior you’ll be wrangling and summarizing. Second, a dataframe that contains the accuracy on those same items, but from several different sources: neural responses from two visual regions (V4 and IT) as well as a computational model.
Great. Let’s start by loading the human behavioral data frame:
df.human_wide = read_csv("data/human_behavior_V03.csv")
Let’s take a look at the first 40 column names:
df.human_wide %>%
names() %>%
head(40)
## [1] "index" "0-array_type" "0-choice"
## [4] "0-chosen_object" "0-collection" "0-database"
## [7] "0-iteration" "0-oddity_category" "0-oddity_index"
## [10] "0-oddity_name" "0-oddity_type" "0-reaction_time"
## [13] "0-trial_indices" "0-trial_number" "0-typical_category"
## [16] "0-typical_name" "0-variation_level" "0-worker_id"
## [19] "1-array_type" "1-choice" "1-chosen_object"
## [22] "1-collection" "1-database" "1-iteration"
## [25] "1-oddity_category" "1-oddity_index" "1-oddity_name"
## [28] "1-oddity_type" "1-reaction_time" "1-trial_indices"
## [31] "1-trial_number" "1-typical_category" "1-typical_name"
## [34] "1-variation_level" "1-worker_id" "10-array_type"
## [37] "10-choice" "10-chosen_object" "10-collection"
## [40] "10-database"
Oh, wow. That’s really untidy data: each row here is a subject, and each column contains 1) a prefix indicating which trial these data correspond to, and 2) a variable name indicating what information is present. So, for example, if we’re interested in the variable variation_level, and there are \(n\) trials, this means there are \(n\) columns containing variation_level :(
Let’s take a look at the content of a single trial, for example, trial 13:
Variables of interest we’ll be looking at:
V0 are the easy control trials & V3 are the challenging experimental trialsVariables that you need to create:
worker_id, typical_name, etc.). You might call this new data frame df.human_long. This may require two transformation steps. First, you’ll separate the current column headings into new columns: the trial and the variable name indicating what information present. Second, you’ll structure the dataset so that each row corresponds to a single trial, with the variable names indicating what information was present being columns.df.human_wide2 <- df.human_wide %>%
pivot_longer(cols = -index,
names_to = c("trial","variable"),
names_sep = "-",
values_to = "rating",
values_transform = list(rating = as.character)) %>%
pivot_wider(names_from = "variable",
values_from = "rating")
index column, rename the worker_id column to participant and put it as the left-most column of the data frame.df.human_wide3 <- df.human_wide2 %>%
select(index,oddity_name,chosen_object,typical_name,variation_level,reaction_time,worker_id) %>%
select(-index) %>%
relocate(worker_id) %>%
rename(participant = worker_id)
NA) in the participant column (i.e. the column that used to be the worker_id column before you renamed it in step 2).Tip: The
drop_na()function will be helpful here.
df.human_wide4 <- df.human_wide3 %>%
drop_na(participant)
<chr> columns that contain only numbers into numeric columns.df.human_wide5 <- df.human_wide4 %>%
mutate(across(.cols = c(participant,reaction_time),
.fns = ~ as.numeric(.)))
correct column by checking whether the object a participant chose (chosen_object) matches the oddity (oddity_name). That is, you want to check whether they correctly identified the odd one out.# 1 means the participant got the trial correct, 0 means they got it wrong.
df.human_wide6 <- df.human_wide5 %>%
mutate(correct = ifelse(chosen_object == oddity_name,1,0))
df.human_wide6 %>%
head(10)
Tip: Remember that the control trials are marked as “V0” in
variation_level.
#Find the total number of trials for each participant, remove participants' V0 trials if they scored less than 90% overall on it, then find the new number of total trials for each participant and remove participants' V3 trials if their V0 trials were removed
df.human_wide7 <- df.human_wide6 %>%
group_by(participant) %>%
mutate(pre_n = n()) %>%
group_by(participant,variation_level) %>%
mutate(n = n(),
total_number_correct = sum(correct),
percent_correct = (total_number_correct/n)*100) %>%
filter((variation_level == "V0" & percent_correct > 90) | variation_level == "V3") %>%
group_by(participant) %>%
mutate(post_n=n()) %>%
filter(pre_n == post_n)
typical_name). The estimates should be rounded to two decimal places.df.human_wide8 <- df.human_wide7 %>%
group_by(typical_name) %>%
summarize(n=n(),
total_trials_accurate=sum(correct),
human_accuracy = round(total_trials_accurate/n, digits=2),
human_reaction_time = round(mean(reaction_time, na.rm=T), digits=2)) %>%
select(typical_name, human_accuracy, human_reaction_time)
df.human_wide8 %>%
head(10)
If all went well, you should have created the same data frame that is also saved in data/human_accuracy _and_reaction_time_per_item.csv. If you weren’t quite able to create this data frame, you can load it in the code chunk below and continue to work with it.
df.human_long = read_csv("data/human_accuracy_and_reaction_time_per_item.csv")
Now, let’s load the data frame that contains the neural and computational performance on those same typical_name items.
df.neural_and_model = read_csv("data/neural_and_computational_predictions_per_item.csv")
df.human_long data frame with the df.neural_and_model data frame.Tip: Use the
typical_namecolumn to join the data frames.
df.human_wide9 <- df.human_wide8 %>%
full_join(df.neural_and_model,
by = "typical_name")
df.human_wide10 <- df.human_wide9 %>%
mutate(difference = human_accuracy-it)
df.human_wide10 %>%
head(10)
Tip: For panels A to D the model accuracy (in the column
model) is shown on the x-axis, and what is shown on the y-axis differs between the panels (A:v4, B:it, C:human_accuracy). For D, instead of showing a scatter plot, usegeom_smooth(method = "lm")to show the regression lines. For E, use a scatter plot to show the relationship between the difference in accuracy between human and it performance on the y-axis (which you’ve already calculated above), and reaction time on the x-axis.To combine the different plots into one figure panel, take a look at the
patchworkpackage. You can find an example for how to use that package in the notes for the Visualization 2 lecture (visualization2.Rmd).
Plot1 <- ggplot(df.human_wide10, aes(x=model, y=v4)) +
geom_point(pch=21, size=2.5, stroke=0.7, color="black", fill="grey") +
coord_cartesian(ylim=c(0.2,1),
xlim=c(0.2,1)) +
ylab("V4 Performance") +
xlab("Model Performance") +
labs(title = "Model outperforms neural area V4") +
geom_abline(lty="dashed",color="gray45") +
theme_classic()
Plot2 <- ggplot(df.human_wide10, aes(x=model, y=it)) +
geom_point(pch=21, size=2.5, stroke=0.7, color="black", fill="darkorange") +
coord_cartesian(ylim=c(0.2,1),
xlim=c(0.2,1)) +
ylab("IT Performance") +
xlab("Model Performance") +
labs(title = "Model Predicts neural area IT") +
geom_abline(lty="dashed",color="gray45") +
theme_classic()
Plot3 <- ggplot(df.human_wide10, aes(x=model, y=human_accuracy)) +
geom_point(pch=21, size=2.5, stroke=0.7, color="white", fill="black") +
coord_cartesian(ylim=c(0.2,1),
xlim=c(0.2,1)) +
ylab("Human Performance") +
xlab("Model Performance") +
labs(title = "Humans outperform model") +
geom_abline(lty="dashed",color="gray45") +
theme_classic()
df.human_long11 <- df.human_wide10 %>%
pivot_longer(cols = c("human_accuracy","v4","it"),
names_to = "plot_type",
values_to = "performance")
Plot4 <- ggplot(df.human_long11, aes(x=model, y=performance, color=plot_type)) +
geom_point(alpha=0, show.legend=F) +
geom_smooth(method="lm", se=0, show.legend=F, lwd=1.5) +
coord_cartesian(ylim=c(0.2,1),
xlim=c(0.2,1)) +
ylab("Performance") +
xlab("Model Performance") +
labs(title = "Stimuli exhibit ideal properties") +
geom_abline(lty="dashed",color="gray45") +
scale_color_manual(values=c("black","darkorange","grey")) +
theme_classic()
Plot5 <- ggplot(df.human_wide10, aes(x=human_reaction_time, y=difference)) +
geom_point(pch=21, size=2.5, stroke=0.7, color="black", fill="darkorange") +
coord_cartesian(ylim=c(0,0.6),
xlim=c(2000,4000)) +
ylab("Human - Neural") +
xlab("Reaction time") +
labs(title = "Difference between humans and IT \nis related to reaction time") +
scale_x_continuous(expand = expansion(mult=c(0.15,0))) +
scale_y_continuous(expand = expansion(mult=c(0.12,0))) +
theme_classic()
patchwork <- Plot1 + Plot2 + Plot3 + Plot4 + Plot5
patchwork <- (Plot1 | Plot2 | Plot3 | Plot4 | Plot5)
patchwork + plot_annotation(tag_levels = "A") &
theme(plot.tag = element_text(face="bold"))
## `geom_smooth()` using formula 'y ~ x'