Make sure that you load the following packages: dplyr, ggplot2, and readr.
Make sure that you install and load the package ggridges.
Make sure that you download the template for the lab.
Often, when processing data, you will be given multiple similar files to analyze. For example, each participant in an experiment may produce their own, individual datafile. To analyze all that data, we cannot look at each file individually. Instead, we need to combine the files, to analyze them as a group. Here, we will use R to do that.
The dataset for this part of the class is split into 3 large files, called p_values1.csv, p_values2.csv…, etc. We want to read each of these datafiles in, and then bind them together into a single file. So, download them to your lab directory. (nb. if you find that your computer is too slow to process these files quickly, then let me know).
Here, we will do this using the functions read_csv() and bind_rows().
For example, the following R code reads in two of the files, and then uses the function bind_rows to bind them together into one large dataframe. Begin by implementing this in the chunk load_p_data.
pval1 = read_csv("p_values1.csv")
pval2 = read_csv("p_values2.csv")
pval = bind_rows(pval1,pval2)We can make the loading process a little more efficient if we change the code as so (you should update the ... to refer to the other file).
pval = bind_rows(
read_csv("p_values1.csv"),
read_csv("p_values2.csv"),
...
)Before you continue, make sure that you have code that produces a dataframe called pval, which should have many thousands of rows!
In the chunk examine_pval, use the functions summary() and head() to examine the pval dataframe that you have created. The p values dataset provides a list of reported p values for a wide array of scientific and social scientific discplines, going all the way back to the 1960s.
We’d like to look at this dataset of p values, to see if there is a tendency for people to over-report p values close to 0.05.
We will use ggplot() to plot the set of p values. We will use a function called geom_density() to examine the probability distribution of all of the p values. We’ll do this in the chunk first_plot, by copying and then editing the code below. Remember that ggplot() takes the name of the dataset as the first argument, and then for a density plot we just tell it which column should go along the x axis.
ggplot(dataset,aes(x = column_containing_p_values)) +
geom_density()Uh oh! Something looks very odd here. In particular, some of those p values seem very strange. Have you ever encountered p values greater than 1?
Clearly, there is an error in the dataset, in which some of the records have p values greater than 1. We need to clean the dataset, which we can do (in the chunk filter_pvals) by filtering out all of the rows where the p value is greater than 1. Remember that dplyr::filter() removes all rows that don’t meet a particular condition. Once you have implemented the filer, plot the figure again using the new dataset.
pval = pval %>%
filter(name_of_column_containing_p_values < critical_p_value) # Use the less than symbol to exclude all values less than 1.This looks a lot better! We can already see that there is a big spike of p values close to 0.05.
Now, let’s break it down by discipline. In the chunk colorful_plot, make a new graph and this time, using aes() settings in your ggplot() function call, set a key:value pair, color = column_name, to specify that the color of the line should be determined by the column from your dataset that indicates the field of study (e.g., psychology, economics, etc). You should be able to figure out the name of that column from the explorations that you conducted earlier, using summary() and head().
Uggh! That looks very bad, right? There are so many different fields of study, that the figure legend crowds out the actual density plot!
Let’s try a different type of plot, to make it easier to read. We will try a new ggplot() function, called geom_density_ridge(), which comes from the package ggridges, that you loaded at the start of the lab. In the chunk ridge_plot create a new ggplot call in which you:
geom_density() with geom_density_ridges().aes(). This time, the value of y should = the name of the column that lists the different fields.Once you have done this, your code should produce a new image like this:
That looks cool! Now, we can make the graph a little easier to read, by augmenting your ggplot code with some additional options, to give a more informative x and y axis, and to truncate the x axis.
The last part of this code, that reads
+scale_x_continuous(breaks=c(0,0.15), limits = c(0,0.2)), sets the x axis to only run from 0 to 0.2, with breaks (i.e., tick marks) at 0 and 0.15. You can play around with this setting if you like.
ggplot(data, aes(aesthetics))+
geom_density_ridges() +
xlab("informative x axis name")+
ylab("informative y axis name")+
scale_x_continuous(breaks=c(0,0.15), limits = c(0,0.2))## Warning: Removed 225531 rows containing non-finite values
## (stat_density_ridges).
Unfortunately, this graph is still a little bit hard to read. In particular, if you are anything like me, it is hard to match the labels to the lines. It would be easier if the plot was bigger. You can do this by adding instructions about the figure size to the r code chunk header. Do this for chunk big_ridge_plot, and add the previous ggplot() code in there as well.
{r fig.width=8, fig.height=12}
## Warning: Removed 225531 rows containing non-finite values
## (stat_density_ridges).
If you look in the dataset, you’ll see that there is a column called year. We can use this to examine if the p values reported in the scientific literature have changed over time.
One thing we could do is simply facet by year (i.e., display each year’s p values). But then we would have 40 or 50 different facets (as the data goes back to the 50s), and the resulting graph would be very hard to read.
Instead, let’s use dplyr::mutate to create a new column in your pval dataset, called decade, that returns the decade that each year of publication is in (so, 2009 was in the 2000s). The final task for this part of the lab is to create the function that returns the decade. Your dplyr call should look as follows:
pval = pval %>%
dplyr::mutate(decade = calculation_on_year)Your goal now is to figure out how to do this (in the chunk calculate_decade). If the calculation makes sense to you, try doing it on your own. If not, read on…
If you are trying to figure out a calculation like this, it is best to start with a simple example, before you start modifying your dataframe. I’ve tried to walk you through that below.
# Imagine that you want to calculate what decade 1999 is in.
1999
# The easiest way to do this might be to round 1999 to the lowest 10 (i.e., to 1990).
# You could also round 1999 to the nearest 10 (i.e., to 2000), but that seems wrong.
# r has a function called floor(), that rounds numbers to their smallest integer.
# So,
floor(1.5) # = 1
floor(2005.5) # = 2005
# But, the year 1999 is already an integer (there's no decimal place).
# But, we can make it non-integer by dividing it.
# e.g.,
1999/10 # = 199.9
# Now, what happens if we take the floor?
floor(1999/10) # = 199
# And if we multiply this by 10 again
floor(1999/10) * 10 # = 1990Now, to apply this to a full dataset, what you need to do is to create a mutate() function that creates a new column called decade. The value of decade is calculated by taking the column year, dividing it by 10, taking the floor() of that value, and then multiplying that by 10.
Shout if this doesn’t make sense.
Finally, in the chunk plot_decade, use facet_grid(.~decade) to facet your ggplot graph.
## Warning: Removed 225531 rows containing non-finite values
## (stat_density_ridges).
Go to the p curve website, and click on the link to the p curve app.
You’ll see a text box with some example statistics entered into it, a mix of t test results, correlation results (r), ANOVA results (F), and others.
Press the button to make the p curve. You should see a Figure as below.
Do you remember what the different lines mean?
Now, I would like you to create your own p curve, based on the results sections of a set of papers that I have pasted below.
To create a p curve, you need to find a hypothesis that has been tested in a set of experiments, and that has a set of p values associated with it. These p values can come from one research paper, or from many research papers, but they must all be investigating a roughly coherent topic (e.g., Can children use top-down information in language processing?, Do children prefer people of the same gender?)
Here, your broad topic is pattern learning in infancy. In a seminal paper, Gary Marcus and his colleagues demonstrated that 7-month-old infants could learn abstract rules from simple verbal stimuli. In their study, infants heard an auditory stimulus that followed the pattern, for example, wo fe fe, ga ti ti, pa ku ku, ta lo lo. As you may have noticed, all of these strings follow the pattern ABB (the first syllable is different to the second and third). Marcus found that, when infants were exposed to a long string of these syllables, they learned the underlying pattern. In particular, using a head turn preference procedure, Marcus found that infants would listen longer to new strings that followed a different pattern (ABA), than to strings that followed the same pattern (ABB). This was the case even if the new set of strings were made up of completely different sounds (e.g., train on spoken syllables, test with strings made up of animal sounds, like woof, moo, woof). Marcus claimed that infants’ ability to learn these abstract patterns was evidence that they could learn an “abstract rule”. This finding is so important because, while Marcus argued that rule learning is easy for babies, rule learning appears to be hard for connectionist neural network models, which cannot easily replicate this behavior. In particular, it is very hard for those networks to generalize from one type of stimulus to a second type of stimulus.
Here, our hypothesis is that infants can learn abstract rules. Our study selection is all of the papers that have attempted to replicate and extend the finding of Marcus and his colleagues. Below, I have pasted results sections from six of those papers. In each of these papers, the authors measured infants’ responses to test trials defined by a “novel”" rule compared to test trials defined by a “familiar” rule, i.e., the rule that they had been trained on.
Now, using the p-curve app, and these results sections, test the evidential value in the hypothesis that infants can learn abstract rules. Extract the critical, hypothesis-relevant test statistics from each paragraph, and enter them into the p-curve app. Remember that you should only include findings that relate to the core hypothesis. Remember that some papers may phrase the hypothesis differently from others (e.g., some papers might call rules Familiar vs Unfamiliar, other papers may call rules Trained vs Untrained, etc). Save the resulting picture and embed it in your RMarkdown document.
Should look as so:
Start building a p curve for the broader topic being investigated for your replication project. Try to use more than one research paper to do this – identify references from your replication paper in order to find additional resources.
A brief guide to creating a p curve can be found here.