This is a step by step guide for Pilots running a reproducibility check for the Cognition Open Data (COD) project. R code that you may need to run will appear in grey boxes. We strongly recommend you look through the pre-registration for this project before starting out. This will give you a good overview of the general workflow and many contain additional details that have not been included here.

Please note Your goal is to try and reproduce a set of target outcomes using the available data files, information provided in the original article, and any other additional documentation (e.g., codebook or analysis scripts). Your role is not to try and attempt alternative analyses that you believe are superior. We are only interested in reproducibility for the purposes of the current investigation.

You can e-mail Tom with any questions (tom.hardwicke[@]stanford.edu).

Good luck!

Step 1: Setting up

R

You must run your reproducibility check in R. You can download the latest version from here:

https://www.r-project.org/

R Studio

We highly recommend using the free R Studio software which you can download here:

https://www.rstudio.com/products/RStudio/

We will be using several of R Studio’s built in features, such as R Markdown and Packrat. If you are an advanced user and want to do it you’re own way, you are welcome to. However, the rest of this guide assumes you are using R Studio.

Github

We will be using Github for version control and collaboration. We have our own Github ‘organisation’ set up for this project:

https://github.com/CognitionOpenDataProject

To be added to the team, send an e-mail to Tom with your Github id.

We also highly recommend using the free Github Desktop software which you can download here:

https://help.github.com/desktop/guides/getting-started/installing-github-desktop/

You can find plenty of guides to using Github online. Here is a good place to start:

https://guides.github.com/

We will not be doing anything super fancy so don’t panic if you are not familiar with this tool. If you do not want to use this tool you can switch to a non-Github workflow (exchanging .zip files with Tom via e-mail).

Again, if you are an advanced user, feel free use your own git workflow. However, the rest of this guide assumes you are using the Github Desktop software.

The COD Reports package

We have put together a simple R package (‘CODreports’) that contains a custom R Markdown template and a couple of custom functions for you to use when preparing your reproducibility report. To install the CODreports package, you will first need to install another package called devtools:

install.packages("devtools")

Next, run the following command to install CODreports directy from our project Github page:

devtools::install_github("CognitionOpenDataProject/CODreports")

To check that this has installed correctly, click on ‘file,’ ‘new file’, and then ‘R markdown…’ in R Studio. Select ‘from template’ and you should see an option “COD Reproducibility Report”. If not, the installation has gone wrong somewhere. Otherwise, you’re good to go!

Step 2: Request an article

The coding team are currently manually assessing whether articles published in Cognition between March 2014 and March 2017 have available and comprehensible data. Articles that pass those checks are then assessed by one of the project leads who will identify a small, coherent subset of reported outcomes that will be the target of a reproducibility check.

Once these target outcomes have been identified, the project lead adds a repository (repo) to the project Github page:

https://github.com/CognitionOpenDataProject

Each Cognition article has been assigned a unique 5 letter ID code and each repo follows the naming convention “set_articleID” e.g., “set_rhTys”.

Each repo contains the original article pdf, a targetOutcomes.md file, and a data folder containing a data file or files (more on this below). The repo also comes with a .gitignore file so you do not have to set this up yourself.

To request an article, you must do one of the following:

If you are in Psych 254, visit this Google Doc to find your assigned article: https://docs.google.com/document/d/1jVdDeabSB8A4gt5cfg_kaJBJkaI-KPjgIlGBNIAK2Mg/edit?usp=sharing
If you are not in Psych 254 then e-mail Tom with the subject header “COD Pilot Article Request”. You can leave the body of the e-mail blank (I won’t be offended). I’ll then run some R code which randomly selects an article from the current available pool and e-mail you the ID code.

You can only request one article at a time. Each time you complete a reproducbility check you can request another article. If you are assigned an article with which you feel there may be some degree of conflicting interest (e.g., you are an author of the article or have previously reviewed the article) then you should return it to TH who will send a replacement instead.

Step 3. Fork and clone your repo

Once you have your article ID code, you need to fork the corresponding repository. You can do this by opening the repository on the Github website and clicking on the ‘fork’ button in the top-right. In may take a few minutes for the repo to be forked over to your account. When it has finished you need to clone the repository to your personal computer. The quickest way to do this is to click the green ‘clone or download’ button and then click ‘Open in Desktop’. Make sure you are in the forked repository and not the original master branch! The files will now be downloaded to your computer and you should see the repo in the Github Desktop software.

Step 4. Set up an R Studio project

Open up R Studio. Click ‘file’, ‘new project’, ‘existing directory’ and then browse to whereever you cloned the repo to on your computer. Now click ‘create project’. If the R Studio project is set up correctly then you should see the files from your repo listed in the files section.

Step 5. Set up Packrat

We will now set up something called ‘Packrat’. This is a system that will help to ensure that our own reproducibility checks are themselves reproducible! Basically it will set up a local store of all of the R packages we need for this reproducibility check in the repo itself, rather than just relying on packages installed on your computer system. This means that when somebody else wants to re-run your analysis they can find all of the correct package versions they need in the repo. You don’t need to know much more then that but if you’re interested you can read more here: https://rstudio.github.io/packrat/

You don’t need to install packrat separately as it comes with R Studio. To set up Packrat for this repo, you need to click on the small blue cube next to the your project name in the top right of the window. From the dropdown menu choose ‘project options’, choose ‘packrat’, and then click on the checkbox for ‘use packrat with this project’.

Make sure that only “Automatically snapshot changes” is selected. We do not want Github to ignore Packrat.

Now click on OK and Packrat will start installing a local package store, which might take a few minutes. When its finished you can move on to step 6.

Step 6. ‘Commit’ and ‘Sync’ your changes with Github

It is up to you how much you use the Github version control features. It is good practice to ‘commit’ your changes fairly regularly. Each time you commit changes they are saved, and you can ‘roll back’ if you realise later on that you’ve made a mistake. I suggest you commit your changes now as there are probably a lot of Packrat files that have just been created.

To do this, open up Github Desktop and select your repo. Where it says ‘summary’ and ‘description’ you can enter some information about this commit so you can work out what you did later on. For example, you might call this ‘added packrat’ and in the description put something like ‘first install of packrat dependency management’ (just a summary is often sufficient). Now click on ‘commit to master’. Confusingly, this does not mean you are commiting to the original master repo in the COD Github project. You are committing to your fork ‘master’. Commiting also means you have just saved the changes locally. Its a good idea to now click on ‘sync’ in the top right, which will back up your changes on the Github website.

You can see a useful graphical representation of the original COD master and your fork master in the dark grey box. You are not making any changes to the original COD master right now, just your fork. But eventually we are going to connect these back up.

It is good practice to keep committing and syncing your changes regularly, but I’ll leave it up to you how often you do this.

Ok you’re almost ready to start with the actual reproducibility check!

Step 7. Open a new R Markdown file

Now you will open up a new R Markdown file. R Markdown is an approach to ‘literate programming’. This is the idea that we interleave actual code with plain language commentary explaining what we are doing in sufficient detail such that someone who does not understand the code itself can still figure out what we have done. This is a key component of reproducible analysis, and I hope we can exemplify this best practice in our own analysis scripts. You should provide detailed commentary in plain text throughout your report.

If you are unfamiliar with R Markdown, there’s a plenty of information available here:

http://rmarkdown.rstudio.com/lesson-1.html

It is very easy to use! You may also find this ‘cheatsheet’ useful:

https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf

To run code that you have entered in ‘chunks’ just click the green arrow.

I have put together a custom R Markdown template so we can keep the reproducibility reports in a fairly standardised format. To open the template, click on ‘file’, ‘new file’, and then ‘R markdown…’ in R Studio. Select ‘from template’ and you should see an option “COD Reproducibility Report” if the CODreports package installed correctly (see above). Click on OK.

The R Markdown file that opens will begin with a ‘yaml’ header between two sets of dashed lines. Leave this section as it is. Below that you’ll see some lines referring to various details about this reproducibility check e.g.,

“#### Article ID: [Insert article ID number]”

The ‘#’ here is the markdown way of saying ‘format this as a heading’. Four ‘#’s means heading level 4. One’#’ would be heading level 1.

Throughout the template I included some text in square brackets that you should either replace or delete before submitting your final report. Anything not in square brackets should remain in your report.

So in this case you should replace “[Insert article ID number]” with the article ID. Enter your name as the Pilot, and today’s date as the start date. The end date will be the date that you submit your report via a pull request (details on this below).

Save the R Markdown file with the name “pilotReport.Rmd”

Step 8. Familiarise yourself with the article and associated files

Before we get into the details of the R Markdown template, let’s go and have a look at what is available in the repo. You should have a pdf of the article, a targetOutcomes.md file (.md stands for ‘markdown’), and a data folder containing a data file or files. If any of these are missing contact Tom.

The targetOutcomes.md file can be opened in any text editor, or you can view it in the repo on Github. It outlines exactly which outcomes in the paper you are to try and reproduce.

Please note you may need more information than is included in the targetOutcomes.md file in order to run your reproducibility check. For example, there may be essential pre-processing steps that are listed in the article, but are not included in the targetOutcomes.md file.

You should read the entire article and develop a good understanding of the methods employed by the original authors. Make sure you download any supplementary information files to see if they contain additional important details. You may even find some analysis scripts. This is great news of course because that is very concrete information that should help you run your reproducibility check. If you do find detailed information about the analysis used by the original authors, be sure to include quotations in your report to illustrate this (see below for details how).

Please note You must not directly edit the original data file This cannot be emphasised enough! The original data file must remain as it was when you forked the repo. This is so that when the copilot starts working with you on the project they can reproduce everything you have done from scratch. If you need to make manual edits to a data file, you should save an additional file (see below for details). If you accidentally make changes to the original data file, then you should roll back these changes using Github (this is why its important to regularly commit changes!).

Step 9. Start completing the R Markdown report: Methods and target outcomes

Your first steps will be to fill in the Methods summary and target outcomes section. You need to write the methods summary from scratch, but you can copy and paste the target outcomes from the targetOutcomes.md file.

The remainder of the report is divided into 5 key stages outlined below.

Step 10. Load packages

Load any necessary R packages. Some useful ones are already listed and you can add any additional ones that you need.

Step 11. Load data

Load data from the file or files in the data folder. You may need different functions for different types of file.

This cheatsheet may be helpful: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-import-cheatsheet.pdf

Step 12. Tidy data

Mung/wrangle (organise) the data into a format that facilitates subsequent analysis. We highly recommend learning the concept of ‘tidy data’. For resources see here: http://r4ds.had.co.nz/tidy-data.html

and here: https://www.jstatsoft.org/article/view/v059i10

This cheatsheet may also be helpful: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf

To the greatest possible extent you should try and conduct data munging operations programmatically in R. In some cases, you may need to make manual adjustments to the data file in, for example, Excel. If you have to do this, you should detail the steps you have taken in your R Markdown report, and save an additional data file with the name “data_manualClean”.

Step 13. Run analysis

This section is further sub-divided into pre-processing, descriptive statistics, and inferential statistics. Work systematically through the target outcomes, attempting to reproduce each reported outcome with the analyses described in the original article (and any supporting documents).

Step 14. Recording errors

Whenever you identify a potential discrepancy between an outcome from your analysis and a reported outcome, you need to explictly note the error in your report. Please make sure you are familar with the Error Classification Scheme outlined in the pre-registration document.

Whenever you encounter a numerical discrepancy between a reported value and a value obtained in your analysis, you should use the compareValues() function to classify the type of error. This function takes three arguments: the reported outcome, the outcome obtained in your analysis and “isP”. “isP” should be set to TRUE if you are comparing p-values, otherwise you do not need to specify anything (it will default to FALSE). The function will calculate the percentage error difference between the two values, classify the error type, and return a standardised reporting sentence that you should include in your report.

Here is an example comparing two p values that results in a MINOR NUMERICAL ERROR and a DECISION ERROR:

compareValues(reportedValue = .054, obtainedValue = .049, isP = T)

## [1] "DECISION ERROR and MINOR NUMERICAL ERROR. The reported value (0.054) and the obtained value (0.049) differed by 9.26%"

Here is an example comparing two other values where there is a MINOR NUMERICAL ERROR:

compareValues(reportedValue = 2.5, obtainedValue = 2.45)

## [1] "MINOR NUMERICAL ERROR. The reported value (2.5) and the obtained value (2.45) differed by 2%"

Here is an example comparing two other values where there is a MAJOR NUMERICAL ERROR:

compareValues(reportedValue = 52, obtainedValue = 75)

## [1] "MAJOR NUMERICAL ERROR. The reported value (52) and the obtained value (75) differed by 44.23%"

Note that there is special, fourth type of error which does not involve comparing numerical values. The INSUFFICIENT INFORMATION ERROR applies to situations where the data analysis procedure reported in the original article (and any supporting documentation) is so unclear or incomplete that you cannot conduct your reproducibility check. Note that if the provided information is ambiguous and you are unsure what the original analysis entailed, you should not attempt to make an educated guess about what the original authors did. Please consult the pre-registration for details about our rationale for taking this approach.

There is no r function for these situations. You should simply type INSUFFICIENT INFORMATION ERROR in block capitals and then underneath provide commentary in as much detail as possible about what the issue is.

Step 15. Reporting Conclusions

There are two aspects to reporting your conclusions. Firstly, you should provide a verbal summary of the report. Identify and describe any issues you encountered in as much detail as possible.

Secondly, use the custom codReport() function included in the CODreports package to output a standardised report table and .csv file. The codReport() function takes 6 arguments. Firstly, you must specificy the report type which in your case is ‘pilot’ (if/when you get to the copiloting stage, you wil output a second report and enter ‘joint’ here). The second argument is the article ID code. The final four arguments are the number of errors of each type that you encountered.

Manually read through your report, tally up the number of different errors, and enter the numbers into the codReport() function.

Here is an example:

codReport(Report_Type = 'pilot',
          Article_ID = 'fdsrW', 
          Insufficient_Information_Errors = 1,
          Decision_Errors = 0, 
          Major_Numerical_Errors = 3, 
          Minor_Numerical_Errors = 1)

Insufficient_Information_Errors	Decision_Errors	Major_Numerical_Errors	Minor_Numerical_Errors	Final_Outcome
1	0	3	1	Failure

Notice that the function automatically works out what the final outcome of your report is (success or failure) based on the error types. Consult the pre-registration for more details about how this decision is made.

Step 16. Submitting your report

The final step in preparing your report is to ‘knit’ it. This produces a nice looking html document. You can find the knit button towards the top of the window next to a blue ball of string. When you click ‘knit’, R Studio will show you the html version of your report. Some of the formatting might look a little strange. In which case you should click on ‘open in browser’. Things should look ok now.

If you decide you need to make some changes that’s fine. Just remember to knit your report again right before you submit it so that the html file is up-to-date.

To submit your report, you should issue a pull request. This means you are requesting that the author of the original master repo (Tom) merges the changes you have made in your fork with the master. To issue the pull request, open up the Github Desktop software and select your repo. Make sure you have committed and synced all recent changes first. Now click on the ‘pull request’ button in the top right. In the ‘description’ box, write ‘Pilot reproducbility check is complete’. Then click ‘send pull request’.

That’s it! The piloting stage is over. Psych 254 students your job is done (unless you decide to stick with the project). Other pilots, you will be contacted soon by a co-pilot who will verify your reproducibility check and work with you to try and resolve any issues, potentially through making contact with the original authors.

Step 17. Tips for a top-notch report

Here are a few additional tips for producing a top notch reproducible report.

Commentary

Describe exactly what you are doing throughout in plain language interleaved with code chunks. Try to avoid jargon and acronyms where possible (unless they are clearly defined).

Quotations

It can be really useful to use quotations from the original article or associated files to illustrate exactly what the original authors say they did and what they found. To write a quotation in markdown, just use the ‘>’ symbol. For example:

“> This is a quote from the article”

will produce:

This is a quote from the article.

When quoting, make sure you note the source e.g.,

This is a quote from the article. (from Jones et al. p.18).

Tables

We recommend using ‘Kable’ for outputing nicely formatted tables. Kable is included in the knitR package which is loaded at the start of the template.

Images

There are instructions for including images in the R markdown documentation here: http://rmarkdown.rstudio.com/authoring_basics.html

You could, for example, include a screenshot of a figure/table from the original article and compare/contrast it with your own findings.

Reproducibility Checks Step By Step

Tom Hardwicke

2/2/2017