Reproducible Analysis

Researcher degrees of freedom are an issue in disciplines that rely heavily on fairly complicated statistical analysis. There are many different ways a dataset can be analysed, which can produce qualitatively different results (see Many analysts, one dataset). A way to combat this is to create reproducible workflows, so that given the files and analytical techniques used a result can be replicated and thus critiqued.

Hadley Wickam’s amazing book R for Data Science has the following image of the workflow:

Data Analysis Workflow

Data Analysis Workflow

Read the book for a much more developed workflow practice.


Data Import

The information contain here is a quick and dirty way to get set up data import, the first stage of the workflow, a stage that is confusing and daunting given the many different applications that exist to assist in streamlining and data sharing.

I will walk through three topics:

  1. R Projects

  2. Packrat

  3. Github

Projects

R Projects are amazing. Use them. They will save you time and allow you to organise all your work so that you, or anyone else, can pick up where you left off without any key files missing. There is great synergy between R projects, Packrat and GitHub, such that you should be able to import, analyse and share data and code that clearly illustrates what you are trying to do; great for collaboration and great for reporducibility.

Projects allow one to use relative file addresses, such as Temporal_data.csv, which allows for ease of reproducibilty. Projects automatically set the working directing to project folder. This allows us to use relative file addresses so we don’t have to specify what folder to look in on the computer for the dataset. If we have all the files of interest in the project folder, we can just call the name of the datafile and not have to specify where to look.

Why is this useful? Well consider if you sent your datafiles and scipts to someone else. They are going to have a different storage system to you, they’re unlikely to have folders laid out like /PhD/Data/Dimensions of Natural Capital/Transect Data/Nat_Cap_Dim_Transect_Data.csv". So instead we organise our files into one project folder, use relative addresses and send the whole project folder over to our friend. They will then be able to open up a script and get straight to work, no messing around with looking for files.

#relative address
temporal <- read.csv("Temporal_data.csv")

#absolute address, wont be transferable and so hinders ease of reproducibility
Transect <- read.csv("~/PhD/Data/Dimensions of Natural Capital/Transect Data/Nat_Cap_Dim_Transect_Data.csv")

Packrat

Packages make life so much easier, sharing functions and code allows analyses to proceed incredibly fast, once you know what to do. An issue for reproducibility though is that packages change over time, perhaps a function is changed to do something slightly different and years later when you go to reproduce some work, this could produce a different result. Packrat gets around this issue.

Packrat is a dependency management tool that allows each project to have it’s own package folder, or private package library. It notes what version of the package has been used, so that if an analysis is reproduced years later, it will install the version of the package that was used when the analysis took place.

Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa. That’s because packrat gives each project its own private package library.

Portable: Easily transport your projects from one computer to another, even across different platforms. Packrat makes it easy to install the packages your project depends on.

Reproducible: Packrat records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go.

To set up Packrat: got to Packages tab and look for Packrat sub tab. You may need to update R studio for this to appear. Click on the packrat tab and check the box to use Packrat with this project.

Now all the packages for this project are stored in a folder for this project. Great.

GitHub

GitHub is a repository for code. It is useful as just that, a repository for all your code, which you can access from anywhere. It’s main advantage over other cloud storage services is that it is built to assist in collaboratively working on projects together. Very useful in the computer programming worlds, but also useful when working on data analysis.

Github works off repositories, basically the same as projects in R. All the filed related to one project are stored in a repositroy.

Introduction to Git

If you are new to Github it can be quite confusing. I was, and still am, a beginner, but I can now see why it can be useful and am eager to keep using it for all of my projects. For an introduction check out this tutorial. By following this tutorial you will be set up with a GitHub account and introduced to repositories.

Further tutorials

For our purposes it will mostly be a reposisitory for code and allow ease of collaboration on analysis, but it can be used for alot more besides.

Setting up Git with an R project:

It should be possible to set up git with an already existing project, yet when I do this I can’t connect to my online github account. I haven’t figured out why yet. So what I do is create a new project that uses Git from the start and then copy over any folders I want into the new Git initiated R project.

You will need a GitHub account to follow these instructions.

  1. Create a new project (top right tab in the R studio user interface)
  2. Choose Version Control option (the third option)
  3. Choose the local library on your computer where you would like to create the folder, I use the same address as the R project I created above and just make a new folder within that.
  4. On your git hub account, create a new repository (green button on the right hand side of the repository sreen).
  5. Name the repository (repo if Git jargon) whatever you would like, good to keep the names of the R project and Git repo the same.
  6. Copy and paste the URL of the newly created repo into the URL section of the New Project window in R Studio.
  7. Copy and paste all the files from the R project folder into the newly created Git enabled R project folder.
  8. To check that Git is now enabled for use with your R project, open up R project and look for the Git Tab in the top right window in R Studio, as shown:

Now that the project is set up and connected to your online github profile you can commit all the files in the R project and then push them to the master branch online (read this as uploading the project).

  1. In the top right window of R Studio go to the Git tab.
  2. As this is the first time we are commiting (getting files ready to upload) files, click on the commit sub tab
  3. Check the files that you would like to commit, write some commit comment, it is mandatory, and then click the commit button.
  • Commit
  1. Close the box that pops up and then click the push button. This will upload or push the committed files to the online repo.
  • Push
  1. Check the online repo, have the files been uploaded?

Sharing Projects

Your project can now be shared with other people, all they have to do is copy and paste the URL of the repository when they are creating a new project.

Create a new version control project

Create a new version control project

Choose GIT as the version control manager

Copy and paste the repository URL into the URL box

Copy and paste the repository URL into the URL box

Alternatively, you can add collaborators to the online repository. Go to the settings tab of your repository on GitHub, then manage access and add collaborators.

Doing this will mean that you don’t have to clone the repository and create a new project every time someone else conducts some new analysis, uploads a new file etc. This is the benefit of GIT, you can pull any new updates that someone else has pushed, you can see what has changed, then do some of your own work, commit the work and then push back to the online repository so your collaborators can access it.

When sharing files collaboratively, the process to follow is:

  1. Pull any new files from the online repo, this will update all the files you hold locally on your device.
  2. Commit the changes in the files you have worked on.
  3. Push the files to upload them to the online repo.

The Pull, Commit and Push workflow ensures that you stay up to date with how the project is developing while contributing your work.


Tidy Data

Tidy Data

Tidy data (go and check out the link for Hadley Wickham’s explanation) is way of formatting data that follows three rules:

  1. Each variable has its own column
  2. Each observation must have its own row
  3. Each value must have its own cell.

Visually it looks like this:

Tidy data format

Tidy data format

The benefit of using tidy datasets is that it provides a standardised way to format a dataset. This allows you to get familiar with a data structure and become fluent in handling data transformation and analysis, whatever dataset you are presented with. But it is not just an arbirary standard that benefits the user. The tidyverse family of packages is built around tidy datasets and so using tidy data will assist in using that very powerful package family.

The data set below is not tidy. I have multiple observations in one row, the size of a network in May, June, July and August. I am really missing a variable called month and a variable called size. This dataset needs to be tidied.

If you would like to run this code on your computer make sure to remove the comments, the #, in front of the lines of code for installing packages.

#install.packages("kableExtra")
#install.packages("dplyr")
library(dplyr)
library(kableExtra) #this is a package for making nice tables, to compare try just running head(temporal)
kable(head(temporal)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
X May June July August type level
1 6 95 41 54 Urban High Urban
2 37 40 113 23 Urban High Urban
3 3 3 19 56 Urban High Urban
4 6 67 99 63 Urban Medium Urban
5 7 67 69 52 Urban Medium Urban
6 12 114 51 55 Urban Medium Urban

Fortunately, some very smart people have written packages that take all the pain out of doing this. The tidyr package contains functions such as pivot_longer which take an untidy dataset and transform it into a tidy dataset.

#install.packages("tidyr")
library("tidyr")

#taking the four columns May, June, July and Augst and creating a column called month for the names and a column called size for the values.
kable(temporal[1:2,] %>%
  pivot_longer(c("May", "June", "July", "August"), names_to = "month", values_to ="size", names_ptypes = list(month = factor()))) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
X type level month size
1 Urban High Urban May 6
1 Urban High Urban June 95
1 Urban High Urban July 41
1 Urban High Urban August 54
2 Urban High Urban May 37
2 Urban High Urban June 40
2 Urban High Urban July 113
2 Urban High Urban August 23

Ta da! Now we have tidy data.


Transform

Data Analysis Workflow

Data Analysis Workflow

Check out Hadley Wickham’s book for information on piping, vectors, functions and iteration. These are the nuts and bolts of manipulating and transforming data and I would recommend spending some time on each subject to get fluent in data manipulation. It will save you loads of time and even be enjoyable once you hit a certain escape velocity.

Tl;dr Storing objects in the environment is necessary but can be a bit cluttered, piping helps to reduce the clutter. Piping would bring joy to Marie Kondo.

One note on piping is that it changes how you think about using R. Whenever I was conducting multiple operations on a dataset I would store each step of the operation as an object, a = 7, a.1 = 6, a.2 = 8, in the environment. In the process of conducting an analysis I could end up with hundreds of objects in the environment most with arbritary names, such as a.2 being 8, that don’t make sense to anyone else, or even myself a week later. I then have to try and figure out what object stored what value and distinguish it from the myriad of other very similar objects I created, was I interested in a or a.1??. It is very easy to confuse a and a.1 in a piece of code and if I don’t catch it, this could effect the analysis.

Piping helps to get around this issue, as each piece of code can stand alone and be self referential to some degree. It bundles up multiple operations, piping the result of the previous operation to the following operation. That way we don’t need to store lots and lots of objects in the environment, we can use pipes, decluttering the environment, making code less confusing and increasing ease of reproducibilty.

A second point is that code in a script is more ‘real’ than any object stored in the environment. For a more consise explanation of the realness of objects and code check out this what is real section of the R for Data Science book.


Visualisation

Data Analysis Workflow

Data Analysis Workflow

I’m a convert to ggplot, after having used base R functions to create plots for years. To help assist in your conversion check out this webpage of plots built using ggplot, aren’t they beautiful?

An example of using ggplot

temporal %>%
  pivot_longer(c("May", "June", "July", "August"), names_to = "month", values_to ="size", names_ptypes = list(month = factor())) %>%
  ggplot( aes(x=month, y=size)) +
    geom_jitter(aes(col=level), size = 3, width = 0.1) +
       scale_colour_manual(values= urbanpalette) +
       coord_cartesian(ylim=c(0,150)) +
       labs(title = "", y="Size", x="", caption = "") +
        theme_bw()

In this document I am experimenting with ggplot trying to create beautiful looking PCA plots. Check out the many different ways of creatin PCA plots and compare the base plots to the more customised ggplot based plots. The code to recreate all the plots in the document can be downloaded from this github repo.


Communication

R is far from just a statistical analysis programme. It’s an integrated development environment (IDE) meaning that it can be used to create, develop and publish documents, widgets, webpages, books, theses, presentations, blogs and the list keeps growing. It’s like Powerpoint, Word, LateX, Stata, Wordpress and Wix all bundled into one. Best of all it’s open source, free and has a community that loves to share code and best practices, meaning that if you see some really cool figure, analysis or widget, you will most likely find a tutorial, or code, that will help you to reproduce it. Then you can modify it however you want and hopefully inspire someone to use your code.

R Markdown

R Markdown is a very verstaile document formatting language. It can be translated into HTML (the language of webpages), PDFs through LateX, Word Documents and numerious presentation style formats. People have created websites using R Markdown (link, link and link) and have written books in R Markdown, in fact there is a book about R Markdown written in R Markdown.

This guide has been written using R Markdown!

Markdown resources:

R Pubs

R Pubs is a free publishing service provided by RStudio the company so it is easy to share documents created in RStudio the application.

You are able to read this document because RPubs is hosting it on their servers.

ShinyR

ShinyR allows you to build interactive web applications directly from R. You can publish any apps you create for free at https://www.shinyapps.io/.

#install.packages("shiny")
library(shiny)

Define UI for application that draws a histogram

ui <- fluidPage(
   
   # Application title
   titlePanel("Old Faithful Geyser Data"),
   
   # Sidebar with a slider input for number of bins 
   sidebarLayout(
      sidebarPanel(
         sliderInput("bins",
                     "Number of bins:",
                     min = 1,
                     max = 50,
                     value = 30)
      ),
      
      # Show a plot of the generated distribution
      mainPanel(
         plotOutput("distPlot")
      )
   )
)

Define server logic required to draw a histogram

server <- function(input, output) {
   
   output$distPlot <- renderPlot({
      # generate bins based on input$bins from ui.R
      x    <- faithful[, 2] 
      bins <- seq(min(x), max(x), length.out = input$bins + 1)
      
      # draw the histogram with the specified number of bins
      hist(x, breaks = bins, col = 'darkgray', border = 'white')
   })
}

The render function is for the plots or text that you would like to change when in the app

Run the application

#shinyApp(ui = ui, server = server)

Making your own Website

Making websites with Git

Making websites with R

Writing a book or thesis with R

See this good blog on how to write a thesis in R

See this book chapter on how to write books

