Reproducible Analysis
Researcher degrees of freedom are an issue in disciplines that rely heavily on fairly complicated statistical analysis. There are many different ways a dataset can be analysed, which can produce qualitatively different results (see Many analysts, one dataset). A way to combat this is to create reproducible workflows, so that given the files and analytical techniques used a result can be replicated and thus critiqued.
Hadley Wickam’s amazing book R for Data Science has the following image of the workflow:
Read the book for a much more developed workflow practice.
Data Import
The information contain here is a quick and dirty way to get set up data import, the first stage of the workflow, a stage that is confusing and daunting given the many different applications that exist to assist in streamlining and data sharing.
I will walk through three topics:
R Projects
Packrat
Github
Projects
R Projects are amazing. Use them. They will save you time and allow you to organise all your work so that you, or anyone else, can pick up where you left off without any key files missing. There is great synergy between R projects, Packrat and GitHub, such that you should be able to import, analyse and share data and code that clearly illustrates what you are trying to do; great for collaboration and great for reporducibility.
Projects allow one to use relative file addresses, such as Temporal_data.csv, which allows for ease of reproducibilty. Projects automatically set the working directing to project folder. This allows us to use relative file addresses so we don’t have to specify what folder to look in on the computer for the dataset. If we have all the files of interest in the project folder, we can just call the name of the datafile and not have to specify where to look.
Why is this useful? Well consider if you sent your datafiles and scipts to someone else. They are going to have a different storage system to you, they’re unlikely to have folders laid out like /PhD/Data/Dimensions of Natural Capital/Transect Data/Nat_Cap_Dim_Transect_Data.csv". So instead we organise our files into one project folder, use relative addresses and send the whole project folder over to our friend. They will then be able to open up a script and get straight to work, no messing around with looking for files.
#relative address
temporal <- read.csv("Temporal_data.csv")
#absolute address, wont be transferable and so hinders ease of reproducibility
Transect <- read.csv("~/PhD/Data/Dimensions of Natural Capital/Transect Data/Nat_Cap_Dim_Transect_Data.csv")
Packrat
Packages make life so much easier, sharing functions and code allows analyses to proceed incredibly fast, once you know what to do. An issue for reproducibility though is that packages change over time, perhaps a function is changed to do something slightly different and years later when you go to reproduce some work, this could produce a different result. Packrat gets around this issue.
Packrat is a dependency management tool that allows each project to have it’s own package folder, or private package library. It notes what version of the package has been used, so that if an analysis is reproduced years later, it will install the version of the package that was used when the analysis took place.
Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa. That’s because packrat gives each project its own private package library.
Portable: Easily transport your projects from one computer to another, even across different platforms. Packrat makes it easy to install the packages your project depends on.
Reproducible: Packrat records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go.
To set up Packrat: got to Packages tab and look for Packrat sub tab. You may need to update R studio for this to appear. Click on the packrat tab and check the box to use Packrat with this project.
Now all the packages for this project are stored in a folder for this project. Great.
GitHub
GitHub is a repository for code. It is useful as just that, a repository for all your code, which you can access from anywhere. It’s main advantage over other cloud storage services is that it is built to assist in collaboratively working on projects together. Very useful in the computer programming worlds, but also useful when working on data analysis.
Github works off repositories, basically the same as projects in R. All the filed related to one project are stored in a repositroy.
Introduction to Git
If you are new to Github it can be quite confusing. I was, and still am, a beginner, but I can now see why it can be useful and am eager to keep using it for all of my projects. For an introduction check out this tutorial. By following this tutorial you will be set up with a GitHub account and introduced to repositories.
Further tutorials
For our purposes it will mostly be a reposisitory for code and allow ease of collaboration on analysis, but it can be used for alot more besides.
Setting up Git with an R project:
It should be possible to set up git with an already existing project, yet when I do this I can’t connect to my online github account. I haven’t figured out why yet. So what I do is create a new project that uses Git from the start and then copy over any folders I want into the new Git initiated R project.
You will need a GitHub account to follow these instructions.
- Create a
new project (top right tab in the R studio user interface)
- Choose
Version Control option (the third option)
- Choose the local library on your computer where you would like to create the folder, I use the same address as the R project I created above and just make a new folder within that.
- On your git hub account, create a new repository (green button on the right hand side of the repository sreen).
- Name the repository (repo if Git jargon) whatever you would like, good to keep the names of the R project and Git repo the same.
- Copy and paste the URL of the newly created repo into the URL section of the New Project window in R Studio.
- Copy and paste all the files from the R project folder into the newly created Git enabled R project folder.
- To check that Git is now enabled for use with your R project, open up R project and look for the Git Tab in the top right window in R Studio, as shown:

Now that the project is set up and connected to your online github profile you can commit all the files in the R project and then push them to the master branch online (read this as uploading the project).
- In the top right window of R Studio go to the
Git tab.
- As this is the first time we are commiting (getting files ready to upload) files, click on the
commit sub tab
- Check the files that you would like to
commit, write some commit comment, it is mandatory, and then click the commit button.
- Close the box that pops up and then click the
push button. This will upload or push the committed files to the online repo.
- Check the online repo, have the files been uploaded?
Sharing Projects
Your project can now be shared with other people, all they have to do is copy and paste the URL of the repository when they are creating a new project.
Choose GIT as the version control manager
Alternatively, you can add collaborators to the online repository. Go to the settings tab of your repository on GitHub, then manage access and add collaborators.

Doing this will mean that you don’t have to clone the repository and create a new project every time someone else conducts some new analysis, uploads a new file etc. This is the benefit of GIT, you can pull any new updates that someone else has pushed, you can see what has changed, then do some of your own work, commit the work and then push back to the online repository so your collaborators can access it.
When sharing files collaboratively, the process to follow is:
Pull any new files from the online repo, this will update all the files you hold locally on your device.
Commit the changes in the files you have worked on.
Push the files to upload them to the online repo.
The Pull, Commit and Push workflow ensures that you stay up to date with how the project is developing while contributing your work.
Tidy Data

Tidy Data
Tidy data (go and check out the link for Hadley Wickham’s explanation) is way of formatting data that follows three rules:
- Each variable has its own column
- Each observation must have its own row
- Each value must have its own cell.
Visually it looks like this:
The benefit of using tidy datasets is that it provides a standardised way to format a dataset. This allows you to get familiar with a data structure and become fluent in handling data transformation and analysis, whatever dataset you are presented with. But it is not just an arbirary standard that benefits the user. The tidyverse family of packages is built around tidy datasets and so using tidy data will assist in using that very powerful package family.
The data set below is not tidy. I have multiple observations in one row, the size of a network in May, June, July and August. I am really missing a variable called month and a variable called size. This dataset needs to be tidied.
If you would like to run this code on your computer make sure to remove the comments, the #, in front of the lines of code for installing packages.
#install.packages("kableExtra")
#install.packages("dplyr")
library(dplyr)
library(kableExtra) #this is a package for making nice tables, to compare try just running head(temporal)
kable(head(temporal)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| X |
May |
June |
July |
August |
type |
level |
| 1 |
6 |
95 |
41 |
54 |
Urban |
High Urban |
| 2 |
37 |
40 |
113 |
23 |
Urban |
High Urban |
| 3 |
3 |
3 |
19 |
56 |
Urban |
High Urban |
| 4 |
6 |
67 |
99 |
63 |
Urban |
Medium Urban |
| 5 |
7 |
67 |
69 |
52 |
Urban |
Medium Urban |
| 6 |
12 |
114 |
51 |
55 |
Urban |
Medium Urban |
Fortunately, some very smart people have written packages that take all the pain out of doing this. The tidyr package contains functions such as pivot_longer which take an untidy dataset and transform it into a tidy dataset.
#install.packages("tidyr")
library("tidyr")
#taking the four columns May, June, July and Augst and creating a column called month for the names and a column called size for the values.
kable(temporal[1:2,] %>%
pivot_longer(c("May", "June", "July", "August"), names_to = "month", values_to ="size", names_ptypes = list(month = factor()))) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| X |
type |
level |
month |
size |
| 1 |
Urban |
High Urban |
May |
6 |
| 1 |
Urban |
High Urban |
June |
95 |
| 1 |
Urban |
High Urban |
July |
41 |
| 1 |
Urban |
High Urban |
August |
54 |
| 2 |
Urban |
High Urban |
May |
37 |
| 2 |
Urban |
High Urban |
June |
40 |
| 2 |
Urban |
High Urban |
July |
113 |
| 2 |
Urban |
High Urban |
August |
23 |
Ta da! Now we have tidy data.
Visualisation
I’m a convert to ggplot, after having used base R functions to create plots for years. To help assist in your conversion check out this webpage of plots built using ggplot, aren’t they beautiful?
An example of using ggplot
temporal %>%
pivot_longer(c("May", "June", "July", "August"), names_to = "month", values_to ="size", names_ptypes = list(month = factor())) %>%
ggplot( aes(x=month, y=size)) +
geom_jitter(aes(col=level), size = 3, width = 0.1) +
scale_colour_manual(values= urbanpalette) +
coord_cartesian(ylim=c(0,150)) +
labs(title = "", y="Size", x="", caption = "") +
theme_bw()

In this document I am experimenting with ggplot trying to create beautiful looking PCA plots. Check out the many different ways of creatin PCA plots and compare the base plots to the more customised ggplot based plots. The code to recreate all the plots in the document can be downloaded from this github repo.
Communication
R is far from just a statistical analysis programme. It’s an integrated development environment (IDE) meaning that it can be used to create, develop and publish documents, widgets, webpages, books, theses, presentations, blogs and the list keeps growing. It’s like Powerpoint, Word, LateX, Stata, Wordpress and Wix all bundled into one. Best of all it’s open source, free and has a community that loves to share code and best practices, meaning that if you see some really cool figure, analysis or widget, you will most likely find a tutorial, or code, that will help you to reproduce it. Then you can modify it however you want and hopefully inspire someone to use your code.
R Markdown
R Markdown is a very verstaile document formatting language. It can be translated into HTML (the language of webpages), PDFs through LateX, Word Documents and numerious presentation style formats. People have created websites using R Markdown (link, link and link) and have written books in R Markdown, in fact there is a book about R Markdown written in R Markdown.
This guide has been written using R Markdown!
Markdown resources:
R Pubs
R Pubs is a free publishing service provided by RStudio the company so it is easy to share documents created in RStudio the application.
You are able to read this document because RPubs is hosting it on their servers.
ShinyR
ShinyR allows you to build interactive web applications directly from R. You can publish any apps you create for free at https://www.shinyapps.io/.
#install.packages("shiny")
library(shiny)
Define UI for application that draws a histogram
ui <- fluidPage(
# Application title
titlePanel("Old Faithful Geyser Data"),
# Sidebar with a slider input for number of bins
sidebarLayout(
sidebarPanel(
sliderInput("bins",
"Number of bins:",
min = 1,
max = 50,
value = 30)
),
# Show a plot of the generated distribution
mainPanel(
plotOutput("distPlot")
)
)
)
Define server logic required to draw a histogram
server <- function(input, output) {
output$distPlot <- renderPlot({
# generate bins based on input$bins from ui.R
x <- faithful[, 2]
bins <- seq(min(x), max(x), length.out = input$bins + 1)
# draw the histogram with the specified number of bins
hist(x, breaks = bins, col = 'darkgray', border = 'white')
})
}
The render function is for the plots or text that you would like to change when in the app
Run the application
#shinyApp(ui = ui, server = server)
Making your own Website
Making websites with Git
Making websites with R
Writing a book or thesis with R
See this good blog on how to write a thesis in R
See this book chapter on how to write books
