Project Management for Economists

Wei Yang Tham

September 29, 2017

Intro

Goals

  • Understand what reproducibility is and why it’s important
  • Give you practical tips for organizing your project
  • Have concrete ideas for where to find help and new ideas

Caveat

  • I’m no reproducibility guru…
  • …but hopefully I can share a few tips to make your life easier

2nd years: Choosing a language

  • Stata
    • Heavily used in economics
  • R
    • Oriented towards data analysis
    • I use this :)
  • Python
    • More general purpose programming
  • Pick one and get really good at it; you can learn another language later

Reproducibility

What is reproducibility?

  • With the same data, you can reproduce the same set of results
  • NOT replicability
  • NOT correctness

Practical goals of reproducibility

  • You can run your entire workflow, from raw data to final paper, in one keystroke
  • Your project should be self-contained
    • i.e. you can reproduce your work on any machine in the world

Don’t let the perfect be the enemy of the good

  • Reproducibility tools take time and effort to learn
  • Sometimes project-specific constraints make it difficult to attain an idealized version of reproducibility

Reproducibility makes your life easier

  • Ensures “correctness” i.e. you’re doing what you think you’re doing
  • Easy for others to trace your steps and build on your work
  • “Others” includes future you!

Reproducibility workflow in R/RStudio

What if you’re not using R?

  • These principles are applicable across languages
  • Some will have similar tools or tools may be common across languages
  • E.g. check out Project Manager in Stata

Projects

  • Use RStudio’s Projects feature
  • Easy to keep your projects organized separately
  • Can open multiple sessions at the same time

Folder structure principles

  • Most basic: separate folders for code, data, output
  • Additional rules you can apply
    • separate raw and intermediate data
    • separate types of code e.g. cleaning and analysis

Creating a new project

Version control

  • Avoid this: file_final_v2_oops_reallyfinal_revised_v7.docx
  • Like time travel, or saving and loading in a video game
    • Can even create alternate timelines to test your code
  • Easy to keep code on multiple computers updated

Version control software

  • Git + Github
  • Git is a version control system
  • Github is an online version control repository

Git is intimidating

  • Start small; I use two commands (push and pull) 99% of the time
  • Using a Git client like GitKraken can make it more intuitive
  • More at this link for getting started with Git/Github

Dynamic documents

  • Don’t cut and paste figures/tables/numbers - easy to make mistakes
  • Integrate code and prose
  • Any changes to your data will be automatically reflected in the text
  • In R this is done with R Markdown

Example

  • I want to say: The flights dataset has N observations and K variables
library(nycflights13)
dim(flights)
## [1] 336776     19
  • The flights dataset has `r dim(flights)[1]` observations and `r dim(flights)[2]` variables
  • The flights dataset has 336776 observations and 19 variables

Coding

Coding style

  • Look up style guides for whatever language you use
  • You may get conflicting advice on style; just be consistent
  • E.g. in R, many style guides will say x <- 5, but a substantial minority say x = 5 - just choose one
  • Human-readable; with auto-complete, object names can be a bit longer
  • This is a problem with STATA wildcards

Defensive coding

  • Mistakes will happen - your goal is not to write mistake-free code, but to write code that catches those mistakes early
  • Insert tests in your code - e.g. after merging two datasets, a line that stops the code from running if some observations were not matched
  • Readable code makes mistakes easier to spot
  • Don’t transcribe

Projects set your home directory

# Type this in your console after opening a project
getwd()
## [1] "/Users/weiyangtham/Documents/WYT Projects/methods_workshop"
  • You can save a script in any subfolder in the project directory and it will always start from the project’s root directory

Use relative directories!

  • Suppose you want to load a file from the data folder
  • "data/file.csv"
  • Not "/Users/weiyangtham/Documents/Projects/file.csv"