Example Project Organization for Lab Meeting

Being organized in research is absolutely critical. It may sometimes be months (or even years) between when you last looked at a project. Also, you may need to communicate your work to others. It is our responsibility to ourselves, to our colleagues, and to the public who funds our work to be organized and traceable.

These are my thoughts on best practices for organizing research projects. Sometimes I depart from this format when it doesn’t really make sense (e.g. the directory this document is housed within isn’t really a “research project”, so I don’t intend to keep things organized in the same manner). This is what has worked for me after many years of trial and error. Overall, I would say that as long as you keep organized, that’s the most important part. Knowing where things go but, perhaps more importantly, where things (scripts, code, data, etc) originated from is critical to keeping track of your work and making edits/corrections/reporting go efficiently.

You will notice that when I name files or objects, I do not shy away from using plenty of text. Tab-complete is your friend and I, personally, prefer to know that a file called 01a_wrangling_plant_and_animal_data_to_tidy_format.R wrangles my plant and animal data into tidy format even if it means a bit long of a filename. The alternative calling something like 01a_wrangle_data.R might be meaningless when you have four different datasets that you’re wrangling in a project. (Of course, this is still way better than something ambiguous like data_cleaning.R)

Directory Organization

I organize directories in a way that works well for how I name files as well. I follow this format:

Project home directory (e.g. “Denver Urban Parks Project”)
- data - only has subdirectories in it, no files live here
  - data_raw - contains only raw data files do not edit them directly!
  - data_wrangling - contains scripts to manipulate raw data
  - data_output - contains the output of data_wrangling scripts
- analyses - contains scripts associated with analysis…sometimes when I go from housing scripts in data_wrangling to analyses can get a bit blurry depending on the project, but that’s fine because the files are all named sequentially! (see below)
  - analyses_output - contains any outputs from analyses scripts
- figures - usually this directory contains no files, but I sometimes get sloppy here
  - figures_wrangling - scripts used to make figures from data_output or analyses_output files
  - figures_output - the output of figure wrangling
  - figures_edited - often, I make final edits in InkScape, so I save those files here. I’ll often save the final version of figures in this directory or in a separate directory
- docs - contains all word and markdown (and now quarto) docs. Basically, written stuff. Sometimes I’ll make a subdirectory for the manuscript specifically
- incubator - a messy directory to house all the mess. Often I’ll use this a lot during the planning phase of the project and will make folders for things like meeting notes, site data, budgets, flyers to recruit technicians, etc. Really just a junk pile that I do my best to keep some sort of organization. When I push my parent directory to github, I use “gitignore” to remove this directory since it’s not intended to be public!
- src - I don’t always have this folder, but sometimes I will if there are functions or something that I want to reference to across a bunch of scripts. (e.g. I made a custom color palette for a project and want it to live somewhere I can load in at the top of every figure script or whatever)

Without the notes, a typical directory will look like this, with some example files:

Example Parent Directory
- data
  - data_raw
    - data_raw_example_dataset_animals.csv
    - data_raw_example_dataset_plants.csv
    - etc…
  - data_wrangling
    - 01a_data_wrangling_merging_plant_animal_data.R
    - 01b_data_wrangling_changing_to_tidy_format.R
  - data_output
    - output_data_01a_merged_plant_animal.Rdata
    - output_data_01b_tidy_plant_animal.csv
- analyses
  - 02a_analyses_plant_animal_linear_models.R
  - analyses_output
    - output_analyses_02a_plant_animal_model_results.Rdata
- figures
  - figures_wrangling
    - 03a_figure1_plant_animal.R
  - figures_output
    - output_figure_03a_plant_animal.svg
  - figures_edited
    - figure_03a_figure1_final.png
- docs
  - 2023-09-12-denver-manuscript-outline.doc
  - 2023-10-28-denver-manuscript-outline.doc
- incubator

You’ll notice that when I name files, I follow a format using numbers that match the directory (e.g. scripts beginning with 01 belong to data, 02 belongs to analysis, etc) and then I add letters to sequence within those numbers (e.g. 01a is the first script in the data wrangling sequence, 01b is second, etc). This is really handy because you can then name your output to match this sequencing. If I have many outputs from a script, I just name them all using the same prefix, but they get unique descriptions (e.g. output_data_01a_mergedgenotypes.Rdata; output_data_01a_clean_site_names.Rdata). I tend to save data as an Rdata object if I’m only ever going to use it in R and as a csv or some other format if I might be exporting it out of R. I’m not very consistent there – but because we’ve named our files in a traceable manner, we can always go back to the source script and change our minds!

Using this naming format, you’ll also be able to, if you prefer, keep a more streamlined directory structure. Some people prefer to have all of their scripts together in one directory. If you do this, naming your scripts 01a, 01b, etc and then 02a, 02b, etc still keeps them organized even in one giant pool. I still would recommend subdirectories for outputs if you go this route.

I sometimes go back and forth on whether I name outputs as e.g. output_data_01a_... (as above) or 01a_output_data_.... It doesn’t really matter so long as you’re consistent. I generally prefer to put output first so I know it’s an output file and not a script and reserve numbers first only for scripts.

Sometimes I make slight divergences and will do something like 02a01 02a02 when I realized I needed to breakk something into parts but want to keep all the other scripts in order. Again, I think this is fine so long as you track stuff properly - it keeps things in order. Also sometimes if there is a big switch in what’s happening between stages of analysis, I’ll switch from calling everything in analysis 02a, 02b, etc to 02a, 02b, then 03a, 03b, etc. I might do this if there’s a big jump in the data format or whatever. Then the figures would be 04a and so forth. The main objective is preserved: everything is sortable quickly by name.

Lastly, notice that I named my documents using year-month-day format to start their filename. Some people make better use of version control programs to ensure they have prior versions of files available, but I’m usually not good about that. So, instead, I name things with the date first and that usually ensures that I can efficiently sort and find the most recent version of my document even if it’s been several months since I’ve worked on it.

Thoughts on Best Practices for Naming Objects in R

As above, I do not shy away from long-ish names for objects in R because there are no character limits and we can use tab complete.

Personally, I like to name objects with a clear reference to their object class (i.e. dataframe, vector, matrix, model output, etc).

So if I have a vector of bee species names, I’ll name it v_bee_species. If I have a dataframe of site data from Denver Parks, I’ll name it df_denver_site_data etc.

When I have filtered or otherwise wrangled data, I’ll name it explicitly, too. e.g. df_flt_denver_site_data.

Because you can tab-complete to type out these names, I don’t think it adds more inconvenience than it adds safety and efficiency ultimately.

Thoughts on Organizing Code Within Scripts

I like to organize all scripts in a similar manner and make use of the sections function within R Studio (on a mac you type CMD+Shift+R to insert a section).

I typically have the following sections (see example script):

A title for the script
Description of the script’s purpose
SECTION: PACKAGES
SECTION: DATA
SECTION(S): various sections of what I’m doing in that script
SECTION: OUTPUT
SECTION: OLD CODE

Final Thoughts on Code Organization/Style

There are many style guides for coding in R. Look them up and choose one you like. I think the important thing is to be consistent and choose something relatively clean. I kind of use a hybrid of styles I have seen over time. It’s probably not great but it works and is obvious to someone reviewing my code.