Learning Goals

  1. Understand where different files should be stored.
  2. Understand how files should be named.
  3. Understand and apply best practices for organizing, documenting, and sharing research data.
  4. Develop habits that make code, datasets, and project materials reproducible, accessible, and collaborative.

Data Storage Workflows

In the Wildfire Water Security project we primarily use the following tools to store and share files which both have their benefits and limitations:

Box

Git/GitHub

Both these tools are preferable over a organization specific network drive because they:

  • allow easy collaboration across organizations

  • automatically back up work and save version history

I have a file where does it go?

Generally, the following rules apply:

  • Box: Large files, collaborative Office files, files not related to a specific research project

  • GitHub: Project files, code, manuscript files (besides document itself)

If you’re unsure of where to put a file you can follow the flow chart below:

Where to NOT put files:

  • Network drive: even keeping shared GitHub repositories on the network drive can cause issues. Keep your own copies of files on a local drive (C)
  • OneDrive
  • Sharepoint
  • Google Drive
  • Dropbox

Data belongs to the project, not individuals. Store it where the team can find and it.

Activity:

Where should the following files be stored?

  1. Word document containing text for a manuscript

    • Box in 02_Nodes/your node/Publications_Presentations/Manuscripts/folder for your paper
  2. meeting notes from Node 1

    • Box in 02_Nodes/01_Empirical/01_Meetings
  3. exploratory figure associated with the Bedrock project

    • GitHub repo: WWS-Node1-BDRK-bedrock-microbes/figures/exploratory
  4. large dataset associated with the Sonde project

    • GitHub repo: WWS-Node1-SONDE-postfire-sonde-network/data
  5. SOP for filtering water samples

    • GitHub repo: WWS-standard-methods /filtering
  6. figure for a manuscript

    • GitHub repo for project, on branch for associated manuscript in figure folder

Folder and File Naming Conventions

Keeping your files and folders organized makes it easier for everyone on the team to find what they need and avoids confusion down the road.

Folder Organization

  1. Git Repo First
    Your top-level project folder (root directory) should be a GitHub repository. This ensures you have version control and backups from the start.

  2. Limit the Number of Folders in the Root
    Aim for fewer than 10 top-level folders — for example: data/, code/, figures/, methods/

  3. Use Nested Folders for Subcategories
    For example inside data/:

    • raw-data/
    • processed-data/
    • metadata/
  4. Avoid Spaces & Special Characters
    Use - or _ instead of spaces. Avoid characters like .:*?"<>|[]&$.

  5. Descriptive Names Name folders so someone unfamiliar with your project can still guess what’s inside.

  6. Organize by Date (if needed)

  7. Force Folder Order with Numbers

File Organization

  1. Avoid Spaces & Special Characters
    Use - or _ instead of spaces. Avoid characters like .:*?"<>|[]&$.

  2. Be Concise but Descriptive

    • DON’T: use words like “the” or “and”

    • DO: use standard abbreviations and keywords

  3. Self-Contained File Names – name files so they still make sense outside the folder:

  4. Let Git (or Box) Handle Versions – don’t add dates, initials, or “final_v2” to file names.

    Note: Do use dates and initials when emailing files or working outside Git and Box.

  5. No Duplicate Files – edit the original and commit often instead of making copies.

Activity:

What is wrong with these file paths?

  1. sonde & other instrument data/08-12-2025 data_JS.csv

    - special characters

    - spaces in path

    - initials used for versioning

    - non descriptive name

  2. SWAT-modeling/final map (2).png

    - spaces in path

    - non-nested file structure

    - multiple copies of a single file

    - non descriptive name

  3. Aqualog/methods/running EEM's analysis SOP_25_12_01.docx

    - special characters

    - spaces in path

    - dates used for versioning


Special File Types

1. Geospatial Data

  • Store .shp files and other ‘multi-file’ geospatial layers within their own folder

  • To share these files:

    • zip the file together before sharing

    • upload to the Data Sharing folder in Box

2. Publicly Available Datasets

Maintain Metadata

  • Immediately after downloading a dataset, create a readme.txt file which lists at a minimum:

    • When the file was downloaded

    • Where the file was downloaded from: the link and the owner in case the link breaks

    • A short description of the dataset

Directly load when possible

  • For smaller datasets, read directly into the code to preserve file provenance:

    • You can read many types of files (.csv, .txt, .xslx) directly into R using the link instead of a file path

    • If you can’t read in directly, consider downloading to a temporary directory and then loading in

      • download.file(url, destfile=tempdir())
    • There are many R packages which allow direct access to useful data:

      • dataRetrieval: USGS and Water Quality Portal data

      • FedData: Land cover database, SSURGO soil data, Daymet meteorological data

      • nhdplusTools: Stream and HUC layers

      • elevatr: DEM layers

      • climateR: Many different kind of gridded climate data

Activity

  1. Download the daily streamflow summaries data set. Place it in the data folder of the test repository.
    • Create an appropriate readme.txt to accompany the data set.
  2. Open up the script code/load-data-nicely.R from the test respository
    • Example 1: Download the daily streamflow summaries using R.
    • Example 2: Download the same file to your temporary folder first, then read it in.
    • Example 3: Download data for Lookout Creek at HJ Andrews using dataRetrieval and make a plot to save.

3. Very Large Files (>1 GB)

Avoid storing very large files on Box unless needed, instead:

  • Create a subset of the data to work with

    • This should be stored in GitHub or Box (depending on size)
  • Include information or code detailing:

    • where to get data

    • how to subset data

Activity

  1. From the test repository open code/subsetting-vlarge-data.R
  2. Edit the code to perform the specified subsetting steps
  3. Save the file to the data folder in the test repository

4. Code

The goals for code we write are that it be:

  1. Replicable: Anyone should be able to pick up the code and have it run

    • Use R Projects which automatically sets the working directory to the project folder so file paths work for anyone who opens the project

    • Use relative file paths which specify the location of the file relative to the project directory

  2. Understandable: Use comments to describe what the code is doing so both later you and others know what you’re doing

  3. Organized: Create functions and loops to avoid repeating the same code over and over again

  4. Flexible: Avoid ‘hard-coding’ or manually specifying values as these can be easy to overlook if your data changes

Activity

  1. Open up the messy-code.R script from WWS-TEST-example-repo/code
  2. Fix the code so it runs (hint: check the file paths)
  3. Add comments to explain what the code is doing
  4. Is there are way to do the same thing with fewer lines of code?

5. Manuscripts

  • To avoid cluttering project repositories, create a new branch in the project repository

  • Name the branch: lastname-manuscript-year

  • All work should be stored on this branch

  • Keep files organized for future data package:

    • Data
      • input

      • output

    • Figures
      • exploratory
      • manuscript
    • Code
  • If you’re co-writing, store the manuscript text on Box

    • 02_Nodes/your node/Publications_Presentations/Manuscripts
    • Add this link the in the repository README so anyone working on the project can find the manuscript.

6. Standard Methods

Standard methods (SOPs and QA/QC scripts) are valuable files, allowing consistency and knowledge transfer across the project.

Don’t hide them within project folders.

  • Store in the standard-methods GitHub repository

  • Follow directions in the README for where to store files.

Activity

  1. Clone the standard-methods repository to your local computer
  2. Go find an SOP or QA/QC script for one of your projects
  3. Place it in the correct location in the repository
  4. Push your changes
  5. Check to ensure your file is now on GitHub

7. Analytical Data

Keeping detailed sample records is critical to ensure high quality data.

  • Note method deviations

    • can help explain outliers during analysis

    • important in publishing high quality data

  • Keep track of processing steps and storage locations so samples aren’t lost

To do this, create a copy of the sample-tracking.xlsx spreadsheet for your project.

  • Feel free to add columns as needed to keep detailed records

  • Store in Box within project folder so multiple people can edit

    • Place link in GitHub project README

Activity

  1. Open up the example-sample-tracking.xlsx.
  2. Add a new sample with your name (ie. Katie01).
  3. Make the bottle number your favorite number.
  4. Set the storage location of the sample as your office number.
  5. Mark that it was filtered with a 0.22 um filter.
  6. Make a note that something happened to the sample (ie. your dog ate it; the lab elves drank half).