Lesson 5: File and Folder Organization

Learning Goals

Understand where different files should be stored.
Understand how files should be named.
Understand and apply best practices for organizing, documenting, and sharing research data.
Develop habits that make code, datasets, and project materials reproducible, accessible, and collaborative.

Data Storage Workflows

In the Wildfire Water Security project we primarily use the following tools to store and share files which both have their benefits and limitations:

Box

Pros: Good for big files, real time Office document collaboration, easy to use
Cons: Clunky backups, limited version history, confusing shared folders

Git/GitHub

Pros: Full version history, easy backups, easy collaboration on non-binary files
Cons: Steep learning curve, can’t store large files

Both these tools are preferable over a organization specific network drive because they:

allow easy collaboration across organizations

automatically back up work and save version history

I have a file where does it go?

Generally, the following rules apply:

Box: Large files, collaborative Office files, files not related to a specific research project
GitHub: Project files, code, manuscript files (besides document itself)

If you’re unsure of where to put a file you can follow the flow chart below:

Where to NOT put files:

Network drive: even keeping shared GitHub repositories on the network drive can cause issues. Keep your own copies of files on a local drive (C)
OneDrive
Sharepoint
Google Drive
Dropbox

Data belongs to the project, not individuals. Store it where the team can find and it.

Activity:

Where should the following files be stored?

Word document containing text for a manuscript
- Box in 02_Nodes/your node/Publications_Presentations/Manuscripts/folder for your paper
meeting notes from Node 1
- Box in 02_Nodes/01_Empirical/01_Meetings
exploratory figure associated with the Bedrock project
- GitHub repo: WWS-Node1-BDRK-bedrock-microbes/figures/exploratory
large dataset associated with the Sonde project
- GitHub repo: WWS-Node1-SONDE-postfire-sonde-network/data
SOP for filtering water samples
- GitHub repo: WWS-standard-methods /filtering
figure for a manuscript
- GitHub repo for project, on branch for associated manuscript in figure folder

Folder and File Naming Conventions

Keeping your files and folders organized makes it easier for everyone on the team to find what they need and avoids confusion down the road.

Folder Organization

Git Repo First
Your top-level project folder (root directory) should be a GitHub repository. This ensures you have version control and backups from the start.
Limit the Number of Folders in the Root
Aim for fewer than 10 top-level folders — for example: data/, code/, figures/, methods/
Use Nested Folders for Subcategories
For example inside data/:
- raw-data/
- processed-data/
- metadata/
Avoid Spaces & Special Characters
Use - or _ instead of spaces. Avoid characters like .:*?"<>|[]&$.
Descriptive Names Name folders so someone unfamiliar with your project can still guess what’s inside.
Organize by Date (if needed)
Force Folder Order with Numbers

File Organization

Avoid Spaces & Special Characters
Use - or _ instead of spaces. Avoid characters like .:*?"<>|[]&$.
Be Concise but Descriptive –
- DON’T: use words like “the” or “and”
- DO: use standard abbreviations and keywords
Self-Contained File Names – name files so they still make sense outside the folder:
Let Git (or Box) Handle Versions – don’t add dates, initials, or “final_v2” to file names.

Note: Do use dates and initials when emailing files or working outside Git and Box.
No Duplicate Files – edit the original and commit often instead of making copies.

Activity:

What is wrong with these file paths?

sonde & other instrument data/08-12-2025 data_JS.csv

- special characters

- spaces in path

- initials used for versioning

- non descriptive name
SWAT-modeling/final map (2).png

- spaces in path

- non-nested file structure

- multiple copies of a single file

- non descriptive name
Aqualog/methods/running EEM's analysis SOP_25_12_01.docx

- special characters

- spaces in path

- dates used for versioning

Special File Types

1. Geospatial Data

Store .shp files and other ‘multi-file’ geospatial layers within their own folder
To share these files:
- zip the file together before sharing
- upload to the Data Sharing folder in Box

2. Publicly Available Datasets

Maintain Metadata

Immediately after downloading a dataset, create a readme.txt file which lists at a minimum:
- When the file was downloaded
- Where the file was downloaded from: the link and the owner in case the link breaks
- A short description of the dataset

Directly load when possible

For smaller datasets, read directly into the code to preserve file provenance:
- You can read many types of files (.csv, .txt, .xslx) directly into R using the link instead of a file path
- If you can’t read in directly, consider downloading to a temporary directory and then loading in
  - download.file(url, destfile=tempdir())
- There are many R packages which allow direct access to useful data:
  - dataRetrieval: USGS and Water Quality Portal data
  - FedData: Land cover database, SSURGO soil data, Daymet meteorological data
  - nhdplusTools: Stream and HUC layers
  - elevatr: DEM layers
  - climateR: Many different kind of gridded climate data

Activity

Download the daily streamflow summaries data set. Place it in the data folder of the test repository.
- Create an appropriate readme.txt to accompany the data set.
Open up the script code/load-data-nicely.R from the test respository
- Example 1: Download the daily streamflow summaries using R.
- Example 2: Download the same file to your temporary folder first, then read it in.
- Example 3: Download data for Lookout Creek at HJ Andrews using dataRetrieval and make a plot to save.

3. Very Large Files (>1 GB)

Avoid storing very large files on Box unless needed, instead:

Create a subset of the data to work with
- This should be stored in GitHub or Box (depending on size)
Include information or code detailing:
- where to get data
- how to subset data

Activity

From the test repository open code/subsetting-vlarge-data.R
Edit the code to perform the specified subsetting steps
Save the file to the data folder in the test repository

4. Code

The goals for code we write are that it be:

Replicable: Anyone should be able to pick up the code and have it run
- Use R Projects which automatically sets the working directory to the project folder so file paths work for anyone who opens the project
- Use relative file paths which specify the location of the file relative to the project directory
Understandable: Use comments to describe what the code is doing so both later you and others know what you’re doing
Organized: Create functions and loops to avoid repeating the same code over and over again
Flexible: Avoid ‘hard-coding’ or manually specifying values as these can be easy to overlook if your data changes

Activity

Open up the messy-code.R script from WWS-TEST-example-repo/code
Fix the code so it runs (hint: check the file paths)
Add comments to explain what the code is doing
Is there are way to do the same thing with fewer lines of code?

5. Manuscripts

To avoid cluttering project repositories, create a new branch in the project repository
Name the branch: lastname-manuscript-year
All work should be stored on this branch
Keep files organized for future data package:
- Data
  - input
  - output
- Figures
  - exploratory
  - manuscript
- Code
If you’re co-writing, store the manuscript text on Box
- 02_Nodes/your node/Publications_Presentations/Manuscripts
- Add this link the in the repository README so anyone working on the project can find the manuscript.

6. Standard Methods

Standard methods (SOPs and QA/QC scripts) are valuable files, allowing consistency and knowledge transfer across the project.

Don’t hide them within project folders.

Store in the standard-methods GitHub repository
Follow directions in the README for where to store files.

Activity

Clone the standard-methods repository to your local computer
Go find an SOP or QA/QC script for one of your projects
Place it in the correct location in the repository
Push your changes
Check to ensure your file is now on GitHub

7. Analytical Data

Keeping detailed sample records is critical to ensure high quality data.

Note method deviations
- can help explain outliers during analysis
- important in publishing high quality data
Keep track of processing steps and storage locations so samples aren’t lost

To do this, create a copy of the sample-tracking.xlsx spreadsheet for your project.

Feel free to add columns as needed to keep detailed records
Store in Box within project folder so multiple people can edit
- Place link in GitHub project README

Activity

Open up the example-sample-tracking.xlsx.
Add a new sample with your name (ie. Katie01).
Make the bottle number your favorite number.
Set the storage location of the sample as your office number.
Mark that it was filtered with a 0.22 um filter.
Make a note that something happened to the sample (ie. your dog ate it; the lab elves drank half).

Lesson 5: File and Folder Organization

Wildfire and Water Security Project

Learning Goals

Data Storage Workflows

I have a file where does it go?

Activity:

Folder and File Naming Conventions

Folder Organization

File Organization

Activity:

Special File Types

1. Geospatial Data

2. Publicly Available Datasets

Activity

3. Very Large Files (>1 GB)

Activity

4. Code

Activity

5. Manuscripts

6. Standard Methods

Activity

7. Analytical Data

Activity