Lesson 4: Working with Large Files

Estimated Time to Complete: 90 minutes

Learning Goals

Learn how to avoid generating large files that need special handling.
Know where to store large files and how to maintain version history.

Identifying Large Files

GitHub caps file sizes at 100 MB, which makes it less ideal for storing certain kinds of files. Most project files will be smaller than this, but we still want to be able to:

Keep version control on large files
Make these files easily shareable

In general there are only a few types of files that will exceed the size limit:

Massive .csv files
.zip files
Large geospatial files
Photos

How will I know if my file is too big for GitHub?

GitHub desktop will warn you!

Do not click commit anyway, this will prevent you from pushing your commit to GitHub.

Deciding How to Handle Large Files

Once you’ve identified you have a large file you’ll have to decide to either:

Store the file on Box in the respective project folder. This is a good option for static files (i.e., photos, maps) since Box has more limited version control.
Find a way to decrease it’s size. This is the preferred method for anything data related that you might want to pull into R (i.e., datasets, geospatial files) so it can remain in the R project.

Method 1: Storing on Box

We went over how to upload and store files on Box in Module 3, but what about when you need to use a large file that’s stored on Box for a project? You have a few options:

Use Box Drive to access the file directly (via R package boxrdrive).
- Strongly preferred as they will keep the version of the file you’re using updated with the version everyone else has access to.
Download the file manually to your project repository and put file in the .gitignore.
- Need to clearly specify in project readme.md and in code where to access file and where to put it so others using your project are able to get the same data.

Note that while Box links are great for sharing data, you cannot download directly using the link, so you cannot use R to download a Box file via the link.

Activity:

First we’ll access the file via Box directly:

Go to Box to view the large files you want to download (example-large-file.csv and example-large-raster.tif), note that their file sizes are too large for GitHub.
Locate and open the large-files.R file in WWS-TEST-example-repo/code.
- Clone the repository or merge the master into your branch if you don’t see the file
Run lines 13-24 to load the large dataset and raster file into R.

Now try the other method, and download the files manually and put them in our repository:

Go to Box and download the large files.
Put the files in your repository under data.
Try to commit your changes to the repository, you’ll get a warning the that files are too large.
We’ll have to tell Git to ignore the files. It’s important to note that once you do this, any change you make to those files will stay on your computer and not be backed up. This is why we want to avoid this method for files that are changing
We also need to tell others where to get these files. We’ll add it to the readme of the repository and in the code:
- Put the following text in the readme.md and as a comment at the top of the code:
  - This repository contains large files (`example-large-file.csv` and `example-large-raster.tif`) stored on Box (https://oregonstate.app.box.com/folder/336696034807). Please download and place in the data folder.
  - Then update the code to pull in the files using their relative file path.

Method 2: Decreasing a File’s Size

Especially for data you might need in your analysis, we want to do everything we can to keep it the GitHub repository so everyone has access to it and you’re not asking people to download files and move them around to make your code work.

There’s a few strategies that often help you get around storing large files, while maintaining reproducible workflows:

Subset the data
- Include directions as a readme.txt or code on how data was subset.
- Can either break into multiple files or just keep what you need.
Pull in directly so you don’t have to store the file
- Read the file directly into R
  - You can read many types of files (.csv, .txt, .xslx) directly into R using the link instead of a file path
- Download data via an R package
  - dataRetrieval: USGS and Water Quality Portal data
  - FedData: Land cover database, SSURGO soil data, Daymet meteorological data
  - nhdplusTools: Stream and HUC layers
  - elevatr: DEM layers
  - climateR: Many different kind of gridded climate data
- Download the file via R to a temporary folder as part of the script.
  - download.file(url, destfile=tempdir())

Activity:

Let’s practice these strategies for decreasing a file’s size:

Let’s practice subsetting the dataset, in large-files.R file in WWS-TEST-example-repo/code run lines 29-38.
- Notice how that decreased our data file from 104.4 MB to 6.3 MB, a much more reasonable size for GitHub.
- Our raster file decreased from 122.2 MB to 0.732 MB.
Maybe we want to keep all our data, we could store a file for each station individually, and then merge them together when loading them in.
- Try this using lines 41 - 51 in large-files.R.
If we were doing these steps we’d want to make sure our large-files.R script was included in the repository so others understand how we subset our data, and you can redo it if your starting file changed.
We’ll practice the other methods of dealing with large files (reading in via URL and via R package) in Module 5.

Dealing with Very Large Files (>1 GB)

Avoid storing very large files (> 1 GB) on Box unless absolutely needed. These take a long time to upload and download and there are minimal cases in which this amount of data is actually needed. Instead use the subsetting tricks above to store just the data you need.

Create a subset of the data to work with
- This should be stored in GitHub or Box (depending on size)
Include information or code detailing:
- where to get data
- how to subset data

Dynamic Large Files: Box-LFS

If your large files are updated frequently and you’re unable to subset your file to store it on GitHub, you might consider using Box-LFS.

This is an R package that works similarly to Git LFS. It links large files stored on Box with GitHub so that Git maintains versioning, while the actual files live outside GitHub’s storage limits.

How it works:

Large files are stored on Box, inside a box-lfs folder within your project.
GitHub only stores .boxtracker pointer files, which record the file’s location and version history.
When working in a GitHub repository, Box LFS makes sure your local copy has the correct large files.

With Box LFS:

If Box Drive is installed:
- Files will be automatically moved between Box and your Git projects
Otherwise you will still manually download and upload files to Box.

The package is new, can be buggy, and doesn’t currently support branches. For that reason I would only recommend it to more experienced GitHub/R users.

If you’d like to learn more about Box-LFS you can follow this tutorial, or ask Katie for more info about using it.

Lesson 4: Working with Large Files

Wildfire and Water Security Project

Learning Goals

Identifying Large Files

How will I know if my file is too big for GitHub?

Deciding How to Handle Large Files

Method 1: Storing on Box

Activity:

Method 2: Decreasing a File’s Size

Activity:

Dealing with Very Large Files (>1 GB)

Dynamic Large Files: Box-LFS