Estimated Time to Complete: 90 minutes
GitHub caps file sizes at 100 MB, which makes it less ideal for storing certain kinds of files. Most project files will be smaller than this, but we still want to be able to:
Keep version control on large files
Make these files easily shareable
In general there are only a few types of files that will exceed the size limit:
Massive .csv files
.zip files
Large geospatial files
Photos
GitHub desktop will warn you!
Do not click commit anyway, this will prevent you from pushing your commit to GitHub.
Once you’ve identified you have a large file you’ll have to decide to either:
Store the file on Box in the respective project folder. This is a good option for static files (i.e., photos, maps) since Box has more limited version control.
Find a way to decrease it’s size. This is the preferred method for anything data related that you might want to pull into R (i.e., datasets, geospatial files) so it can remain in the R project.
We went over how to upload and store files on Box in Module 3, but what about when you need to use a large file that’s stored on Box for a project? You have a few options:
Use Box Drive to access the file directly (via R package boxrdrive).
Download the file manually to your project repository and put file in the .gitignore.
readme.md and in
code where to access file and where to put it so others using your
project are able to get the same data.Note that while Box links are great for sharing data, you cannot download directly using the link, so you cannot use R to download a Box file via the link.
First we’ll access the file via Box directly:
Go to Box to view the large
files you want to download (example-large-file.csv and
example-large-raster.tif), note that their file sizes are
too large for GitHub.
Locate and open the large-files.R file in
WWS-TEST-example-repo/code.
Run lines 13-24 to load the large dataset and raster file into R.
Now try the other method, and download the files manually and put them in our repository:
Go to Box and download the large files.
Put the files in your repository under
data.
Try to commit your changes to the repository, you’ll get a warning the that files are too large.
We’ll have to tell Git to ignore the files. It’s important to note that once you do this, any change you make to those files will stay on your computer and not be backed up. This is why we want to avoid this method for files that are changing
We also need to tell others where to get these files. We’ll add it to the readme of the repository and in the code:
Put the following text in the readme.md and as a
comment at the top of the code:
This repository contains large files
(`example-large-file.csv` and
`example-large-raster.tif`) stored on Box (https://oregonstate.app.box.com/folder/336696034807).
Please download and place in the data folder.
Then update the code to pull in the files using their relative file path.
Especially for data you might need in your analysis, we want to do everything we can to keep it the GitHub repository so everyone has access to it and you’re not asking people to download files and move them around to make your code work.
There’s a few strategies that often help you get around storing large files, while maintaining reproducible workflows:
Subset the data
readme.txt or code on how data
was subset.Pull in directly so you don’t have to store the file
.csv,
.txt, .xslx) directly into R using the link
instead of a file pathdownload.file(url, destfile=tempdir())Let’s practice these strategies for decreasing a file’s size:
Let’s practice subsetting the dataset, in
large-files.R file in
WWS-TEST-example-repo/code run lines
29-38.
Notice how that decreased our data file from 104.4 MB to 6.3 MB, a much more reasonable size for GitHub.
Our raster file decreased from 122.2 MB to 0.732 MB.
Maybe we want to keep all our data, we could store a file for each station individually, and then merge them together when loading them in.
large-files.R.If we were doing these steps we’d want to make sure our
large-files.R script was included in the repository so
others understand how we subset our data, and you can redo it if your
starting file changed.
We’ll practice the other methods of dealing with large files (reading in via URL and via R package) in Module 5.
Avoid storing very large files (> 1 GB) on Box unless absolutely needed. These take a long time to upload and download and there are minimal cases in which this amount of data is actually needed. Instead use the subsetting tricks above to store just the data you need.
Create a subset of the data to work with
Include information or code detailing:
where to get data
how to subset data
If your large files are updated frequently and you’re unable to subset your file to store it on GitHub, you might consider using Box-LFS.
This is an R package that works similarly to Git LFS. It links large files stored on Box with GitHub so that Git maintains versioning, while the actual files live outside GitHub’s storage limits.
How it works:
Large files are stored on Box, inside a
box-lfs folder within your project.
GitHub only stores .boxtracker pointer
files, which record the file’s location and version
history.
When working in a GitHub repository, Box LFS makes sure your local copy has the correct large files.
With Box LFS:
If Box Drive is installed:
Otherwise you will still manually download and upload files to Box.
The package is new, can be buggy, and doesn’t currently support branches. For that reason I would only recommend it to more experienced GitHub/R users.
If you’d like to learn more about Box-LFS you can follow this tutorial, or ask Katie for more info about using it.