Imagine you’ve graduated and an entrepreneur, recognizing the soft skills gained by a career in research, wants to hire you to run their cutting edge pizza shop.
They decided they want automated machinery to do the work for you.
They’re also planning on eventually building an empire of automatic pizza shops, so you need to make sure everyone can tell exactly how this first shop works.
Your job is to figure out what equipment you need and how to make it work together!
Document everything
Connecting steps automatically
Requesting the resources
git/GitHub
snakemake
Longleaf
git can help.* 9a2b3c4 - Add published version of the paper (2024-04-29)
* 8f7e6d5 - Revise submission after additional feedback, version 2 (2024-04-25)
* 7d6c5b4 - Update submission based on post-submission feedback (2024-04-20)
* 6c5b4a3 - Prepare final version for submission (2024-04-15)
* 5b4a392 - Finalize draft after thorough review (2024-04-10)
* 4a39881 - Incorporate feedback from final review (2024-04-05)
* 3928717 - Update draft, incorporate feedback from John (2024-04-01)
* 2871606 - Add second draft of the paper (2024-03-28)
* 1760505 - Initial draft of the paper (2024-03-25)git!git is version control system used to record changes to files
GitHub uses git to help users host/review code and manage projects
git/GitHub matter because they:
branchsnake doesn’t makeSnakemake is a workflow management tool used to automate data analysis pipelines.
You give Snakemake a list of thing you want, plus a list of rules that take inputs and outputs. Snakemake figures which scripts need to be run and in what order
Reasons to use Snakemake:
bash, R, and PythonLongleaf is UNC’s high-performance computing cluster (HPC). It’s basically a ton of computers/storage.
Accessible from anywhere with internet
Labs typically start with 40 TB of storage, users get 10 TB
Many scripts can be run at once, with your computer off
A LOT more resources than a typical computer (including Gremlin and Sphinx)
Easy to share files!
cp -r data/i/want data/where/i/want/to/put/it
Your computer could explode and it’d have no impact on the project
If you’ve done everything right, you could:
bash snake.shAnd the whole project would be reproduced!
Linux/MacOS use the terminal, Windows needs 3rd party remote computing software, like VScode or MobaXterm
This tutorial will assume that you’re on Windows 11 and using VS Code
VS Code supports SSH (Secure Shell), which is a secure way to connect two remote computers
Lets go ahead and open up VS Code
Lets pivot to the GitHub side of things
We’re going to do two main things:
To start, we need to get our ssh key from Longleaf. Run the following lines in the terminal that we just opened on VS Code
ls -al ~/.ssh This looks for an existing keys associatated with your LL account
You should see a file called id_rsa.pub, copy it with cat ~/.ssh/id_rsa.pub
Now go back to GitHub
In the Longleaf terminal, log into your GitHub account:
git config --global user.name "your-github-username"
git config --global user.email your.email.linkedwith.github
A repository, or repo, is a self-contained project on GitHub
They can be private or public and managed by one person or many.
GitHub is not for file storage!
README file, which we can make into the intro to the project for anyone visiting our repo.gitignore is a way to tell git that you don’t want it to look at certain files. It’s a good way to keep data from accidentally getting uploadedgiting the repo onto LLAll you need to do is run one line to clone the repo
Navigate to wherever you want the directory to exist, then run:
git clone URL-THAT-YOU-COPIED
UNC lets you run interactive RStudio sessions, go here: https://ondemand.rc.unc.edu
Open a RStudio Session and request this in the Additional Job Submission Arguments to get 16 GB of RAM: --mem 16gb
It’ll take a moment to create the session, connect when it’s ready
Rprojects are great way to keep your projects seperate and tidy