git, snakemake, and longleaf - oh my!

Brian Gural

Reproducible science… in-silco??

  • Bioinformaticians are people too
  • We need to make sure our research is well documented and reproducible just like bench scientists
  • Projects can get complex, messy, and very computationally demanding

That’s amoré

Imagine you’ve graduated and an entrepreneur, recognizing the soft skills gained by a career in research, wants to hire you to run their cutting edge pizza shop.

They decided they want automated machinery to do the work for you.

They’re also planning on eventually building an empire of automatic pizza shops, so you need to make sure everyone can tell exactly how this first shop works.

Your job is to figure out what equipment you need and how to make it work together!

How do you approach this?

  • Document everything

  • Connecting steps automatically

  • Requesting the resources

git/GitHub

snakemake

Longleaf

Suffering from manual version control? git can help.

Before GitHub

paper_draft1.doc
paper_draft2.doc
paper_reviewed_by_john.doc
paper_draft3_comments_incorporated.doc
paper_final_draft.doc
paper_final_reviewed.doc
paper_final_submission.doc
paper_final_submission_revised.doc
paper_final_submission_revised_v2.doc
paper_published_version.doc

After GitHub

* 9a2b3c4 - Add published version of the paper (2024-04-29)
* 8f7e6d5 - Revise submission after additional feedback, version 2 (2024-04-25)
* 7d6c5b4 - Update submission based on post-submission feedback (2024-04-20)
* 6c5b4a3 - Prepare final version for submission (2024-04-15)
* 5b4a392 - Finalize draft after thorough review (2024-04-10)
* 4a39881 - Incorporate feedback from final review (2024-04-05)
* 3928717 - Update draft, incorporate feedback from John (2024-04-01)
* 2871606 - Add second draft of the paper (2024-03-28)
* 1760505 - Initial draft of the paper (2024-03-25)

Go on, git!

git is version control system used to record changes to files

GitHub uses git to help users host/review code and manage projects

git/GitHub matter because they:

  • Track every version of every script
  • Publicly document your work
  • Allow for new versions of projects to branch
  • Make it easy to collaborate

Help, my snake doesn’t make

Snakemake is a workflow management tool used to automate data analysis pipelines.

You give Snakemake a list of thing you want, plus a list of rules that take inputs and outputs. Snakemake figures which scripts need to be run and in what order

Reasons to use Snakemake:

  • Gives a rule book for how your project should be run so it can be reproduced
  • Integrates scripts from many languages, like bash, R, and Python
  • Makes it easy to scale your project, since it can ask for many things to be run in parallel

Longleaf: The darling of UNC bioinformaticians

Longleaf is UNC’s high-performance computing cluster (HPC). It’s basically a ton of computers/storage.

Accessible from anywhere with internet

Labs typically start with 40 TB of storage, users get 10 TB

Why use longleaf?

Many scripts can be run at once, with your computer off

A LOT more resources than a typical computer (including Gremlin and Sphinx)

Easy to share files!

cp -r data/i/want data/where/i/want/to/put/it

Getting the band together

Your computer could explode and it’d have no impact on the project

If you’ve done everything right, you could:

  • log into Longleaf
  • clone the GitHub repo for your project
  • copy in your data
  • run bash snake.sh

And the whole project would be reproduced!

Setting it all up

Setting it all up: Longleaf

Linux/MacOS use the terminal, Windows needs 3rd party remote computing software, like VScode or MobaXterm

This tutorial will assume that you’re on Windows 11 and using VS Code

VS Code supports SSH (Secure Shell), which is a secure way to connect two remote computers

Lets go ahead and open up VS Code

VS Code to Longleaf

Use the extensions tab on the left to find pre-built tools to connect via SSH remotely

In the new Remote Explorer tab, find the settings for SSH

VS Code to Longleaf

Open up the config file (not the one that says ssh_config!)

Add these lines to specify where you want to connect and who you want to log in as. Feel free to change “unc” in the Host line to whatever name you want for this connection

VS Code to Longleaf

Find your new SSH connection in the Remotes tab, then click this arrow to connect your current window

You’ll need to enter your ONYEN password, then you’re on!

GitHub and Longleaf

Lets pivot to the GitHub side of things

We’re going to do two main things:

  • Introduce our GitHub and Longleaf accounts to eachother
  • Set up our first repository (project) on GitHub

GitHub and Longleaf

To start, we need to get our ssh key from Longleaf. Run the following lines in the terminal that we just opened on VS Code

ls -al ~/.ssh This looks for an existing keys associatated with your LL account

You should see a file called id_rsa.pub, copy it with cat ~/.ssh/id_rsa.pub

Now go back to GitHub

GitHub SSH Keys

Open the settings of your GitHub account

Find the settings for SSH Keys

GitHub SSH Keys

Go ahead and add a new key

GitHub SSH Keys

Name that sucker and paste the key you copied from Longleaf a minute ago

Back to Longleaf

In the Longleaf terminal, log into your GitHub account:

git config --global user.name "your-github-username"

git config --global user.email your.email.linkedwith.github

Baby’s first repo

A repository, or repo, is a self-contained project on GitHub

They can be private or public and managed by one person or many.

GitHub is not for file storage!

Baby’s first repo

Go to your Repositories and make a new one

You’ll want these settings, which we’ll go over in a moment

Baby’s first repo

Copy the link to your new repo
  • What did we just do?
  • Asked for a README file, which we can make into the intro to the project for anyone visiting our repo
  • .gitignore is a way to tell git that you don’t want it to look at certain files. It’s a good way to keep data from accidentally getting uploaded

giting the repo onto LL

All you need to do is run one line to clone the repo

Navigate to wherever you want the directory to exist, then run:

git clone URL-THAT-YOU-COPIED

Using RStudio on LongLeaf

UNC lets you run interactive RStudio sessions, go here: https://ondemand.rc.unc.edu

Open a RStudio Session and request this in the Additional Job Submission Arguments to get 16 GB of RAM: --mem 16gb

It’ll take a moment to create the session, connect when it’s ready

Rprojects

Rprojects are great way to keep your projects seperate and tidy