This tutorial was originally constructed as a part of Titus Brown’s Next Generation Sequencing Data Analysis Workshop Week 3 that took place at Michigan State University’s Kellogg Biological Station between August 24-28, 2015.
Github for this website
git lesson here is heavily based on Software Carpentry’s core curriculum on git entitled “Version Control with Git” and is maintained by Ivan Gonzalez and Daisie Huang.Could I replicate your Figure 1 from your last publication, grant proposal, or presentation?
If not, what would you and your co-authors need to provide/do so I could replicate your figure 1?
“Robust research is about doing small things that stack the deck in your favor to prevent mistakes.” ~Vince Buffalo
Robust: strong and healthy; vigorous (adjective)
How can we make our research strong, healthy and vigorous?
Reproduce: produce again (verb)
Reproducible: able to be reproduced or copied (adjective)
So, reproducible research may be repeated by other researchers with the same results.
It takes a lot of effort to be robust and reproducible. However, it will make your life (and science) easier!
Code readability is very important.
If your code is more readable, then:
Let your computer do the work for you
Format your data so its easily read by your computer, not by you or other humans.
Add tests within your code to make sure your code is doing what it is supposed to do.
Assertions are statments that something holds true. Assertions:
1. Ensure that if something goes wrong, the program will stop.
2. They also explain what the program is doing.
stopifnot()
testthat package is made for this! Check it out the testthat package hereassert()Read-only is important because:
The Reproducible-Science-Curriculum Github repo for Reproducible Research Project Initialization is a great place to start a reproducible research project.
It is simple. Without your code and data, your research is not reproducible.
Bottom line: Adopt a computing notebook that is as good as a wet-lab notebook.
To fully reproduce a study, each step of analysis must be described in much more detail than can be included in a publication.
Include a record of your steps, where files are, where they came from, and what they contain.
Include session_info() in your document, preferably at the bottom. Session info lists the version of R that you’re using plus all of the packages you’ve loaded.
session_info()For example, all the above information could be stored in a README file
Using inline code can make the creation of tables much easier if the data changes!
Do not rely on hard-coded absolute paths (i.e. /Users/marschmi/Data/seq-data.csv or even ~/Data/seq-data.csv).
Relative paths (i.e. Data/seq-data.csv) or command line arguments are better alternatives.
If there is any randomizations of data or simulations, use set.seed() in the first code chunk.
Karl Broman suggests to open R and type runif(1, 0, 10^8) and then paste the resulting large number into set.seed() in the first code chunk. If you do this, then the random aspects of your analysis should be repeated the same way.
- How do you currently keep track of versions of the same document for (1) data analysis and (2) writing a paper/grant proposal?
- When working with a collaborator simultaneously, how do you keep track of versions of the same document?
Months ago, you submitted a scientific paper to a journal for publication and you’ve finally received your reviews back. The reviewers give your paper minor revisions and suggest that you modify one of the first steps in your data analysis and therefore re-create every figure.
The deadline for the reviews is quickly approaching and you do not have much time. How do you stack the cards in your favor?
Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).
Do you want to be your own friend in a year?
Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system
Examples from my own research: https://github.com/marschmi
Go to the original Software Carpentry lesson on automated version control: http://swcarpentry.github.io/git-novice/01-basics/
Open up the unix shell, what happens when you run:
1.git config --list
2.git config
Now, let’s set up Git by going to http://swcarpentry.github.io/git-novice/02-setup/
Learning Goals
A local repository means that we are creating respository on our own computer. Our computer is local - It especially likes attending the weekly farmers market and supporting local businesses.
- In the shell, navigate to your home directory and create a new directory called
git_repos.cdinto thegit_reposfolder.
- In the
git_reposfolder, make another new directory calledearth_523.
cdintoearth_523.
- Type
ls -aF.
- What files do you see?
- Now, initialize the repository by typing
git init.
- What files do you see now that
earth_523is a repository under version control?
- To check that everything is set up correctly by asking git to tell us the status of our project, type
git status.
Learning Goals
nano README.md(ornotepadd README.mdornpp README.md)
- Write down the author, date
- Click here for help with markdown language syntax.
git status
So far the changes are “untracked” with git. Now we need to tell git to keep track of the changes we have made - in other words, it’s time for the first add and commit!
git add README.md
Git insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we’re adding a few citations to our supervisor’s work to our thesis. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we’re doing on the conclusion (which we haven’t finished yet).
To allow for this, Git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed.
If you think of Git as taking snapshots of changes over the life of a project:
git add specifies what will go in a snapshot (putting things in the staging area), andgit commit then actually takes the snapshot, and makes a permanent record of it (as a commit).If you don’t have anything staged when you type git commit, git will prompt you to use git commit -a or git commit --all, which is kind of like gathering everyone for the picture! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. (Going back to snapshots, you might get the extra with incomplete makeup walking on the stage for the snapshot because you used -a!) Try to stage things manually, or you might find yourself searching for git undo commit more than you would like!
Let’s watch as our changes to a file move from our editor to the staging area and into long-term storage.
git status
Git now knows that it’s supposed to keep track of README.md, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:
git commit -m "Created README.md to keep track of documentation of this repository."
When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is f22b25e (Your commit will have another unique identifier.)
We use the -m flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.
Good commit messages start with a brief (<50 characters) summary of changes made in the commit. If you want to go into more detail, add a blank line between the summary line and your additional notes.
If we run git status now:
git status. Now, git tells us everything is up to date.
- Open the readme and write down the purpose of this repository
git status
git diff: This shows us the differences between the current state of the file and the most recently saved version:
The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we break it down into pieces:
The first line tells us that git is producing output similar to the Unix diff command comparing the old and new versions of the file.
The second line tells exactly which versions of the file git is comparing; df0654a and 315bf3a are unique computer-generated labels for those versions.
The third and fourth lines once again show the name of the file being changed.
The remaining lines are the most interesting, they show us the actual differences and the lines on which they occur. In particular, the + markers in the first column show where we have added lines.
After reviewing our change, it’s time to commit it:
git add
git status
git commit -m "Added purpose of this repo to readme file"
git status
- If we want to know what we’ve done recently, we can ask git to show us the project’s history using `git log``
git log lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created.
- Make another change and go through:
git addgit statusgit commit -m "your message"git statusTo recap, when we want to add changes to our repository, we first need to add the changed files to the staging area (git add) and then commit the staged changes to the repository (git commit):
git remote -v
- Does anything happen?
Learning Goals
Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.
Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, BitBucket or GitLab.
Let’s start by sharing the changes we’ve made to our current project with the world.
- Log in to GitHub.
- Click on “Repositories”, and then click on the
icon in the top right corner to create a new repository called
earth_523(Note: This name should be the same exact name as the folder on your computer):
As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository:
This effectively does the following on GitHub’s servers:
mkdir earth_523
cd earth_523
git init
Our local repository still contains our earlier work on README.md, but the remote repository on GitHub doesn’t contain any files yet:
The next step is to connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it:
- Click on the ‘HTTPS’ link to change the protocol from SSH to HTTPS.
HTTPS vs SSH
We use HTTPS here because it does not require additional configuration. After the workshop you may want to set up SSH access, which is a bit more secure, by following the great tutorial from GitHub.
- Copy that HTTPS URL from the browser, go into the local
SWC_Rrepository.
- Run this command:
git remote add origin https://github.com/marschmi/earth_523.git
Make sure to use the URL for your repository rather than marschmi’s.
We can check that the command has worked by running git remote -v:
git remote -v
The name origin is a local nickname for your remote repository: we could use something else if we wanted to, but origin is by far the most common choice.
Once the nickname origin is set up, this command will push the changes from our local repository to the repository on GitHub:
git push -u origin master
origin: a remote name of the repomaster: a branch nameProxy
If the network you are connected to uses a proxy there is an chance that your last command failed with “Could not resolve hostname” as the error message. To solve this issue you need to tell Git about the proxy:
git config --global http.proxy http://user:password@proxy.url
git config --global https.proxy http://user:password@proxy.url
When you connect to another network that doesn’t use a proxy you will need to tell Git to disable the proxy using:
git config --global --unset http.proxy
git config --global --unset https.proxy
Password Managers
If your operating system has a password manager configured, git push will try to use it when it needs your username and password. If you want to type your username and password at the terminal instead of using a password manager, type:
unset SSH_ASKPASS
You may want to add this command at the end of your ~/.bashrc to make it the default behavior.
Our local and remote repositories are now in this state:
We can pull changes from the remote repository to the local one as well:
git pull origin master
Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.
A helpful resource describing version control with git basics from Software Carpentry can be found here.
git configgit statusgit addgit commitgit loggit diff: Allows us to look at older versions of the file compared to the current (non-staged) version of the file
git diff HEAD~2 filename.R: Takes us back 2 versions agogit push: Push changes from the local repository up to the remote repository.git pull: Pull changes from the remote repository down to the local repository.When adding a local to a new remote repo: - git remote add origin ___(url)___ - git remote -v: to check if the URL is correct - git push -u origin master
knitr to make it easy to create reproducible web-based reports.
A convenient tool for reproducible and dynamic reports with R!
knitr.YAML Header: A set of key value pairs at the start of your file. Begin and end the header with a line of three dashes (- - -)
R Studio template writes the YAML header for you
output: html_document
output: pdf_document
output: word_document
output: beamer_presentation (beamer slideshow - pdf)
output: ioslides_presentation (ioslides presentation - html)
For example: Here’s the YAML header for this webpage with a table of contents.
---
title: "Introductory Version Control with Git"
subtitle: "Earth 523 - Metagenomics"
author: "Marian L. Schmidt, @micro_marian, marschmi@umich.edu"
date: "February 2nd, 2017"
output:
html_document:
code_folding: show
highlight: haddock
keep_md: yes
theme: united
toc: yes
toc_float:
collapsed: no
smooth_scroll: yes
toc_depth: 2
---
Markdown is a simple formatting language that is easy to use
* or + sign
*italics* and **bold**. Can even include tables:| First Header | Second Header |
|---|---|
| Content Cell | Content Cell |
| Content Cell | Content Cell |
Go to Help –> “Cheatsheets”
Code blocks display with fixed-width font
#quick summary
library(ggplot2)
min(diamonds$price)## [1] 326
mean(diamonds$price)## [1] 3932.8
max(diamonds$price)## [1] 18823
You can name the code chunk.
echo = TRUE: The code will be displayed.
eval = TRUE: Yes, execute the code.
You may want to use the same set of chunk options throughout a document and you don’t want to retype those options in every chunk.
Global chunk options are for you!
You can evaluate expressions inline by enclosing the expression within a single back-tick qualified with r.
Inline code is underappreciated!
Last night, I saw 7 shooting stars!
rmarkdown::render("<filepath>")When you render, R will:
Execute each embedded code chunk and insert the results into your report.
Build a new version of your report in the output file type.
Open a preview of the output file in the viewer pane.
Save the output file in your working directory.