Warm up discussion:
1. How do you currently keep track of versions of the same document for (1) data analysis and (2) writing a paper/grant proposal?
2. When working with a collaborator simultaneously, how do you keep track of versions of the same document?
The main page for the core lessons from the Software Carpentry Foundation can be found at http://software-carpentry.org/lessons/
The lesson here is based on Software Carpentry’s core curriculum on git entitled “Version Control with Git” and is maintained by Ivan Gonzalez and Daisie Huang.
The main lesson link can be found at http://swcarpentry.github.io//git-novice/
Months ago, you submitted a scientific paper to a journal for publication and you’ve finally received your reviews back. The deadline for the reviews is quickly approaching and you are working with your collaborators to make the deadline.
As the first author, you are re-running analyses in R and working with your collaborators on re-writing the paper as per the reviewers comments. For the written document, you’re quickly passing a word documentment back and forth and trying to keep up with each other to meet the deadline. In the midst of all of the changes to the document, a paragraph of the results is lost. Which version of the word document was it in? Which version of your code file were those analyses in?
In this moment you are reminded of a time when you attended a Software Carpentry Workshop in January 2017. It was a busy time during the workshop and adjusting to the start of another semester. However, there was one thing you learned always commit your changes to your GitHub repo with a meaningful commit. Since you’ve followed this principle during this process, you know you can rely on your commits and version control for finding those results!
But how?
Version control is a tool for managing changes to a set of files. Each set of changes creates a new commit of the files; the version control system allows users to recover old commits reliably, and helps manage conflicting changes made by different users.
A commit records the current state of a set of files (a group of changes) in a version control repository. As a noun, the result of commiting, i.e. a recorded group of changes in a repository. If a commit contains changes to multiple files, all of the changes are recorded together.
A version control repository is a storage area where a version control system stores the full history of commits of a project and information about who changed what, when.
Multiple versions of a document can be merged into one.
Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).
Do you want to be your own friend in a year?
Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system
Examples from my own research: https://github.com/marschmi
Go to the original Software Carpentry lesson on automated version control: http://swcarpentry.github.io/git-novice/01-basics/
A note on the shell and why it is important: It’s like the air traffic control tower - allowing airlines from all over the globe to work together to get people around. In this case, bash is directly talking to your computer through unix shell commands to run many programs. If you need to run multiple programs at once an efficient way to do it automatically is to use the shell. Here we will the shell to simultaneously run nano, git, unix, and R!
Open up the unix shell, what happens when you run:
1.git config --list
2.git config
Now, let’s set up Git by going to http://swcarpentry.github.io/git-novice/02-setup/
Learning Goals
A local repository means that we are creating respository on our own computer. Our computer is local - It especially likes attending the weekly farmers market and supporting local businesses.
- In the shell, navigate to your home directory and create a new directory called
git_repos.cdinto thegit_reposfolder.
- In the
git_reposfolder, make another new directory calledswc_workshop.
cdintoswc_workshop.
- Type
ls -a.
- What files do you see?
- Now, initialize the repository by typing
git init.
- What files do you see now that
swc_workshopis a repository under version control?
- To check that everything is set up correctly by asking git to tell us the status of our project, type
git status.
Learning Goals
- Copy the
gapminder_analysis.Rfile **and* thegapminder-FiveYearData.csvfrom yesterday’s R lesson.
- Take a look at the
gapminder_analysis.Rfile by typingless gapminder_analysis.Rinto the shell. Take a look atgapminder-FiveYearData.csvby usinghead -n 10 gapminder-FiveYearData.csv- Type
git status. What does git tell us now?
- So far the changes are “untracked” with git. Now we need to tell git to keep track of the changes we have made - in other words, it’s time for the first commit!
git add gapminder_analysis.Randgit status. What happened?
git add gapminder-FiveYearData.csvandgit status.- What happened?
Git now knows that it’s supposed to keep track of gapminder_analysis.R and git add gapminder-FiveYearData.csv, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:
git commit -m "Adding gapminder_analysis.R and gapminder-FiveYearData.csv files to repository"
When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is f22b25e (Your commit will have another unique identifier.)
We use the -m flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.
Good commit messages start with a brief (<50 characters) summary of changes made in the commit. If you want to go into more detail, add a blank line between the summary line and your additional notes.
If we run git status now:
git status: Now, git tells us everything is up to date. If we want to know what we’ve done recently, we can ask git to show us the project’s history usinggit log
git loglists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created.
Where Are My Changes? If we run ls at this point, we will still see just one file called gapminder_analysis.R. That’s because git saves information about files’ history in the special .git directory mentioned earlier so that our filesystem doesn’t become cluttered (and so that we can’t accidentally edit or delete an old version).
Now let’s adds more information to the file. (Again, we’ll edit with nano and then cat the file to show its contents; you may use a different editor, and don’t need to cat.)
- Make a commented header to the file with some identifying information. For example,
# Date: January 18th, 2017.Or comment some of the code.
- Save the changes to
gapminder_analysis.Rcat gapminder_analysis.Rto see the changes.
- Now run
git status.
The last line is the key phrase: “no changes added to commit”. We have changed this file, but we haven’t told git we will want to save those changes (which we do with git add) nor have we saved them (which we do with git commit). So let’s do that now. It is good practice to always review our changes before saving them. We do this using git diff. This shows us the differences between the current state of the file and the most recently saved version:
git diff
The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we break it down into pieces:
The first line tells us that git is producing output similar to the Unix diff command comparing the old and new versions of the file.
The second line tells exactly which versions of the file git is comparing; df0654a and 315bf3a are unique computer-generated labels for those versions.
The third and fourth lines once again show the name of the file being changed.
The remaining lines are the most interesting, they show us the actual differences and the lines on which they occur. In particular, the + markers in the first column show where we have added lines.
After reviewing our change, it’s time to commit it:
git commit -m "Added comment on ______"
git status
Whoops: Git won’t commit because we didn’t use git add first. Let’s fix that:
git add gapminder_analysis.R
git commit -m "Added comment on ______"
git status
Git insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we’re adding a few citations to our supervisor’s work to our thesis. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we’re doing on the conclusion (which we haven’t finished yet).
To allow for this, Git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed.
If you think of Git as taking snapshots of changes over the life of a project:
git add specifies what will go in a snapshot (putting things in the staging area), andgit commit then actually takes the snapshot, and makes a permanent record of it (as a commit).If you don’t have anything staged when you type git commit, git will prompt you to use git commit -a or git commit --all, which is kind of like gathering everyone for the picture! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. (Going back to snapshots, you might get the extra with incomplete makeup walking on the stage for the snapshot because you used -a!) Try to stage things manually, or you might find yourself searching for git undo commit more than you would like!
Let’s watch as our changes to a file move from our editor to the staging area and into long-term storage.
- Add
ggplot(data = gapminder, aes(x = year, y = lifeExp, color = continent))+geom_point()for a plot with year on the x-axis and life expectantcy on the y axis in thegapminder_analysis.Rfile.
git diff
So far, so good: we’ve added one line to the end of the file (shown with a + in the first column). Now let’s put that change in the staging area and see what git diff reports:
git add gapminder_analysis.R
git diff
There is no output: as far as git can tell, there’s no difference between what it’s been asked to save permanently and what’s currently in the directory. However, if we do this:
git diff --staged
it shows us the difference between the last committed change and what’s in the staging area. Let’s save our changes:
git commit -m "Added code to plot gapminder data with x = year and y = life expectantcy."
Check the status:
git status
and look at the history of what we’ve done so far:
git log
To recap, when we want to add changes to our repository, we first need to add the changed files to the staging area (git add) and then commit the staged changes to the repository (git commit):
Challege Question 1: Choosing a Commit Message
Which of the following commit messages would be most appropriate for the last commit made to mars.txt?
1. “Changes”
2. “Added line ‘But the Mummy will appreciate the lack of humidity’ to mars.txt”
3. “Discuss effects of Mars’ climate on the Mummy”
Challege Question 2: Committing Changes to Git
Which command(s) below would save the changes of myfile.txt to my local git repository?
1. $ git commit -m "my recent changes"
2. $ git init myfile.txt
$ git commit -m "my recent changes"
3. $ git add myfile.txt
$ git commit -m "my recent changes"
4. $ git commit -m myfile.txt "my recent changes"
Challenge 3: Create a project description
Create a new file called README.txt. Write a three-line description of theswc_workshoprepository, commit your changes, then:
- modify one line
- add a line
- and display the differences between its updated state and its original state.
Learning Goals
Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.
Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, BitBucket or GitLab.
Let’s start by sharing the changes we’ve made to our current project with the world.
- Log in to GitHub.
- Click on “Repositories”, and then click on the
icon in the top right corner to create a new repository called
swc_workshop(Note: This name should be the same exact name as the folder on your computer):
As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository:
This effectively does the following on GitHub’s servers:
mkdir swc_workshop
cd swc_workshop
git init
Our local repository still contains our earlier work on gapminder_analysis.R, but the remote repository on GitHub doesn’t contain any files yet:
The next step is to connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it:
- Click on the ‘HTTPS’ link to change the protocol from SSH to HTTPS.
HTTPS vs SSH
We use HTTPS here because it does not require additional configuration. After the workshop you may want to set up SSH access, which is a bit more secure, by following the great tutorial from GitHub.
- Copy that HTTPS URL from the browser, go into the local
SWC_Rrepository.
- Run this command:
git remote add origin https://github.com/marschmi/swc_workshop.git
Make sure to use the URL for your repository rather than marschmi’s.
We can check that the command has worked by running git remote -v:
git remote -v
The name origin is a local nickname for your remote repository: we could use something else if we wanted to, but origin is by far the most common choice.
Once the nickname origin is set up, this command will push the changes from our local repository to the repository on GitHub:
git push origin master
Proxy
If the network you are connected to uses a proxy there is an chance that your last command failed with “Could not resolve hostname” as the error message. To solve this issue you need to tell Git about the proxy:
git config --global http.proxy http://user:password@proxy.url
git config --global https.proxy http://user:password@proxy.url
When you connect to another network that doesn’t use a proxy you will need to tell Git to disable the proxy using:
git config --global --unset http.proxy
git config --global --unset https.proxy
Password Managers
If your operating system has a password manager configured, git push will try to use it when it needs your username and password. If you want to type your username and password at the terminal instead of using a password manager, type:
unset SSH_ASKPASS
You may want to add this command at the end of your ~/.bashrc to make it the default behavior.
Our local and remote repositories are now in this state:
We can pull changes from the remote repository to the local one as well:
git pull origin master
Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.
Challenge 1: Remote vs Local Repositories
1. Define remote repository.
2. Define local repository.
3. What is the difference between remote and local repositories*?
Challenge 2:
git pushvsgit pull
1. What happened when you performedgit push?
2. What happened when you rangit pull?
3. How aregit pushandgit pulldifferent?
Challenge 3: Github Timestamp
Create a repository on GitHub, clone it, add a file, push those changes to GitHub, and then look at the timestamp of the change on GitHub. How does GitHub record times, and why?
In this lesson we will navigate to http://swcarpentry.github.io/git-novice/05-history/
- Version control is better than mailing files back and forth! (It is also better than dropbox.)
- Do you want to be your (or advisor’s/collaborator’s) friend in 6 months? A year? 2 years?
A helpful resource describing version control with git basics from Software Carpentry can be found here.
git configgit statusgit addgit commitgit loggit diff: Allows us to look at older versions of the file compared to the current (non-staged) version of the file
git diff HEAD~2 filename.R: Takes us back 2 versions agogit push: Push changes from the local repository up to the remote repository.git pull: Pull changes from the remote repository down to the local repository.When adding a local to a new remote repo: - git remote add origin ___(url)___ - git remote -v: to check if the URL is correct - git push -u origin master