Warm up discussion:
1. How do you currently keep track of versions of the same document?
2. When working with a collaborator simultaneously, how do you keep track of versions of the same document?
The main page for the core lessons from the Software Carpentry Foundation can be found at http://software-carpentry.org/lessons/
The lesson here is based on Software Carpentry’s core curriculum on git entitled “Version Control with Git” and is maintained by Ivan Gonzalez and Daisie Huang.
The main lesson link can be found at http://swcarpentry.github.io//git-novice/
You are working on a project with your advisor/PI/post-doc/staff member/PhD, Masters, or undergraduate student and have a deadline for a paper quickly approaching. To get the paper where it needs to be for publication you and your collaborator must work on the document simultaneously. You’re quickly passing documents back and forth and trying to keep up with each other to meet the deadline. In the midst of all of the changes to the document, something is lost. Which version of the document was it in?
In this moment you are reminded of a time when you attended a Software Carpentry Workshop in January 2016. It was a busy time during the workshop and getting ready for the semester. However, there was one thing you remember that you learned version control is better than mailing files back and forth.
But why?
Version control is a tool for managing changes to a set of files. Each set of changes creates a new commit of the files; the version control system allows users to recover old commits reliably, and helps manage conflicting changes made by different users.
A commit records the current state of a set of files (a group of changes) in a version control repository. As a noun, the result of commiting, i.e. a recorded group of changes in a repository. If a commit contains changes to multiple files, all of the changes are recorded together.
A version control repository is a storage area where a version control system stores the full history of commits of a project and information about who changed what, when.
Multiple versions of a document can be merged into one.
Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).
Do you want to be your own friend in a year?
Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system
Examples from my own research: https://github.com/marschmi
Go to the original Software Carpentry lesson on automated version control: http://swcarpentry.github.io/git-novice/01-basics.html
A note on bash and why it is important: It’s like the air traffic control tower - allowing airlines from all over the globe to work together to get people around. In this case, bash is directly talking to your computer through unix shell commands to run many programs. If you need to run multiple programs at once an efficient way to do it automatically is to use your bash shell. Here we will the bash shell simultaneously run nano, git, unix, and R!
Open up your bash shell, what happens when you run:
1.git config --list
2.git config
Now, let’s set up Git by going to http://swcarpentry.github.io/git-novice/02-setup.html
Learning Goals
A local repository means that we are creating respository on our own computer. Our computer is local - It especially likes attending the weekly farmers market and supporting local businesses.
- In bash, navigate to
Desktop -> SWC -> SWC_R
folder from yesterday afternoon’s R lesson.- In the
SWC_R
folder, typels -a
. the-a
flag shows us the hidden items within the directory.- What files you see?
- Now, initialize the repository by typing
git init
.- What files you see that
SWC_R
is now a repository under version control?- To check that everything is set up correctly by asking Git to tell us the status of our project, type
git status
.
Learning Goals
Let’s look at
variables.R
by typingcat variables.R
into the bash shell. What happens? What command does this remind your of?git status
- Open up the
variables.R
file with nano or notepad.- Make a commented header to the file. For example,
# This is the document where I learn introductory R.
- Save the changes.
git status
- So far the changes are “untracked” with git. Now we need to tell git to keep track of the changes we have made - in other words, it’s time for the first commit!
git add variables.R
git status
- What happened?
Git now knows that it’s supposed to keep track of variables.R
, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:
git commit -m "Adding commented header"
When we run git commit
, Git takes everything we have told it to save by using git add
and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is f22b25e
(Your commit may have another identifier.)
We use the -m
flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m
option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.
Good commit messages start with a brief (<50 characters) summary of changes made in the commit. If you want to go into more detail, add a blank line between the summary line and your additional notes.
If we run git status now:
git status
it tells us everything is up to date. If we want to know what we’ve done recently, we can ask Git to show us the project’s history using git log
:
git log
git log
lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created.
Where Are My Changes? If we run ls
at this point, we will still see just one file called variables.R
. That’s because Git saves information about files’ history in the special .git
directory mentioned earlier so that our filesystem doesn’t become cluttered (and so that we can’t accidentally edit or delete an old version).
Now let’s adds more information to the file. (Again, we’ll edit with nano and then cat the file to show its contents; you may use a different editor, and don’t need to cat.)
- Using nano/notepad, comment some more of the code in
variables.R
and save the file.cat variables.R
to see the changes.- Now run
git status
.
The last line is the key phrase: “no changes added to commit”.
We have changed this file, but we haven’t told Git we will want to save those changes (which we do with git add) nor have we saved them (which we do with git commit). So let’s do that now. It is good practice to always review our changes before saving them. We do this using git diff
. This shows us the differences between the current state of the file and the most recently saved version:
git diff
The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we break it down into pieces:
2.The second line tells exactly which versions of the file Git is comparing; df0654a and 315bf3a are unique computer-generated labels for those versions.
3. The third and fourth lines once again show the name of the file being changed.
After reviewing our change, it’s time to commit it:
git commit -m "Added comment on ______"
git status
Whoops: Git won’t commit because we didn’t use git add first
. Let’s fix that:
git add variables.R
git commit -m "Added comment on ______"
git status
Git insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we’re adding a few citations to our supervisor’s work to our thesis. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we’re doing on the conclusion (which we haven’t finished yet).
To allow for this, Git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed.
Staging area
If you think of Git as taking snapshots of changes over the life of a project,git add
specifies what will go in a snapshot (putting things in the staging area), andgit commit
then actually takes the snapshot, and makes a permanent record of it (as a commit). If you don’t have anything staged when you typegit commit
, Git will prompt you to usegit commit -a
orgit commit --all
, which is kind of like gathering everyone for the picture! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. (Going back to snapshots, you might get the extra with incomplete makeup walking on the stage for the snapshot because you used-a
!) Try to stage things manually, or you might find yourself searching for “git undo commit” more than you would like!
Let’s watch as our changes to a file move from our editor to the staging area and into long-term storage.
- Add
dist10 <- rnorm(10)
for a random normal distribution of 10 numbers at the bottom of thevariables.R
file.git diff
So far, so good: we’ve added one line to the end of the file (shown with a + in the first column). Now let’s put that change in the staging area and see what git diff
reports:
git add variables.R
git diff
There is no output: as far as Git can tell, there’s no difference between what it’s been asked to save permanently and what’s currently in the directory. However, if we do this:
git diff --staged
it shows us the difference between the last committed change and what’s in the staging area. Let’s save our changes:
git commit -m "Created dist10, a random normal distribution of 10 numbers."
Check the status:
git status
and look at the history of what we’ve done so far:
git log
To recap, when we want to add changes to our repository, we first need to add the changed files to the staging area (git add) and then commit the staged changes to the repository (git commit):
Challege Question 1: Committing Changes to Git
Which command(s) below would save the changes of myfile.txt to my local Git repository?
git commit -m "my recent changes"
git init myfile.txt
git commit -m "my recent changes"
git add myfile.txt
git commit -m "my recent changes"
git commit -m myfile.txt "my recent changes"
Challenge 2: Create a project description
Create a new file called README.txt. Write a three-line description of the swc_r repository, commit your changes, then modify one line, add a line, and display the differences between its updated state and its original state.
Learning Goals
Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.
Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, BitBucket or GitLab.
Let’s start by sharing the changes we’ve made to our current project with the world.
- Log in to GitHub.
- Click on the icon in the top right corner to create a new repository called
SWC_R
:
As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository:
This effectively does the following on GitHub’s servers:
mkdir SWC_R
cd SWC_R
git init
Our local repository still contains our earlier work on variables.R, but the remote repository on GitHub doesn’t contain any files yet:
The next step is to connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it:
- Click on the ‘HTTPS’ link to change the protocol from SSH to HTTPS.
HTTPS vs SSH
We use HTTPS here because it does not require additional configuration. After the workshop you may want to set up SSH access, which is a bit more secure, by following the great tutorial from GitHub.
- Copy that HTTPS URL from the browser, go into the local
SWC_R
repository.- Run this command:
git remote add origin https://github.com/marschmi/SWC_R.git
Make sure to use the URL for your repository rather than marschmi’s.
We can check that the command has worked by running git remote -v:
git remote -v
The name origin
is a local nickname for your remote repository: we could use something else if we wanted to, but origin
is by far the most common choice.
Once the nickname origin
is set up, this command will push the changes from our local repository to the repository on GitHub:
git push origin master
Proxy
If the network you are connected to uses a proxy there is an chance that your last command failed with “Could not resolve hostname” as the error message. To solve this issue you need to tell Git about the proxy:
git config --global http.proxy http://user:password@proxy.url
git config --global https.proxy http://user:password@proxy.url
When you connect to another network that doesn’t use a proxy you will need to tell Git to disable the proxy using:
git config --global --unset http.proxy
git config --global --unset https.proxy
Password Managers
If your operating system has a password manager configured, git push will try to use it when it needs your username and password. If you want to type your username and password at the terminal instead of using a password manager, type:
unset SSH_ASKPASS
You may want to add this command at the end of your ~/.bashrc to make it the default behavior.
Our local and remote repositories are now in this state:
We can pull changes from the remote repository to the local one as well:
git pull origin master
Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.
Challenge 1: Remote vs Local Repositories
1. Define remote repository.
2. Define local repository.
3. What is the difference between remote and local repositories*?
Challenge 2:
git push
vsgit pull
1. What happened when you performedgit push
?
2. What happened when you rangit pull
?
3. How aregit push
andgit pull
different?
Challenge 3: Github Timestamp
Create a repository on GitHub, clone it, add a file, push those changes to GitHub, and then look at the timestamp of the change on GitHub. How does GitHub record times, and why?
- Version control is better than mailing files back and forth! (It is also better than dropbox.)
- Do you want to be your (or advisor’s/collaborator’s) friend in 6 months? A year? 2 years?
A helpful resource describing version control with git basics from Software Carpentry can be found here.
git config
git status
git add
git commit
git log
git diff
git push
git pull