Version Control For Data Science Projects

Part 1: Introduction to VCSs and Git

Introduction

This blog post intends to explain the version control systems, VCSs , the Git version control tool, its workflow and its most powerful commands that are used in collaborative projects.

Getting Started

We begin by providing the summary background on version control tools, then how to make Git up and running on your computer system. Also, we will be making its set up to start working with. In addition, we would understand why Git exists, why it should be utilized and how it would be used in version control of any data science project especially in RStudio.

  • What a Version Control System??

Version control is a system that tracks and records changes made to a file or set of files over time so that you can easily recall specific versions later. It usually used for coordinating work among programmers collaboratively developing source code during software development and even supporting virtual collaboration on data science project. Its goals include speed, data integrity, and support for distributed, non-linear workflows.

To manage a version, each change (addition, edition, or removal) to the files in a project must be tracked. Version Control records each change made to a file (or a group of files) and offers a way to undo or roll back each change.

For an effective Version Control, you have to use tools called Version Control Systems. They help you navigate between changes and quickly let you go back to a previous version when something isn’t right.

  • Importance of Version Control System in Data Science Project

One of the most important advantages of using Version Control is teamwork. When more than one person is contributing to a project, tracking changes becomes a sophisticated, and it widely increases the probability of overwriting another person’s changes. With Version Control, multiple people can work on their copy of the project (called branches) and only merge (link up) those changes to the main project when they (or the other team members) are satisfied with the work done.

  • What is the Choices??

There are many kinds of Version Control Systems (VCS), each with their own cons and pros. A VCS can be local, centralized, or distributed. Intuitively, this blog post prefer to use Git. Due to its fastness and works better with large projects. The Git community is very active, and there are many contributors involved in its development. For more details Visit Here.

Why Git??

  • First, it works great with tracking changes. And it greatly support in

    • Go back and forth between project versions
    • Review the differences between those versions
    • Check the change history of a file
    • Tag a specific version for quick referencing
    • Strong support for non-linear development (thousands of parallel branches)
    • Able to handle large projects like the Linux kernel efficiently (speed and data size)
  • One of the main features of Git is its Branching system. A branch is a copy of a project on which you can work independently without messing with the entire project repository. With Git tool you can easily and efficiently:

    • Exchange “changesets” between repositories
    • Review the changes made by others in project collaborators

Furthermore, Git Branching comes with Merging, which is the act of copying the changesets done in a individual branch back to the entire project source. Simply, it is about creating a branch to test a new feature and merge that branch back when you are satisfied with the work cone.

Git Powerful Features and Commands

Before stating and explaining the tasks of Git commands, we could identify the other extremely powerful of Git in supporting the teamwork. This is Stashing. Stashing is the act of safely putting away your current edits in order to have a clean environment to work on something else completely different. Probably, you might want to use stashing when you are testing a feature but you relevantly need to work on a new other feature in priority. So, you stash your changes or edit away and begin to write that priority feature. After you complete the task, you can get your changes back and apply them to your current working environment.

Cool!!, Here below are powerful Git commands and their respective tasks

$ git init # Initialize a new git database/repository
$ git clone # Copy an existing database/repository
$ git status # Check the status of the local project within database/repository
$ git diff # Review the changes done to the project
$ git add # Tell Git to track a changed file
$ git commit # Save the current state of the project to database/repo
$ git push # Copy the local database/repo to a remote server
$ git pull # Copy a remote database/repo to a local machine
$ git log # Check the history of the project
$ git branch # List, create or delete branches
$ git merge # Merge the history of two branches together
$ git stash # Keep the current changes stashed away to be used later

How does Git Works??

Contrary to other many Version Control Systems, Git works with Snapshots, not Differences. This means that it does not track the difference between two versions of a file or project, but takes a picture of the current state of the file or project. This makes Git super faster compared to other VCSs; it is also why switching between versions and branches is so fast and easy.

How does Git knows which changesets are whose? When Git takes a snapshot, it performs a checksum on it; so, it knows which files were changed by comparing the checksums. This is why Git can track changes between files and directories easily, and it also checks for any file corruption.

Git has Three States systems which are the working directory, the staging area, and the git directory:

  • The working directory is just the current snapshot that you are working on.
  • The staging area is where modified files are marked in their current version, ready to be stored in the database or repo.
  • The git directory is the database where the history is stored.

Simply, Git works as follows:

  • You modify the files, add each file you want to include in the snapshot to the staging area (git add),
  • Then take the snapshot and add them to the database (git commit). For the terminology, we call a modified file added to the staging area “staged” and a file added to the database “committed.” So, a file goes from “modified” to “staged” to “committed.”

What is Git Worflow??

Git workflow

Scenario: You join the collaboration project,then you are tasked to add your name to an existing project description file. Since this is your first day, a project lead is there to review your code. The first task you must do is to get the project’s source code on the server houses the source code. Most of the project collaborators are using a GitHub server to house the project source cod. This means that the Git database is stored on a remote server hosted by GitHub and you can access it by URL or directly on the GitHub web site. Here, we are going to use the git clone command to locally get the database or repo. Also, you could just download the project from the GitHub web site. You will get a zip file containing and the project files with all its history.

$ git clone https://github.com/mgisa/thebestwebsite.git

Then git clone downloads a copy of the repository in your current directory. After that, you can enter the new directory and check its contents and recent changes made to the project by log command to show the history.

$ git log

WOW! right away you might create a new branch to work on so that you don’t mess up with the entire project. You can create a new branch by using the branch command and checking it out with the checkout command.

$ git branch add-new-dev-name-to-readme
$ git checkout add-new-dev-name-to-readme

Now a new branch is created, you can start to modify the files. You can use whatever editor you want; Git will track all the changes via checksums. You made the necessary changes, it is time to put them on the staging area. As a reminder, the staging area is where you put modified codes that are ready to be snapshotted by Git VCS. If we modified the “README.md” file, we can add it to the staging area by using the add command.

$ git add README.md

You don’t need to add every file you modified to the staging area, only those which you want to be accounted in the snapshot. Now the file is staged, it is time to commit it or putting its change in the database or repo. We do this by using the command commit and attaching a little description with it.

$ git commit -m "Add Alexa to the list of data scientist"

The changes you made are now in the database or repo and safely stored. But only on your local computer.The others from your team can’t see your work because you worked on your own repository and on a different branch. To show your work to the team, you have to push your commits to the remote server. But you have to show the code to the project lead first before making a push. If he/she is okay with it, you can merge your branch with the main snapshot of the project (called the master/main branch). So first you must navigate back to the master branch by using the checkout command.

$ git checkout master/main

Now you are now on the master/main branch, where all the team’s work is housed. But the time you worked on your fix, the project may have changed, meaning that a team member may have changed some files. You should retrieve those changes before committing your own changes to master. This will limit the risk of “conflicts” which can happen when two or more contributors change the same file. To get the changes, you have to pull the project from the remote server (also called origin).

$ git pull origin master/main

Even if another team member changed the same file as you did, the risk of conflicts is low. The conflicts only arise when the same line has been modified by multiple people in team. If you and your team member changed different parts of the file, then no conflict arises. Here, it’s time to commit our version to master. You can merge your branch with the merge command.

$ git merge add-new-dev-name-to-readme

WOW!! the commit has been merged back to master, it is time to push the changes to the main server. We do that by using to push command.

$ git push

It’s that simpler! And again, don’t worry you will get to know and understand all the git commands when you regularly practice by working on the collaborative data science projects.

Summary

This was only a small introduction on Version Control System, workflow of Git system and its powerful features that you will learn along your data science career. But in your journey, here are some questions that you must ask yourself before moving forward:

  • How will Git help me in my projects?
  • Which features are the most important?
  • Will Git improve my workflow?

Cooool! Hope you are enjoyed the blog.


References

  • Mariot Tsitoara (2020), Beginning Git and GitHub: A Comprehensive Guide to Version Control, Project Management, and Teamwork for the New Developer, Antanarivo, Madagascar, https://doi.org/10.1007/978-1-4842-5313-7