class: center, middle, title-slide .title[ # Reproducible Research and Version Control ] .subtitle[ ## JSC 370: Data Science II ] .date[ ### January 13, 2025 ] --- ## Repeatability vs Reproducibility vs Replicability These terms are often used interchangeably, but they are different. Repeatability: Generating the exact same results when using the same data by the same person. Reproducibility: Generating the exact same results when using the same data by a different person or group. If we can't reproduce a study, how can we replicate it? Replicability: Repeating a study by independently performing another study on new data. --- ## Repeatability vs Reproducibility vs Replicability <img src="data:image/png;base64,#reproduce-pyramid.png" width="85%" style="display: block; margin: auto;" /> --- ## Reproducibility A different analyst/researcher re-performs the analysis with the - *same code* and - *same data* and - obtains the *same result*. If your results are not repeatable then they will not be reproducible. --- ## Reproducibility There are several things happening for reproducibility to occur: - Data creation or acquisition - Data storage and transfer - Computing power - Complexity of methods - How output and results are presented --- ## Reproducibility <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#reproduce-comic.png" alt="From https://jhudatascience.org" width="70%" /> <p class="caption">From https://jhudatascience.org</p> </div> --- ## Reproducibility <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#reproduce-comic2.png" alt="From https://jhudatascience.org" width="70%" /> <p class="caption">From https://jhudatascience.org</p> </div> --- ## Reproducibility Barriers to doing reproducible work: - Poor documentation - Manual steps - Non-transferable tools - Incorrect training - Time --- ## Reproducible Workflow <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#reproduce-flow.png" alt="From https://jhudatascience.org" width="90%" /> <p class="caption">From https://jhudatascience.org</p> </div> --- ## Reproducible Research In the context of research in academia, often we push for publication. Often this is very poorly reproducible, so the push is to publish code, data, and the tools needed to re-run analyses. <img src="data:image/png;base64,#reproduce.png" width="70%" style="display: block; margin: auto;" /> --- ## Reproducible Research In computational sciences and data analysis, what is reproducibility? - The data and code used to make a finding are available and they are presented in such a way that it is (relatively) straightforward for an independent researcher to recreate the finding. --- ## Reproducible Research This actually seldom happens. Consider two interesting articles by [Tim Vines](https://dataseer.ai/about/): - The Availability of Research Data Declines Rapidly with Article Age “of 516 articles published between 2 and 22 years ago…the odds of a data set being extant fell by 17% per year.” - Recommendations for utilizing and reporting population genetic analyses: the reproducibility of genetic clustering using the program structure “we reanalysed data sets gathered from papers using the software package ‘structure’… 30% of analyses were unable to reproduce the same number of population clusters.” --- ## Reproducible Research Scientific articles have fairly detailed methods sections, but those are typically insufficient to actually reproduce an analysis. Roger Peng and Stephanie Hicks [wrote](https://www.annualreviews.org/doi/pdf/10.1146/annurev-publhealth-012420-105110) "Reproducibility is typically thwarted by a lack of availability of the original data and computer code." -- Scientists owe it to themselves and their community to have an explicit record of all the steps in an analysis done at a computer. --- ## Reproducible Research Do's - Start with a good question, make sure it is focused and it is something you're interested in. - Teach your computer to do the work from beginning to end! - Use version control. - Keep track of your software environment, from what is in your toolchain (software: Python, R, Tableau) to version numbers. - Set your seed for any random number generation or sampling! This is needed when splitting up your training and test sets. - Think about the entire pipeline. --- ## Reproducible Research Dont's Do NOT do things by hand! This includes: - Editing spreadsheets to clean it up (e.g. removing outliers, your own QA/QC) - Editing tables or figures - Downloading data from a website by clicking links in a web browser - Splitting data and moving it around - If anything is done by hand because there is no other way, document it! --- ## Reproducible Research Dont's In data science, try not to use point and click software or other interactive software. - This type of work is not easily reproduced because there is no trace of the steps. If you have to use it, write down the steps! -- - Save the data and code that generated the output, rather than the output itself. --- ## Reproducible Research Challenges - Data size - Try to build in your code tools ways to manage large datasets, for example use data.table and parallel processing. - Can store data in smaller chunks and write code that pulls data files automatically, combining them when needed for analysis. - Write metadata, use tools that help with data organization. --- ## Reproducible Research Challenges Data complexity - Try to incorporate smaller snippets of data in your workflow to check reproducibility - Training, validation sets - Diagnostic visualizations -- Workflow complexities - Use readme files!! --- ## What is version control? <img src="data:image/png;base64,#phdcomic.png" width="35%" height="15%" style="display: block; margin: auto;" /> --- ## What is version control? <div style="text-align: center;"> <table> <col width="40%"> <col width="40%"> <tr> <td style="text-align: left;"> [I]s the <strong>management of changes</strong> to documents [...] <strong>Changes are usually identified</strong> by a number or letter code, termed the "revision number", "revision level", or simply "revision". For example, an initial set of files is "revision 1". When the first change is made, the resulting set is "revision 2", and so on. <strong>Each revision is associated with a timestamp and the person making the change</strong>. Revisions can be <strong>compared</strong>, <strong>restored</strong>, and with some types of files, <strong>merged</strong>. -- <a href="https://en.wikipedia.org/w/index.php?title=Version_control&oldid=948839536" target="_blank">Wiki</a> </td> <td> <img src="https://upload.wikimedia.org/wikipedia/commons/a/af/Revision_controlled_project_visualization-2010-24-02.svg" alt="Diagram of version control" width="35%"> </td> </tr> </table> </div> --- ## Why do we care? Have you ever: - Made a *change to code*, realised it was a *mistake* and wanted to *revert* back? - *Lost code* or had a backup that was too old? - Had to *maintain multiple versions* of a product? - Wanted to see the *difference between* two (or more) *versions* of your code? - Wanted to prove that a particular *change broke or fixed* a piece of code? - Wanted to *review the history* of some code? --- ## Why do we care? (cont'd) - Wanted to submit a *change* to *someone else's code*? - Wanted to *share your code*, or let other people work on your code? - Wanted to see *how much work* is being done, and where, when and by whom? - Wanted to *experiment* with a new feature *without interfering* with working code? In these cases, and no doubt others, a version control system should make your life easier. -- [Version Control ](https://www.atlassian.com/git/tutorials/what-is-version-control) --- ## Why do we care? (cont'd) <img src="data:image/png;base64,#fig/git-flow.png" width="55%" style="display: block; margin: auto;" /> --- ## Git: The stupid content tracker <div style="text-align: center;"> <figure> <a href="https://commons.wikimedia.org/wiki/File:Git-logo.svg" target="_blank"><img style="width: 200px;vertical-align: middle;" src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Git-logo.svg/500px-Git-logo.svg.png" hspace="20px" alt="Git logo"></a> <a href="https://en.wikipedia.org/wiki/Linus_Torvalds" target="_blank"><img style="width: 200px;vertical-align: middle;" hspace="20px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/LinuxCon_Europe_Linus_Torvalds_03_%28cropped%29.jpg/345px-LinuxCon_Europe_Linus_Torvalds_03_%28cropped%29.jpg" alt="Linus Torvalds"></a> </figure> <figcaption><b>Git logo and Linus Torvalds, creator of git</b></figcaption> </div> --- ## Git: The stupid content tracker - During this class (and perhaps, the entire program) we will be using [Git](https://git-scm.com). - Git is used by [most developers in the world](https://survey.stackoverflow.co/2024#developer-profile), in fact [93% of developers use git](https://stackoverflow.blog/2023/01/09/beyond-git-the-other-version-control-systems-developers-use/) - A great reference about the tool can be found [here](https://git-scm.com/book/en/v2) - More on what's stupid about git [here](https://en.wikipedia.org/wiki/Git#Naming). --- ## How can I use Git There are several ways to include Git in your work-pipeline. A few are: - Through command line - Through one of the available Git Clients: - RStudio [(link)](https://happygitwithr.com/rstudio-git-github.html) - Github Desktop [(link)](https://desktop.github.com/) - GitKraken [(link)](https://www.gitkraken.com/) More alternatives [here](https://git-scm.com/download/gui). --- ## What Git does <img src="data:image/png;base64,#git-image.png" width="85%" height="85%" style="display: block; margin: auto;" /> --- ## Git workflow <img src="data:image/png;base64,#fig/git.svg" width="65%" height="65%" style="display: block; margin: auto;" /> --- ## Setting up the workflow - Go to Github and sign in to your account. - Create a repository (name it, choose public, add a README). - Clone it (create a copy and put it on your local computer). Note: We are assuming that you already [installed git in your system](https://git-scm.com). --- ## Setting up the workflow <img src="data:image/png;base64,#git-clone.png" width="85%" height="85%" style="display: block; margin: auto;" /> --- ## Workflow for an existing repo 1 Start the session by pulling (possible) updates: `git pull` 2 Make changes: a) (optional) Add untracked (possibly new) files: `git add [target file]` b) (optional) Stage tracked files that were modified: `git add [target file]` c) (optional) Revert changes on a file: `git checkout [target file]` 3 Move changes to the staging area (optional): `git add` --- ## Workflow for an existing repo (con't) 4 Commit: a) If nothing pending: `git commit -m "Your comments go here."` b) If modifications not staged: `git commit -a -m "Your comments go here."` 5 Upload the commit to the remote repo: `git push`. --- ## Hands-on 0: Introduce yourself Set up your git install with `git config`, start by telling who you are ```ssh $ git config --global user.name "Meredith Franklin" $ git config --global user.email "mfranklin@email.com" ``` If you have already set up git previously, you can check your settings ```ssh $ git config --list ``` (to get out of the list in terminal, press q) Try it yourself (5 minutes) (more on how to configure git <a href="https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration" target="_blank">here</a>) --- ## Hands-on 1: Remote repository We will start by working on our very first project. To do so, you are required to start using Git and Github so you can share your code with your team. For this exercise, you need to: a. Create an new (empty) repository on GitHub (you can try `JSC370`). Make sure to include a README.md (checkbox) b. Go to the local directory where you want to store the files for this repo. c. Clone the repository (in GitHub copy the repo link) `git clone https://github.com/...`. d. Back in terminal, edit the README.md. You can use nano in the terminal or open in another app such as RStudio or SublimeText. e. Add the edited README.md file to the tree using the `git add` command, and check the status. f. Make the first commit using the `git commit` command adding a message, e.g. ```sh $ git commit -m "My first commit ever!" ``` --- ## Hands-on 1: Remote repository You can use `git log` to see the history. You can also use `git status` to see the list of items that might be pending in your `git` workflow. --- ## Hands-on 1: Remote repository The following code is fully executable (copy-pastable) ```sh # (a) Creating the folder for the project (and getting in there) mkdir ~/JSC370 cd ~/JSC370 # (b) Initializing git, creating a file, and adding the file git init # (c) Creating the Readme file echo An empty line > README.md # (d) Adding the file to the tree git add README.md git status # (e) Commiting and checkout out the history git commit -m "My first commit ever!" git log ``` --- ## Hands-on 1: Remote repository If you add a wrong file to the tree, you can remove files from the tree using `git rm --cached`, for example, imagine that you added the file `class-notes.docx` (which you are not supposed to track), then you can remove it using ```sh $ git rm --cached class-notes.docx ``` This will remove the file from the tree **but not from your computer**. You can go further and ask git to avoid adding docx files using the [.gitignore file](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository#_ignoring) --- ## Hands-on 1: Remote repository <img src="data:image/png;base64,#fig/git1a-2024.png" width="85%" height="85%" style="display: block; margin: auto;" /> --- ## Hands-on 1: Remote repository <img src="data:image/png;base64,#fig/git1.png" width="85%" height="85%" style="display: block; margin: auto;" /> --- ## Example for .gitignore Example exctracted directly from Pro-Git [(link)](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository#_ignoring). <pre style="font-size: 12pt;"> # ignore all .a files *.a # but do track lib.a, even though you're ignoring .a files above !lib.a # only ignore the TODO file in the current directory, not subdir/TODO /TODO # ignore all files in any directory named build build/ # ignore doc/notes.txt, but not doc/server/arch.txt doc/*.txt # ignore all .pdf files in the doc/ directory and any of its subdirectories doc/**/*.pdf </pre> --- # Resources - Git's everyday commands, type `man giteveryday` in your terminal/command line. and the very nice [cheatsheet](https://github.github.com/training-kit/). - My personal choice for nightstand book: The Pro-git book (free online) [(link)](https://git-scm.com/book) - Github's website of resources [(link)](https://try.github.io/) - The "Happy Git with R" book [(link)](https://happygitwithr.com/) - Roger Peng's Mastering Software Development Book Section 3.9 Version control and Github [(link)](https://bookdown.org/rdpeng/RProgDA/version-control-and-github.html) - Git exercises by Wojciech Frącz and Jacek Dajda [(link)](https://gitexercises.fracz.com/) - Checkout GitHub's Training YouTube Channel [(link)](https://www.youtube.com/user/GitHubGuides) --- # Other tools to explore - Project management tool [Jira](https://www.atlassian.com/software/jira) provides a platform for teams to plan, track and manage work. It includes issue and task tracking, customizable workflows, and integration with Git through [GitKraken](https://www.gitkraken.com/git-integration-for-jira)