Workshop Overview
Purpose
To introduce ecologists and other life scientists to better practices for sharing their code and results, managing their projects, and improving reproducibility of their computational work. No prerequisite knowledge is necessary.
Instructional objectives
- Share & collaborate
- Learn how to do version control for code and data files
- Tools: shell, git & GitHub via the RStudio terminal pane
- Project management & reproducibility
- Learn how to organize projects, use coding better practices, and intermediate R skills (for loops, functions)
- Tools: RStudio projects, dplyr/tidyr/ggplot2
- Documentation & publication
- Learn how to write project READMEs and publicly publish reports
- Tools: markdown, Rmarkdown files, RPubs
General structure
This workshop was taught as a series of 8 two-hour virtual sessions, meeting twice a week for four weeks. This content could be taught in two 8-hour long days, or split up in various other ways to total 16 hours of instructional time.
Relatively small workshop size (< 15) helped keep one-on-one troubleshooting manageable for two instructors and an occasional helper.
The workshop format is code-along, and we use R/RStudio for the entire workshop. Students should install R and RStudio prior to the first class.
Lessons
Lesson 1
Objective
Learners should be comfortable with their file structure and navigating around it using command line commands in RStudio Terminal pane.
Teaching notes
- If not already installed, learners should install R and RStudio
- Need to install git initially for Windows users to have shell commands
- Separate into 2 breakout rooms to install git/GitBash for Windows and Macs
- Restart RStudio after installation
- For Windows, open RStudio Tools/Global Options/Terminal and make sure GitBash is selected
- Download shell-lesson-data.zip into temporary location, this won’t be referenced in later lessons
- Emphasize purposes for learning shell
- Helpful for using git and GitHub
- Easier to deal with installation problems
- Understanding file structure and file paths
- If there’s additional time, can also cover Software Carpentry Unix Shell episode #3
Homework
Exercises in shell lesson #2
Lesson 2
Topic
Version control with git
Objective
Learners will practice how changes to code are tracked with version control and become familiar with command-line git within the RStudio IDE.
Teaching notes
- When setting up git for the first time, remind students to use the same email as their existing GitHub account or to select an email that will be used for GitHub
- Reinforce using
cd and ls -al to move around the file structure and see contents
Homework
Create a GitHub account if you don’t already have one
Lesson 3
Topic
Sharing with GitHub
Objective
Learners will share their local repository on GitHub and learn how to sync some files while ignoring others systematically.
Installation & materials
- Software Carpentry Version Control with Git episode #6
- Software Carpentry Version Control with Git episode #7
- Software Carpentry Version Control with Git episode #8
Modifications
- Provide a large .csv file, ask students to move into pilot-analyses repo within a data folder
- Use RStudio interface to create a new text file, save as .gitignore; explore how to suppress a specific file, files within a folder, and a specifc kind of file extensions
- Start #7 with SSH setup.
ssh-keygen -t ed25519 -C "username@laptop" will create a .ssh file if one does not already exist
- Then proceed to connecting local repository to GitHub repository
- The only part of #8 that was taught was cloning a forked repository. Ask students to fork an instructor-created repository, then clone locally. This will be the repo for downstream files.
- Use
git remote add upstream xxxx to connect repo to upstream. If time permits, demonstrate a pull request (PR).
Teaching notes
- Practice an add-commit cycle after creating the .gitignore
- If
ssh -T git@github.com doesn’t work, could be an issue with OpenSSH. Try a different ssh flavor:
which -a ssh
- Use a non OpenSSH version, e.g.,
/usr/bin/ssh -T git@github.com
- Easy to check on learner progress by looking for who has and hasn’t forked repo
- Here is a good figure and blog post on the relationship, vocabulary, and workflow of working with forked repositories
Lesson 4
Topic
Project management and coding best practices
Objective
Learners will learn about and practice managing their projects using file structure and RStudio projects, and about current best practices and style guides for R coding. They will also get more comfortable with version control through learning and doing pull requests and reinforcement of git cycle.
Teaching notes
- After introducing concept of R projects, walk through turning their local copy of forked repo into an R project
- To practice git, add new
*.Rproj to .gitignore, then commit and push to remote repo
- Discuss what new concepts learners discovered in tidyverse style guide
- Can start homework during session
- Then add, commit, push these changes and then walk through opening a pull request to the upstream (instructor’s) repo
- This is another nice way to keep track of learners’ progress
Homework
Fix up example R script in forked repo
Lesson 5
Objective
Learn how to reproducibly clean up dataframes using tidyverse R packages.
Installation & materials
- Install R packages ‘dplyr’, ‘tidyr’, ‘readr’ and ‘udunits2’
- Data carpentry R ecology lesson - #3 ‘Manipulating, analyzing, and exporting data’
Modifications
- Introduced
readr::read_csv() in lieu of read.csv(), which reads character strings as characters (not factors) and imports time/date columns
- Introduced
ud.convert() as a less error-prone way of performing conversions, compatible within mutate()
- Updated to the more modern
tidyr::pivot_longer() and tidyr::pivot_wider()
surveys_gw <- surveys %>% filter(!is.na(weight)) %>% group_by(plot_id, genus) %>% summarize(mean_weight = mean(weight))
surveys_wide <- surveys_gw %>% pivot_wider(names_from(genus), values_from = mean_weight)
surveys_long <- surveys_wide %>% pivot_longer(-plot_id, names_to = "genus", values_to = "mean_weight")
- Used
readr::write_csv(), which eliminates need for row.names = FALSE
- Ended with saving script and doing an add-commit-push cycle to sync local repository with GitHub repository
Teaching notes
- Have students install each library separately rather than as the tidyverse - can be important to know which functions come from which packages
Lesson 6
Topic
Reproducibility of R code I
Objective
Learners will learn how to make their R code more reproducible using functions and control flow approaches.
Teaching notes
- Showed learners how to use issues so they could use for homework assignment
- Forked repos don’t have issues included by default, change in repo settings
- Create new R script for functions content, and another new R script for control flow (ifelse and for loops) content
- Demonstrate git add, commit, and push files or file changes at the end of each set of material
Homework
Start to create a list of at least three specific ways to implement the skills learned here in research projects
Examples:
- “I will rewrite the code on lines 16-25 of analysis.R into a function”
- “I will turn my plotting script into an Rmarkdown file with descriptions of the plots”
Lesson 7
Topic
Reproducibility of R code II and plotting I
Objective
Learners will learn how to make their R code more reproducible using control flow approaches and understand the components of creating data visualizations with ggplot.
Installation & materials
- Data Carpentry for Biologists for loop lecture
- Data Carpentry R ecology lesson - #4 ‘Data visualization’
- Install ‘ggplot2’
Modifications
- For for loops lecture:
- At “Do Tasks 3-4…”, do “Basic For Loop” exercises #2 & #3
- Skip sections “Looping over multiple values” and “Looping with functions”
- If sufficient time, do “Storing loop results in a data frame”, otherwise stop there
- Use the end of #3 (creating surveys_complete dataframe) at the start of plotting part of lesson
- Demonstrate the hindfoot_length vs. weight scatter plot within a for loop, using ggsave
Teaching notes
- ggplot content took about an hour in this lesson and an hour in the next lesson
- Currently getting warnings from “Looping over files section” but wasn’t able to troubleshoot, so used as a teaching opportunity to discuss warnings vs. errors
- Show how to create a new
plots folder in the RStudio interface for the for loop outputs
- Good time to troubleshoot and reinforce the syntax of a for loop
Lesson 8
Topic
Plotting II and documentation
Objective
Learners will practice different ways to make multiple plots (for loops, faceting) and learn how and why to document their repositories and code.
Installation & materials
- Data carpentry R ecology lesson - #4 ‘Data visualization’
- Data Carpentry for Biologists knitr lecture
- Install R packages
rmarkdown and knitr
Modifications
Plotting:
- Make a time series plot, but with mean weights across a particular species (
species_id == "DM")
- Reinforce the
group_by() and summarize() functions from an earlier lesson
- Demonstrate piping datasets into
ggplot()
- Challenge students to filter by and plot all members of the genus Dipodomys; what are some choices for differentiating between species?
- Students may have used
color = species_id within aes()
- Show
facet_wrap() as an option, particularly with scales = 'free_y'
- Additional challenge of adding a third factor of sex (in addition to year and species_id)
- Show
facet_grid() as an option, with species by row and sex by column
- What are the advantages and disadvantages of plotting together versus separately?
- Final note of caution: it can be important to depict the number of samples that when into the sample mean and sd
- Can be added with
geom_text() to the graph itself
- Or as a color axis when combined with
facet_grid()
Documentation:
- No lesson materials for first section about READMEs
- For Rmarkdown lesson:
- Simplify text used to spend less time typing
- Save .Rmd file in newly created
docs folder (to demonstrate project management)
- Skip “Citations” section
- Stop at “R Presentations” section
Teaching notes
Plotting:
- Students may be familiar with the basics of ggplot already, this portion of the lesson emphasizes how to build plots and customize viewing options to maximize students’ own understanding of their data (as opposed to how to make publication-quality graphs)
Documentation:
- Define README
- Come from software engineering
- File that introduces and explains a project
- Contains info about other files and folders in a repo
- Also package versions and installation instructions, how to contribute
- Should go in main folder of project
- Usually a plain text file
- Name is in all upper case
- Show an example of a README
- GitHub encourages use of README
- Automatically asks to include one when creating new repo
- Automatically displays one on repo main page if there is one
- Either have them create a new README from scratch, or we included one in the repo they forked so we had them open up the README in RStudio to edit later
- Explanation of markdown
- What we can use to format README files and other plain text files
- History
- Plain text formatting, can be converted to HTML
- Compare file in RStudio to GitHub render
- GUI vs. command line type approach
- Can do same things in Word/Google Doc but is more explicit and less point and click
- Also better for version control
- Simple and easy to read and format
- GitHub markdown specifically: https://guides.github.com/features/mastering-markdown/
- Practice markdown and README concepts by adding a “Project files” section to their forked local repo README, which can look similar to the example below:
### Project files
- `data_raw` folder
- `portal_data_joined.csv`: example data from [Portal project](https://portal.weecology.org/)
- `scripts` folder
- `01-data-manipulation.R`: introduction to dplyr
- `plots` folder
- Switch to introducing Rmarkdown
- Purpose is to share results as report, and possibly turn into manuscripts
- Combines markdown-formatted text with code
- Do DC Rmarkdown lesson
- Add, commit, and push .Rmd file
- If time, demonstrate publishing Rmarkdowns on RPubs
- Show an example
- In RStudio, click blue publish button and select RPubs option
Technical notes
- One early problem occurred when students were unsure of which ‘Documents’ or ‘Desktop’ folder they were saving to, which made navigating filepaths tricky. Note that on some PCs, the Windows OneDrive backup folders are saved to by default, but students should navigate to
C:/Users/username/Documents or C:/Users/username/Desktop
- R packages sometimes did not install properly.
- One reason was that some usernames included special characters, which were not interpreted correctly.
- Another included R being installed on a remote server, rather directly on the local machine
Acknowlegements
Thanks to the Ecological Society of America, and in particular Teresa Mourad and Fred Abbott, for inviting us to teach the first half of the Critical Skills to Scale Up Ecology workshop, which resulted in this set of curriculum. A huge thanks to The Carpentries organization for generating all of the volunteer-created content we used here, and for training both of us in pedagogy.