05 December, 2019

Introduction

A project refers to a collection of files and folders for one task. It is useful to organize a project by types of files and that consistency helps you effectively find and use things later.

Seminar Goal

  • Organize a project in GitLab using RMarkdown

    1. Put each project in its own directory, which is named after the project.
    2. Create an overview of your project in the README file.
    3. Put the project source code in the src directory.
    4. Put the created functions in the R directory.
    5. Put text documents associated with the project in the doc directory.
    6. Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
    7. Keep the project portable.
    8. Name all files to reflect their content or function.
    9. You may need to have pic and presentation directories.

  • Some tips around using RMarkdown for data analysis and writing reports

Required tools and packages

  • Have RStudio installed on your computer.
  • Open RStudio and install usethis and devtools packages.
    • The devtools package is used to develop a package in R.
    • The usethis package automates project and package setup tasks including Rstudio and Git projects.
    #install from CRAN
    install.packages("devtools") 
    install.packages("usethis")

Project Organization

1. Put each project in its own directory

  • RStudio Projects are a directory plus some minimal metadata contained in a .RProj file.

  • You can open a new instance of Rstudio creating an RStudio project.

  • Having a new instance of RStudio enables you to locally reference and work with the files in the project directory.

  • If you need to do this for a new RStudio project:

    • In a local RStudio: New Project -> New Directory, or
    • Use the command line: usethis::create_project("Path to the project")
  • This directory should contain everything needed for an analysis and be completely portable and reproducible.

Project Organization

2. Create an overview of your project in the README file

  • Have a short file in the project’s home directory that explains the purpose of the project.

  • A great place to remind yourself and other people of:

    • Project title
    • Project objectives
    • Links to important resources
    • An up-to-date contact information
    • An example or 2 of how to run various cleaning or analysis tasks (if applicable)
  • You can create a README file:

    • On GitLab, or
    • Use the command line: usethis::use_readme_md()

Your Turn 1

Make a project and README file

  • Step 1. Create a project on GitLab and name it as you like.

    • Project name should contain only numbers/letters/“.”, start with a letter, contain at least two characters, and not end with “.”
    • Go to http://gitlab.weyer.com/ -> New project
    • Initialize repository with a README file

  • Step 2. Clone the project on your computer.

  • Step 3. Open R and create an R project in your project directory.

    • You can either use the command line usethis::create_project("Path to the project") or go to File -> New Project.
    • Using the command line to create a project is a better option since an R folder is created and you can put your R scripts in it (You will see the benefits of having an R folder later…)
  • Step 4. Have not created a README file? Use the command line usethis::use_readme_md() to create one.

  • Step 5. Add a brief description of the project in the README file.

How to edit a README file

  • README file is an .md (markdown) file that can edited in:
    • RMarkdown (recommended)
      • In RStudio: File -> Open File -> Navigate to the README file
      • Looking for the markdown syntax? Here is the cheatsheet (Will dive into this later…)
    • GitLab (not recommended since you need to pull it to your computer later)
      • In GitLab repository: Click on README.md -> Edit
    • Text editor

Project Organization

3. Put the project source code in the src directory

  • src contains all of the code written for the project.

    • Programs written in interpreted languages such as R or Python
    • Those in compiled languages like Fortran, C++, or Java
    • As well as shell scripts, snippets of SQL used to pull information from databases
    • Other code needed to regenerate the results
  • Need to create the src directory manually in the project directory.

Project Organization

4. Put the created functions in the R directory

  • This will mostly be definitions of R functions, in which case put it in R, not src.
  • To be clear Only R function definitions should go in R/ inside .R files. R code to run functions, simulations or analyses should be somewhere else (src).
  • How to put .R scripts in R/ directory?
    • Manually in RStudio: File -> New File -> R Script and save it in R/ directory
    • Use the command line: usethis::use_r("FunctionName.R")
  • Does it matter what the Function.R is called?
    • No, but note that good practice is to name the same thing as the function itself
  • If the R/ directory does not exist (you did not use the command line to create your .R functions or R project), you need to create it manually.

How to load the function.R scripts

  • You need to create a DESCRIPTION file first.

    • It stores the important metadata about your project (e.g. What packages are used)
    • Use the command line: usethis::use_description() to create one
    • Need more information? Here you go!
  • Then, you can load all the functions in R/ directory and any documentation you’ve written.

    • Use the command line: devtools::load_all()

On a side note

Create documentation for your functions

  • You need to add some special comments at the beginning of your .R function script.
  • Use the command line devtools::document() to generate the documentation.
  • Documentation files (.Rd files) are created in a folder, named man automatically.
  • You can use ?FunctionName or help(FunctionName) to find the descriptions of the generated function.

General format of special comments for documentations

  • Need more information? Here you go!

    #' Tile of the function
    #'
    #' Description of the function
    #'
    #' Specify more details here
    #'
    #' @param "you can specify the input parameters here"
    #'
    #' @examples
    #' "Show some examples of your function"
    #'
    #' @return "describe the outputs of the function here"

Your Turn 2

Generate functions in R

  • Step 1. Create a function that simulates the outcome of tossing a coin n times. Let’s start with creating an R script in the R/ directory.
    • Use the command: usethis::use_r("coin_toss_sim.R"), or
    • Create an .R script in R and save it in the R/ directory. Name the file as coin_toss_sim.R
  • If there is no R/, go ahead and create one in the project directory manually.

  • Step 2. Copy and paste the following code in the coin_toss_sim.R script and save it.

    coin_toss_sim <- function(n, p){
    
        # this function simulates the outcome of tossing a coin n times
        # inputs:
        ## n = number of tosses
        ## p = probability of getting a head
    
        rbinom(n = n, size = 1, p = p)
    
    }

  • Step 3. Create a DESCRIPTION file using the command: usethis::use_description().
  • Step 4. Load the function using the command: devtools::load_all().
  • Step 5. If you try to type the function name in the Console, R autofills it for you.

Now you see that R recognizes the function!!!

On a side note - adding documentation

  • Add the following comments to the beginning coin_toss_sim.R script and save it again.

    #' Coin Toss Simulation
    #'
    #' This function simulates the outcomes of tossing a coin n times
    #'
    #' The coin can be unfair, meaning that the probability of 
    #' getting a head or tail can be unequal.
    #' The probability of getting a head can be passed in the 
    #' function using the "p" argument.
    #'
    #'
    #' @param n number of tosses
    #' @param p probability of getting a head
    #'
    #'
    #'
    #' @return Outcomes of tossing an unfair coin n times
    #'
    #' @examples
    #'
    #' coin_toss_sim(n = 10, p = 0.5)

  • To create the documentation, use the command:

    devtools::document()
  • Check the project directory!

    • There is a man folder created with the function documentation in .Rd format
  • Now use the code below to see the function description in the help pane.

    ?coin_toss_sim
    
    # tossing a fair coin 20 times
    coin_toss_sim(n = 20, p = 0.5)

You should see the documentation in the help pane now! Note that this function is only available within this project. You would need to add the script to other projects for access there.

Project Organization

5. Put text documents in the doc directory

  • This includes files for manuscripts, documentation for source code, and/or an electronic lab notebook recording your experiments.
  • Subdirectories may be created for these different classes of files in large projects.
  • If you only have one file in here, it’s OK to put it in the main project directory instead.
  • I’ve also seen suggestions for this to be called analysis.
  • doc directory needs to be created manually.

Project Organization

6. data and results directories

  • data directory only includes the raw data.

  • results directory includes:

    • Generated intermediate results such as cleaned data sets or simulated data
    • Final results such as figures and tables
    • It will usually require additional subdirectories for all but the simplest projects
  • These two directories need to be created manually.

What format should we save it in?

  • CSV is good for:
    • Portability
    • Cleaned and tabular data
    • It is human readable
  • RDS is good for:
    • Complicated data structures
    • It is quick to load
    • Saving intermediate objects when you don’t need to share with people that don’t use R
    • Save RDS with readr::write_rds()
    • When you need it later? Import with readr::read_rds()
  • Why not RDA? Can’t give the object a name when you read it.

Some examples

  • Save the big_data table in RDS format, e.g. to save in current directory:

    write_rds(big_data, "big_data.rds")
  • When you need it later:

    MyData <- read_rds("big_data.rds")
  • Why not RDA? You have to remember what you called it when you saved it.

    x <- big_data
    save(x, file = "y.rda")
    rm(list = ls()) # delete everything from environment 
    load("y.rda") # where is it? It's in an object called x

Project Organization

7. Keep the project portable

  • Never hard code any file paths above the project directory.

  • E.g. This is BAD, because it will never work for anyone else, or if I move my project directory:

    setwd("/Users/Documents/Projects/cointoss/")
    my_data <- read.csv("data/coin_data.csv")
    
    # OR also bad
    read.csv("/Users/Documents/Projects/cointoss/data/coin_data.csv")
  • Rely on RStudio projects and the here package.

    • RStudio projects move the working directory to the project directory
    • Moving the project doesn’t affect the subdirectory structure so things should still work
    • You can use the here() function from the here package to navigate to a specific file in your project directory

Use projects and the here package

  • You can avoid using setwd() at the top of a code when you want to read or save something.

  • here function set the directory to the project directory and you can navigate from there.

  • It makes your project portable.

  • Run:

    install.packages("here")
    library(here)
    here()

    What happens?

  • More info on here package.

  • To create a reproducible project, you need to make a fresh R process with working directory set to the project directory.
  • R, by default, saves and reloads your workspace later. The workspace includes all the variables and functions.
    • you may inadvertently use variables from your previous R code
    • your code might call a package that was previously installed but might not be available if you hadn’t reloaded your previous workspace

  • To make a fresh R process, you need to change a few extra settings:
    • Go to Tools -> Global Options
    • Uncheck Restore .RData
    • Never save workspace on exit

Project Organization

8. Name all files to reflect their content or function

  • For example, use names such as bird_count_table.csv, manuscript.md, or sightings_analysis.py.
  • Do not use sequential numbers (e.g., result1.csv, result2.csv) or a location in a final manuscript (e.g., fig_3_a.png).

Adding dependencies

  • You can formally add packages your project relies on with usethis::use_package(). This adds a line to the DESCRIPTION file.

    usethis::use_package("tidyr")
  • Best practice to look at this if you get project from someone else to find information including the author’s information and required packages.

Project Organization

9. pic and presentation directories

  • If you have some figures that are not generated from your R codes and you would like to use them in your report, you may need to create the pic directories.

  • You can also put your presentation files in the presentation directory.

  • You should create these subdirectories manually.

RMarkdown/GitLab

About RMarkdown

Advantages of using RMarkdown

  • RMarkdown files (.Rmd) enable us to create dynamic documents, reports, and presentations from R.
  • You can generate a reproducible file with descriptive texts, code chunks, and code outputs.
  • You can export the .Rmd files to shareable formats like PDF, HTML, and Word.
  • To create a new .Rmd file: Go to File -> New File -> R Markdown.

RMarkdown/GitLab

Set up the starting chunk

Output on GitLab

  • To generate readable outputs on GitLab, .md files should be uploaded along with the .html files.
    • The output should be set as html_document
    • keep_md should be true in the starting chunk

Global setting

  • The setup chunk sets the global settings for the RMarkdown file.
  • The global chunk setting can be set in knitr::opts_chunk$set(...)
  • It is recommended to put the packages and data that you want to work with in this section.

RMarkdown/GitLab

Some tips

  • You can format the text, include headings, figures, and tables in your RMarkdown document. Here is the cheatsheet that you can use to quickly find some commands.
  • You can run codes (in R, python, etc.) in RMarkdown files by adding code chunks.
    • RMarkdown will run the chunks and embed the results beneath them in your final report
    • The R code chunk results can be the outputs of a model, pictures, tables, etc.
    • There are some chunk settings that customize the chunk outputs (Please see the cheatsheet for more information) like:
      • include = FALSE excludes code and results from your final report
      • echo = FALSE excludes the code chunk but shows the results
      • eval = FALSE excludes the results but shows the code chunk

RMarkdown/GitLab

Rendering outputs

  • You can use the command lines or the Knit button to generate the output.
    • If you use the command lines, you can generate the output in another subdirectory.
    • If you use the Knit button, the results are generated in the same directory where the .RMD file is located.
  • Using the Knit button:
    • Go to Knit -> Knit to HTML/PDF/Word
    • You can knit it to HTML/PDF/Word

  • Using command line to render the output in another subdirectory:
    • rmarkdown::render('Path to your rmarkdown file yourfile.Rmd', 'Path to the folder you would like to save the outputs FinalName.html/md/..'))
    • If you use the here function, you can navigate using the project location as your origin. Also, the output format can be indicated using different extensions for your output file (.html or .md, etc.).
    • Need more info? Here you go!
  • You can find more information about rendering outputs here.

Your Turn 3

Let’s organize a project (IRIS Data)

  • Step 1. Create the src directory if you have not created it.
  • Step 2. Create the data, doc, and results directories.
  • Step 3. Download the famous IRIS data in RDS format and put it in the data directory.
  • Step 4. Download the documentation for the IRIS data and put it in the doc directory.

“This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.”

In this exercise, we are trying to build a linear regression model by correlating sepal length (response variable) with sepal width and species (explanatory variables). We can also load the coin toss function and play around with it.

  • Step 5. Open up an RMarkdown file:
    • Go to File -> New File -> R Markdown
    • Write a title in the title box and specify the author’s name (optional)
    • You have three options for the output format:
      • HTML
      • PDF
      • Word
    • The recommended format for the authors is HTML. You can switch to other formats anytime later
    • Hit OK

  • Step 6. Specify the output format to make readable results on GitLab
    • Output should be .md and .html
    • Change the first few line of the .RMD file:
    output: 
      html_document:
        keep_md: true
  • Step 7. Change the global setting.
    • For the sake of this exercise, let’s say we do not want to see any R warnings and messages in our report. Add the arguments warning = F and message = F to knitr::opts_chunk$set()

  • Step 8. Think about the libraries you would need and load it.
    • Add library(here) and library(readr)
    • If you do not have those packages installed, go ahead and install them first in Console: install.packages("readr") and install.packages("here")
  • Step 9. Load the coin toss function in the setup chunk.
    • Add devtools::load_all() to load the all function in the R/ directory, or
    • Use source(here::here("R","coin_toss_sim.R")) or devtools::load_all("R/coin_toss_sim.R") to load the coin_toss_sim() function only
  • Step 10. Read the data in the setup chunk too.
    • Add iris_data <- read_rds(here("data", "iris.rds"))

Setup chunk

Step 11. Coin toss function

  • Add a heading in the RMarkdown document
    • Add a heading to the document: ## Coin toss
    • Add an R chunk to work with the function: Go to Insert -> R

  • Other options to create an R code chunk:
    • Type:

      ```{r}

      ```

    • Hit Ctrl + Alt + I

  • Let’s simulate tossing a fair coin 5 times (copy and paste the code in the R code chunk):

    # tossing a fair coin 5 times
    coin_toss_sim(n = 5, p = 0.5)
  • You can run the code in the chunk by hitting the green arrow.

Step 12. IRIS data analysis

  • Now let’s build a linear regression using the iris data:
    • Add a heading to the document: ## IRIS data analysis
    • Let’s add a brief description of the data in this heading section. Copy and paste the paragraph below:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

  • Add an R code chunk and copy and paste the code below in it:

    # model (linear regression)
    iris_model <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris_data)
    
    # model summary
    summary(iris_model)
    
    # residual plots
    plot(iris_model, which = 1)
  • Run the code chunk. What do you see?

  • Step 13. Let’s save what you have created so far.
    • Hit Ctrl+s, or
    • Go to File -> Save As
    • Navigate the it to the src subdirectory in your project directory, give a name to it, and save it
    • If you go to the src folder, you should be able to see an yourfile.RMD file now
  • Step 14. Let’s render the output in the results subdirectory.
    • Copy and paste this code in Console: rmarkdown::render(here("src", "yourfilename.Rmd"), output_file = here("results", "yourfilename.html"))
  • Step 15. Push all files to GitLab.

RMarkdown/GitLab

Advanced topics

Add a caption to a figure

If you are creating a figure in an R chunk, you need to set the caption in the fig.cap argument. Also, you need to set the fig.caption to be yes on top.

If a picture is saved on your local drive and you want to have it in your rmarkdown file with a caption, you can read the image in an R chunk, and put a caption automatically using the method described above.

Change figure size

  • You can adjust the size of a figure in your final report by adding some arguments to R chunks like:
    • fig.height =
    • fig.width =
    • out.width =
    • out.height =

See the cheatsheet for more information.

Create tables

  • If you want to create beautiful tables from the output of a code (summary of a linear model), the pander package is an option.
    • Other methods are explained in more detail in this link.
  • If you want to create a table manually, this link is a useful tool.
    • You can copy and paste your table from an excel spreadsheet or make your table on the website. If you click copy to clipboard and paste it in your rmarkdown document, the table is generated.

Add appendix with all your codes at the end

Sometimes you would prefer to hide the codes in the body of your document and move your codes to the Appendix. - You need to create an Appendix heading and use the following code:

README with quick navigation on GitLab

After pushing all files to Gitlab, it is recommended to add a quick navigation section to the README file. You can add a link navigating yourself and other people to the important files of the project.

  • On GitLab project repository go to the file you would like to refer to in the README file.
  • Copy the path to the file from the web browser.
  • Open the README.md file using RStudio and add a section (Quick Navigation) to it.
  • Add the link to the Quick Navigation section.

You can find more information on how to add a link to an RMarkdown file in cheatsheet.

Your Turn 4

Organize the REAME file

  • Step 1. Open the README.md file in RStudio.
  • Step 2. Add a heading (Quick Navigation) beneath the short description in the README file.
  • Step 3. Let’s add a navigation link for the IRIS report.
    • On GitLab go to the results folder and click on the .md version of the IRIS report
    • Copy the path to the file from the web browser

  • Step 4. add the link to the REAME.md file

  • Step 5. Push everything to GitLab again.
  • Step 6. Go to the project repositoy. What do you see?

Thank you