Disclaimer: The contents of this document come from 27 R Markdown of R for Data Science (Grolemun & Wickham, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.
source: http://www.storybench.org/getting-started-with-tidyverse-in-r/

Motivation

  1. Until now, we have used R script files (.R) for our work on our local machines.
  2. When it comes to sharing our work with others (even with future you), it’ll be good to create tidy/neat html files, which document what we did, why we did so, what else we tried but chose not to (b/c it didn’t work), etc.
  3. These html files are similar to lab notebooks in the natural science or engineering. Lab researchers keep track of various kinds of activities on lab notebooks, which then benefit not only (future) them but also other team members or broader audience.
  4. Although we can add #comments to R script files, html files are visually better (than R script files), and we can use several tricks regarding what to show and what to hide on html files. Also, html files are better to non-technical users (e.g., planning commissioners or city officials, not GIS analysts).
  5. We can also publish these html files online (that’s the point of creating html files), meaning that we don’t need to share a set of files, but URLs.
  6. R markdown files (.Rmd) allow to create html files out of our R work in (relatively) easy and intuitve ways, and they do not require an understanding of html, css, etc.

Examples:

  1. R Markdown Gallery - You can find source code for many of their good examples.
  2. CP6521 course webpage - As you know, we have used R markdown files throughout this semester.
  3. rpubs.com - This website provides a free publishing service for your html files (you don’t need to set up a server), while you need to make them open to the public. Many classes use this website as a tool for homework submission, and you will find interesting analyses when you explore recently published articles.

Before we begin today’s tutorial:

  1. Creating a simple R markdown file and publishing it online (e.g., on rpubs.com) is not very challenging. However, as you learn more about a variety of available options, logics, and tricks, you’ll find it gets more interesting and confusing at times. That is, the more you want to customize your html file, the more you need to explore those options and tricks. At first, you may see a steep learning curve, however, it pays off well in the long run (even within weeks, you will find it very useful).
  2. Several online resources are available for creating your professionally looking personal website on Github (free!) only with R markdown files. If you are interested, check the following webpages. You can make a pretty website with Hugo themes, which are beyond my knowledge, but highly recommended.

R Markdown Basics

Step 1. Create a new .Rmd file

  1. Go to R Markdown basics.
  2. Copy the sample code and paste it to a new .Rmd file. To create a new .Rmd file, go to File > New File > R Markdown on the menu, accept the default setting (by cliking yes), and delete all default texts.
  3. Once you paste the sample code to the .Rmd file, save it.

Step 2. check the individual elements of the sample code:

  1. An (optional) YAML header surrounded by ---s.
  2. Chunks of R code surrounded by ```.
  3. Text mixed with simple text formatting like # heading and _italics_.

In the .Rmd file, code and output are interleaved. You can run (part of) in a few different ways:

  1. Execute each line: put your cursor in the line you want to execute and press ctrl/cmd + Enter.
  2. Execute each code chunk: in a code chunk, press ctrl/cmd + Shift + Enter to run all lines in that chunk. Each code chunk has a green “play” button on its top right corner, which does the same thing. You will see the outcome right below a code chunk, not on Plots pane.

Step 3. Publish an html file online

  1. To execute the entire document, press ctrl/cmd + Shift + K or clike the Knit icon. Depending on your local setup, you’ll see an html file either on Viewer Pane or on a separate window.
  2. On your html file, click Publish on the top right corner. This will lead you to the RPubs website, on which you need to log in (if you don’t have an account there, create one) and specify your preferred URL. Now, you see your work published online and you can share it with anyone with your URL.
  3. You may want to delete your webpage by cliking Delete at the bottom left corner.

Wait, what’s going on under the hood?

  1. When you knit the document, R Markdown sends the .Rmd file to knitr, http://yihui.name/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output.
  2. The markdown file generated by knitr is then processed by pandoc, http://pandoc.org/, which is responsible for creating the finished file.
  3. The advantage of this two step workflow is that you can create a very wide range of output formats (to create a pdf file from R markdown, you need to install LaTeX, which goes beyond today’s tutorial).

Text Formatting with Markdown

  1. Prose in .Rmd files is written in Markdown, a lightweight set of conventions for formatting plain text files.
  2. Markdown is designed to be easy to read and easy to write. It is also very easy to learn.
  3. The guide below shows how to use Pandoc’s Markdown, a slightly extended version of Markdown that R Markdown understands.
  4. Click Help > Markdown Quick Reference.
# this block is written inside a code chunk, to avoid actual effects 
# these tricks work outside of the code chunk 

Text formatting 
------------------------------------------------------------
*italic*  or _italic_
**bold**   __bold__
`code`
superscript^2^ and subscript~2~

Headings
------------------------------------------------------------
# 1st Level Header
## 2nd Level Header
### 3rd Level Header

Lists
------------------------------------------------------------
*   Bulleted list item 1
*   Item 2
    * Item 2a
    * Item 2b
1.  Numbered list item 1
1.  Item 2. The numbers are incremented automatically in the output.

Links and images
------------------------------------------------------------
<http://example.com>
[linked phrase](http://example.com)
![optional caption text](path/to/img.png)

Tables 
------------------------------------------------------------
First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

Exercise

Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.

Code chunks

  1. ctrl/cmd + Alt + I
  2. The “insert” button icon in the editor toolbar
  3. Mannualy type ```{r} (to open a code chunk) and ``` (to close the chunk).

Chunk Name

  1. Naming a code chunk is useful because we can easily naviage to specific chunks.
  2. Type by-name in the openning line ```{r by-name}.

Chunk Options

  1. Chunk output can be customised with options, arguments supplied to chunk header.
  2. Knitr provides almost 60 options that you can use to customize your code chunks.
  3. Here we’ll cover the most important chunk options that you’ll use frequently.
  4. You can see the full list at http://yihui.name/knitr/options/.

Table

knitr::kable(
  mtcars[1:5, ], 
  caption = "A knitr kable."
)
A knitr kable.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

(Advanced) Caching

  1. Normally, each knit of a document starts from a completely clean slate.
  2. This is great for reproducibility, because it ensures that you’ve captured every important computation in code.
  3. However, it can be painful if you have some computations that take a long time (e.g., Google API query).
  4. The solution is cache = TRUE. When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.
  5. The caching system must be used with care, because by default it is based on the code only, not its dependencies.
#```{r raw_data} <- the header of the first chunk 
rawdata <- readr::read_csv("a_very_large_file.csv")
#```
#```{r processed_data, cache = TRUE} <- the header of the second chunk 
processed_data <- rawdata %>% 
  filter(!is.na(import_var)) %>% 
  mutate(new_variable = complicated_transformation(x, y, z))
#```
  1. Above, caching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won’t get rerun if the read_csv() call changes. Then, how to avoid?
  2. Below, dependson should contain a character vector of every chunk that the cached chunk depends on.
  3. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.
#```{r processed_data, cache = TRUE, dependson = "raw_data"} <- the new header of the second chunk 
processed_data <- rawdata %>% 
  filter(!is.na(import_var)) %>% 
  mutate(new_variable = complicated_transformation(x, y, z))
#```
  1. What if an input data, a_very_large_file.csv, file changes, but not R scripts?
  2. The cache.extra option is an R expression that will invalidate the cache whenever it changes.
  3. Combine cache.extra with file.info(): it returns a bunch of information about the file including when it was last modified.
#```{r raw_data, cache.extra = file.info("a_very_large_file.csv")} <- the new header of the first chunk 
rawdata <- readr::read_csv("a_very_large_file.csv")
#```
  1. As your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with knitr::clean_cache().
  2. Each chunk is named after the primary object that it creates. This makes it easier to understand the dependson specification.

Global Options

knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE
)
knitr::opts_chunk$set(
  echo = FALSE
)

Inline Code

We have data about ‘r nrow(diamonds)’ diamonds. Only ‘r nrow(diamonds) - nrow(smaller)’ are larger than 2.5 carats. The distribution of the remainder is shown below:

We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:

comma <- function(x) format(x, digits = 2, big.mark = ",")
comma(3452345)
#> [1] "3,452,345"
comma(.12358124331)
#> [1] "0.12"

Troubleshooting

  1. Troubleshooting R Markdown documents can be challenging because you are no longer in an interactive R environment.
  2. The first thing you should always try is to recreate the problem in an interactive session.
  3. Restart R, then “Run all chunks” (either from Code > Run region), or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.
  4. If that doesn’t help, there must be something different between your interactive environment and the R markdown environment. You’re going to need to systematically explore the options.
  5. The most common difference is the working directory: the working directory of an R Markdown is the directory in which it lives. Check the working directory is what you expect by including getwd() in a chunk.
  6. Next, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your R markdown session. The easiest way to do that is to set error = TRUE on the chunk causing the problem, then use print() and str() to check that settings are as you expect.