R Markdown

Disclaimer: The contents of this document come from 27 `R` Markdown of R for Data Science (Grolemun & Wickham, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.

source: http://www.storybench.org/getting-started-with-tidyverse-in-r/

Motivation

Until now, we have used R script files (.R) for our work on our local machines.
When it comes to sharing our work with others (even with future you), it’ll be good to create tidy/neat html files, which document what we did, why we did so, what else we tried but chose not to (b/c it didn’t work), etc.
These html files are similar to lab notebooks in the natural science or engineering. Lab researchers keep track of various kinds of activities on lab notebooks, which then benefit not only (future) them but also other team members or broader audience.
Although we can add #comments to R script files, html files are visually better (than R script files), and we can use several tricks regarding what to show and what to hide on html files. Also, html files are better to non-technical users (e.g., planning commissioners or city officials, not GIS analysts).
We can also publish these html files online (that’s the point of creating html files), meaning that we don’t need to share a set of files, but URLs.
R markdown files (.Rmd) allow to create html files out of our R work in (relatively) easy and intuitve ways, and they do not require an understanding of html, css, etc.

Examples:

R Markdown Gallery - You can find source code for many of their good examples.
CP6521 course webpage - As you know, we have used R markdown files throughout this semester.
rpubs.com - This website provides a free publishing service for your html files (you don’t need to set up a server), while you need to make them open to the public. Many classes use this website as a tool for homework submission, and you will find interesting analyses when you explore recently published articles.

Before we begin today’s tutorial:

Creating a simple R markdown file and publishing it online (e.g., on rpubs.com) is not very challenging. However, as you learn more about a variety of available options, logics, and tricks, you’ll find it gets more interesting and confusing at times. That is, the more you want to customize your html file, the more you need to explore those options and tricks. At first, you may see a steep learning curve, however, it pays off well in the long run (even within weeks, you will find it very useful).
- R Markdown cheatsheet - very useful summary of commonly used techniques
Several online resources are available for creating your professionally looking personal website on Github (free!) only with R markdown files. If you are interested, check the following webpages. You can make a pretty website with Hugo themes, which are beyond my knowledge, but highly recommended.
- Making free websites with RStudio’s R Markdown
- blogdown package on Github
- blogdown: Creating Websites with R Markdown - One of its authors, Yihui Xie, is the creater of the knitr package, which is the core element of R markdown.

R Markdown Basics

You need the rmarkdown package, but you don’t need to explicitly install it or load it, as RStudio automatically does both when needed.
Let’s create a simple R markdown file and publish it online

Step 1. Create a new .Rmd file

Go to R Markdown basics.
Copy the sample code and paste it to a new .Rmd file. To create a new .Rmd file, go to File > New File > R Markdown on the menu, accept the default setting (by cliking yes), and delete all default texts.
Once you paste the sample code to the .Rmd file, save it.

Step 2. check the individual elements of the sample code:

An (optional) YAML header surrounded by ---s.
Chunks of R code surrounded by ```.
Text mixed with simple text formatting like # heading and _italics_.

In the .Rmd file, code and output are interleaved. You can run (part of) in a few different ways:

Execute each line: put your cursor in the line you want to execute and press ctrl/cmd + Enter.
Execute each code chunk: in a code chunk, press ctrl/cmd + Shift + Enter to run all lines in that chunk. Each code chunk has a green “play” button on its top right corner, which does the same thing. You will see the outcome right below a code chunk, not on Plots pane.

Step 3. Publish an html file online

To execute the entire document, press ctrl/cmd + Shift + K or clike the Knit icon. Depending on your local setup, you’ll see an html file either on Viewer Pane or on a separate window.
On your html file, click Publish on the top right corner. This will lead you to the RPubs website, on which you need to log in (if you don’t have an account there, create one) and specify your preferred URL. Now, you see your work published online and you can share it with anyone with your URL.
You may want to delete your webpage by cliking Delete at the bottom left corner.

Wait, what’s going on under the hood?

When you knit the document, R Markdown sends the .Rmd file to knitr, http://yihui.name/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output.
The markdown file generated by knitr is then processed by pandoc, http://pandoc.org/, which is responsible for creating the finished file.
The advantage of this two step workflow is that you can create a very wide range of output formats (to create a pdf file from R markdown, you need to install LaTeX, which goes beyond today’s tutorial).

Text Formatting with Markdown

Prose in .Rmd files is written in Markdown, a lightweight set of conventions for formatting plain text files.
Markdown is designed to be easy to read and easy to write. It is also very easy to learn.
The guide below shows how to use Pandoc’s Markdown, a slightly extended version of Markdown that R Markdown understands.
Click Help > Markdown Quick Reference.

# this block is written inside a code chunk, to avoid actual effects 
# these tricks work outside of the code chunk 

Text formatting 
------------------------------------------------------------
*italic*  or _italic_
**bold**   __bold__
`code`
superscript^2^ and subscript~2~

Headings
------------------------------------------------------------
# 1st Level Header
## 2nd Level Header
### 3rd Level Header

Lists
------------------------------------------------------------
*   Bulleted list item 1
*   Item 2
    * Item 2a
    * Item 2b
1.  Numbered list item 1
1.  Item 2. The numbers are incremented automatically in the output.

Links and images
------------------------------------------------------------
<http://example.com>
[linked phrase](http://example.com)
![optional caption text](path/to/img.png)

Tables 
------------------------------------------------------------
First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

Exercise

Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.

Code chunks

Three ways of inserting a code chunk

ctrl/cmd + Alt + I
The “insert” button icon in the editor toolbar
Mannualy type ```{r} (to open a code chunk) and ``` (to close the chunk).

Chunk Name

Naming a code chunk is useful because we can easily naviage to specific chunks.
Type by-name in the openning line ```{r by-name}.

Chunk Options

Chunk output can be customised with options, arguments supplied to chunk header.
Knitr provides almost 60 options that you can use to customize your code chunks.
Here we’ll cover the most important chunk options that you’ll use frequently.
You can see the full list at http://yihui.name/knitr/options/.

eval = FALSE prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.
include = FALSE runs the code, but doesn’t show the code or results in the final document. Use this for setup code that you don’t want cluttering your report.
echo = FALSE prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code.
message = FALSE or warning = FALSE prevents messages or warnings from appearing in the finished file.
results = 'hide' hides printed output; fig.show = 'hide' hides plots.
error = TRUE causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .Rmd. It’s also useful if you’re teaching R and want to deliberately include an error. The default, error = FALSE causes knitting to fail if there is a single error in the document.

Table

If you prefer that data be displayed with additional formatting you can use the knitr::kable function.

knitr::kable(
  mtcars[1:5, ], 
  caption = "A knitr kable."
)

A knitr kable.
	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2

(Advanced) Caching

Normally, each knit of a document starts from a completely clean slate.
This is great for reproducibility, because it ensures that you’ve captured every important computation in code.
However, it can be painful if you have some computations that take a long time (e.g., Google API query).
The solution is cache = TRUE. When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.
The caching system must be used with care, because by default it is based on the code only, not its dependencies.

#```{r raw_data} <- the header of the first chunk 
rawdata <- readr::read_csv("a_very_large_file.csv")
#```

#```{r processed_data, cache = TRUE} <- the header of the second chunk 
processed_data <- rawdata %>% 
  filter(!is.na(import_var)) %>% 
  mutate(new_variable = complicated_transformation(x, y, z))
#```

Above, caching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won’t get rerun if the read_csv() call changes. Then, how to avoid?
Below, dependson should contain a character vector of every chunk that the cached chunk depends on.
Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.

#```{r processed_data, cache = TRUE, dependson = "raw_data"} <- the new header of the second chunk 
processed_data <- rawdata %>% 
  filter(!is.na(import_var)) %>% 
  mutate(new_variable = complicated_transformation(x, y, z))
#```

What if an input data, a_very_large_file.csv, file changes, but not R scripts?
The cache.extra option is an R expression that will invalidate the cache whenever it changes.
Combine cache.extra with file.info(): it returns a bunch of information about the file including when it was last modified.

#```{r raw_data, cache.extra = file.info("a_very_large_file.csv")} <- the new header of the first chunk 
rawdata <- readr::read_csv("a_very_large_file.csv")
#```

Additional tricks

As your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with knitr::clean_cache().
Each chunk is named after the primary object that it creates. This makes it easier to understand the dependson specification.

Global Options

You can choose global options in the first place, which applies to all code chunks, unless each code chunk is specified differently.
When you want code and output kept closely to each other,

knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE
)

When you write a rerpot and do not want to include you source code,

knitr::opts_chunk$set(
  echo = FALSE
)

Inline Code

One other way to embed R code into an R Markdown document: directly into the text, with: `r `.
The below quote in the sample .Rmd file translates to the following one sentence. FYI, in the first quote, I used a single quotation mark, not backtick, to avoid actual effects.

We have data about ‘r nrow(diamonds)’ diamonds. Only ‘r nrow(diamonds) - nrow(smaller)’ are larger than 2.5 carats. The distribution of the remainder is shown below:

We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:

When inserting numbers into text, format() is your friend.

comma <- function(x) format(x, digits = 2, big.mark = ",")
comma(3452345)
#> [1] "3,452,345"
comma(.12358124331)
#> [1] "0.12"

Troubleshooting

Troubleshooting R Markdown documents can be challenging because you are no longer in an interactive R environment.
The first thing you should always try is to recreate the problem in an interactive session.
Restart R, then “Run all chunks” (either from Code > Run region), or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.
If that doesn’t help, there must be something different between your interactive environment and the R markdown environment. You’re going to need to systematically explore the options.
The most common difference is the working directory: the working directory of an R Markdown is the directory in which it lives. Check the working directory is what you expect by including getwd() in a chunk.
Next, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your R markdown session. The easiest way to do that is to set error = TRUE on the chunk causing the problem, then use print() and str() to check that settings are as you expect.