Introduction
R Markdown
R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more.
R Markdown files are designed to be used in three ways:
For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.
For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).
As an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.
Cheatsheets: R Markdown Reference Sheet
R Markdown Cheat Sheet
R Markdown Basics
library(ggplot2)
library(dplyr)
smaller <- diamonds %>%
filter(carat <= 2.5)
We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:

It contains three important types of content:
- An (optional) YAML header surrounded by —s.
- Chunks of R code surrounded by ```.
- Text mixed with simple text formatting like # heading and italics.
R notebook files show the output inside the editor, while hiding the console. R markdown files shows the output inside the console, and does not show output inside the editor. They differ in the value of output in their YAML headers. The YAML header for the R notebook is
ouptut: html_notebook
while the header for the R markdown file is
ouptut: html_document
The other output options in the YAML header: word_document for Word documents, pdf_document for PDF documents, and html_document for HTML documents.
Text Formatting with Markdown
Text formatting
italic or italic bold bold code superscript2 and subscript2
Lists
- Numbered list item 1
- Item 2. The numbers are incremented automatically in the output.
Links and images
http://example.com
linked phrase
[optional caption text]
Tables
| Content Cell |
Content Cell |
| Content Cell |
Content Cell |
Chunk name
Chunks can be given an optional name: ```{r by-name}. This has three advantages:
- You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:

Graphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in other important options.
You can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that below.
There is one chunk name that imbues special behaviour: setup. When you’re in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.
Chunk Options
Chunk output can be customised with options, arguments supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we’ll cover the most important chunk options that you’ll use frequently. You can see the full list at
http://yihui.name/knitr/options/.
The most important set of options controls if your code block is executed and what results are inserted in the finished report:
eval = FALSE prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.
include = FALSE runs the code, but doesn’t show the code or results in the final document. Use this for setup code that you don’t want cluttering your report.
echo = FALSE prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code.
message = FALSE or warning = FALSE prevents messages or warnings from appearing in the finished file.
results = ‘hide’ hides printed output; fig.show = ‘hide’ hides plots.
error = TRUE causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .Rmd. It’s also useful if you’re teaching R and want to deliberately include an error. The default, error = FALSE causes knitting to fail if there is a single error in the document.
Table
By default, R Markdown prints data frames and matrices as you’d see them in the console:
mtcars[1:5, ]
If you prefer that data be displayed with additional formatting you can use the knitr::kable function. The code below generates Table 27.1.
knitr::kable(
mtcars[1:5, ],
caption = "A knitr kable."
)
Read the documentation for ?knitr::kable to see the other ways in which you can customise the table. For even deeper customisation, consider the xtable, stargazer, pander, tables, and ascii packages. Each provides a set of tools for returning formatted tables from R code.
There is also a rich set of options for controlling how figures are embedded. You’ll learn about these in saving your plots.
Caching
Normally, each knit of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you’ve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is cache = TRUE. When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.
The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the processed_data chunk depends on the raw_data chunk:
rawdata <- readr::read_csv("test.csv"))
processed_data <- rawdata %>%
filter(!is.na(import_var)) %>%
mutate(new_variable = complicated_transformation(x, y, z))
Caching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won’t get rerun if the read_csv() call changes. You can avoid that problem with the dependson chunk option:
processed_data <- rawdata %>%
filter(!is.na(import_var)) %>%
mutate(new_variable = complicated_transformation(x, y, z))
dependson should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.
Note that the chunks won’t update if a_very_large_file.csv changes, because knitr caching only tracks changes within the .Rmd file. If you want to also track changes to that file you can use the cache.extra option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is file.info(): it returns a bunch of information about the file including when it was last modified. Then you can write:
rawdata <- readr::read_csv("a_very_large_file.csv")
As your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with knitr::clean_cache().
Global Options
As you work more with knitr, you will discover that some of the default chunk options don’t fit your needs and you want to change them. You can do this by calling knitr::opts_chunk$set() in a code chunk. For example, when writing books and tutorials I set:
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE
)
This uses my preferred comment formatting, and ensures that the code and output are kept closely entwined. On the other hand, if you were preparing a report, you might set:
knitr::opts_chunk$set(
echo = FALSE
)
That will hide the code by default, so only showing the chunks you deliberately choose to show (with echo = TRUE). You might consider setting message = FALSE and warning = FALSE, but that would make it harder to debug problems because you wouldn’t see any messages in the final document.
Inline Code
There is one other way to embed R code into an R Markdown document: directly into the text, with: r. This can be very useful if you mention properties of your data in the text. For example, in the example document I used at the start of the chapter I had:
We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:
When inserting numbers into text, format() is your friend. It allows you to set the number of digits so you don’t print to a ridiculous degree of accuracy, and a big.mark to make numbers easier to read. I’ll often combine these into a helper function:
comma <- function(x) format(x, digits = 2, big.mark = ",")
comma(3452345)
Troubleshooting
Troubleshooting R Markdown documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. The first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks” (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.
If that doesn’t help, there must be something different between your interactive environment and the R markdown environment. You’re going to need to systematically explore the options. The most common difference is the working directory: the working directory of an R Markdown is the directory in which it lives. Check the working directory is what you expect by including getwd() in a chunk.
Next, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your R markdown session. The easiest way to do that is to set error = TRUE on the chunk causing the problem, then use print() and str() to check that settings are as you expect.
References
---
title: "R For Data Science Chapter 23 R Markdown"
output: html_notebook
---
# Introduction
# R Markdown
R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more.

R Markdown files are designed to be used in three ways:

1. For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.

2. For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).

3. As an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.

Cheatsheets: </br>
<a href="https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf"> R Markdown Reference Sheet </a>

<a href="https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf" > R Markdown Cheat Sheet </a>


# R Markdown Basics

```{r}
library(ggplot2)
library(dplyr)

smaller <- diamonds %>% 
  filter(carat <= 2.5)
```


We have data about `r nrow(diamonds)` diamonds. Only 
`r nrow(diamonds) - nrow(smaller)` are larger than
2.5 carats. The distribution of the remainder is shown
below:

```{r, echo = FALSE}
smaller %>% 
  ggplot(aes(carat)) + 
  geom_freqpoly(binwidth = 0.01)
``` 


It contains three important types of content: </br>

1. An (optional) YAML header surrounded by ---s.
2. Chunks of R code surrounded by ```.
3. Text mixed with simple text formatting like # heading and _italics_.




R notebook files show the output inside the editor, while hiding the console. R markdown files shows the output inside the console, and does not show output inside the editor. They differ in the value of output in their YAML headers. The YAML header for the R notebook is

> ouptut: html_notebook

while the header for the R markdown file is

> ouptut: html_document 

The other output options in the YAML header: *word_document* for Word documents, *pdf_document* for PDF documents, and *html_document* for HTML documents.


# Text Formatting with Markdown



Text formatting 


> *italic*  or _italic_
> **bold**   __bold__
> `code`
> superscript^2^ and subscript~2~




> # 1st Level Header

> ## 2nd Level Header

> ### 3rd Level Header

Lists

> *   Bulleted list item 1

> *   Item 2

>    * Item 2a

>    * Item 2b

> 1.  Numbered list item 1

> 1.  Item 2. The numbers are incremented automatically in the output.

Links and images


> <http://example.com>

> [linked phrase](http://example.com)

> ![optional caption text]

Tables 


> First Header  | Second Header
> ------------- | -------------
> Content Cell  | Content Cell
> Content Cell  | Content Cell


# Chunk name

Chunks can be given an optional name: ```{r by-name}. This has three advantages: </br>

1. You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor: </br>

<img src="http://r4ds.had.co.nz/screenshots/rmarkdown-chunk-nav.png" />


2. Graphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in other important options.

3. You can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that below.

There is one chunk name that imbues special behaviour: setup. When you're in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.</p>

## Chunk Options

Chunk output can be customised with options, arguments supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we'll cover the most important chunk options that you'll use frequently. You can see the full list at http://yihui.name/knitr/options/. </p>

The most important set of options controls if your code block is executed and what results are inserted in the finished report: </p>


* eval = FALSE prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.

* include = FALSE runs the code, but doesn't show the code or results in the final document. Use this for setup code that you don't want cluttering your report.

* echo = FALSE prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don't want to see the underlying R code.

* message = FALSE or warning = FALSE prevents messages or warnings from appearing in the finished file.

* results = 'hide' hides printed output; fig.show = 'hide' hides plots.

* error = TRUE causes the render to continue even if code returns an error. This is rarely something you'll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .Rmd. It's also useful if you're teaching R and want to deliberately include an error. The default, error = FALSE causes knitting to fail if there is a single error in the document.


# Table

By default, R Markdown prints data frames and matrices as you'd see them in the console:

```{r}
mtcars[1:5, ]
```

If you prefer that data be displayed with additional formatting you can use the knitr::kable function. The code below generates Table 27.1.

```{r}
knitr::kable(
  mtcars[1:5, ], 
  caption = "A knitr kable."
)
```

Read the documentation for ?knitr::kable to see the other ways in which you can customise the table. For even deeper customisation, consider the xtable, stargazer, pander, tables, and ascii packages. Each provides a set of tools for returning formatted tables from R code. </p>


There is also a rich set of options for controlling how figures are embedded. You'll learn about these in saving your plots. </p>


# Caching

Normally, each knit of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you've captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is cache = TRUE. When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn't, it will reuse the cached results. </p>

The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the processed_data chunk depends on the raw_data chunk:


```{r}
rawdata <- readr::read_csv("test.csv"))
```


```{r processed_data, cache = TRUE}
processed_data <- rawdata %>% 
  filter(!is.na(import_var)) %>% 
  mutate(new_variable = complicated_transformation(x, y, z))
```


Caching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won't get rerun if the read_csv() call changes. You can avoid that problem with the dependson chunk option:

```{r processed_data1, cache = TRUE, dependson = "raw_data"}
processed_data <- rawdata %>% 
  filter(!is.na(import_var)) %>% 
  mutate(new_variable = complicated_transformation(x, y, z))
```

dependson should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.</p>


Note that the chunks won't update if a_very_large_file.csv changes, because knitr caching only tracks changes within the .Rmd file. If you want to also track changes to that file you can use the cache.extra option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is file.info(): it returns a bunch of information about the file including when it was last modified. Then you can write: </p>

```{r raw_data, cache.extra = file.info("a_very_large_file.csv")}
rawdata <- readr::read_csv("a_very_large_file.csv")
```

As your caching strategies get progressively more complicated, it's a good idea to regularly clear out all your caches with knitr::clean_cache(). </p>

# Global Options

As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them. You can do this by calling knitr::opts_chunk$set() in a code chunk. For example, when writing books and tutorials I set:

```{r}
knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE
)
```

This uses my preferred comment formatting, and ensures that the code and output are kept closely entwined. On the other hand, if you were preparing a report, you might set:

```{r}
knitr::opts_chunk$set(
  echo = FALSE
)

```

That will hide the code by default, so only showing the chunks you deliberately choose to show (with echo = TRUE). You might consider setting message = FALSE and warning = FALSE, but that would make it harder to debug problems because you wouldn't see any messages in the final document.

# Inline Code

There is one other way to embed R code into an R Markdown document: directly into the text, with: `r `. This can be very useful if you mention properties of your data in the text. For example, in the example document I used at the start of the chapter I had:

We have data about `r nrow(diamonds)` diamonds. Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats. The distribution of the remainder is shown below:


When inserting numbers into text, format() is your friend. It allows you to set the number of digits so you don't print to a ridiculous degree of accuracy, and a big.mark to make numbers easier to read. I'll often combine these into a helper function:

```{r}
comma <- function(x) format(x, digits = 2, big.mark = ",")
comma(3452345)
```

# Troubleshooting 

Troubleshooting R Markdown documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. The first thing you should always try is to recreate the problem in an interactive session. Restart R, then "Run all chunks" (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If you're lucky, that will recreate the problem, and you can figure out what's going on interactively. </p>

If that doesn't help, there must be something different between your interactive environment and the R markdown environment. You're going to need to systematically explore the options. The most common difference is the working directory: the working directory of an R Markdown is the directory in which it lives. Check the working directory is what you expect by including getwd() in a chunk. </p>

Next, brainstorm all the things that might cause the bug. You'll need to systematically check that they're the same in your R session and your R markdown session. The easiest way to do that is to set error = TRUE on the chunk causing the problem, then use print() and str() to check that settings are as you expect. </p>

# YAML Header
You can control many other "whole document" settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: it's "yet another markup language", which is designed for representing hierarchical data in a way that's easy for humans to read and write. R Markdown uses it to control many details of the output. Here we'll discuss two: document parameters and bibliographies. </p>

## Parameters
R Markdown documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the params field. </p>

This example use a my_class parameter to determines which class of cars to display:



As you can see, parameters are available within the code chunks as a read-only list named params. </p>

You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with !r. This is a good way to specify date/time parameters. 

## Bibliographies and Citations 

Pandoc can automatically generate citations and a bibliography in a number of styles. To use this feature, specify a bibliography file using the bibliography field in your file's header. The field should contain a path from the directory that contains your .Rmd file to the file that contains the bibliography file:

> Bibliography: rmarkdown.bib

You can use many common bibliography formats including BibLaTeX, BibTeX, endnote, medline. </p>

To create a citation within your .Rmd file, use a key composed of '@' + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples: </br>

* Separate multiple citations with a `;`: Blah blah [@smith04; @doe99]. 

* You can add arbitrary comments inside the square brackets: 
Blah blah [see @doe99, pp. 33-35; also @smith04, ch. 1].

* Remove the square brackets to create an in-text citation: @smith04 
says blah, or @smith04 [p. 33] says blah.

* Add a `-` before the citation to suppress the author's name: 
Smith says blah [-@smith04].


When R Markdown renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as # References or # Bibliography.

You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the csl field:

> bibliography: rmarkdown.bib
> csl: apa.csl

# References