This is the markdown part of R Markdown. This will appear as formatted text where you can describe what you’re doing, comment on results, etc. Pretty much like your results section, except that it’s in the same document as you R code.

It’s formatted with plain text symbols, yielding italics, bold, math: \(x^2, \sqrt{y}, \sum_i^k m_i\)

\[ \frac{n!}{k!(n-k)!} = \binom{n}{k} \]

This is a heading

This is a subheading

And a subsubheading

The following is an R chunk. You can either insert a new one by typing the first and last lines of the following source code, or just click the ‘insert’ button at the top of the viewer window. As you can see, the information in the header does not end up in the rendered html.

If your cursor is inside this chunk, you can run just that chunk by hitting CMD+SHIFT+ENTER (just like an R script). Also, see the little icons on the side? You can click those to run things too.

myFunction <- function(x){
  return(x+x^2)
}

myFunction(3)

## [1] 12

mean(iris$Sepal.Length)

## [1] 5.843333

model1 <- lm(Sepal.Length ~ Species, data=iris)

summary(model1)

## 
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
## Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

But while you might run individual chunks to check they work, the main point is to produce a nicely formatted document containing both your descriptive text and your R outputs. Hit ‘Knit’ on the toolbar of this panel to do that (or CMD+SHIFT+K).

Your rendered html file should pop up a new window. Notice the ‘Publish’ button top right of that window? If you’ve set up up an RPubs account, you can click this to publish your webpage. Go to rpubs.com/justinsulik to see it.

Here’s another R chunk. It just loads the libraries I’ll need for this session. Hitting ‘Knit’ doesn’t add anything to your global environment, though, so if you want to mess around with the console, you need to load libraries there too.

library(boot)
library(dplyr)
library(tidyr)
library(ggplot2)
library(pander)
library(stargazer)

Avoid copy+pasting or rounding errors

You can also include inline R code. This is great for reporting your findings. You already had to write a command to calculate the mean of your favorite variable. Why not simply include that code in an .Rmd document to produce a report?

The mean of Sepal.Length is 5.8433333 (SD=0.8280661). But who needs all these decimal places? Let’s round the mean: 5.843.

You can extract parameters from the above linear model and report them here, e.g. for species versicolor (\(\beta\) = 0.93, SE = 0.1029579, t = 9.0328194, p = 8.770194210^{-16}).

Again, these need rounding. However, a common error in reported stats is incorrect rounding of p values, and incorrect use of sign (e.g. mixing up ‘=’ and ‘<’). See @nuijten16, for instance. Correct rounding of the p values can be handled automatically.

I’ve added a name to the following chunk header (reportP). You don’t have to but it helps with debugging.

# Define a custom function for reporting p values automaticaly

reportP <- function(pValue){
  if (pValue < 0.001){
    result <- "p < 0.001"
  } else {
    result <- sprintf("p = %.3f", pValue) # inserts a float into a string and simultaneously do rounding
  }
  return(result)
}

modelPvalue <- summary(model1)$coefficients[2,4]

Much nicer to be able to report that the relevant value is p < 0.001. Want to change the number of decimal places or the cutoff? Make one change to the above function, knit, and all the p values in your document will be updated.

Citations and bibliography

There was a citation above [@nuijten16]. To get this to display properly, simply add a bibliography: tag to the YAML header that points to your bibliography file (I use BibDesk to create a .bib file).

Chunk options

You might decide that nobody needs to see your reportP function, though, because it’s not actually part of your analysis. It’s easy control how much detail is shown for a given chunk. For instance, we could add include=FALSE in the header of this chunk, which would exclude it from the output. There should be no code after this paragraph but the code is still run so you can later do stuff with modelPvalue.

Or you might want to output all results, but hide the code that produced them. This is common for tables or graphs.

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

These options can be set once for the whole document, for instance by putting the following chunk at the start of the document (after the YAML header). The eval=FALSE argument in the header means ‘Don’t run this code at all. Just display it’.

opts_chunk$set(echo = FALSE)

Fiddling with these options in the global setting means that it should take you just seconds to produce two documents:

1. Your results section for a manuscript
2. Supplementary material that shows all the working, which can be shared online

One of my favorite chunk options: caching

Sometimes your R chunks might take a while to run. The following bootstrap takes a few seconds to run, but some of my other bootsraps can much longer (e.g. bootstrapping confidence intervals for parameters for complex glmer models). I can tell knitr to cache the results so that it runs once, stores the results, and doesn’t run again unless I make changes to the chunk.

# Function to bootstrap: provides bootstrap samples for the mean of a particular column in the data frame

myBootFunction <- function(data, i, column){
  data <- data[i,]
  bootMean <- mean(data[[column]])
  return(bootMean)
}

# Bootstrap command that calls the above function, specifies which column I'm interested it, and runs the function 10000 times

bootResults <- boot(data=iris, statistic=myBootFunction, column='Sepal.Length', R=10000)

# Extract the CIs (here, using Bias-Corrected Accelerated method) and create a nicely formatted string that can be placed in my report

CIstring <- function(bootOutput){
  CIs <- boot.ci(bootOutput, type='bca')
  lower <- CIs$'bca'[[4]]
  upper <- CIs$'bca'[[5]]
  CIstring <- sprintf("(bootstrapped 95%% CIs [%.3f, %.3f])", lower, upper) # This is the bit that creates the nice, reportable string
  return(CIstring)
}

Mean sepal length in the iris data is 5.843 (bootstrapped 95% CIs [5.716, 5.976]). Obviously it would take only a few more lines of code to create a single function to report both mean and CIs.

Some fancier formatting

Number automatically by adding updating your YAML header as follows:

output: 
  html_document:
    number_sections: true

It’s pretty easy to add a menu bar to skip from section to section.

output: 
  html_document:
    toc: true
    toc_depth: 
    toc_float: true

Since you’re producing html, you can point to a CSS file in the header.

output:
  html_document:
    css: styles.css

You can also include CSS in your .Rmd file. See ‘Custom CSS’ here. Of course, you could just include html tags, for instance making this phrase red.

Tables

The previous table is quite ugly, so how about sprucing it up a little:

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
Min. :4.300	Min. :2.000	Min. :1.000	Min. :0.100	setosa :50
1st Qu.:5.100	1st Qu.:2.800	1st Qu.:1.600	1st Qu.:0.300	versicolor:50
Median :5.800	Median :3.000	Median :4.350	Median :1.300	virginica :50
Mean :5.843	Mean :3.057	Mean :3.758	Mean :1.199	NA
3rd Qu.:6.400	3rd Qu.:3.300	3rd Qu.:5.100	3rd Qu.:1.800	NA
Max. :7.900	Max. :4.400	Max. :6.900	Max. :2.500	NA

We can use the same command for a model summary, but there are specialised libraries for formatting model outputs.

	Estimate	Std. Error	t value	Pr(>\|t\|)
Speciesversicolor	0.93	0.103	9.033	8.77e-16
Speciesvirginica	1.582	0.103	15.37	2.215e-32
(Intercept)	5.006	0.0728	68.76	1.134e-113

Fitting linear model: Sepal.Length ~ Species
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
150	0.5148	0.6187	0.6135


	Dependent variable:

	Sepal.Length

Speciesversicolor	0.930^***
	(0.103)

Speciesvirginica	1.582^***
	(0.103)

Constant	5.006^***
	(0.073)


Observations	150
R²	0.619
Adjusted R²	0.614
Residual Std. Error	0.515 (df = 147)
F Statistic	119.265^*** (df = 2; 147)

Note:	p<0.1; p<0.05; p<0.01

Plotting

One strength of R Markdown is that you can directly include plots directly in your report.

p <- ggplot(data=iris, aes(x=Petal.Length,y=Sepal.Length, color=Species))+geom_point()
p

Want to update your graph without having to export it and then insert in a report?

Now that you have a new graph, you don’t have to export it and then insert it into your Word doc. It’s enough to just write the code and knit it! But if you do want to reuse this graph elsewhere, go look at ./example_files/figure-html/betterPlot.png on your local drive: the file name matches the R chunk name.

Obviously you can hide the above code if you want, but it’s often useful for people to see how you made your graph, especially if it took you hours to work out how to do it.

The size of the above image is specified in the chunk header, but you can also specify global settings in the YAML header.

output:
  html_document:
    fig_width: 5
    fig_height: 5

Smart data processing and manipulation

The packages dplyr and tidyr have nothing to do with markdown or knitr, so this is not an explanation on how to use them. Rather, it’s an illustration of how they might be useful in an .Rmd file, given the present goals (reproducible science, efficient workflow)

They make it easy to keep track of what you’re doing to your data to produce a certain output
They avoid having to
- recursively embed functions
- create temporary objects
- make changes to your dataframe

Simple example: let’s say you want to

create a model
get its summary
format that nicely

This involves functions

lm()
summary()
pander()

# Multiple embedding

pander(summary(lm(Sepal.Length ~ Species, data=iris)))

# Create temporary objects

lm1 <- lm(Sepal.Length ~ Species, data=iris)
summary1 <- summary(lm1)
pander(summary1)

These are not so bad, but imagine you wanted to filter the data so that you’re only looking at one species, calculate area=length*width, center that variable, and then create a model. Lots of embedding (hard to write) or lots of temporary objects (messy, easy to loose track of what temporary file you’re working with). The smart alternative: piping.

Basically piping means “do X, feed the output to Y, feed the output of that to Z, etc.”, so something like X \(\rightarrow\) Y \(\rightarrow\) Z. Much more human-readable than Z(Y(X)) and it doesn’t involve cluttering up your global environment.

dplyr uses %>% for piping, so X %>% Y %>% Z. The above simple example then be:

iris %>% lm(Sepal.Length~Species, .) %>% summary %>% pander

	Estimate	Std. Error	t value	Pr(>\|t\|)
Speciesversicolor	0.93	0.103	9.033	8.77e-16
Speciesvirginica	1.582	0.103	15.37	2.215e-32
(Intercept)	5.006	0.0728	68.76	1.134e-113

Fitting linear model: Sepal.Length ~ Species
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
150	0.5148	0.6187	0.6135

And the more complex example would be:

iris %>%
  filter(Species=='versicolor') %>% # Exclude some data
  mutate(area=Sepal.Length*Sepal.Width, # Mutate creates a new variable
         areaC=scale(area)) %>% #And another one - a scaled, centred version of the previous
  lm(Petal.Length~areaC, .) %>%  
  summary %>% 
  pander

This doesn’t involve changing your original data set in any way, it avoids embedding, and you needn’t create any temporary objects. You don’t have to make a note of what you did, because dplyr piping shows your working in detail. It’s a bit like archaeology, but upside down. A reader can see that you extracted a subset of the data, and then centred the independent variable.

If you just saved (or shared) the output of the lm, you’d lose this information. And if you want to make changes to this in the future, you can just change the relevant line and rerun the anlysis.

The main thing to remember is that, by default, dplyr assumes that the output of the previous function is the first argument in the current function, so the following are equivalent:

# Normal
filter(iris, Species=='versicolor')

# With piping
iris %>% filter(Species=='versicolor')

If you want the output placed anywhere other than as the first argument of the function, use a . to mark its position. So the following are equivalent:

# Normal
lm(Y~X, iris)

# With piping
iris %>% lm(Y~X, .) # Not `iris %>% lm(Y~X)` which would mean `lm(iris, Y~Z)`

The point is that you can pipe your data through a whole bunch of functions before piping it into your model, and you can share how you did that rather than just sharing the model output.

Another great function in dplyr is summarise:

iris %>%
  group_by(Species) %>%
  summarise(mean = mean(Sepal.Length), 
            sd = sd(Sepal.Length),
            min = min(Sepal.Length), 
            max = max(Sepal.Length)) %>%
  pander

Species	mean	sd	min	max
setosa	5.006	0.3524897	4.3	5.8
versicolor	5.936	0.5161711	4.9	7.0
virginica	6.588	0.6358796	4.9	7.9

Be careful. It matters what order you load packages in (library()). Some other packages have a summarise command, so if you load them after dplyr, it would override this function.

Another package, tidyr, allows you to reshape the dataframe (long \(\rightarrow\) wide or wide \(\rightarrow\) long). I’d always struggled a bit with the other alternatives (melt or dcast).

# reshape wide to long with `gather`
# gathers names of columns that start with 'Sepal' into a new column 'varName', with the relevant values in a new column 'value'
# gets rid of the Species column
# prints just the top 10 and bottom 10 columns
# formats it nicely

iris %>% 
  gather(varName, value, starts_with('Sepal')) %>%
  select(-Species) %>% 
  filter(row_number() < 10 | row_number() >290) %>% 
  pander

Petal.Length	Petal.Width	varName	value
1.4	0.2	Sepal.Length	5.1
1.4	0.2	Sepal.Length	4.9
1.3	0.2	Sepal.Length	4.7
1.5	0.2	Sepal.Length	4.6
1.4	0.2	Sepal.Length	5
1.7	0.4	Sepal.Length	5.4
1.4	0.3	Sepal.Length	4.6
1.5	0.2	Sepal.Length	5
1.4	0.2	Sepal.Length	4.4
5.6	2.4	Sepal.Width	3.1
5.1	2.3	Sepal.Width	3.1
5.1	1.9	Sepal.Width	2.7
5.9	2.3	Sepal.Width	3.2
5.7	2.5	Sepal.Width	3.3
5.2	2.3	Sepal.Width	3
5	1.9	Sepal.Width	2.5
5.2	2	Sepal.Width	3
5.4	2.3	Sepal.Width	3.4
5.1	1.8	Sepal.Width	3

For more examples see here or here or here.

Bibliography

A list of citations will be included here if I include bibliography: ../../writing/bib1.bib in the YAML header.

But back to the slides to wrap up!

Introduction to R Markdown

Justin Sulik

March 15, 2017