"R Markdown allows you to create documents that serve as a record of your analysis. In the world of reproducible research, we want other researchers to easily understand what we did in our analysis, otherwise nobody can be certain that you analysed your data properly. You might choose to create an RMarkdown document as an appendix to a paper or project assignment that you are doing, upload it to an online repository such as Github, or simply to keep as a personal record so you can quickly look back at your code and see what you did.
RMarkdown presents your code alongside its output (graphs, tables, etc.) with conventional text to explain it, a bit like a notebook."
RMarkdown uses Markdown syntax. Markdown is a very simple ‘markup’ language which provides methods for creating documents with headers, images, links etc. from plain text files, while keeping the original plain text file easy to read."
-from Coding Club
Helpful resources:
Install markdown using install.packages(“rmarkdown”,dependencies = TRUE). Open a new markdown docuent by clicking the dropdown icon where you would normally go to create a new script.
library(rmarkdown)
Above is our first chunk. If you’re looking at this in R studio rather than in HTML, you’ll see that code chunks are offset by the grave accents (```) at the beginning at end. The “r” inside curly brackets denotes that you’re using R code, rather than another type of code. Inside the curly brackets, you can also name the chunk and add rules.
If you’re looking at this in HTML, here is a reproduction of the structure of a chunk: ># {r name of chunk here} ># yourcode<-goes here >#
Chunks can be run alone by clicking the green arrow (far right icon if you’re viewing in R Studio). You can also run all chunks above using the downward arrow (right center icon). Finally, you can run a markdown document like you would a normal script. All text outside of the code chunks will be ignored. No more adding and deleting # or seperating important things with #############.
YAML Header The first part of a markdown document is the YAML header, which is enclosed above and below by three dashes (—). You can edit this and set a variety of rules in the header that will alter the entire document.
Here is an example of what a standard YAML header looks like: > — > title: ‘Name of your Markdown Document’ > author: “Your Name” > date: “6/6/2019” > output: html_document > —
Chunks Within chunks, you can write and run code just like you would anywhere else. Within the curly brackets, you can title your chunks (ie: “setup” below). You will not see these titles in the rendered version of the document. You can also add rules here:
include = FALSE means that you do not want to include this code or its outputs in the rendered document (the code will still run).
echo = FALSE means that you want to include the output in your rendered document, but you do not want to see the code. This is useful when you want to display a figure or table, but don’t necessarily need to show the code you used to create it.
warnings = FALSE means “please do not show me”There were 50 warnings()" when I render this."
messages = FALSE means that you don’t want to see the various messages that pop up when your code runs (even the ones that are not errors), but you still want to see the code and output.
fig.align = “left,right,or center” change the position of your rendered figure.
You can use these rules alone or together.
Figures By default, RMarkdown will place graphs by maximising their height, while keeping them within the margins of the page and maintaining aspect ratio. If you have a particularly tall figure, this can mean a really huge graph. To manually set the figure dimensions, you can insert an instruction into the curly braces:
fig.width and fig.height adjust size of rendered figures
Tables Tables can be included in their raw format by calling the table, but they can also be beautified using kable() in the knitr package or formattable() in the formattable package.
Text Finally, you can format text in the final rendered documents using the rules explained thoroughly in the Formatting Style Guide. You can even format equations!
To begin, I always have a “setup” chunk. This is the best place to load packages. I usually use include=FALSE because loading packages always produces weird outputs and messages. This is also a good place to do things like set a shared path, bring in outside functions, etc
For this example, we will use the “ramen” ratings dataset of #tidytuesday. We will use packages in the tidyverse, as well as a few other related helpers.
| variable | class | description |
|---|---|---|
| review_number | integer | Ramen review number, increasing from 1 |
| brand | character | Brand of the ramen |
| variety | character | The ramen variety, eg a flavor, style, ingredient |
| style | character | Style of container (cup, pack, tray, |
| country | character | Origin country of the ramen brand |
| stars | double | 0-5 rating of the ramen, 5 is best, 0 is worst |
If you’re looking at this in markdown, the table above will look quite ugly. If you’re looking at it in HTML however, it will look very clean and neat!
Download data from Git.
ramen_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-04/ramen_ratings.csv")
## Parsed with column specification:
## cols(
## review_number = col_double(),
## brand = col_character(),
## variety = col_character(),
## style = col_character(),
## country = col_character(),
## stars = col_double()
## )
There is one less than ideal quality of loading the entire tidyverse (often I only load dplyr for this very reason). Dplyr, plyr, tidyr, and stats have some conflicts (non-uniquely named functions). Depending on the order you load them in the library (or if you load the entire tidyverse), you may run into this. The easiest way to deal with this is to always load dplyr last.
However, the most sure-fire way to deal with this is to use the “conflicted” package to point out where you duplicated function names may be causing you issues. For example, if you call “filter”, conflicted will write a message in your output.
Error: [conflicted]
filterfound in 2 packages. Either pick the one you want with::* dplyr::filter * stats::filter Or declare a preference withconflict_prefer()* conflict_prefer(“filter”, “dplyr”) * conflict_prefer(“filter”, “stats”)
To remedy this, you can use the “conflict_prefer” function to tell R which package you want to use when you call a specific function.
conflict_prefer("filter","dplyr")
## [conflicted] Will prefer dplyr::filter over any other package
ramen_ratings %>%
filter(stars >= 3.75 & country == "Canada")
## # A tibble: 6 x 6
## review_number brand variety style country stars
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 3076 Rooster Chili Chicken Flavour Nood~ Bowl Canada 3.75
## 2 2997 Rooster Chili Seafood Flavour Pack Canada 4.25
## 3 2958 Rooster Kimchee Flavour Noodle Soup Bowl Canada 3.75
## 4 2073 Great Value Spicy Ramen Noodles Pack Canada 4
## 5 1549 Sapporo Ic~ Chow Mein Japanese Style N~ Pack Canada 5
## 6 1341 Plats Du C~ Cuisine Adventures Chicken~ Bowl Canada 5
Another way to do this is to use double colons to explictly refer to the package within the pipe. This can be quite verbose, but can also be useful if you need to alternate between functions/packages.
ramen_ratings %>%
dplyr::filter(stars >= 4.25 & country == "Vietnam")
## # A tibble: 7 x 6
## review_number brand variety style country stars
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 2354 Vifon Viet Cuisine Bun Rieu Cua So~ Bowl Vietnam 5
## 2 855 Vina Ace~ Bestcook Hot & Sour Shrimp Bowl Vietnam 4.25
## 3 589 Vifon South Korean Style Kim Chee Bowl Vietnam 4.25
## 4 447 Ve Wong Little Prince Bacon Pack Vietnam 4.75
## 5 396 Vina Ace~ Bestcook Hot spicy Tom Yum S~ Bowl Vietnam 4.25
## 6 381 Saigon V~ Kung Fu Artificial Beef Rice~ Pack Vietnam 4.25
## 7 362 Vina Ace~ Daily Hot & Sour With Shrimp Pack Vietnam 4.5
If you want more information about what dplyr is doing here, you can load the package “tidylog” to provide messages about each step, but note that it has some annoying conflicts.
I’ll provide a brief explanation of what each step is doing: (except for the very self-explanatory ones)
gather - go from wide to long format, the “category” is the current column headers, the “value” is the contents of each cell, the minus signs means “don’t include this column when gathering”
group_by - group your data by any the levels of any variable
top_n - select the top (15 in our case) cases
mutate - add (or overwrite) a column called “value” and do something to it. In this case, we are using fct_reorder from “forcats” to order our values by n.
facet_wrap - facets by category, scales = are scales shared across all facets (the default, “fixed”), or do they vary across rows (“free_x”), columns (“free_y”), or both rows and columns (“free”)?
The code for this wrangling/figure is from [@drob](https://github.com/dgrtwo/data-screencasts). The rest of the analysis was inspired by his screencast.
ramen_ratings %>%
gather(category,value, -review_number, -stars) %>%
count(category, value) %>%
group_by(category) %>%
top_n(15,n) %>%
ungroup() %>%
mutate(value = fct_reorder(value,n)) %>%
ggplot(aes(value,n)) +
geom_col() +
facet_wrap(~category, scales = "free_y") +
coord_flip()
Those are some ugly plots, but what I wanted to know was if there was a lot of redundancy among these categories, which there is not.
Here, we will use tidy text techniques to mine the “variety” name for details about this ramen. More on tidytext.
filter - remove all instances where the ramen does not have a star value assigned
unnest_tokens - take apart the variety column and put each piece in a column called “word”
summarise - calculate the mean star value associated with each word, count the n
arrange the word values in descending order by n
print first 10 values in this table.
ramen_ratings %>%
filter(!is.na(stars)) %>%
unnest_tokens(word,variety) %>%
group_by(word) %>%
summarise(avg_rating = mean(stars),n = n()) %>%
arrange(desc(n)) %>%
head(n = 10)
## # A tibble: 10 x 3
## word avg_rating n
## <chr> <dbl> <int>
## 1 noodles 3.57 793
## 2 noodle 3.78 659
## 3 instant 3.58 510
## 4 ramen 3.84 484
## 5 flavour 3.61 459
## 6 flavor 3.68 407
## 7 chicken 3.43 378
## 8 spicy 3.74 349
## 9 beef 3.57 276
## 10 soup 3.61 259
I went through the table and picked out all the words associated with meats. I’m sure you could automate this, but I did it by hand for this example. I made a vector of all the meat possibilities.
meatsvector<-c(as.character(expression(beef, carne, pork, bacon, chicken, chikin, pollo, duck, shrimp, crab, abalone, lobster, fish, scallop, clam, oyster, anchovy, seafood, prawn)))
I filtered the original text_mined dataset to include any ramen that included a word in meatsvector.
rename - renames the column “word” and calls it “meat” - so basically its backwards of how I would think you’d rename somthing, but okay.
mutate_at - rounds the values in column 2 to position 2
formattable - creates a fancy looking table
ramen_ratings %>%
filter(!is.na(stars)) %>%
unnest_tokens(word,variety) %>%
filter(word %in% meatsvector) %>%
group_by(word) %>%
summarise(avg_rating = mean(stars),n = n()) %>%
arrange(desc(n)) %>%
rename("meat" = "word") %>%
mutate_at(2,round,2) %>%
formattable(align = rep("l",3), list(`avg_rating` = color_bar(color = "lightgray", fun = "proportion")))
| meat | avg_rating | n |
|---|---|---|
| chicken | 3.43 | 378 |
| beef | 3.57 | 276 |
| shrimp | 3.50 | 140 |
| seafood | 3.82 | 130 |
| pork | 3.60 | 128 |
| crab | 3.87 | 40 |
| prawn | 4.19 | 30 |
| abalone | 3.02 | 15 |
| duck | 2.74 | 14 |
| lobster | 3.66 | 11 |
| chikin | 4.72 | 10 |
| pollo | 3.52 | 10 |
| fish | 3.67 | 9 |
| scallop | 3.75 | 8 |
| carne | 3.55 | 5 |
| clam | 2.46 | 5 |
| oyster | 2.80 | 5 |
| anchovy | 3.06 | 4 |
| bacon | 3.75 | 4 |
Select the ramens that have a meat in their name. Rename the column “word” to variety.
ramen_by_protein<-ramen_ratings %>%
filter(!is.na(stars)) %>%
unnest_tokens(word,variety) %>%
filter(word %in% meatsvector) %>%
rename("variety" = "word")
Select the ramens that DO NOT have a meat in their name. We do this by using “anti_join” which selects all the ramen reviews that are not in the “ramen by protein” table. Assign the column “variety” to contain “other.”
ramen_no_protein<-ramen_ratings %>%
anti_join(ramen_by_protein, by = "review_number") %>%
mutate(variety = "other")
Bind these two datasets back together, and relevel to set “other” as the reference group.
ramen_ratings_processed<-bind_rows(ramen_by_protein,ramen_no_protein) %>%
mutate(variety = fct_relevel(variety, "other"))
Library “broom” helps make your model outputs more usable.
Here, I wrote the a lm, then used the function “tidy” in broom to save the model output to a tibble. I removed the “intercept”, arranged the estimates in descending order, then plotted the estimates and their confidence intervals.
lm(stars ~ variety, ramen_ratings_processed) %>%
tidy(conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = fct_reorder(term,estimate)) %>%
ggplot(aes(estimate,term)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(lty = 2, xintercept = 0) +
labs(x = "Estimated effect on ramen rating", y = "", title = "Protein type as a predictor of ramen rating", subtitle = "Varieties without named protein are used as the reference level") +
theme_bw()
Conclusion: People like misspelled meat names (yeah, I’m talking about you “chikin”), they don’t want to know what unspecified mystery meats are in their ramen, or they simply prefer vegetarian ramen.
For fast reference, the tidyverse cheat sheets are really helpful:
My personal favorite: dplyr