R Markdown Tutorial

This is a tutorial on how to use R markdown for reproducible research.

Here, we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first example, we will be using the ToothGrowth dataset. In this experiment, Guinea pigs (literal) were given different amounts of vitamin C to see the effects on the animal’s tooth growth.

To run R code in a markdown file, we need to denote the section that is considered R code. We call these “Code chunks.”

Below is a Code Chunk:

Toothdata <- ToothGrowth

head(Toothdata)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

As you can see, from running the “play” button on the code chunk, the results are printed inline of the R Markdown file.

fit <- lm(len ~ dose, data = ToothGrowth)

b <- fit$coefficients

plot(len ~ dose, data = ToothGrowth)

abline(lm(len ~ dose, data = ToothGrowth))

The slope of the regression is 9.7635714.

Section Headers

We can also put sections and subsections in an R Markdwon file, similar to numbers or bullet points in a word document. This is done with the “#” that we used previously to denote text in a script.

First level Header

Second Level Header

Third Level Header

Make sure you put a space after the hashtag, otherwise it will not work!

We can also add bullet point-type marks.

  • one item
  • one item
  • one item
    • one more item
    • one more item
    • one more item

Its important to note here, in R Markdown, INDENTATION MATTERS!

We can similarly do ordered lists.

  1. First item
  2. Second item
  3. Third item
  • subitem 1
  • subitem 2
  • subitem 3

Block Quotes

We can also put nice quotes into the markdown package. We do this by using the “>” symbol.

“Genes are like the store, and DNA is the language that the story is written in.”

— Sam Kean

Formulas

We can also put nice formatted formulas into Markdown using two dollar signs.

Hardy-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

And you can get really complex as well!

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

If you are more interested in how to do these formulas in R Markdown, you can check out this LaTex cheatsheet.

Code chunk options

There are also options for your R Markdown file on how knitr interperits the chunk. There are the following options:

Eval (T or F): whether or not to evaluate the code chunk

Echo (T or F): whether or not to show the code for the chunk, results will still print.

Cache: If enabled, the same code chunk will not be evaluated the next time the document is run. Great for long running processes, but can pose a problem if you change any values above the code.

fig.width or fig.height: The (graphical device) size of R plots in inches. The figures are first written to the knitr document then to files that are saved seperatly.

out.width or out.height: The output size of the R plots in the output document.

fig.cap: the words for the figure caption.

Table of Contents

We can also add a table of contents to our HTML document. We do this by altering our YAML code (the weird code chunk at the VERY top of the document.) We can add this:

title: “HTML Markdown Handbook” author: “Dr. VDB” date: “3/1/2022” output: html_document: toc: true toc_float: true

This will give us a very nice floating table of contents on the right side of our HTML file.

If you wan to change it to having the TOC always show and not appear as you scroll, you can change the setting “collapsed” to FALSE.

Alternatively, you can have numbered sections by using the number_sections: TRUE

Data Wrangling with R

First thing is to load the library and look at the top of the data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
??flights

my_data <- nycflights13::flights

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

First we will just look at the data on October 14th.

filter(my_data, month == 10, day == 14)
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

If we want to subset this into a new variable, we do the following:

oct_14_flight <- filter(my_data, month == 10, day == 14)

What if you want to do both, print and save the variable?

(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

If you want to filter based on different operators, you can use the following:

Equals: == Not Equal to: != Greater than: > Less than: < Greater than or equal to >= Less than or equal to <=

(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 252,474 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

If we don’t use the == to mean equals, we get this:

(oct_14_flight <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You can also use logical operators to be more selective

And: & Or: | Not: !

Let’s use the “or” function to pick flights in March and April

March_April_Flights <- filter(my_data, month == 3 | month == 4)

March_April_Flights <- filter(my_data, month == 3 & day == 4)

Non_jan_flights <- filter(my_data, month!= 1)

Arrange

Arrange allows us to arrange the dataset based on the variables we desire.

arrange(my_data, year, day, month)

We can also do this in descending fashion

descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless of ascending or descending.

Select

We can also select specific columns that we want to look at.

calendar <- select(my_data, year, month, day) print(calendar)

We can also look at a range of columns

calendar2 <- select(my_data, year:day)

Let’s look at all columns years through carrier

calendar3 <- select(my_data, year:carrier)

We can also choose which columns NOT to include

everything_else <- select(my_data, -(year:day))

In this instance, we can also use the “not” operator !

everything_else2 <- select(my_data, !(year:day))

There are also some other helper functions that can help you select the columns or data you’re looking for.

starts_with(“xyz”) – will select the values that start with xyz

ends_with(“xyz”) – will select the values that end with xyz

contains(“xyz”) – will select values that contain xyz

matches(“xyz”) – will match the indentical value xyz

Renaming

head(my_data)

rename(my_data, departure_time = dep_time)

my_data <- rename(my_data, departure_time = dep_time)

Mutate

What if you want to add new columns to your data frame? we have the mutate() function for that

First, Let’s make a smaller data frame so we can see what we’re doing

my_data_small <- select(my_data, year:day, distance, air_time)

Let’s calculate the speed of the flights.

mutate(my_data_small, speed = distance / air_time * 60)

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

What if we wanted to create a new dataframe with ONLY your calculations (transmute)

airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)

Summarize and by_group()

we can use summarize to run a function on a data column to get a single return

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))

So we can see here that the average delay is about 12 minutes

we gain additional value in summarize by pairing it with by_group()

by_day <- group_by(my_data, year, month, day) summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

As you can see, we now have the delay by the days of the year

Missing Data

What happens when we don’t tell R what to do with the missing data?

summarize(by_day, delay = mean(dep_delay))

We can also filter our data based on NA ( which in this dataset was cancelled flights)

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Let’s run summarize again on this data

summarize(not_cancelled, delay = mean(dep_delay))

counts

we can also count the number of variables that are NA

sum(is.na(my_data$dep_delay))

We can also count the numbers that are NOT NA

sum(!is.na(my_data$dep_delay))

Piping

With tibble datasets (more on them soon), we can pipe results to get rid of the need to use the dollar signs

we can then summarize the number of flights by minutes delayed

my_data %>% group_by(year, month, day) %>% summarize(mean = mean(departure_time, na.rm = TRUE))

Tibbles

library(tibble)

Now we will take the time to explore tibbles. Tibbles are modified data frames which tweak some

of the older features from data frames, R is an old language, and useful things frome 20 years

ago are not as useful anymore.

as_tibble(iris)

As we can see, we have the same data frame, but we have different features

You can also create a tibble from scratch with tibble()

tibble( x = 1:5, y = 1, z = x ^ 2 + y )

You can also use tribble() for basic data table creation

tribble( ~genea, ~ geneb, ~ genec, ######################### 110, 112, 114, 6, 5, 4 )

Tibbles are built to not overwhelm your console when printing data, only showing

the first few lines.

This is how a data frame prints

print(by_day) as.data.frame(by_day) head(by_day)

nycflights13::flights %>% print(n=10, width = Inf)

Subsetting

Subsetting tibbles is easy, similar to data.frames

df_tibble <- tibble(nycflights13::flights)

df_tibble

We can subset by column name using the $

df_tibble$carrier

We can subset by position using [[]]

df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder

df_tibble %>% .$carrier

Some older functions do not like tibbles, thus you might have to convert them back to dataframes

class(df_tibble)

df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)

df_tibble

head(df_tibble_2)

tidyr

library(tidyverse)

How do we make a tidy dataset? Well the tidyverse follows three rules.

1 - each variable must have its own column

2 - each observation has its own row

3 - each value has its own cell

It is impossible to satisfy two of the three rules.

This leads to the following instructions for tidy data

1 - put each dataset into a tibble

2 - put each variable into a column

3 - profit

Picking one consistent method of data storage makes for easier understanding

of your code and what is happening “under the hood” or behind the scenes.

Let’s now look at working with tibbles

bmi <- tibble(women)

bmi %>% mutate(bmi = (703 * weight)/(height)^2)

Gathering

Sometimes you’ll find data sets that don’t fit well into a tibble

We’ll use the built in data from tidyverse for this part

table4a

As you can see from this data, we have one variable in column A (country)

but columns b and c are two of the same. Thus, there are two observations

in each row.

To fix this, we can use the gather function

table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)

Let’s look at another example

table4b

As you can see we have the same problem in table 4b

table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)

Now what if we want to join these two tables? We can use dplyr

table4a <- table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)

left_join(table4a, table4b)

Spreading

Spreading is the opposite of gathering. Let’s look at table 2

print(table2)

You can see that we have redundant info in columns 1 and 2

We can fix that by combining rows 1&2, 3&4, etc.

spread(table2, key = type, value = count)

Type is the key of what we are turning into columns, the value is what becomes rows/observations.

In summary, spread makes long tables shorter and wider

gather makes wide tables narrower and longer

Separating and pull

Now what happens when we have two observations stuck in one column?

table3

As you can see, the rate is just the population and cases combined.

We can use separate to fix this

table3 %>% separate(rate, into = c(“cases”, “population”))

However, if you notice, column type is not correct.

table3 %>% separate(rate, into = c(“cases”, “populate”), conver = TRUE)

You can specify what you want to separate based on.

table3 %>% separate(rate, into = c(“cases”, “populate”), sep = “/”, conver = TRUE)

Let’s make this more tidy

table3 %>% separate ( year, into = c(“century”,“year”), convert = TRUE, sep = 2 )

Unite

what happens if we want to do the inverse of separate?

table5

table5 %>% unite(date, century, year)

table5 %>% unite(date, century, year, sep = ““)

Missing values

There can be two types of missing values, NA (explicit) or just no entry(implicit)

gene_data <- tibble ( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )

gene_data

The nucleotide count for gene b run 2 is explicit missing.

the nucleotide count for gene b run 1 is implicitly missing.

One way we can make implicit missing values explicit is by putting runs in columns

gene_data %>% spread(gene, nuc)

If we want to remove the missing values, we can use spread and gather, and na.rm = true

gene_data %>% spread(gene, nuc) %>% gather(gene,nuc, ‘a’:‘b’, na.rm = TRUE)

Another way we can make missing values explicit is complete()

gene_data %>% complete(gene, run)

Sometimes an NA is present to represent a value being carried forward

treatment <- tribble(
person, ~treatment, ~response, #################################################### “Isaac”, 1, 7, NA, 2, 10, NA, 3, 9, “VDB”, 1, 8, NA, 2, 11, NA, 3, 10, )

treatment

What we can do here is use the fill() option

treatment %>% fill(person)

DPLYR

It is rare that you will be working with a simple data table. The DPLYR package allows

you to join two data tables based on common values.

Mutate joins - add new variables to one data frame from the matching observations in another.

Filtering joins - filters observations from one data frame based on whether or not

# they are present in another. # Set operations - treats observations as they are set elements

library(tidyverse) library(nycflights13)

Let’s pull full carrier names based on letter codes

print(airlines)

Let’s get info about airports

print(airports)

Let’s get info about each plane

print(planes)

Let’s get some info on the weather at the airports

print(weather)

Let’s get info on singular flights

print(flights)

Let’s look at how these tables connect

Flights -> planes based on tail number

flights -> airlines through carrier

flights -> airports through origin AND destination

flights -> weather via origin, year/month/day/hour

keys

keys are unique identifiers per observation

primary keys uniquely identifies an observation in its own table.

One way to identify a primary key is as follows:

planes %>% count(tailnum) %>% filter(n>1)

This indicates that the tail number is unique.

planes %>% count(model) %>% filter(n>1)

Mutate join

flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)

We’ve now added the airline name to our data frame from the airline data frame

Other types of joins

Inner joins (inner_join) matches a pair of observations when their key is equal

Outer joins (outer_join) keeps observations that appear in at least one table.

Detect Matches

str_detect() returns a logical vector the same length of input

y <- c(“apple”, “banana”, “pear”)

str_detect(y, “e”)

How many common words contain the letter e?

words

sum(str_detect(words, “e”))

Let’s get more complex, what proportion of words end in a vowel?

mean(str_detect(words, “1”))

Let’s find all the words that don’t contain “o” or “u”

no_o <- !str_detect(words,“[ou]”) no_o

Now let’s extract

words[!str_detect(words,“[ou]”)]

You can also use str_count() to say how many matches there are in string

x <- c(“ATTAGA”, “CGCCCCCGGAT”, “TATTA”) x

str_count(x, “[GC]”)

Let’s couple this with mutate

df <- tibble( word = words, count = seq_along(word) )

df

df %>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )


  1. aeiou↩︎