This is a tutorial on how to use R markdown for reproducible research

Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the hashtag symbol. In our first example, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literal) were given different amounts of vitamin C to see the effects on the animal’s tooth growth.

To run R code in markdown file, we need to denote the section that is considered R code. We call these “code chunks”.

Below is a code chunk:

Toothdata <- ToothGrowth

head(Toothdata)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

As you can see, from running the play button on the code chunk, the results are printed inline of the r markdown file.

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))

Figure 1: The tooth growth of guinea pigs when given variable amount of vitamin C

The slope of the regression line is 9.7635714.

Section Headers

We can also put sections and subsections in our R Markdown file, similar to numbers or bullet points in a Word document. This is done with the “#” that we previously used to denote text in an R script.

First level header

Second level header

Third level header

Make sure that you put a space after the hashtag, otherwise it will not work!

We can also add bullet point-type marks in our R Markdown file.

one item
one item
one item
- one more item
- one more item
- one more item
  - one last item

Its important to note here that in R Markdown, indentation matters!

First item
Second item
Third item

sub item 1
sub item 2
sub item 3

Block Quotes

We can put really nice quotes into the markdown document. We do this by using the “<” symbol.

“Genes are like the stroy, and DNA is the language that the story is written in.”

— Sam Kean

Hyperlinks

Hyperlinks can also be incorporated into these files. This is especially useful in HTML files, since they are in a web browser and will redirect the reader to the material that you are interested in showing them. Here we will use the link to R Markdown’s homepage for this example.

RMarkdown

Formulas

We can also put nice formatted formulas into Markdown using two dollar signs.

Hard-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

And you can really get complex as well!

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & delta \end{pmatrix}\]

Code Chunks

Code Chunk Options

There are also for your R Markdown file on how knitr interprets the code chunk. There are the following options.

Eval (T or F): whether or not to evaluate the code chunk.

Echo (T or F): whether or not to show the code for the chunk, but the results will still print.

Cache: If enable, the same code chunk will not be evaluated the next time that knitr is run. Great for code that has long run times.

fig.width or fig.hight: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.hight: the output size of the R plot IN THE R DOCUMENT.

fig.cap: the words for the figure capture.

We can also add a table of contents to our HTML Document. We do this by altering the YAML code (the weird code chunk at the very top of the document). We can add this:

title: “HTML_Tutorial” author: “Jacob Johnson” date: “2025-11-13” output: html_document: toc: true toc_float: true

This will give us a very nice floating table of contents on the right hand side of the document.

Tabs

You can also add tabs in a report. To do this, you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.

Themes

You can also add themes to your HTML document that change the highlighting color and hyperlink color of your html output. This can be nice aesthetically. To do this, you change your theme in the YAML to one of the following:

cerulean journal flatly readable space lab united cosmo lumen paper sandstone simple yeti null

You can also change the color by specifying highlight:

default tango payments kate monochrome espresso zenburn haddock textmate

Code Folding

You can also use the code folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:

code_folding: hide

Summary

There are a TON of options and ways to customize your R code using HTML format. This is also a great way to display a “portfolio” of work if you are trying to market yourself to interested parties.

Data Wrangling with R

First thing is to load the library and look at the top of the data.

library(tidyverse)


??flights

my_data <- nycflights13::flights

head(my_data)

First we will just look at the data on October 14th

filter(my_data, month == 10, day == 14)

If we want to subset this into a new variable, we do the following:

oct_14_flight <- filter(my_data, month == 10, day == 14)

What if you want to do both print and save the variable?

(oct_14_flight_2 <- filter(my_data, month < 10, day == 14))

If you want to filter based on different operators, you can use the following:

Equals == Not Equal to != Greater than > Less than < Greater than or equal to >= Less than or equal to <=

(flight_through_september <- filter(my_data, month == 10))

If we don’t use the == to mean equals, we get this:

(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))

You can also use logical operators to be more selective

and & or | not !

Lets use the “or” function to pick flights in march and april

March_April_Flights <- filter(my_data, month == 3 | month == 4)

March_April_Flights <- filter(my_data, month == 3 & day == 4)

Non_jan_flights <- filter(my_data, month !=1)

Arrange

Arrange allows us to arrange the data set based on the variables we desire.

arrange(my_data, year, day, month)

We can also do this in descending fashion

descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the data frame regardless of ascending or descending.

Select

We can also select specific columns that we want to look at.

calendar <- select(my_data, year, month, day)

print(calendar)

We can also look at a range of columns.

calendar2 <- select(my_data, year:day)

Lets look at all columns through carrier

calendar3 <- select(my_data, year:carrier)

We can also choose which columns NOT to include.

everything_else <- select(my_data, -(year:day))

In this instance we can also use the “not” operator!

everything_else2 <- select(my_data, !(year:day))

There also are some other helper functions that can help you select the columns or data you’re looking for.

starts_with(“xyz”) – will select the values that start with xyz. ends_with(“xyz”) – will select the values that end with xyz. contains(“xyz”) – will select the values that contain xyz. matches(“xyz”) – will match the identical value xyz.

Renaming

head(my_data)

rename(my_data, departure_time = dep_time)

my_data <- rename(my_data, departure_time = dep_time)

Mutate

What if you wanted to add new columns to your data frame? We have the mutate () function for that.

First, lets make smaller data frames so we can see what we are doing.

my_data_small <- select(my_data, year:day, distance, air_time)

Lets calculate the speed of the flights.

mutate(my_data_small, speed = distance / air_time * 60)

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

What if we wanted to create a new data frame with ONLY your calculations? (transmute)

airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)

Summarize and by_group()

We can use summarize to run a function on a data column to get a single return.

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))

So we can see here that the average delay is about 12 mins.

We gain additional value in summarize by pairing it with by_group()

by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

As you can see, we now have the delay by the days of the year.

Missing Data

What happens if we don’t tell R what to do with the missing data?

summarize(by_day, delay = mean(dep_delay))

We can also filter our data based on NA (which in this data set was canceled flights)

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Lets run summarize again on this data.

summarize(not_cancelled, delay = mean(dep_delay))

Counts

sum(is.na(my_data$dep_delay))

We can also count the numbers that are NOT NA

sum(!is.na(my_data$dep_delay))

Piping

With tiblle data sets(more on those soon), we can pipe results to get rid of the need to use the dollar signs. We can then summarize the number of flights by minutes delayed.

my_data %>%
  group_by(year, month, day) %>%
  summarize(mean = mean(departure_time, na.rm = TRUE))

Tibbles

library(tibble)

Now we will take the time to explain Tibbles. Tibbles are modified data frames which tweak some of the older features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore.

as_tibble(iris)

As we can see, we have the same data frame, but we have different features.

You can also create a tibble from scratch with tibble()

tibble(
  x = 1:5,
  y = 1,
  z = x ^ 2 + y
)

You can also use tribble() for basic data table creation.

tribble(
  ~genea, ~geneb, ~genec,
  ######################
  110,      112,     114,
  6,        5,       4
)

Tibbles are built to not overwhelm your console when printing data, only showing the first few lines.

This is how a data frame prints

print(by_day)

as.data.frame(by_day)

head(by_day)

nycflights13::flights %>%
  print(n=10, width = Inf)

Subsetting

Subsetting Tibbles is easy, similar to data.frames

df_tibble <- tibble(nycflights13::flights)

df_tibble

We can subset by column name using the $

df_tibble$carrier

We can subset by position using [[]]

df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder.

df_tibble %>%
  .$carrier

Some older functions do not like tibbles, thus you might have to convert them back to data frames.

class(df_tibble)

df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)

df_tibble

head(df_tibble_2)

Tidyr

library(tidyverse)

How do we make a tidy dataset? Well the tidyverse follows 3 rules. 1. Each variable must have it’s own column 2. Each observation has it’s own row 3. Each value has it’s own cell

It is impossible to satisfy 2 of the 3 rules. This leads to the following instructions for tidy data.

Put each data set into a tibble
Put each variable into a colummn
Profit

Picking one consistent method of data storage makes for easier understanding of your code and what is happening “under the hood” or “behind the scenes”.

Lets now look at working with tibbles.

bmi <- tibble(women)

bmi %>%
  mutate(bmi = (703 * weight) / (height)^2)

Spreading and Gathering

Sometimes we will find data sets that don’t fit well into a tibble. We will use the built in data from tidyverse for this part.

table4a

As you can see from this data, we have one variable in column A (country) but columns b and c are two of the same. Thus, there are two observation in each row.

To fix this, we can use the gather function

table4a %>%
  gather('1999', '2000', key = 'year', value = 'cases')

Lets look at another example

table4b

As you can see, we have the same problem in table 4b.

table4b %>%
  gather('1999', '2000', key = 'year', value = 'population')

Now, what if you wanted to join these two tables? We can use dplyr.

table4a <- table4a %>%
  gather('1999', '2000', key = 'year', value = 'cases')
table4b <- table4b %>%
  gather('1999', '2000', key = 'year', value = 'population')

left_join(table4a, table4b)

Spreading

Spreading is the opposite of gathering. Lets look at table 2.

table2

You can see that we have redundant information in columns 1 and 2. We can fix that by combining rows 1 and 2, 3 and 4, etc.

spread(table2, key = type, value = count)

Type is the key of what we are turning into columns, the value is what becomes rows/ observations. In summary, spread makes long tables shorter and wider, and gather makes wide tables narrower and longer.

Separating and Pull

Now what happens when we have two observations stuck in one column?

table3

As you can see, the rate is just the population and cases combined. We can use separate to fix this.

table3 %>%
  separate(rate, into = c('cases', 'population'))

However, if you notice, column type is not correct.

table3 %>%
  separate(rate, into =c('cases', 'populate'), conver = TRUE )

You can specify what you want to separate based on.

table3 %>%
  separate(rate, into =c('cases', 'populate'), sep = '/', conver = TRUE )

Lets make this look more tidy.

table3 %>%
  separate(
    year,
    into = c('century', 'year'),
    convert = TRUE,
    sep = '2'
    )

Unite

What happens if we want to do the inverse of separate?

table5

table5 %>%
  unite(data, century, year)

table5 %>%
  unite(data, century, year, sep = '')

Missing Values

There are two types of missing values. NA (explicit) or just no entry (implicit)

gene_data <- tibble(
  gene = c('a', 'a', 'a', 'a', 'b','b','b'),
  nuc = c(20, 22, 24, 25, NA, 42, 67),
  run = c(1, 2, 3, 4, 2, 3, 4)
)

gene_data

The nucleotide count for Gene b run 2 is explicitly missing. The nucleotide count for Gene b run 1 is implicitly missing.

One way we can make implicit missing values explicit is by putting runs in columns

gene_data %>%
  spread(gene, nuc)

If we want to remove the missing values we can use spread and gather, and na.rm = TRUE

gene_data %>%
  spread(gene, nuc) %>%
  gather(gene, nuc, 'a':'b', na.rm = TRUE)

Another way we can make missing values explicit is complete()

gene_data %>%
  complete(gene, run)

Sometimes an NA is present to represent a value being carried forward.

treatment <- tribble(
  ~ person,       ~treatment,      ~response,
  ############################################
  'Isaac',             1,                 7,
  NA,                  2,                 10,
  NA,                  3,                 9,
  'VDB',               1,                 8,
  NA,                  2,                 11,
  NA,                  3,                 10,
)

treatment

What we can do here is use the fill() option

treatment %>%
  fill(person)

Dplyr

It is rare that you will be working with a single data table. The Dplyr package allows you to join two data tables based on common values.

Mutate joins- add new variables to one data frame from the matching observations in another. Filtering joins- filters observations from one data frame based on whether or not they are present in another. Set Operations- treats observations as they are set elements.

library(tidyverse)
library(nycflights13)

Lets pull full carrier names based on letter codes.

airlines

Lets look at info about airports.

airports

Lets get info about each plane.

planes

Lets get some info about the weather at the airports.

weather

Lets get info on singular flights.

flights

Lets look at how these tables connect.

Flights -> Planes based on tail number. Flights -> Airlines through the carrier column. Flights -> Airports origin and destination Flights -> Weather via origin, year/month/day/hour

Keys

Keys are unique identifiers per observation Primary key uniquely identifies an observation in its own table.

One way to identify a primary key is as follows:

planes %>%
  count(tailnum) %>%
  filter(n>1)

This indicates that the tail number is unique.

planes

planes %>%
  count(model) %>%
  filter(n>1)

Mutate Join

flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = 'carrier')

We have now added the airline name to our data frame from the airline data frame.

Other types of joins: Inner joins (inner_join) matches a pair of observations when their key is equal. Outer joins (outer_join) keeps observations that appear in at least one table.

Stringr

library(tidyverse)
library(stringr)

You can create strings using single or double quotes.

string1 <- "this is a string"
string2 <- 'to put a "quote" in your string, use the opposite'

string1
string2

If you forget to close your string, you’ll get this:

string3 <- "where is this string going?"

string3

Just hit escape and try again.

Multiple strings are stored in character vectors

string4 <- c("one", "two", "three")

string4

Measuring string length:

str_length(string3)

str_length(string4)

Lets combine two strings.

str_c("X", "Y")

str_c(string1, string2)

You can use sep to control how they are separated

str_c(string1, string2, sep = " ")

str_c("x", "y", "z", sep = "_")

Subsetting Strings

You can subset a string using str_sub().

HSP <- c("HSP123", "HSP234", "HSP456")

str_sub(HSP, 4,6)

This just drops the first 4 letters from the strings. Or you can use negatives to count back from the end.

str_sub(HSP, -3, -1)

You can convert the cases of strings like follows:

HSP
str_to_lower(HSP)

str_to_upper()

Regular Expression

install.packages("htmlwidgets")

x <- c('ATTAGA', 'CGCCCCCGGAT', 'TATTA')

str_view(x, "G")

str_view(x, "TA")

The next step is, “.” where the “.” matches an entry.

str_view(x, ".G.")

Anchors allow you to match at the start or the ending.

str_view(x, "^TA")

str_view(x, "TA$")

Character classes/ alternatives

/d matches any digit /s matches any space [abc] matches a, b, or c

str_view(x, "TA[GT]")

[^anc] matches anything BUT a,b, or c

str_view(x, "TA[^T]")

You can also use | to pick between two alternatives.

str_view(x, "TA[G|T]")

Detect Matches

str_detect() returns a logical vector the same length of input

y <- c("apple", "banana", "pear")

str_detect(y, "e")

How many common words start with the letter e? words

sum(str_detect(words, "e"))

Lets get more complex. What proportion of words end in a vowel?

mean(str_detect(words, "[aeiou]$"))

mean(str_detect(words, "^[aeiou]"))

Lets find all the words that don’t contain “o” or “u”

no_o <- !str_detect(words, "[ou]")

no_o

Now lets extract.

words[!str_detect(words, "[ou]")]

You can also use str_count() to say how many matches there are in a string.

str_count(x, "[GC]")

Lets couple this with mutate

df <- tibble(
  word = words,
  i = seq_along(word)
)

df

df %>%
  mutate(
    vowels = str_count(words, "[aeiou]"),
    constonants = str_count(words, "[^aeiou]")
  )

Final Project

Jacob Johnson

2025-11-13