This is a tutorial on how to use R Markdown for reproducible research.

Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In the first example, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literally) were given different amounts of vitamin C to see the effects of the animal’s tooth growth.

To run R code in a markdown file, we need to denote the section that is considered R code. We call these sections “code chunks.”

Below is a code chunk:

Toothdata <- ToothGrowth

head(Toothdata)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

As you can see from running the “play” button on the code chunk, the results are printed inline of the r markdown file.

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))
Figure 1: The tooth growth of Guinea Pigs when given varibale amounts of Vitamin C

Figure 1: The tooth growth of Guinea Pigs when given varibale amounts of Vitamin C

The slope of the regression line is 9.7635714.

Section Headers

We can also put sections and subsections in our r markdwon file, similar to members or bullet points in a word document. This is dine with the “#” that we previously used to denote text in an R script.

First level header

Second level header

Third level header

Make sure you put a space after the hashtag, otherwise it will not work!

We can also add bullet point-type marks in our r markdown file.

  • one item
  • one item
  • one item
    • one more item
    • one more item
    • one more item
      • one last item

Its important to note here that in R Markdown indentation matters!

  1. First item
  2. Second item
  3. Third item
  1. subitem 1
  2. subitem 2
  3. subitem 3

Block Quotes

We can put really nice quotes into the markdown document. We di this by using the “>” symbol.

“I have no special talents. I am only passionately curious.”

— Albert Einstein

Formulas

We can also put nice formatted formulas into Markdown using two dollar signs.

Hard-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

And you get really complex as well!

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

Code Chunks

Code Chunk Options

There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options.

Eval (T or F): whether or not to evaluate the code chunk.

Echo (T or F): whether or not to show the code for the chunk, but results will still print.

Cache: If enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG run times.

fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.height: the output size of the R plots IN THE R DOCUMENT.

fig.cap: the words for the figure caption.

Table of Contents

We can also add a table of contents to our HTML Document. We can do this by altering the YAML code (the weird code chunk and the VERY top of the document.) We can add this:

title: “HTML Tutorial” author: “Rebecca Tingle” date: “2024-11-11” output: html_document: toc: true toc_float: true

This will give us a very nice floating table of contents on the right side of the document.

Tabs

You can also add TABS in the report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.

Themes

You can also add themes to your HTML Document that change the highlighting color and hyperlink color of your HTML output. This can be nice aesthetically. To do this you change your theme in the YAML to one of the following:

cerulean journal flatly readable spacelab united cosmo lumen paper sandstone simplex yeti null

You can also change the color by specifying highlight:

default tango payments kate monochrome espresso zenburn haddock textmate

Code Folding

You can also use the code-folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:

code_folding: hide

Summary

There are a TON of options and ways for you to customize your R code using the HTML format. This is also a great way to display a “portfolio” of your work if you are trying to market yourself to interested parties.

Data Wrangling with R

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
??flights

my_data <- nycflights13::flights

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

#First we will just look at the data on the 14th of October.

filter (my_data, month == 10, day == 14)
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

#If we want to subset this into a new varibale, we do the following:

oct_14_flight <- filter (my_data, month == 10, day == 14)

What if you want to do both print and save the varibale?

(oct_14_flight_2 <- filter (my_data, month == 10, day == 14))

If you want to filter based on different opperations, you can use the following:

Equals ==

Not Equal to !=

Great than >

Less than <

Greater than or equal to >=

Less than or equal to <=

(flight_through_september <- filter(my_data, month < 10))

If we don’t use the == to mean equal, we get this:

#(oct_14_flight_2 <- filter (my_data, month = 10, day = 14))

#you can also use logical opperators to be more selective

and &

or |

not !

Lets use the “or” function to pick flights in March and April

MArch_April_Flights <- filter (my_data, month == 3 | month == 4)

MArch_April_Flights <- filter (my_data, month == 3 & month == 4)

March_4th_Flights <- filter (my_data, month == 3 & day == 4)

Non_Jan_Flights <- filter (my_data, month != 1)

Arrange

Arrange allows us to arrange the dataset based on the variables we desire.

arrange (my_data, year, day, month)

We can also do this in descending order

descending <- arrange (my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless of ascending or descending.

Select

We can also select specific colomns we want to look at.

calendar <- select (my_data, year, month, day) print(calendar)

We can also look at a range of colomns.

calendar2 <- select (my_data, year:day)

Lets look at all colomns years through carrier

calendar3 <- select (my_data, year:carrier)

We can also choose which colomns NOT to include

everything_else <- select (my_data, -(year:day))

In this instance we can also use the “not” opperator !

everything_else2 <- select (my_data, !(year:day))

There are also some other helper functions that can help you select the colomns for data you’re looking for.

starts_with(“xyz”) – will select values that start with xyz

ends_with(“xyz”) — will select values that end with xyz

contains(“xyz”) — will select values that contain xyz

matches(“xyz”) —- will match the identical value xyz

Renaming

head(my_data)

rename(my_data, departure_time = dep_time)

my_data <- rename(my_data, departure_time = dep_time)

Mutate

What if you want to add new colomns to your dataframe? We have the mutate() function for that.

First, lets make smaller dataframe so that we can see what we’re doing.

my_data_small <- select(my_data, year:day, distance, air_time)

Lets calculate the speed of the flights.

mutate(my_data_small, speed = distance / air_time * 60)

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

What if we wanted to create a new dataframe with ONLY your calculations (transmute)

airspeed <- transmute(my_data_small, speed = distance / air_time * 60, speed2 = distance / air_time)

Summarize and by_group()

We can use summarize to run a function or a data colomn to get a single return

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))

So we can see here that the average delay is about 12 minutes.

We gain additional value in summarize by pairing it with by_group()

by_day <- group_by(my_data, year, month, day) summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

As you can see, we now have the delay by the days of the year.

Missing Data

What happens if we don’t tell R what to do with the missing data?

summarize(by_day, delay = mean(dep_delay))

We can also filter our data based on NA (which is this dataset was cancelled flights)

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Lets run summarize again on this data

summarize(not_cancelled, delay = mean(dep_delay))

Counts

We can also count the number of variables that are NA

sum(is.na(my_data$dep_delay))

We can also count the numbers that are NOT NA

sum(!is.na(my_data$dep_delay))

Piping

With tibble dataset, we can pipe results to get rid of the need to use the dollar signs

We can then summarize the number of flights by minutes delayed.

my_data %>% group_by(year, month, day) %>% summarize(mean = mean(departure_time, na.rm = TRUE))

Tibbles

library(tibble)

Now we will take the time to explore tibbles. Tibbles are modified dataframes which tweak some

of the older features from data frames. R is an old language, and useful things from 20 years

ago are not as useful anymore.

as_tibble(iris)

As we can see, we have the same data frame, but we have different features

You can also create a tibble from scratch with tibble()

tibble( x = 1:5, y = 1, z = x ^ 2 + y )

You can also use tribble() for basic data table creation

tribble( ~genea, ~geneb, ~genec, ####################### 110, 112, 114, 6, 5, 4 )

Tibbles are built to not overwhelm your console when printing data, only showing

the first few lines.

print(by_day) as.data.frame(by_day) head(by_day)

nycflights13::flights %>% print(n = 10, width = Inf)

Subsetting

subsetting tibbles is easy, similar to data.frames

df_tibble <- tibble(nycflights13::flights)

df_tibble

We can subset by column name using the $

df_tibble$carrier

We can subset by position using [[]]

df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder.

df_tibble%>% .$carrier

Some older functions do not like tibbles, thus might have to convert them back to dataframe

class(df_tibble)

df_tibble_2 <- as.data.frame(df_tibble)

df_tibble

head(df_tibble_2)

Tidyr

library(tidyverse)

How do we make a tidy dataset? Well the tidyverse follows three rules.

#1 - Each variable must have its own column #2 - Each observation has its own row #3 - Each value has its own cell.

It is impossible to satisy two of the three rules.

this leads to the following instructions for tidy data

#1 put each dataset into a tibble #2 put each varibale into a column #3 profit

Picking one consistent method of data makes for easier understanding of your code

and what is happening “under the hood” or behind the scenes.

Now lets look at working with tibbles.

bmi <- tibble(women) bmi %>% mutate(bmi = (703 * weight)/(height)^2)

Spreading and Gathering

Sometimes you will find datasets that don’t fit well into a tibble

We’ll use the built-in data from tidyverse for this part

table4a

As you can see from this data, we have 1 variable in column A (country)

but columns b and c are two of the same. Thus, there are two observations in

each row

To fix this, we can use the gather function

table4a %>% gather( ‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)

Lets look at another example

table4b

As you can see we have the same problem in table 4b

table4b%>% gather( ‘1999’, ‘2000’, key = ‘year’, value = ‘population’)

Now what if we want to join these two tables? We can use dplyr

table4a <- table4a%>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b%>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘population’)

left_join(table4a, table4b)

Spreading

Spreading is the opposite of gathering. Lets look at table 2

table2

you can see that we have redundant info in columns 1 and 2

you can fix that by combining rows 1 and 2, 3 and 4, etc.

spread(table2, key = type, value = count)

Type is the key of what we are turning into columns, the value is what becomes rows/observations

In summary, spread makes long tables shorter and wider.

Gather makes wide tables narrower and longer.

Separating and pull

Now what happens when we have two observations stuck in one column?

table3

As you can see, the rate is just the population and cases combined.

we can use separate to fix this

table3 %>% separate(rate, into = c(‘cases’, ‘population’))

however, if you notice, the column type is not correct.

table3 %>% separate(rate, into = c(‘cases’, ‘population’), conver = TRUE)

You can specify what you want to separate based on.

table3 %>% separate(rate, into = c(‘cases’, ‘population’), sep = “/”, conver = TRUE)

Lets make this look more tidy

table3 %>% separate( year, into = c(‘century’, ‘year’), conver = TRUE, sep = 2 ) ############################################# # Unite #############################################

What happens if you want to do the inverse of separate?

table5

table5 %>% unite(date, century, year)

table5 %>% unite(date, century, year, sep = ’’)

Missing Values

there are two types of missing values, NA (explicit) or just no entry (implicit)

gene_data <- tibble( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )

gene_data

The nucleotide count for Gene b run 2 is explicit missing.

The nucleotide count for Gene b run 1 implicitly missing.

one way we can make implicit missing values explicit is by putting in columns

gene_data %>% spread(gene, nuc)

if we want to remove the missing values, we can use spread and gather, and na.rm = TRUE

gene_data %>% spread(gene, nuc) %>% gather(gene, nuc, ‘a’:‘b’, na.rm = TRUE)

Another way that we can make implicit values explicit, is complete()

gene_data %>% complete(gene, run)

Sometimes an NA is present to represent a value being carried forward

treatment <- tribble( ~person, ~treamtnet, ~response, ################################################# “Harrison”, 1, 7, NA, 2, 10, NA, 3, 9, “Becca”, 1, 8, NA, 2, 11, NA, 3, 10, )

treatment

What we can do here is use the fill() option

treatment %>% fill(person)

Dplyr

It is rare that you will be working with a single data table. The dplyr package allows

you to join two data tables based on common values.

Mutate joins - add new variables to one data frame from the matching observations in another

Filtering joins - filters observations from one data frame based on whether or not they

# are present in another. # set operations - treats observations as they are set elements.

library(tidyverse) library(nycflights13)

lets pull full carrier names based on letter codes

airlines

lets get info about airports

airports

lets get info about each plane

planes

lets get some info on the weather at the airports

weather

lets get info on singular flights

flights

lets look at how these tables connect

Flights -> planes based on tailnumber

Flights -> airlines through carrier

Flights -> airports through origin AND dest

Flights -> weather via origin, year/month/day/hour

Keys

keys are unique identifiers per observation

primary key uniquely identifies an observation in its own table.

One way to identify a primary key is as follows:

planes %>% count(tailnum) %>% filter(n>1)

This indicates that the tailnumber is unique

planes %>% count(model) %>% filter(n>1)

Mutate Join

flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)

We’ve now added the airline name to our datafram from the airline dataframe

other types of joins

Inner joins (inner_join) matches a pair of observations when their key is equal

Outer joins (outer_join) keeps observations that appear in at least one table.

Stringr

library(tidyverse) library(stringr)

You can create strings using single or double quotes

string1 <- “this is a string” string2 <- ‘to put a double “quote” in your string, use the opposite’

string1 string2

if you forget to close your string, you’ll get this:

string3 <- “where is this string going?”

string3

Just hit escape and try again

multiple strings are stored in character vectors

string4 <- c(“one”, “two”, “three”) string4

measuring string length

str_length(string3)

str_length(string4)

lets combine two strings

str_c(“X”, “Y”)

str_c(string1, string2)

you can use sep to control how they are separated

str_c(string1, string2, sep = ” “)

str_c(“x”, “y”, “z”, sep = “_“)

Subsetting strings

you can subset a string using str_sub()

HSP <- c(“HSP123”, “HSP234”, “HSP456”)

str_sub(HSP, 4,6)

This just drops the first four letters from the strings

or you can use negatives to count back from the end

str_sub(HSP, -3, -1)

you can convert the cases of strings like follows:

HSP str_to_lower(HSP)

str_to_upper()

Regular Expression

install.packages(“htmlwidgets”)

x <- c(‘ATTAGA’, ‘CGCCCCCGGAT’, ‘TATTA’)

str_view(x, “G”)

str_view(x, “TA”)

The next step is, “.” where the “.” matches an entry

str_view(x, “.G.”)

Anchors allow you to match at the start or the ending

str_view(x, “^TA”)

str_view(x, “TA$”)

Character classes/alternatives

atches any digit

matches any space

[abc] mathces any a, b, or c

str_view(x, “TA[GT]”)

[^anc] matches anything but a, b, or c

str_view(x, “TA[^T]”)

you can also use | to pick between two alternatives

str_view(x, “TA[G|T]”)

Detect Matches

str_detect() returns a logical vector the same length of input

y <- c(“apple”, “banana”, “pear”) y

str_detect(y, “e”)

how many common words contain the letter e

words

sum(str_detect(words, “e”))

lets get more complex, what proportion words end in a vowel?

mean(str_detect(words, “[aeiou]$”))

mean(str_detect(words, “1”))

lets find all the words that don’t contain “o” or “u”

no_o <- !str_detect(words, “[ou]”)

no_o

now lets extract

words[!str_detect(words, “[ou]”)]

you can also use str_count() to say how many matches there are in string

x

str_count(x, “[GC]”)

lets couple this with mutate

df <- tibble( word = words, i = seq_along(word) )

df

df%>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )


  1. aeiou↩︎