This is tutorial on how to use R markdown for reproducible research.

Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first exaple, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literal) were given different amounts of vitamin C to see the effects on the animal’s tooth growth.

To run R code in a markdown file, we need to denote the section that is considered R code. We call these “code chunks.”

below is a code chunk:

Toothdata <- ToothGrowth

head(Toothdata)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

As you can see, from running the “play” button on the code chunk, the results are printed inline of the r markdown file.

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))
Figure 1: The tooth growth of Guinea Pigs when given variable amounts of Vitamin C

Figure 1: The tooth growth of Guinea Pigs when given variable amounts of Vitamin C

The slope of the regression line is 9.7635714.

Section Headers

We can also put sections and subsections in our r markdown file, similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in an R script.

First level header

Second level header

Third level header

Make sure that you put a space after the hashtag, otherwise it will not work!

We can also add bullet point-type marks in our r markdown file.

  • one item
  • one item
  • one item
    • one more item
    • one more item
    • one more item
      • one last item

It’s important to note here that in R Markdown indentation matters!

  1. First Item
  2. Second Item
  3. Third Item
  1. subitem 1
  2. subitem 2
  3. subitem 3

Block Quotes

we can put really nice quotes into the markdown document. we do this by using the “>” symbol.

“Genes are like the story, and DNA is the language that that story is written in.”

— Sam Kean

Formulas

we can also put nice formatted formulas into Markdown using two dollar signs.

Hard-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

And you can get really complex as well!

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

print("Hello World")

code chunks

Code chunk options

There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options.

Eval (T or F): whether or not to evaluate the code chunk

Echo (T or F): whether or not to show the code for the chunk, but results will still print.

cache: If enable, the same code chunk will not be evaulated the next time that the knitr is run. Great for code that has LONG run times.

fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.height: The output size of the R plots IN THE R DOCUMENT.

fig.cap: the words for the figure caption

Table of Contents

we can also add a table of contents to our HTML Document. we do this by altering the YAML code (the weird code chunk at the VERY top of the document.) we can add this:

title: “HTML_Tutorial” author: “Lacey Battlefield” date: “2024-10-04” output: html_document: toc: true toc_float: true

This will give us a very nice floating table of contents on thr right hand side of the document.

Tabs

you can also add TABS in our report. To do this you need to specify each section that you want to become a tab by placing {.tabset} after the line. Every subsequent header will be a new tab.

Themes

you can also add themes to your HTML document that change the highlighting color and hyperlink color of your html output. This can be nice aesthetically. To do this, you change your theme in the YAML to one of the following:

cerulean journal flatly readable spacelab united cosmo lumen paper sandstone simplex yeti null

you can also change the color by specifying highlight:

default tango payments kate monochrome espresso zenburn haddock textmate

Code Folding

you can also use the code_folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:

code_folding: hide

Summary

There are a TON of options and ways for you to customize your R code using the HTML format. This is also a great way to display a “portfolio” of your rwork if you are trying to market yourself to interested parties.

# Data wrangling with R

First thing is to load the library and look at the top of the data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
??flights

my_data <- nycflights13::flights

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

First we will just look at the data on the October 14th.

filter(my_data, month == 10, day ==14)
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If we want to subset this into a new variable, we do the following:

oct_14_flight <- filter(my_data, month == 10, day == 14)

what if you want to do both print and save the variable?

(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If you want to filter based on different opperators, you can use the following:

Equals == Not equal to != greater than > Less than < Greater than or equal to >= Less than or equal to <=

(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

IF we don’t use the == to mean equals, we get this:

(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

You can also use logical opperators to be more selective

and & or | not !

Lets use the “or” function to pick flights in march and april

March_April_Flights <- filter(my_data, month == 3 | month == 4)

March_4th_Flights <- filter(my_data, month == 3 & day == 4)

Non_jan_flights <- filter(my_data, month != 1)

Arrange

# Arrange allows us to arrange the dataset based on the variable we desire.

arrange(my_data, year, month)

# we can also do this in descending fashion descending <- arrange(my_data, desc(year), desc(day), desc(month))

# Missing values are always placed at the end of the dataframe regardless of ascending or descending.

Select

# We can also select specific columns that we want to look at.

calendar <- select(my_data, year, month, day) print(calendar)

# We can also look at a range of columns

calendar2 <- select(my_data, year:day)

# Lets look at all columns months through carrier calendar3 <- select(my_data, year:carrier)

# we can also choose which columns NOT to include

everything_else <- select(my_data, -(year:day))

# In this instance we can also use the “not” opperator ! everything_else2 <- select(my_data, !(year:day))

# There are also some other helper functions that can help you select the columns or data you’re looking for

# starts_with(“xyz”) – will select the values that start with xyz # ends_with(“xyz”) — will select the values that end with xyz # contains(“xyz”) — will select the values that contain xyz # matches(“xyz”) —- will match the identical value xyz

Renaming

head(my_data)

rename(my_data, departure_time = dep_time)

my_data <- rename(my_data, departure_time = dep_time)

Mutate

# what if you want to add new columns to your data frame? we have the mutate() function for that.

# First, lets make smaller data frame so we can see what we are doing.

my_data_small <- select(my_data, year:day, distance, air_time)

# Lets calculate the speed of the flights. mutate(my_data_small, speed = distance / air_time * 60)

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

# What if we wanted to create a new dataframe with ONLY your calculations (transmute)

airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)

summarize and by_group()

# we can use summarize to run a function on a data column to get a single return

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))

# so we can see here that the average delay is about 12 minutes

# we gain additional value in summarize by pairing it with by_group()

by_day <- group_by(my_data, year, month, day) summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

# as you can see, we now have the delay by the days of the year

Missing Data

# what happens if we don’t tell R what to do with the missing data? summarize(by_day, delay = mean(dep_delay))

# we can also filter our data based on NA (which in this dataset was canceled flights)

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

# Lets run summarize again on this data summarize(not_cancelled, delay = mean(dep_delay))

counts

# we can also count the number of variables that are NA

sum(is.na(my_data$dep_delay))

# We can also count the numbers that are a NOT NA sum(!is.na(my_data$dep_delay))

Piping

# with tibble datasets (more on them soon), we can pipe results to get rid of the need to use the dollar sign # we can then summarize the number of flights by minutes delayed.

my_data %>% group_by(year, month, day) %>% summarize(mean = mean(departure_time, na.rm = TRUE))

Tibbles

library(tibble)

Now we will take the time to explore tibbles. Tibbles are modified data frames which tweak some

of the older features from data frames. R is an old language, and useful things from 20 years

ago are not as useful anymore

as_tibble(iris)

# As we can see, we have the same data frame, but we have different features

# You can also create a tibble from scratch with tibble()

tibble( x = 1:5, y = 1, z = x ^ 2 + y )

# You can also use tribble() for basic data table creation

tribble( ~genea, ~ geneb, ~ genec, ######################### 110, 112, 114, 6, 5, 4 )

# Tibbles are built to not overwhelm your console when printing data, only showing # the first few lines.

# This is how a data frame prints print(by_day) as.data.frame(by_day) head(by_day)

nycflights13::flights %>% print(n=10, width = Inf)

Subsetting

# Subsetting tibbles is easy, similar to data.frames

df_tibble <- tibble(nycflights13::flights)

df_tibble

# We can subset by column name using the $ df_tibble$carrier

# we can subset by position using [[]]

df_tibble[[2]]

# If you want to use this in a pipe, you need to use the “.” placeholder

df_tibble %>% .$carrier

# some older functions do not like tibbles, thus you might have to convert them back to data frames

class(df_tibble)

df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)

head(df_tibble_2)

tidyr

library(tidyverse)

# how do we make a tidy dataset? well the tidyverse follows three rules

#1 - Each variable must have its own column #2 - Each observation has its own row #3 - Each value has its own cell.

# It is impossible to satisfy two of the three rules.

# This leads to the following instructions for tidy data

#1 put each dataset into a tibble #2 put each variable into a column #3 profit

# Picking one consistent method of data storage makes for easier understanding # of your code and what is happening “under the hood” or behind the scenes

# Lets now look at working with tibbles

bmi <- tibble(women)

bmi %>% mutate(bmi = (703 * weight)/(height)^2)

Spreading and Gathering

# Sometimes you’ll find datasets that don’t fit well into a tibble

# we’ll use the built-in data from tidyverse for this part

table4a

# As you can see from this data, we have one variable in column A (country) # but columns b and c are two of the same. Thus, there are two observations in # each row.

# To fix this, we can use the gather function

table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)

# lets look at another example

table4b

# As you can see we have the same problem in table 4b

table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)

# Now what if we want to join these two tables? We can use dplyr

table4a <- table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)

left_join(table4a, table4b)

Spreading

# Spreading is the opposite of gathering

table2

# You can see that we have redundant info in columns 1 and 2 # We can fix that by combining rows 1&2, 3&4, etc.

spread(table2, key = type, value = count)

# Type is the key of what we are turning into columns, the value is what becomes rows/observations

# In summary, spread makes long tables shorter and wider # gather makes wide tables, narrower and longer.

Separating and pull

# Now what happens when we have two observations stuck in one column?

table3

# As you can see, the rate is just the population and cases combines. # we can use seperate to fix this

table3 %>% separate(rate, into = c(“causes”, “population”))

# However, if you notice, the column type is not correct.

table3 %>% separate(rate, into =c(“cases”, “populate”), conver = TRUE)

# You can specify what you want to separate based on.

table3 %>% separate(rate, into =c(“cases”, “populate”), sep = “/”, conver = TRUE)

# Lets make this look more tidy

table3 %>% separate( year, into = c(“century”, “year”), convert= TRUE, sep = 2 )

Unite

# what happens if we want to do the inverse of separate?

table5

table5 %>% unite(date, century, year)

table5 %>% unite(date, century, year, sep = ““)

Missing Values

# There can be two types of missing values. NA (explicit) or just no entry (implicit)

gene_data <- tibble( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )

gene_data

# The nucleotide count for Gene b run 2 is explicitly missing. # The nucelotide count for Gene b run 1 is implicitly missing.

# one way we can make implicit missing values explicit is by putting runs in columns

gene_data %>% spread(gene, nuc) %>% gather(gene, nuc, ‘a’:‘b’, na.rm = TRUE)

# Another way that er can make implicit values explicit, is complete()

gene_data %>% complete(gene, run)

# sometimes an NA is present to represent a value being carried forward

treatment <- tribble(
person, ~treatment, ~response, ################################################ “Isaac”, 1, 7, NA, 2, 10, NA, 3, 9, “VDB”, 1, 8, NA, 2, 11, NA, 3, 10, )

treatment

# what we can do here is use the fill() option

treatment %>% fill(person)

DPLYR

# It is rare that you will be working with a single data table. The DPLYR package allows # you to join two data tables based on common values.

# Mutate joins - add new variables to one data frame from the matching observations in another # Filtering joins - filters observation from one data frame based on whether or no # they are present in another.

library(tidyverse) library(nycflights13)

# lets pull full carrier names based on letter codes airlines

# lets get info about airports airports

# lets get info about each plane planes

# lets get info on the weather at the airports weather

# lets get info on singular flights flights

# lets look at how these tables connect

# Flights -> planes based on tail number # Flights -> airlines through carrier # Flights -> airports origin and dest # Flights -> weather via origin, year/month/day/hour

keys

# keys are unique identifiers per observation # primary key uniquely identifies an observation in its own table.

# One way to identify a primary key is as follows:

planes %>% count(tailnum) %>% filter(n>1)

# this indicates that the tail number is unique

planes %>% count(model) %>% filter(n>1)

Mutate join

flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)

# We’ve now added the airline name to our data frame from the airline data frame

# Other types of joins

# Inner joins (inner_join) matches a pair of observations when their key is equal # Outer joins (outer_join) keeps observations that appear in at least on table.

Stringr

library(tidyverse) library(stringr)

# You can create strings using single or double quotes

string1 <- “this is a string” string2 <- ‘to put a “quote” in your string, use the opposite’

string1 string2

# if you forget to close your string, you’ll get this:

string3 <- “where is this string going?”

string3

# just hit escape and try again

# multiple strings are stored in character vectors

string4 <- c(“one”, “two”, “three”) string4

# measuring string length

str_length(string3)

str_length(string4)

# lets combine two strings

str_c(“X”, “Y”)

str_c(string1, string2)

# you can use sep to control how they are separated

str_c(string1, string2, sep = ” “)

str_c(“x”, “y”, “z”, sep = “_“)

Subsetting strings

# you can subset a string using str_sub()

HSP <- c(“HSP123”, “HSP234”, “HSP456”)

str_sub(HSP, 4,6)

# This just drops the first four letters from the strings

# Or you can use negatives to count back from the end

str_sub(HSP, -3, -1)

# you can convert the cases of strings like follows:

HSP str_to_lower(HSP)

# str_to_upper()

Regular Expression

install.packages(“htmlwidgets”)

x <- c(“ATTAGA”, “CGCCCCCGGAT”, “TATTA”)

str_view(x, “G”)

str_view(x, “TA”)

# The next step is, “.” where the “.” matches an entry

str_view(x, “.G.”)

# Anchors allow you to match at the start or the ending str_view(x, “^TA”)

str_view(x, “TA$”)

# character classes/alternatives

# atches any digit # matches any space # [abc] matches a, b, or c

str_view(x, “TA[GT]”)

[^anc] matches anything BUT a, b, or c

str_view(x, “TA[^T]”)

you can also use | to pick between two alternatives

str_view(x, “TA[G|T]”)

Detect Matches

str_detect() returns a logical vector the same length of input

y <- c(“apple”, “banana”, “pear”) y

str_detect(y, “e”)

How many common words contain letter e

words

sum(str_detect(words, “e”))

lets get more complex, what proportion words end in a vowel?

mean(str_detect(words, “[aeiou]$”))

mean(str_detect(words, “1”))

lets find all the words that don’t contain “o” or “u”

no_o <- !str_detect(words, “[ou]”)

no_o

now lets exctact

words[!str_detect(words, “[ou]”)]

you can also use str_count() to say how many matches there are in string

x

str_count(x, “[GC]”)

lets couple this with mutate

df <- tibble( word = words, count = seq_along(word) )

df

df %>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )


  1. aeiou↩︎