This is tutorial on how to use R markdown for reproducible research.
Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first exaple, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literal) were given different amounts of vitamin C to see the effects on the animal’s tooth growth.
To run R code in a markdown file, we need to denote the section that is considered R code. We call these “code chunks.”
below is a code chunk:
Toothdata <- ToothGrowth
head(Toothdata)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
As you can see, from running the “play” button on the code chunk, the results are printed inline of the r markdown file.
fit <- lm(len ~ dose, data = Toothdata)
b <- fit$coefficients
plot(len ~ dose, data = Toothdata)
abline(lm(len ~ dose, data = Toothdata))
Figure 1: The tooth growth of Guinea Pigs when given variable amounts of Vitamin C
The slope of the regression line is 9.7635714.
We can also put sections and subsections in our r markdown file, similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in an R script.
Make sure that you put a space after the hashtag, otherwise it will not work!
We can also add bullet point-type marks in our r markdown file.
It’s important to note here that in R Markdown indentation matters!
we can put really nice quotes into the markdown document. we do this by using the “>” symbol.
“Genes are like the story, and DNA is the language that that story is written in.”
— Sam Kean
Hyperlinks can also be incorporated into these files. This is especially useful in HTML files, since they are in a web browser and will redirect the reader to the material that you are interested in showing them. Here we will use the link to R Markdown’s homepage for this example. RMarkdown
we can also put nice formatted formulas into Markdown using two dollar signs.
Hard-Weinberg Formula
\[p^2 + 2pq + q^2 = 1\]
And you can get really complex as well!
\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]
print("Hello World")
There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options.
Eval (T or F): whether or not to evaluate the code chunk
Echo (T or F): whether or not to show the code for the chunk, but results will still print.
cache: If enable, the same code chunk will not be evaulated the next time that the knitr is run. Great for code that has LONG run times.
fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.
out.width or out.height: The output size of the R plots IN THE R DOCUMENT.
fig.cap: the words for the figure caption
we can also add a table of contents to our HTML Document. we do this by altering the YAML code (the weird code chunk at the VERY top of the document.) we can add this:
title: “HTML_Tutorial” author: “Lacey Battlefield” date: “2024-10-04” output: html_document: toc: true toc_float: true
This will give us a very nice floating table of contents on thr right hand side of the document.
you can also add TABS in our report. To do this you need to specify each section that you want to become a tab by placing {.tabset} after the line. Every subsequent header will be a new tab.
you can also add themes to your HTML document that change the highlighting color and hyperlink color of your html output. This can be nice aesthetically. To do this, you change your theme in the YAML to one of the following:
cerulean journal flatly readable spacelab united cosmo lumen paper sandstone simplex yeti null
you can also change the color by specifying highlight:
default tango payments kate monochrome espresso zenburn haddock textmate
you can also use the code_folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:
code_folding: hide
There are a TON of options and ways for you to customize your R code using the HTML format. This is also a great way to display a “portfolio” of your rwork if you are trying to market yourself to interested parties.
# Data wrangling with R
First thing is to load the library and look at the top of the data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
??flights
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
First we will just look at the data on the October 14th.
filter(my_data, month == 10, day ==14)
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we want to subset this into a new variable, we do the following:
oct_14_flight <- filter(my_data, month == 10, day == 14)
what if you want to do both print and save the variable?
(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If you want to filter based on different opperators, you can use the following:
Equals == Not equal to != greater than > Less than < Greater than or equal to >= Less than or equal to <=
(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
IF we don’t use the == to mean equals, we get this:
(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
You can also use logical opperators to be more selective
and & or | not !
Lets use the “or” function to pick flights in march and april
March_April_Flights <- filter(my_data, month == 3 | month == 4)
March_4th_Flights <- filter(my_data, month == 3 & day == 4)
Non_jan_flights <- filter(my_data, month != 1)
# Arrange allows us to arrange the dataset based on the variable we desire.
arrange(my_data, year, month)
# we can also do this in descending fashion descending <- arrange(my_data, desc(year), desc(day), desc(month))
# Missing values are always placed at the end of the dataframe regardless of ascending or descending.
# We can also select specific columns that we want to look at.
calendar <- select(my_data, year, month, day) print(calendar)
# We can also look at a range of columns
calendar2 <- select(my_data, year:day)
# Lets look at all columns months through carrier calendar3 <- select(my_data, year:carrier)
# we can also choose which columns NOT to include
everything_else <- select(my_data, -(year:day))
# In this instance we can also use the “not” opperator ! everything_else2 <- select(my_data, !(year:day))
# There are also some other helper functions that can help you select the columns or data you’re looking for
# starts_with(“xyz”) – will select the values that start with xyz # ends_with(“xyz”) — will select the values that end with xyz # contains(“xyz”) — will select the values that contain xyz # matches(“xyz”) —- will match the identical value xyz
head(my_data)
rename(my_data, departure_time = dep_time)
my_data <- rename(my_data, departure_time = dep_time)
# what if you want to add new columns to your data frame? we have the mutate() function for that.
# First, lets make smaller data frame so we can see what we are doing.
my_data_small <- select(my_data, year:day, distance, air_time)
# Lets calculate the speed of the flights. mutate(my_data_small, speed = distance / air_time * 60)
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)
# What if we wanted to create a new dataframe with ONLY your calculations (transmute)
airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)
# we can use summarize to run a function on a data column to get a single return
summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
# so we can see here that the average delay is about 12 minutes
# we gain additional value in summarize by pairing it with by_group()
by_day <- group_by(my_data, year, month, day) summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
# as you can see, we now have the delay by the days of the year
# what happens if we don’t tell R what to do with the missing data? summarize(by_day, delay = mean(dep_delay))
# we can also filter our data based on NA (which in this dataset was canceled flights)
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
# Lets run summarize again on this data summarize(not_cancelled, delay = mean(dep_delay))
# we can also count the number of variables that are NA
sum(is.na(my_data$dep_delay))
# We can also count the numbers that are a NOT NA sum(!is.na(my_data$dep_delay))
# with tibble datasets (more on them soon), we can pipe results to get rid of the need to use the dollar sign # we can then summarize the number of flights by minutes delayed.
my_data %>% group_by(year, month, day) %>% summarize(mean = mean(departure_time, na.rm = TRUE))
library(tibble)
as_tibble(iris)
# As we can see, we have the same data frame, but we have different features
# You can also create a tibble from scratch with tibble()
tibble( x = 1:5, y = 1, z = x ^ 2 + y )
# You can also use tribble() for basic data table creation
tribble( ~genea, ~ geneb, ~ genec, ######################### 110, 112, 114, 6, 5, 4 )
# Tibbles are built to not overwhelm your console when printing data, only showing # the first few lines.
# This is how a data frame prints print(by_day) as.data.frame(by_day) head(by_day)
nycflights13::flights %>% print(n=10, width = Inf)
# Subsetting tibbles is easy, similar to data.frames
df_tibble <- tibble(nycflights13::flights)
df_tibble
# We can subset by column name using the $ df_tibble$carrier
# we can subset by position using [[]]
df_tibble[[2]]
# If you want to use this in a pipe, you need to use the “.” placeholder
df_tibble %>% .$carrier
# some older functions do not like tibbles, thus you might have to convert them back to data frames
class(df_tibble)
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
head(df_tibble_2)
library(tidyverse)
# how do we make a tidy dataset? well the tidyverse follows three rules
#1 - Each variable must have its own column #2 - Each observation has its own row #3 - Each value has its own cell.
# It is impossible to satisfy two of the three rules.
# This leads to the following instructions for tidy data
#1 put each dataset into a tibble #2 put each variable into a column #3 profit
# Picking one consistent method of data storage makes for easier understanding # of your code and what is happening “under the hood” or behind the scenes
# Lets now look at working with tibbles
bmi <- tibble(women)
bmi %>% mutate(bmi = (703 * weight)/(height)^2)
# Sometimes you’ll find datasets that don’t fit well into a tibble
# we’ll use the built-in data from tidyverse for this part
table4a
# As you can see from this data, we have one variable in column A (country) # but columns b and c are two of the same. Thus, there are two observations in # each row.
# To fix this, we can use the gather function
table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)
# lets look at another example
table4b
# As you can see we have the same problem in table 4b
table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)
# Now what if we want to join these two tables? We can use dplyr
table4a <- table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)
left_join(table4a, table4b)
# Spreading is the opposite of gathering
table2
# You can see that we have redundant info in columns 1 and 2 # We can fix that by combining rows 1&2, 3&4, etc.
spread(table2, key = type, value = count)
# Type is the key of what we are turning into columns, the value is what becomes rows/observations
# In summary, spread makes long tables shorter and wider # gather makes wide tables, narrower and longer.
# Now what happens when we have two observations stuck in one column?
table3
# As you can see, the rate is just the population and cases combines. # we can use seperate to fix this
table3 %>% separate(rate, into = c(“causes”, “population”))
# However, if you notice, the column type is not correct.
table3 %>% separate(rate, into =c(“cases”, “populate”), conver = TRUE)
# You can specify what you want to separate based on.
table3 %>% separate(rate, into =c(“cases”, “populate”), sep = “/”, conver = TRUE)
# Lets make this look more tidy
table3 %>% separate( year, into = c(“century”, “year”), convert= TRUE, sep = 2 )
# what happens if we want to do the inverse of separate?
table5
table5 %>% unite(date, century, year)
table5 %>% unite(date, century, year, sep = ““)
# There can be two types of missing values. NA (explicit) or just no entry (implicit)
gene_data <- tibble( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )
gene_data
# The nucleotide count for Gene b run 2 is explicitly missing. # The nucelotide count for Gene b run 1 is implicitly missing.
# one way we can make implicit missing values explicit is by putting runs in columns
gene_data %>% spread(gene, nuc) %>% gather(gene, nuc, ‘a’:‘b’, na.rm = TRUE)
# Another way that er can make implicit values explicit, is complete()
gene_data %>% complete(gene, run)
# sometimes an NA is present to represent a value being carried forward
treatment
# what we can do here is use the fill() option
treatment %>% fill(person)
# It is rare that you will be working with a single data table. The DPLYR package allows # you to join two data tables based on common values.
# Mutate joins - add new variables to one data frame from the matching observations in another # Filtering joins - filters observation from one data frame based on whether or no # they are present in another.
library(tidyverse) library(nycflights13)
# lets pull full carrier names based on letter codes airlines
# lets get info about airports airports
# lets get info about each plane planes
# lets get info on the weather at the airports weather
# lets get info on singular flights flights
# lets look at how these tables connect
# Flights -> planes based on tail number # Flights -> airlines through carrier # Flights -> airports origin and dest # Flights -> weather via origin, year/month/day/hour
# keys are unique identifiers per observation # primary key uniquely identifies an observation in its own table.
# One way to identify a primary key is as follows:
planes %>% count(tailnum) %>% filter(n>1)
# this indicates that the tail number is unique
planes %>% count(model) %>% filter(n>1)
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
flights2
flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)
# We’ve now added the airline name to our data frame from the airline data frame
# Other types of joins
# Inner joins (inner_join) matches a pair of observations when their key is equal # Outer joins (outer_join) keeps observations that appear in at least on table.
library(tidyverse) library(stringr)
# You can create strings using single or double quotes
string1 <- “this is a string” string2 <- ‘to put a “quote” in your string, use the opposite’
string1 string2
# if you forget to close your string, you’ll get this:
string3 <- “where is this string going?”
string3
# just hit escape and try again
# multiple strings are stored in character vectors
string4 <- c(“one”, “two”, “three”) string4
# measuring string length
str_length(string3)
str_length(string4)
# lets combine two strings
str_c(“X”, “Y”)
str_c(string1, string2)
# you can use sep to control how they are separated
str_c(string1, string2, sep = ” “)
str_c(“x”, “y”, “z”, sep = “_“)
# you can subset a string using str_sub()
HSP <- c(“HSP123”, “HSP234”, “HSP456”)
str_sub(HSP, 4,6)
# This just drops the first four letters from the strings
# Or you can use negatives to count back from the end
str_sub(HSP, -3, -1)
# you can convert the cases of strings like follows:
HSP str_to_lower(HSP)
# str_to_upper()
install.packages(“htmlwidgets”)
x <- c(“ATTAGA”, “CGCCCCCGGAT”, “TATTA”)
str_view(x, “G”)
str_view(x, “TA”)
# The next step is, “.” where the “.” matches an entry
str_view(x, “.G.”)
# Anchors allow you to match at the start or the ending str_view(x, “^TA”)
str_view(x, “TA$”)
# character classes/alternatives
# atches any digit # matches any space # [abc] matches a, b, or c
str_view(x, “TA[GT]”)
str_view(x, “TA[^T]”)
str_view(x, “TA[G|T]”)
y <- c(“apple”, “banana”, “pear”) y
str_detect(y, “e”)
words
sum(str_detect(words, “e”))
mean(str_detect(words, “[aeiou]$”))
mean(str_detect(words, “1”))
no_o <- !str_detect(words, “[ou]”)
no_o
words[!str_detect(words, “[ou]”)]
x
str_count(x, “[GC]”)
df <- tibble( word = words, count = seq_along(word) )
df
df %>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )
aeiou↩︎