This is a tutorial on how to use R markdown for reproducible research.
Here, we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first example, we will be using the ToothGrowth dataset. In this experiment, Guinea pigs (literal) were given different amounts of vitamin C to see the effects on the animal’s tooth growth.
To run R code in a markdown file, we need to denote the section that is considered R code. We call these “Code chunks.”
Below is a Code Chunk:
Toothdata <- ToothGrowth
head(Toothdata)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
As you can see, from running the “play” button on the code chunk, the results are printed inline of the R Markdown file.
fit <- lm(len ~ dose, data = ToothGrowth)
b <- fit$coefficients
plot(len ~ dose, data = ToothGrowth)
abline(lm(len ~ dose, data = ToothGrowth))
The slope of the regression is 9.7635714.
We can also put sections and subsections in an R Markdwon file, similar to numbers or bullet points in a word document. This is done with the “#” that we used previously to denote text in a script.
Make sure you put a space after the hashtag, otherwise it will not work!
We can also add bullet point-type marks.
Its important to note here, in R Markdown, INDENTATION MATTERS!
We can similarly do ordered lists.
We can also put nice quotes into the markdown package. We do this by using the “>” symbol.
“Genes are like the store, and DNA is the language that the story is written in.”
— Sam Kean
Hyperlinks can also be incorporated into these files. This is especially useful in HTML files, since they are in a web browser and will redirect the reader to the material that you are interested in showing. Here we for this example we will link to the R Markdown homepage for your reference. RMarkdown
We can also put nice formatted formulas into Markdown using two dollar signs.
Hardy-Weinberg Formula
\[p^2 + 2pq + q^2 = 1\]
And you can get really complex as well!
\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]
If you are more interested in how to do these formulas in R Markdown, you can check out this LaTex cheatsheet.
There are also options for your R Markdown file on how knitr interperits the chunk. There are the following options:
Eval (T or F): whether or not to evaluate the code chunk
Echo (T or F): whether or not to show the code for the chunk, results will still print.
Cache: If enabled, the same code chunk will not be evaluated the next time the document is run. Great for long running processes, but can pose a problem if you change any values above the code.
fig.width or fig.height: The (graphical device) size of R plots in inches. The figures are first written to the knitr document then to files that are saved seperatly.
out.width or out.height: The output size of the R plots in the output document.
fig.cap: the words for the figure caption.
We can also add a table of contents to our HTML document. We do this by altering our YAML code (the weird code chunk at the VERY top of the document.) We can add this:
title: “HTML Markdown Handbook” author: “Dr. VDB” date: “3/1/2022” output: html_document: toc: true toc_float: true
This will give us a very nice floating table of contents on the right side of our HTML file.
If you wan to change it to having the TOC always show and not appear as you scroll, you can change the setting “collapsed” to FALSE.
Alternatively, you can have numbered sections by using the number_sections: TRUE
First thing is to load the library and look at the top of the data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
??flights
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
First we will just look at the data on October 14th.
filter(my_data, month == 10, day == 14)
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
If we want to subset this into a new variable, we do the following:
oct_14_flight <- filter(my_data, month == 10, day == 14)
What if you want to do both, print and save the variable?
(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
If you want to filter based on different operators, you can use the following:
Equals: == Not Equal to: != Greater than: > Less than: < Greater than or equal to >= Less than or equal to <=
(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 252,474 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
If we don’t use the == to mean equals, we get this:
(oct_14_flight <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You can also use logical operators to be more selective
And: & Or: | Not: !
Let’s use the “or” function to pick flights in March and April
March_April_Flights <- filter(my_data, month == 3 | month == 4)
March_April_Flights <- filter(my_data, month == 3 & day == 4)
Non_jan_flights <- filter(my_data, month!= 1)
arrange(my_data, year, day, month)
descending <- arrange(my_data, desc(year), desc(day), desc(month))
calendar <- select(my_data, year, month, day) print(calendar)
calendar2 <- select(my_data, year:day)
calendar3 <- select(my_data, year:carrier)
everything_else <- select(my_data, -(year:day))
everything_else2 <- select(my_data, !(year:day))
head(my_data)
rename(my_data, departure_time = dep_time)
my_data <- rename(my_data, departure_time = dep_time)
my_data_small <- select(my_data, year:day, distance, air_time)
mutate(my_data_small, speed = distance / air_time * 60)
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)
airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)
summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(my_data, year, month, day) summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
summarize(by_day, delay = mean(dep_delay))
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
summarize(not_cancelled, delay = mean(dep_delay))
sum(is.na(my_data$dep_delay))
sum(!is.na(my_data$dep_delay))
my_data %>% group_by(year, month, day) %>% summarize(mean = mean(departure_time, na.rm = TRUE))
library(tibble)
as_tibble(iris)
tibble( x = 1:5, y = 1, z = x ^ 2 + y )
tribble( ~genea, ~ geneb, ~ genec, ######################### 110, 112, 114, 6, 5, 4 )
print(by_day) as.data.frame(by_day) head(by_day)
nycflights13::flights %>% print(n=10, width = Inf)
df_tibble <- tibble(nycflights13::flights)
df_tibble
df_tibble$carrier
df_tibble[[2]]
df_tibble %>% .$carrier
class(df_tibble)
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
df_tibble
head(df_tibble_2)
library(tidyverse)
bmi <- tibble(women)
bmi %>% mutate(bmi = (703 * weight)/(height)^2)
table4a
table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)
table4b
table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)
table4a <- table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)
left_join(table4a, table4b)
print(table2)
spread(table2, key = type, value = count)
table3
table3 %>% separate(rate, into = c(“cases”, “population”))
table3 %>% separate(rate, into = c(“cases”, “populate”), conver = TRUE)
table3 %>% separate(rate, into = c(“cases”, “populate”), sep = “/”, conver = TRUE)
table3 %>% separate ( year, into = c(“century”,“year”), convert = TRUE, sep = 2 )
table5
table5 %>% unite(date, century, year)
table5 %>% unite(date, century, year, sep = ““)
gene_data <- tibble ( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )
gene_data
gene_data %>% spread(gene, nuc)
gene_data %>% spread(gene, nuc) %>% gather(gene,nuc, ‘a’:‘b’, na.rm = TRUE)
gene_data %>% complete(gene, run)
treatment
treatment %>% fill(person)
# they are present in another. # Set operations - treats observations as they are set elements
library(tidyverse) library(nycflights13)
print(airlines)
print(airports)
print(planes)
print(weather)
print(flights)
planes %>% count(tailnum) %>% filter(n>1)
planes %>% count(model) %>% filter(n>1)
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
flights2
flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)
y <- c(“apple”, “banana”, “pear”)
str_detect(y, “e”)
words
sum(str_detect(words, “e”))
mean(str_detect(words, “1”))
no_o <- !str_detect(words,“[ou]”) no_o
words[!str_detect(words,“[ou]”)]
x <- c(“ATTAGA”, “CGCCCCCGGAT”, “TATTA”) x
str_count(x, “[GC]”)
df <- tibble( word = words, count = seq_along(word) )
df
df %>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )
aeiou↩︎