This is a tutorial on how to use R Markdown for reproducible research.
Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In the first example, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literally) were given different amounts of vitamin C to see the effects of the animal’s tooth growth.
To run R code in a markdown file, we need to denote the section that is considered R code. We call these sections “code chunks.”
Below is a code chunk:
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
As you can see from running the “play” button on the code chunk, the results are printed inline of the r markdown file.
fit <- lm(len ~ dose, data = Toothdata)
b <- fit$coefficients
plot(len ~ dose, data = Toothdata)
abline(lm(len ~ dose, data = Toothdata))Figure 1: The tooth growth of Guinea Pigs when given varibale amounts of Vitamin C
The slope of the regression line is 9.7635714.
We can also put sections and subsections in our r markdwon file, similar to members or bullet points in a word document. This is dine with the “#” that we previously used to denote text in an R script.
Make sure you put a space after the hashtag, otherwise it will not work!
We can also add bullet point-type marks in our r markdown file.
Its important to note here that in R Markdown indentation matters!
We can put really nice quotes into the markdown document. We di this by using the “>” symbol.
“I have no special talents. I am only passionately curious.”
— Albert Einstein
Hyperlinks can also be incorporated into thse files. This is especially useful in HTML files, since they are in a web browser and will redirect the reader to the material that you are interested in showing them. Here we will use the link to R Markdown’s homepage for this example. RMarkdown
We can also put nice formatted formulas into Markdown using two dollar signs.
Hard-Weinberg Formula
\[p^2 + 2pq + q^2 = 1\]
And you get really complex as well!
\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]
There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options.
Eval (T or F): whether or not to evaluate the code chunk.
Echo (T or F): whether or not to show the code for the chunk, but results will still print.
Cache: If enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG run times.
fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.
out.width or out.height: the output size of the R plots IN THE R DOCUMENT.
fig.cap: the words for the figure caption.
We can also add a table of contents to our HTML Document. We can do this by altering the YAML code (the weird code chunk and the VERY top of the document.) We can add this:
title: “HTML Tutorial” author: “Rebecca Tingle” date: “2024-11-11” output: html_document: toc: true toc_float: true
This will give us a very nice floating table of contents on the right side of the document.
You can also add TABS in the report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.
You can also add themes to your HTML Document that change the highlighting color and hyperlink color of your HTML output. This can be nice aesthetically. To do this you change your theme in the YAML to one of the following:
cerulean journal flatly readable spacelab united cosmo lumen paper sandstone simplex yeti null
You can also change the color by specifying highlight:
default tango payments kate monochrome espresso zenburn haddock textmate
You can also use the code-folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:
code_folding: hide
There are a TON of options and ways for you to customize your R code using the HTML format. This is also a great way to display a “portfolio” of your work if you are trying to market yourself to interested parties.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#First we will just look at the data on the 14th of October.
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
#If we want to subset this into a new varibale, we do the following:
oct_14_flight <- filter (my_data, month == 10, day == 14)
(oct_14_flight_2 <- filter (my_data, month == 10, day == 14))
(flight_through_september <- filter(my_data, month < 10))
#(oct_14_flight_2 <- filter (my_data, month = 10, day = 14))
#you can also use logical opperators to be more selective
MArch_April_Flights <- filter (my_data, month == 3 | month == 4)
MArch_April_Flights <- filter (my_data, month == 3 & month == 4)
March_4th_Flights <- filter (my_data, month == 3 & day == 4)
Non_Jan_Flights <- filter (my_data, month != 1)
arrange (my_data, year, day, month)
descending <- arrange (my_data, desc(year), desc(day), desc(month))
calendar <- select (my_data, year, month, day) print(calendar)
calendar2 <- select (my_data, year:day)
calendar3 <- select (my_data, year:carrier)
everything_else <- select (my_data, -(year:day))
everything_else2 <- select (my_data, !(year:day))
head(my_data)
rename(my_data, departure_time = dep_time)
my_data <- rename(my_data, departure_time = dep_time)
my_data_small <- select(my_data, year:day, distance, air_time)
mutate(my_data_small, speed = distance / air_time * 60)
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)
airspeed <- transmute(my_data_small, speed = distance / air_time * 60, speed2 = distance / air_time)
summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(my_data, year, month, day) summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
summarize(by_day, delay = mean(dep_delay))
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
summarize(not_cancelled, delay = mean(dep_delay))
sum(is.na(my_data$dep_delay))
sum(!is.na(my_data$dep_delay))
my_data %>% group_by(year, month, day) %>% summarize(mean = mean(departure_time, na.rm = TRUE))
library(tibble)
as_tibble(iris)
tibble( x = 1:5, y = 1, z = x ^ 2 + y )
tribble( ~genea, ~geneb, ~genec, ####################### 110, 112, 114, 6, 5, 4 )
print(by_day) as.data.frame(by_day) head(by_day)
nycflights13::flights %>% print(n = 10, width = Inf)
df_tibble <- tibble(nycflights13::flights)
df_tibble
df_tibble$carrier
df_tibble[[2]]
df_tibble%>% .$carrier
class(df_tibble)
df_tibble_2 <- as.data.frame(df_tibble)
df_tibble
head(df_tibble_2)
library(tidyverse)
#1 - Each variable must have its own column #2 - Each observation has its own row #3 - Each value has its own cell.
#1 put each dataset into a tibble #2 put each varibale into a column #3 profit
bmi <- tibble(women) bmi %>% mutate(bmi = (703 * weight)/(height)^2)
table4a
table4a %>% gather( ‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)
table4b
table4b%>% gather( ‘1999’, ‘2000’, key = ‘year’, value = ‘population’)
table4a <- table4a%>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b%>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘population’)
left_join(table4a, table4b)
table2
spread(table2, key = type, value = count)
table3
table3 %>% separate(rate, into = c(‘cases’, ‘population’))
table3 %>% separate(rate, into = c(‘cases’, ‘population’), conver = TRUE)
table3 %>% separate(rate, into = c(‘cases’, ‘population’), sep = “/”, conver = TRUE)
table3 %>% separate( year, into = c(‘century’, ‘year’), conver = TRUE, sep = 2 ) ############################################# # Unite #############################################
table5
table5 %>% unite(date, century, year)
table5 %>% unite(date, century, year, sep = ’’)
gene_data <- tibble( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )
gene_data
gene_data %>% spread(gene, nuc)
gene_data %>% spread(gene, nuc) %>% gather(gene, nuc, ‘a’:‘b’, na.rm = TRUE)
gene_data %>% complete(gene, run)
treatment <- tribble( ~person, ~treamtnet, ~response, ################################################# “Harrison”, 1, 7, NA, 2, 10, NA, 3, 9, “Becca”, 1, 8, NA, 2, 11, NA, 3, 10, )
treatment
treatment %>% fill(person)
# are present in another. # set operations - treats observations as they are set elements.
library(tidyverse) library(nycflights13)
airlines
airports
planes
weather
flights
planes %>% count(tailnum) %>% filter(n>1)
planes %>% count(model) %>% filter(n>1)
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
flights2
flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)
library(tidyverse) library(stringr)
string1 <- “this is a string” string2 <- ‘to put a double “quote” in your string, use the opposite’
string1 string2
string3 <- “where is this string going?”
string3
string4 <- c(“one”, “two”, “three”) string4
str_length(string3)
str_length(string4)
str_c(“X”, “Y”)
str_c(string1, string2)
str_c(string1, string2, sep = ” “)
str_c(“x”, “y”, “z”, sep = “_“)
HSP <- c(“HSP123”, “HSP234”, “HSP456”)
str_sub(HSP, 4,6)
str_sub(HSP, -3, -1)
HSP str_to_lower(HSP)
install.packages(“htmlwidgets”)
x <- c(‘ATTAGA’, ‘CGCCCCCGGAT’, ‘TATTA’)
str_view(x, “G”)
str_view(x, “TA”)
str_view(x, “.G.”)
str_view(x, “^TA”)
str_view(x, “TA$”)
str_view(x, “TA[GT]”)
str_view(x, “TA[^T]”)
str_view(x, “TA[G|T]”)
y <- c(“apple”, “banana”, “pear”) y
str_detect(y, “e”)
words
sum(str_detect(words, “e”))
mean(str_detect(words, “[aeiou]$”))
mean(str_detect(words, “1”))
no_o <- !str_detect(words, “[ou]”)
no_o
words[!str_detect(words, “[ou]”)]
x
str_count(x, “[GC]”)
df <- tibble( word = words, i = seq_along(word) )
df
df%>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )
aeiou↩︎