This document will cover how to use R Markdown in R Studio to create reproduceable research.
In R Markdown, you can type long passages/descriptions without hashing out comments. In this first example, we will use the ToothGrowth dataset, in which guinea pigs were given different amounts of vitamin C to see the effects of tooth growth
To run R in a markdown file, you need to denote the section that is to be considered R code. This is a code chunk. Code chunks are defined by ```{r} at their beginning.
Code chunks can be closed using ``` at the bottommost line.
Below is an example code chunk:
Toothdata <- ToothGrowth
head(Toothdata)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
The play button on code chunks allows you to run all code within the chunk. Results are printed inline in the markdown file.
The abline command makes a line from a to b on a given plot:
fit <- lm(len ~ dose, data = Toothdata)
b <- fit$coefficients
plot(len ~ dose, data = Toothdata)
abline(lm(len ~ dose, data = Toothdata))
The slope of the regression line is 9.7635714.
You can call variables directly into the text like above using a backtick + r followed by the variable, then its position in brackets.
The syntax is: variable[position].
You can put section headers into the final document by putting a # before the desired header.
Headers require a space after the # or else it will not format as a header. Additionally, you can append .unlisted and .unnumbered to a header to keep it excluded from the table of contents, which I’ll go over later.
You can also add tabs to the report. This requires that you specify each section that you want to become a tab by placing {.tabset} after the line. Every subsequent header will be a new tab.
Below are some example bullet points. Dashes (-) followed by a space and the desired characters are used to create bullet points, and indentation produces sub-bullets.
Really nice quotes can be placed by using >, followed by a space and the desired quote.
“Genes are like the story, and DNA is the language that the story is written in.”
— Sam Kean
Hyperlinks can be embedded into these files using (RMarkdown)(http://etc) except the first set of parenthesis should instead be in brackets. When done correctly with a functional link, it should look like so:
You can also make nice looking formulas by enclosing our text with two dollar signs. This uses a unique syntax, including using ^ to create exponents and the ampersand to place two symbols next to one another with no operator.
Hardy-Weinberg Formula
\[p^2 + 2pq + q^2 = 1\]
You can also get more complex using backslash followed by a character name. This will print the desired character in our formula. Additionally, new lines can be used to create additional lines for the equation.
\[\theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end {pmatrix}\]
After ```{r, mutliple arguments can be used to adjust how code chunks are evaluated, if at all. echo= governs whether or not the code of the chunk is printed inline on the output file. By default this is set to true. Below is a chunk that has echo= set to true and eval= set to false.
Echo set to true and eval set to false:
print('Hello World')
eval= governs whether or not a code chunk is run. Below is a chunk that has echo= set to false and eval= set to true.
Echo set to false and eval set to true:
## [1] "Hello World"
In addition to eval= and echo=, the following are other arguments and their functions:
Cache: if enabled, the same code chunk will not be evaluated next time knitr is run, but the first run’s data will be stored for use. Good for code that has long run times.
fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document, then to files that are saved separately
out.width or out.height: the output size of the R plots in the R document itself.
fig.cap: the words for the figure captions.
You can also use the code_folding option in the YAML code (the unique code chunk at the very top of the document) to allow the reader to toggle between displaying and hiding the code. This is done via setting the argument code_folding to hide or show; both options will allow for the toggle, but each respective argument sets the default state. Below is my YAML code for this document for an example of this and other arguments:
title: "R Markdown in R Studio"
author: "Zachary Baham"
date: "2022-05-10"
output:
html_document:
code_folding: show
toc: true
toc_float: true
theme: simplex
highlights: monochrome
You can also add a table of contents to a document using the toc and toc_float arguments as shown in the Code Folding tab. Pasted below is the same YAML code; note the toc and toc_float arguments are set to true. This will give us a very nice floating table of contents on the right hand side of the document.
title: "R Markdown in R Studio"
author: "Zachary Baham"
date: "2022-05-10"
output:
html_document:
code_folding: show
toc: true
toc_float: true
theme: simplex
highlights: monochrome
Themes and highlights can be specified as part of the YAML code as well, allowing you to add a splash of color and some fitting text formatting for your report.
Themes will broadly change your highlight, hyperlink, and header colors and formats, as well as some small spacing adjustments. To change this, change your theme argument in YAML to one of the following:
You can also change the color by specifying highlights from one of the following options:
There are a bunch of ways to customize R code using HTML format. This is a great way to display a portfolio of your work if you’re marketing yourself to interested parties. Next, we’ll go over the main methods to wrangle data in R to prepare it for analysis from a raw file.
We will be using tidyverse for much of our data organization, so we’ll begin by using the library() command to load the tidyverse package into our R session. We’ll also be using a dataset for flights to and from New York City in 2013, called nycflights13.
library(tidyverse)
install.packages("nycflights13")
Next, we’ll load specifically the flights data from nycflights13 into a variable. To do this, we use the name of the dataset followed by :: and the desired subset. We will then use the head() command, which prints the first 6 rows of our data.
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
In this section, we’re going to filter the data to focus on flights on October 14th. We will use the filter() command to do so, which requires the variable plus any arguments to narrow the data by.
filter(my_data, month == 10, day == 14)
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The command will print our filtered data, but if we want to manipulate it, we should store the subset in a variable like so:
oct_14_flight <- filter(my_data, month == 10, day == 14)
You can add an extra set of parenthesis around the entire line of code above to both print the filtered data and store the subset as a unique variable.
Note that we use double equals signs instead of a single one for the previous commands. This is required in R for many functions and commands, but it can vary depending on the package. Typically, operators take the below forms:
As an example, below is the less-than operator being used to show flights through September:
(flight_thru_sept <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 252,474 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
In addition to the basic operators, logical operators can also be used. These include…
Below is another example, where we are looking at specifically flights in both March and April.
March_April_Flight <- filter(my_data, month == 3 | month == 4)
Think for a moment why we didn’t use the & operator. This is because R, like many programming languages, takes operators very literally. By saying that we want data on flights over both March and April, we are looking for all individual flights that occurred over both months simultaneously, which is impossible. Thus, we use the | operator to look at all individual flights that occurred over March OR April.
Finally, let’s look at an example using the ! operator, which we already briefly covered in the normal operators section with “Not equal to” being !=. Below, we will get a subset of all flights except those in January.
Non_Jan_Flight <- filter(my_data, month != 1)
In R, there are a number of functions available to enable the reorganization, selection, and filtering of daata in addition to the options above.
Arrange lets us order our data by a given variable in the set, organizing in sequential order. For example, the below code will arrange our data set by year, then day, then month. By default, data is sorted ascending.
arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Alternatively, you can use desc() to instead sort by descending.
descending <- arrange(my_data, desc(year), desc(day), desc(month))
print(descending)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 31 13 2359 14 439 437
## 2 2013 12 31 18 2359 19 449 444
## 3 2013 12 31 26 2245 101 129 2353
## 4 2013 12 31 459 500 -1 655 651
## 5 2013 12 31 514 515 -1 814 812
## 6 2013 12 31 549 551 -2 925 900
## 7 2013 12 31 550 600 -10 725 745
## 8 2013 12 31 552 600 -8 811 826
## 9 2013 12 31 553 600 -7 741 754
## 10 2013 12 31 554 550 4 1024 1027
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Select allows you to specify columns that you want to see or manipulate. The below code will pull out just the year, month, and day columns of our dataset and store it in a variable called calendar.
calendar <- select(my_data, year, month, day)
print(calendar)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
Select also allows you to look at a range of columns. The below variable will include all columns between month and carrier, inclusively. The rows you can see should be missing their year column.
calendar2 <- select(my_data, month:carrier)
print(calendar2)
## # A tibble: 336,776 × 9
## month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <dbl> <int> <int>
## 1 1 1 517 515 2 830 819
## 2 1 1 533 529 4 850 830
## 3 1 1 542 540 2 923 850
## 4 1 1 544 545 -1 1004 1022
## 5 1 1 554 600 -6 812 837
## 6 1 1 554 558 -4 740 728
## 7 1 1 555 600 -5 913 854
## 8 1 1 557 600 -3 709 723
## 9 1 1 557 600 -3 838 846
## 10 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 2 more variables: arr_delay <dbl>,
## # carrier <chr>
Select can also make use of the ! operator to exclude specific columns or ranges of columns. Below, we see a line of code that stores all columns of data except year, day, and the columns in between. Note that in addition to !, some commands like select allow you to use - as a ‘not’ operator.
everything_else <- select(my_data, !(year:day))
print(everything_else)
## # A tibble: 336,776 × 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 517 515 2 830 819 11 UA
## 2 533 529 4 850 830 20 UA
## 3 542 540 2 923 850 33 AA
## 4 544 545 -1 1004 1022 -18 B6
## 5 554 600 -6 812 837 -25 DL
## 6 554 558 -4 740 728 12 UA
## 7 555 600 -5 913 854 19 B6
## 8 557 600 -3 709 723 -14 EV
## 9 557 600 -3 838 846 -8 B6
## 10 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
You can also rename elements of your data using rename. This function will rename the first variable given to the string you input after the =. Note taht in the example, we are replacing the original data with the rename function’s result, overwriting the previous data. This doesn’t change the raw file, but will effectively permanently change the variable’s data.
my_data <- rename(my_data, departure_time = dep_time)
head(my_data)
## # A tibble: 6 × 19
## year month day departure_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
If you need to add new columns to your data frame, the mutate function can do so. To begin, however, we’re going to make our data frame smaller to better facilitate this example.
my_data_s <- select(my_data, year:day, distance, air_time)
Next, we will calculate the average speed of the flights using mutate. To do this, we will create a new column whose values are equal to the distance divided by the air_time multiplied by 60.
my_data_speed <- mutate(my_data_s, speed = distance/air_time*60)
head(my_data_speed)
## # A tibble: 6 × 6
## year month day distance air_time speed
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2013 1 1 1400 227 370.
## 2 2013 1 1 1416 227 374.
## 3 2013 1 1 1089 160 408.
## 4 2013 1 1 1576 183 517.
## 5 2013 1 1 762 116 394.
## 6 2013 1 1 719 150 288.
What if we want to create a new data frame using only the previously calculated data? We can use the transmute function for this, in which we do the calculation again to store it in a separate set alone. In addition, you can create multiple new columns with different calculations to be included in this fresh data frame. We do this in the example below with miles per hour and miles per minute:
airspeed <- transmute(my_data_s, speed_mph = distance/air_time*60 , speed_mpm = distance/air_time)
head(airspeed)
## # A tibble: 6 × 2
## speed_mph speed_mpm
## <dbl> <dbl>
## 1 370. 6.17
## 2 374. 6.24
## 3 408. 6.81
## 4 517. 8.61
## 5 394. 6.57
## 6 288. 4.79
Summarize runs a function on a data column to get a single return. In the below example, we will find the mean delay time. Note the na.rm argument, which stands for “Not applicable.remove”. By setting this to true, N/A data is ignored.
summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
As we can see, the average delay time was ~12.6 minutes.
We can get additional value out of summarize by using by_group. This lets you group data by other columns’ variables. Below is an example where we group the data by year, month, and day, then summarize the same delay mean of that grouped dataset.
by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # … with 355 more rows
As you can see, we now have a mean delay by each specific day, with the first value showing a mean delay of 11.5 minutes on January 1st, 2013.
There are a number of other important helpful functions for data selection and organization. The examples below use quotation marks to denote string values, but they can also use numerical characters.
As was briefly mentioned for summarize, some data can be “N/A”, or not-applicable. This data is essentially either empty values or some type of data that R cannot parse. This data can be detrimental to many analyses as any summary function or operation that includes an N/A value will itself become N/A. We can see this in the below example in which we try to get the mean delay of our grouped data without the na.rm argument:
summarize(by_day, delay = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # … with 355 more rows
As you can see, every visible calculated delay is NA. This is because cancelled flights in this dataset have NA as their delay, and we can check what rows have NA values using the is.na function. Below, we check airspeed for values that are NA. (The full is.na function has been omitted for the head instead due to its length; is.na originally printed 500 lines)
check_na <- is.na(airspeed)
head(check_na)
## speed_mph speed_mpm
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
## [5,] FALSE FALSE
## [6,] FALSE FALSE
We can filter out NA values using the filter function that we went over previously. Below, we filter out anything from my_data that is TRUE according to the is.na function.
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
head(not_cancelled)
## # A tibble: 6 × 19
## year month day departure_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Now, with the NA data removed, we should be able to rerun a summarize without the na.rm argument and our new variable and get the same result as the original summarize function (12.6 minutes).
summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
The final result is actually slightly different due to different methodologies, but rounds to the same approximate value of 12.6.
Counts can be used to count quantities of value ranges, value types, or exact values. For example, we can count the number of values that are NA.
sum(is.na(my_data$dep_delay))
## [1] 8255
This function shows us that 8,255 values of dep_delay in my_data are NA. Likewise, you can count how many variables are not NA.
sum(!is.na(my_data$dep_delay))
## [1] 328521
Note the usage of the dollar sign; following a data frame with a dollar sign and a variable name targets a specific column for the function.
When using tibble datasets (which will be discussed more later in this document), results can be piped to avoid using the dollar-sign targeting of specific variables. Via this, we can summarize the number of flights by minutes delayed.
Below, we can see that the %>% line can be used to point everything below the symbol to the dataset in front of it.
my_data %>%
group_by(year, month, day) %>%
summarize(mean = mean(departure_time, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 1385.
## 2 2013 1 2 1354.
## 3 2013 1 3 1357.
## 4 2013 1 4 1348.
## 5 2013 1 5 1326.
## 6 2013 1 6 1399.
## 7 2013 1 7 1341.
## 8 2013 1 8 1335.
## 9 2013 1 9 1326.
## 10 2013 1 10 1333.
## # … with 355 more rows
Tibbles are essentially modified data frames that adjust some of the older features of data frames. R is an old language, and tibbles give data frames appropriate functionality for modern implementation. Below, we will begin by loading the tibble package and setting the iris data set to be our tibble.
library(tibble)
as_tibble(iris)
Tibbles can also be craeted from scratch using tibble(). An example is shown below:
tibble(
x = 1:5,
y = 1,
z = x^2 + y
)
## # A tibble: 5 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
You can also use tribble() to instead create basic data tables. These are more ‘drawn out’ when written, and an example is given below:
tribble(
~gene_a, ~gene_b, ~gene_c,
##########################
110, 112, 114,
6, 5, 4
)
## # A tibble: 2 × 3
## gene_a gene_b gene_c
## <dbl> <dbl> <dbl>
## 1 110 112 114
## 2 6 5 4
Tibbles are designed to not overwhelm your console with data printing, being limited to only the first few lines. Recall the omission of the is.na full printout; data frames will similarly print many lines before reaching the print limit without using head().
Subsetting tibbles is much like subsetting from a data frame. Below, we set up a tibble for the flights data from nycflights13.
df_tibble <- tibble(nycflights13::flights)
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
We can subset by column name using the dollar sign. Below, we can show the subset of all carriers from our flight data. (head was used to truncate the print)
head(df_tibble$carrier)
## [1] "UA" "UA" "AA" "B6" "DL" "UA"
We can also subset by position using brackets. Below, we will use the column number 6 to look at the subset of dep_delay. (head was used to truncate as above)
head(df_tibble[[6]])
## [1] 2 4 2 -1 -6 -4
To use tibbles’ functionality in this way in a pipe, you must use . as a placeholder before the dollar sign. Below is an example of the carrier subset using piping.
head(df_tibble %>%
.$carrier)
## [1] "UA" "UA" "AA" "B6" "DL" "UA"
As nice as tibbles are, some older functions do not play nice with them, so you might have to convert them back into data frames. This is easily done by simply using the function as.data.frame(). See below for an example of this conversion and a comparison of tibbles and data frames:
class(df_tibble)
## [1] "tbl_df" "tbl" "data.frame"
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble_2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
Tidy data is very useful for readable and easy analysis. In order to make a tidy dataset, tidyverse has 3 main rules:
To adequately meet these requirements, the following instructions must be followed to produce tidy data:
Picking a consistent method of data storage makes for readable and understandable code, as well as figuring out your own older code if you have to look back at older projects.
But, before anything else, we have to start by loading the tidyverse library:
library(tidyverse)
Next, we’ll store the dataset called ‘women’ in a tibble, which is a small set of the height and weight of fifteen different women, then mutate it to add a bmi column to the tibble.
bmi <- tibble(women)
bmi %>%
mutate(bmi = (703*weight)/(height)^2)
## # A tibble: 15 × 3
## height weight bmi
## <dbl> <dbl> <dbl>
## 1 58 115 24.0
## 2 59 117 23.6
## 3 60 120 23.4
## 4 61 123 23.2
## 5 62 126 23.0
## 6 63 129 22.8
## 7 64 132 22.7
## 8 65 135 22.5
## 9 66 139 22.4
## 10 67 142 22.2
## 11 68 146 22.2
## 12 69 150 22.1
## 13 70 154 22.1
## 14 71 159 22.2
## 15 72 164 22.2
Sometimes, you’ll run into datasets that don’t fit very well into tibbles. We’ll use a set built into tidyverse for this, called table4a, which is data on the number cases of an unknown pathogen in Afghanistan, Brazil, and China.
print(table4a)
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
As you can see in the printed table, we have one variable in column A (country), but columns B and C are the same variable (year). Thus, we have two observations in each row, breaking rule #2 for tidy data. To fix this, we can use the gather function to restructure the table so year is its own column.
table4a_gathered <- table4a %>%
gather('1999','2000', key = 'year', value = 'cases')
table4a_gathered
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
As you can see in the code itself, we start by listing the columns that we want to sort into its own variable, those being the two year values. Next, we use the key argument to define the name of this new column. Finally, we use value to define the name of the column that the data originally under the 1999 and 2000 will be organized under.
In the new tibble, we can see that each variable has a unique column, and the case numbers remained coupled with their year and country.
Let’s look at another example using table4b.
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
We have much the same issues as we did with table4a here, so let’s gather it.
table4b_gathered <- table4b %>%
gather('1999','2000', key = 'year', value = 'population')
table4b_gathered
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
Now, with our tables reorganized, what if we want to join them together via common variables? For this, we can use dplyr’s left_join() function.
table4 <- left_join(table4a_gathered, table4b_gathered)
## Joining, by = c("country", "year")
table4
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
Spreading is the opposite of gathering. This can be useful with datasets that have redundant data, such as in table2.
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
The repeating country and year values make for untidy data since we see a repeat of observations in 1999 for every country, just with a different ‘type’. To fix this, we will use the spread() function to spread the type column into two new variables - cases and population - while moving each count to that column by its type. We can see the results of this below:
spread(table2, key = 'type', value = 'count')
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Similarly to the gather() function, key defines to column to be modified, while value instead now defines the column that will spread to the newly created columns for what was once the type.
Spread tends to make long tables shorter and wider by condensing repeated observations into their own columns via common variables, while gather makes wide tables narrower and longer by distributing columns of data to a single column with new rows.
Now, what do we do if we have two observation types in the same cell? We see this issue in table3, which has the rate column showing both counts and population in the same cell.
table3
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The separate() function will split the two observations into their own columns using the names we give it. See the example below:
table3_fix <- table3 %>%
separate(rate, into=c('cases','population'))
table3_fix
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Despite this appearing to fully fix our tibble, recall that class is important for analysis. Our cases and population retained the chr (character) typing despite being integers. This can be fixed by adding the convert = TRUE argument to our separate function, as seen below:
table3_fix <- table3 %>%
separate(rate, into=c('cases', 'population'), convert = TRUE)
table3_fix
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Importantly, it must be noted that separate() will, by default, use any non-integer character as a separator for how it splits data into separate columns. If you want to specify a specific separation boundary, you can use the sep argument. An example is shown below:
table3_fix <- table3 %>%
separate(rate, into=c('cases','population'), sep = '/', convert = TRUE)
table3_fix
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Here’s another example where we split the year into century and year by splitting year by the sep value 2. This means we are splitting year after its second character.
table3_cent <- table3 %>%
separate(year, into=c('century','year'), sep = 2, convert = TRUE)
table3_cent
## # A tibble: 6 × 4
## country century year rate
## <chr> <int> <int> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 0 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 0 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 0 213766/1280428583
What do we need to do if we want the inverse effect of separate? Let’s start with table5, which is much like the table we previously created by separating the year, except without conversion. Below is a print of the original table followed by the same table with the century and year united.
table5
## # A tibble: 6 × 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5_un <- table5 %>%
unite(date, century, year, sep='')
table5_un$date <- as.integer(table5_un$date)
table5_un
## # A tibble: 6 × 3
## country date rate
## <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Note that in this code, I used the sep argument to place a null space as our separator. In unite, this defines what symbol should separate the united values, which is an underscore (_) by default. Additionally, I added the line beginning with table5_un$date, which converted the previously character-class year and century values into their intended integer class.
There are two forms of missing values; NA (explicit) or non-entered (implicit). Below, I’ll construct a gene_data tibble for illustration purposes:
gene_data <- tibble(
gene = c('a','a','a','a','b','b','b'),
nuc = c(20,22,24,25,NA,42,67),
run = c(1,2,3,4,2,3,4)
)
gene_data
## # A tibble: 7 × 3
## gene nuc run
## <chr> <dbl> <dbl>
## 1 a 20 1
## 2 a 22 2
## 3 a 24 3
## 4 a 25 4
## 5 b NA 2
## 6 b 42 3
## 7 b 67 4
The nucleotide count for gene b in run 2 is explicitly missing, while the nucleotide count for gene b in run 1 is implicitly missing. Implicit missing data can be troublesome as there is no way to tell that it is missing without reading the tibble directly.
One way to make implicit missing values explicit is by putting runs in columns via spread().
gene_data_2 <- gene_data %>%
spread(gene, nuc)
gene_data_2
## # A tibble: 4 × 3
## run a b
## <dbl> <dbl> <dbl>
## 1 1 20 NA
## 2 2 22 NA
## 3 3 24 42
## 4 4 25 67
With this layout, we can clearly see that the implicit missing value for b has been made explicit. To remove these values, we can use spread and gather or na.rm = TRUE.
gene_data_3 <- gene_data %>%
spread(gene, nuc) %>%
gather(gene, nuc, 'a':'b', na.rm = TRUE)
gene_data_3
## # A tibble: 6 × 3
## run gene nuc
## <dbl> <chr> <dbl>
## 1 1 a 20
## 2 2 a 22
## 3 3 a 24
## 4 4 a 25
## 5 3 b 42
## 6 4 b 67
By first spreading the data, we made all implicit and explicit values for gene b visible. Then, using gather, we reorganized the gene a and gene b columns back into a single gene column, ignoring any NA values thanks to the na.rm = TRUE argument. Now, we have a clean and tidy version of our gene_data dataset.
As an aside, another way to make implicit values explicit it to use the complete() function.
gene_data %>%
complete(gene, run)
## # A tibble: 8 × 3
## gene run nuc
## <chr> <dbl> <dbl>
## 1 a 1 20
## 2 a 2 22
## 3 a 3 24
## 4 a 4 25
## 5 b 1 NA
## 6 b 2 NA
## 7 b 3 42
## 8 b 4 67
Sometimes NA, either explicit or implicit, can be used to represent a value being carried forward in a dataset. To illustrate this, I will construct tribble:
treatment <- tribble(
~ person, ~ treatment, ~ response,
##############################################
"Isaac", 1, 7,
NA, 2, 10,
NA, 3, 9,
"VDB", 1, 8,
NA, 2, 11,
NA, 3, 10,
)
treatment
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 VDB 1 8
## 5 <NA> 2 11
## 6 <NA> 3 10
We can see that the NA values here are simply empty spaces meant to represent the value above them. In order to fill these values so the dataset can be used properly, we can use the fill() function.
treatmentfill <- treatment %>%
fill(person)
treatmentfill
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 Isaac 2 10
## 3 Isaac 3 9
## 4 VDB 1 8
## 5 VDB 2 11
## 6 VDB 3 10
Oftentimes, you will likely be working with multiple data tables. The DPLYR package allows for the joining of two data tables based on common variables, which we’ve seen before in our Spreading and Grouping section.
Mutate joins adds new variables to one data frame from the matching observations in another.
Filtering joins filters an observation from one data frame based on whether or not they are present in the other.
Set operations treats observations as they are set elements.
To demonstrate some of the things that DPLYR can do, let’s begin by loading the library of nycflights13.
library(nycflights13)
Keys are unique identifiers per observation. Primary keys uniquely identify an observation within its own table. In other words, primary keys are variables that have unique values for each row. We can identify if a variable is suitable as a primary key using the code below, where I will check if tailnum in the planes dataset of nycflights13 is a primary key.
planes %>%
count(tailnum) %>%
filter(n>1)
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>
The lack of any rows here means that in the planes dataset, no tailnum values are identical, making it perfectly suitable as a primary key. Below is an example of variable that isn’t usable as a primary key:
planes %>%
count(model) %>%
filter(n>1)
## # A tibble: 79 × 2
## model n
## <chr> <int>
## 1 717-200 88
## 2 737-301 2
## 3 737-3G7 2
## 4 737-3H4 105
## 5 737-3K2 2
## 6 737-3L9 2
## 7 737-3Q8 5
## 8 737-3TO 2
## 9 737-401 4
## 10 737-4B7 18
## # … with 69 more rows
First, let’s select the relevant data for our purposes here:
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 × 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # … with 336,766 more rows
Next, we’re going to remove the origin and destination, then join the table with the airlines data by carrier to associate the carrier’s full name with the flight.
flight_carrier <- flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = 'carrier')
flight_carrier
## # A tibble: 336,776 × 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # … with 336,766 more rows
Stringr is important for many bioinformatic purposes as strings are paramount to enable gene analysis and the like. First, as always, before starting we will load the necessary library:
library(stringr)
Strings in R can be created via single or double quotes:
string1 <- "this is a string"
string2 <- 'to put a "quote" in your string, use the opposte quotation type'
string1
## [1] "this is a string"
string2
## [1] "to put a \"quote\" in your string, use the opposte quotation type"
Multiple strings can be stored in collections as character vectors:
string3 <- c('one','two','three')
string3
## [1] "one" "two" "three"
String length can be measured using str_length():
str_length(string1)
## [1] 16
str_length(string3)
## [1] 3 3 5
To combine strings, you can use str_c() with a comma separating both strings. They are combined directly with no additional characters.
str_c('X','y')
## [1] "Xy"
str_c(string1, string2)
## [1] "this is a stringto put a \"quote\" in your string, use the opposte quotation type"
You can use the sep argument to control what is placed between combined strings:
str_c(string1, string2, sep = ". If you want ")
## [1] "this is a string. If you want to put a \"quote\" in your string, use the opposte quotation type"
Strings can be subsetted using the str_sub() function. For the below variable, we’re going to pull out just the numbers from each string. To do this, we will add numbers to the end of str_sub() to define which range of characters should be subsetted.
HSP <- c('HSP123','HSP234','HSP456')
HSP
## [1] "HSP123" "HSP234" "HSP456"
HSPnum <- str_sub(HSP,4,6)
HSPnum
## [1] "123" "234" "456"
Alternatively, you can use negative numbers to count backwards from the rightmost part of the string:
HSPmun <- str_sub(HSP,-3,-1)
HSPmun
## [1] "123" "234" "456"
Strings are also often case-sensitive, so there is a function to change case. To change all characters to lowercase, you can use str_to_lower(). To change all characters to uppercase, you use str_to_upper(). An example is shown below:
HSPLow <- str_to_lower(HSP)
HSPLow
## [1] "hsp123" "hsp234" "hsp456"
In this section, we’re going to look at techniques to improve your gene analysis and binding motif-finding capabilities using stringr. Below, we’re going to create a variable with a few short gene sequences and load a library to help with visualization:
library(htmlwidgets)
x <- c('ATTAGA','CGCCCCCGGAT','TATTA')
str_view(x, 'G')
With str_view, we can highlight the first of a character in each string. Periods can be used to represent placeholders, allowing you to expand the size of highlights in a string:
str_view(x, '.G.')
You can also use ^ and $ to make ‘anchors’, which lets you specify that highlights only appear if a string starts or ends specifically with the desired characters:
str_view(x, '^TA')
str_view(x, 'AT$')
Other useful options exist for str_view(), including… - matches any digit - : matches any space - [abc]: matches any one of a, b, or c
In the below example, the search will essentially look for TAG or TAT since G and T are placed within the brackets and therefore are interchangeable.
str_view(x, 'TA[GT]')
Conversely, [^abc] will match for anything other than one of a, b, or c:
str_view(x, 'TA[^T]')
Str_view() has shown us matches based on our argument, but str_detect() can specifically be used to detect matches via TRUE and FALSE logic, instead of printing highlights. Let’s start by making a dummy set of strings, then seeing if we can detect the letter e.
y <- c('apple','banana','pear')
y
## [1] "apple" "banana" "pear"
str_detect(y, 'e')
## [1] TRUE FALSE TRUE
As you can see, the function determines whether or not e is present in each string and returns TRUE if the string contains it, or false if it does not. Next, let’s see how many words contain the letter e in the built-in words dataset.
sum(str_detect(words, 'e'))
## [1] 536
Let’s get more complex; what proportion of words in the set start with a vowel, and what proportion end with a vowel?
mean(str_detect(words, '^[aeiou]'))
## [1] 0.1785714
mean(str_detect(words, '[aeiou]$'))
## [1] 0.2765306
17.9% of words start with a vowel, and 27.7% of words end with a vowel in this dataset. Next, let’s find all words that don’t contain ‘o’ or ‘u’ by specifying what we want from the words set:
wrds <- words[!str_detect(words, '[ou]')]
head(wrds)
## [1] "a" "able" "accept" "achieve" "act" "active"
In addition to str_detect(), str_count can be used to specifically count the number of matches found in a string. Let’s use the x dataset we made:
x
## [1] "ATTAGA" "CGCCCCCGGAT" "TATTA"
str_count(x,'[GC]')
## [1] 1 9 0
Let’s couple string matching with mutate. First, let’s create a tibble for the words dataset:
df <- tibble(
word = words,
wordnum = seq_along(word)
)
head(df)
## # A tibble: 6 × 2
## word wordnum
## <chr> <int>
## 1 a 1
## 2 able 2
## 3 about 3
## 4 absolute 4
## 5 accept 5
## 6 account 6
Next, let’s mutate the tibble to add a vowels and consonants count for each word:
df %>%
mutate(
vowels = str_count(words, '[aeiou]'),
consonants = str_count(words, '[^aeiou]')
)
## # A tibble: 980 × 4
## word wordnum vowels consonants
## <chr> <int> <int> <int>
## 1 a 1 1 0
## 2 able 2 2 2
## 3 about 3 3 2
## 4 absolute 4 4 4
## 5 accept 5 2 4
## 6 account 6 3 4
## 7 achieve 7 4 3
## 8 across 8 2 4
## 9 act 9 1 2
## 10 active 10 3 3
## # … with 970 more rows
head(df)
## # A tibble: 6 × 2
## word wordnum
## <chr> <int>
## 1 a 1
## 2 able 2
## 3 about 3
## 4 absolute 4
## 5 accept 5
## 6 account 6
Microarrays, though now somewhat outdated, are still widely analyzed since a number of older experiments used this method of genetic data collection. This method often required knowing what genes to test for, hence the modern tendency to use full genome sequencing methods. Nonehtheless, microarrays are still used and important for broad analyses like meta-analyses.
Now, before we get into it proper, we have to set our working directory and load a number of libraries:
library(ath1121501.db)
library(ath1121501cdf)
library(annotate)
library(affy)
library(limma)
library(oligo)
library(dplyr)
library(stats)
library(reshape)
library(imager)
library(stringr)
## setwd("~Desktop/classroom/myfiles")
Naturally, your working directory may vary. You will need to set this directory to where you have Bric16_Targets.csv and other necessary files saved. For the way our RStudio is set up, we cannot change the working directory, so I have commented out the line to set our directory.
Now that we have our directory set and libraries loaded, we can read our file and begin the analysis.
targets <- readTargets("Bric16_Targets.csv", sep=',', row.names = 'filename')
ab <- ReadAffy()
With the data read and stored in variables, we’ll plot a basic histogram:
hist(ab)
One thing of import to note is that in microarray experiments, measurements are taken based on light given off. This light is directly proportional to the amount of RNA, and it’s possible that a given sample just has more RNA on average than another, so we must normalize our data.
eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
## sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL 1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL 2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL 3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL 4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL 5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL 6
hist(eset)
As you can see, compared to the last graph, we have a much tighter fit between each experiment; the differences in sample light have been normalized to one another to give us the relative data, allowing us to compare across microarray experiments. With that done, let’s characterize our data next.
We will start by getting the probe names from our data set. featureNames will call the actual probe names, and getSYMBOL will find the gene symbols for the probes using one of the databases we loaded. Lastly, we will use read.delim to create a data frame from a table of data using a large arabidopsis dataset.
ID <- featureNames(eset)
Symbol <- getSYMBOL(ID, 'ath1121501.db')
affydata <- read.delim("affy_ATH1_array_elements.txt")
First, we need a variable to represent ground or flight conditions for our data. We will set this up by storing the gravity data of our microarray as the condition variable, then use model.matrix to compare condition variables and categorize them. colnames is used to set the column names once the conditions have been categorized.
condition <- targets$gravity
design <- model.matrix(~factor(condition))
colnames(design) <- c('flight','ground')
With our flight and ground variable set up, we will first fit a linear model for each gene in our data given a series of arrays using lmFit and eBayes. lmFit establishes the linear model and eBayes will compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors. In other words, eBayes will show us if there is significant differential expression in our linear model. This code is shown below:
fit <- lmFit(eset, design)
fit <- eBayes(fit)
Next, we will make the data look a little nicer with options(digits=2) to limit the number of digits rendered, as well as topTable to have a list of the most differentially expressed genes. We will also use an FDR (false discovery rate) adjustment as our multiple-test correction.
options(digits = 2)
top <- topTable(fit, coef = 2, n = Inf, adjust='fdr')
Next, let’s create a data frame to put everything together. Each line of our data frame construction below creates a column using our greater database’s column of the same name. This frame will include gene names, kegg data (a type of annotation), gene ontology data, symbols, and accession numbers.
After making this data frame, we are going to merge this with our topTable of significantly differentially expressed genes with Annot using column 0 (by.x) and top using column 0 (by.y). Then, we will merge that data frame with our affydata, associating rownames with our all data frame and array_element_name with affydata. Lastly, we’ll set ACCNUM to be character class.
Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ', '),
KEGG = sapply(contents(ath1121501PATH), paste, collapse = ', '),
GO = sapply(contents(ath1121501GO), paste, collapse = ','),
SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ', '),
ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ', '))
all <- merge(Annot, top, by.x=0, by.y=0, all = TRUE)
all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")
all2$ACCNUM <- as.character(all2$ACCNUM)
Finally,with our table compiled, we will write the resulting data into a csv file. Note the sep argument; this will place a tab as a separator.
write.table(all2, file="BRIC_16_FINAL.csv:", row.names = TRUE, col.names = TRUE, sep = '\t')
Next, we’re going to use the org.At.tair.db database, which includes much more diverse information on the differentially expressed genes we were looking at. Below is a list of the columns of data in the database:
columns(org.At.tair.db)
## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
With this new database, we will add a column to our all2 dataframe called entrez using mapIds. mapIds looks at the arabidopsis database that we’re using and matches ACCNUMs to their entrezID. The keytype denotes that ACCNUM is TAIR-type, and multiVals is an argument that tells the function to use the first variable if multiple are present.
We’ll also store the fold change as its own variable for our next section on Pathview, which will give biological context to our data.
all2$entrez = mapIds(org.At.tair.db,
keys = all2$ACCNUM,
column = "ENTREZID",
keytype = "TAIR",
multiVals = 'first')
foldchanges <- as.data.frame(all2$logFC)
First, let’s load our libraries and set up some annotation data. Pathview builds excellent visual pathway diagrams and the gage package is an acronym for generically applicable gene expression.
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)
Next, we will assign genes with entrezIDs to their appropriate fold change values:
foldchanges = all2$logFC
names(foldchanges) = all2$entrez
head(foldchanges)
## <NA> <NA> <NA> <NA> <NA> <NA>
## -0.742 -0.826 -1.214 -0.568 0.120 0.055
In this next step, we are going to map genes to our kegg data’s annotations. We’ll start by setting kegg.ath (our arabidopsis kegg data) equal to kegg.gsets for arabidopsis thaliana’s (ath) entrezIDs, then pulling the relevant data we need fully out of the database and into its own data frame. See this step below:
kegg.ath = kegg.gsets('ath', id.type = 'entrez')
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]
To get the results of our analysis, we will use the gage function with foldchanges and our gene set kegg.ath.sigmet. Note the same.dir argument, which tells the function that the data is found in the same directory:
keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## ath03010 Ribosome 2.4e-27 11.5 2.4e-27 2.7e-25
## ath00195 Photosynthesis 7.6e-06 4.6 7.6e-06 4.3e-04
## ath04145 Phagosome 3.2e-05 4.1 3.2e-05 1.2e-03
## ath01230 Biosynthesis of amino acids 5.5e-05 3.9 5.5e-05 1.5e-03
## ath00190 Oxidative phosphorylation 6.0e-04 3.3 6.0e-04 1.3e-02
## ath00020 Citrate cycle (TCA cycle) 1.4e-03 3.1 1.4e-03 2.6e-02
## set.size exp1
## ath03010 Ribosome 261 2.4e-27
## ath00195 Photosynthesis 44 7.6e-06
## ath04145 Phagosome 71 3.2e-05
## ath01230 Biosynthesis of amino acids 201 5.5e-05
## ath00190 Oxidative phosphorylation 97 6.0e-04
## ath00020 Citrate cycle (TCA cycle) 57 1.4e-03
##
## $less
## p.geomean stat.mean p.val
## ath04016 MAPK signaling pathway - plant 0.0035 -2.7 0.0035
## ath00906 Carotenoid biosynthesis 0.0198 -2.1 0.0198
## ath04141 Protein processing in endoplasmic reticulum 0.0283 -1.9 0.0283
## ath00592 alpha-Linolenic acid metabolism 0.0552 -1.6 0.0552
## ath00071 Fatty acid degradation 0.1160 -1.2 0.1160
## ath04122 Sulfur relay system 0.1415 -1.1 0.1415
## q.val set.size exp1
## ath04016 MAPK signaling pathway - plant 0.4 129 0.0035
## ath00906 Carotenoid biosynthesis 1.0 28 0.0198
## ath04141 Protein processing in endoplasmic reticulum 1.0 189 0.0283
## ath00592 alpha-Linolenic acid metabolism 1.0 36 0.0552
## ath00071 Fatty acid degradation 1.0 45 0.1160
## ath04122 Sulfur relay system 1.0 11 0.1415
##
## $stats
## stat.mean exp1
## ath03010 Ribosome 11.5 11.5
## ath00195 Photosynthesis 4.6 4.6
## ath04145 Phagosome 4.1 4.1
## ath01230 Biosynthesis of amino acids 3.9 3.9
## ath00190 Oxidative phosphorylation 3.3 3.3
## ath00020 Citrate cycle (TCA cycle) 3.1 3.1
We use lapply here to get a look at the data in a table form.
Thanks to gage, our keggres data is separated into downregulated (less) and upregulated (greater) sections. With this in mind, we can separate these into their own variables, then write each into a csv file for storage or later analysis:
greater <- keggres$greater
lesser <- keggres$less
write.csv(greater, file = 'BRIC_16_pathview_Greater_Pathways.csv')
write.csv(lesser, file = 'BRIC_16_pathview_Lesser_Pathways.csv')
First, to to begin the end of the mapping process, we will take our upregulated pathways and filter out all but the first five pathways, then make them characters. In other words, we’re taking the first 5 $greaters data points from keggres, formatting it as a new dataframe, then formatting the data as characters.
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tibble::as_tibble() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
Next, we’ll cut out the full names of each pathway, leaving only the pathway identifiers. This cleans up our pathview diagram nicely. We’ll also go ahead and set character-class logFC from all2 as our genedata variable for later use.
keggresids_greater = substr(keggrespathways, start=1, stop=8)
genedata <- as.character(all2$logFC)
Finally, we will define the function to produce our pathview diagram. We will call this function pid, and will call pathview to visually show fold changes along the genetic pathways of Arabidopsis Thaliana. See this definition below:
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = 'ath', new.signature = FALSE)
When we plot the actual pathways, we will need to use a temporary variable, which will be called tmp. sapply() will command the function to go over the five variables of keggresids_greater through the function pid. This will generate diagrams of upregulation/downregulation on the relevant pathways.
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = 'ath'))
These files will be saved to disk, with the tmp variable being a throwaway object list. With that, we will have five different pathview diagrams that elegantly show the fold changes of the five pathways we selected.
I will also use file.move to separate the pathview images from the working directory since we will produce more pathview images later.
To render these files in our markdown document, we can run the following code to tell R what files to use, then in the next chunk, tell knitr to include the images stored in the filenames object.
file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.microarray.png'))
micro_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.microarray.png")
allmicro_images <- lapply(micro_filenames, load.image)
dev.off()
knitr::include_graphics(micro_filenames)
As always, we’ll start this section by loading our required packages and the database org.Mm.eg.db, which is a database of mouse genomic information.
library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
To begin the analysis, we’re going to load our raw data, create a group variable to define ground or flight categories, then utilize the DGEList() function to create a differential gene expression list via edgeR.
rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_id")
group <- factor(c('1','1','1','1','1','1','2','2','2','2','2','2'))
dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
## ..@ .Data:List of 2
## .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
## .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
## .. ..$ :'data.frame': 12 obs. of 3 variables:
## .. .. ..$ group : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
## .. .. ..$ lib.size : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
## .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ names: chr [1:2] "counts" "samples"
Printed by this code chunk, we can see a bunch of information in regards to our list including the unique class of the object as well as some of the data stored in it.
Next, we want to remove any rows that have empty cells. To do so, we will run the following code to keep only the data whose number of filled cells per row is four or more.
keep <- rowSums(cpm(dgeGlm)>2) >=4
dgeGlm <- dgeGlm[keep,]
With this narrowed dataset, we’ll only be looking at genes which have data across all of the experiments listed. The below section of code shows the names of our differential expression data, and then prints the names of the samples we have as a snapshot into the state of our data.
names(dgeGlm)
## [1] "counts" "samples"
dgeGlm[["samples"]]
## group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23 1 4.0e+07 1
## Mmus_C57.6J_KDN_FLT_Rep2_M24 1 4.1e+07 1
## Mmus_C57.6J_KDN_FLT_Rep3_M25 1 3.8e+07 1
## Mmus_C57.6J_KDN_FLT_Rep4_M26 1 3.9e+07 1
## Mmus_C57.6J_KDN_FLT_Rep5_M27 1 3.7e+07 1
## Mmus_C57.6J_KDN_FLT_Rep6_M28 1 3.6e+07 1
## Mmus_C57.6J_KDN_GC_Rep1_M33 2 3.7e+07 1
## Mmus_C57.6J_KDN_GC_Rep2_M34 2 3.7e+07 1
## Mmus_C57.6J_KDN_GC_Rep3_M35 2 4.0e+07 1
## Mmus_C57.6J_KDN_GC_Rep4_M36 2 3.6e+07 1
## Mmus_C57.6J_KDN_GC_Rep5_M37 2 3.8e+07 1
## Mmus_C57.6J_KDN_GC_Rep6_M38 2 3.5e+07 1
Now, to begin using edgeR proper, we need to set up our data in a way that the package can interpret our data. To achieve this, we’ll create a design using model.matrix based on our group variable.
Then, we will… - calculate the common negative binomial dispersion parameter - estimate the abundance-dispersion trend - compute an empirical Bayes estimate of the negative binomial dispersion parameter for each tag, with expression levels specified via a log-linear model
This is shown in the below code chunk:
design <- model.matrix(~group)
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
With these values calculated, let’s plot dgeGlmTagDisp to give some context to all this statistical jargon:
plotBCV(dgeGlmTagDisp)
This plot shows the distribution of the variation between samples. Good data should show a fairly uniform curve, and we see that our data indeed has a relatively uniform appearance. However, there is more analysis to do.
For the next stage of our analysis, we’ll fit a linear regression and log-linear model to our data using glmFit and glmLRT.
Then, we will use topTags to pull out the most differentially expressed genes from our object in a similar fashion to topTable from our microarray analysis.
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef=2)
ttGlm <- topTags(lrt, n = Inf)
Using the summary() function, we can use a desired p-value to narrow our lrt data to only that which meets our significance while correcting with the false-discovery rate adjustment.
With the summary, we can then store only the significant data by using the logical output of deGlm; anything TRUE by the summary (significant data points) will be selected from the dgeGlmTagDisp data.
summary(deGlm <- decideTestsDGE(lrt, p=0.1, adjust = 'fdr'))
## group2
## Down 64
## NotSig 13390
## Up 159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
Next, we’ll create a volcano plot, which shows all of our logFC values for every gene with red data points defining the differentially expressed genes. Good data should have an even distribution of these red points; a concentration of red points would indicate a problem with the experiment or dataset.
plotSmear(lrt, de.tags=tagsGlm)
Next, we’re going to pull out the data whose false-discovery rate is less than 0.1, then write a csv with that data in it.
hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]
write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")
To show our data off in Pathview, we need to do some legwork to prep it and associate the genes with their annotations and pathways.
To start, we’ll set up our database, set up a dataframe version of ttGlm’s table, then map out symbols, entrezIDs, and gene names in a similar fashion to the microarray analysis:
columns(org.Mm.eg.db)
ttGlm2 <- as.data.frame(ttGlm$table)
ttGlm2$symbol = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
ttGlm2$entrez = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
ttGlm2$name = mapIds (org.Mm.eg.db,
keys=row.names(ttGlm2),
column="GENENAME",
keytype="ENSEMBL",
multiVals="first")
head(ttGlm2)
After loading our packages and datasets required for the analysis, we’ll create a separate variable for our fold changes and associate each logFC datapoint with ttGlm2’s entrezIDs.
library(pathview)
library(gage)
library(gageData)
library(imager)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
With that done, we’ll next associate genes with their annotation data using a similar method to the microarray data using sigmet. With our sigmet data and foldchanges, we can then get our results (keggres).
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]
keggres = gage(foldchanges, gsets=kegg.mm.sigmet, same.dir=TRUE)
lapply(keggres, head)
## $greater
## p.geomean
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04350 TGF-beta signaling pathway 7.7e-03
## mmu04330 Notch signaling pathway 1.0e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04350 TGF-beta signaling pathway 2.5
## mmu04330 Notch signaling pathway 2.4
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## p.val q.val
## mmu03010 Ribosome 9.5e-05 0.022
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.230
## mmu04350 TGF-beta signaling pathway 7.7e-03 0.589
## mmu04330 Notch signaling pathway 1.0e-02 0.589
## mmu04390 Hippo signaling pathway 2.0e-02 0.809
## mmu00830 Retinol metabolism 2.1e-02 0.809
## set.size
## mmu03010 Ribosome 127
## mmu04550 Signaling pathways regulating pluripotency of stem cells 100
## mmu04350 TGF-beta signaling pathway 72
## mmu04330 Notch signaling pathway 51
## mmu04390 Hippo signaling pathway 125
## mmu00830 Retinol metabolism 37
## exp1
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04350 TGF-beta signaling pathway 7.7e-03
## mmu04330 Notch signaling pathway 1.0e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
##
## $less
## p.geomean stat.mean p.val q.val set.size
## mmu04145 Phagosome 0.0019 -2.9 0.0019 0.33 121
## mmu04714 Thermogenesis 0.0048 -2.6 0.0048 0.33 207
## mmu04110 Cell cycle 0.0051 -2.6 0.0051 0.33 110
## mmu04217 Necroptosis 0.0056 -2.6 0.0056 0.33 113
## mmu00910 Nitrogen metabolism 0.0087 -2.6 0.0087 0.41 13
## mmu00190 Oxidative phosphorylation 0.0118 -2.3 0.0118 0.41 125
## exp1
## mmu04145 Phagosome 0.0019
## mmu04714 Thermogenesis 0.0048
## mmu04110 Cell cycle 0.0051
## mmu04217 Necroptosis 0.0056
## mmu00910 Nitrogen metabolism 0.0087
## mmu00190 Oxidative phosphorylation 0.0118
##
## $stats
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04350 TGF-beta signaling pathway 2.5
## mmu04330 Notch signaling pathway 2.4
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## exp1
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04350 TGF-beta signaling pathway 2.5
## mmu04330 Notch signaling pathway 2.4
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
As before, we’ll separate the data into upregulated (greaters) and downregulated (lessers) genes:
greaters <- keggres$greater
lessers <- keggres$less
In a fashion similar to the microarray analysis, we’ll create two keggrespathways objects by making a dataframe with the upregulated and downregulated genes, respectively.
First, upregulated genes:
keggrespathways1 = data.frame(id = rownames(keggres$greater), keggres$greater)%>%
tibble::as_tibble() %>%
filter(row_number()<=5)%>%
.$id %>%
as.character()
keggrespathways1
## [1] "mmu03010 Ribosome"
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04350 TGF-beta signaling pathway"
## [4] "mmu04330 Notch signaling pathway"
## [5] "mmu04390 Hippo signaling pathway"
keggresidsgreater = substr(keggrespathways1, start = 1, stop = 8)
keggresidsgreater
## [1] "mmu03010" "mmu04550" "mmu04350" "mmu04330" "mmu04390"
Then, downregulated genes:
keggrespathways2 = data.frame(id = rownames(keggres$less), keggres$less)%>%
tibble::as_tibble() %>%
filter(row_number()<=5)%>%
.$id %>%
as.character()
keggrespathways2
## [1] "mmu04145 Phagosome" "mmu04714 Thermogenesis"
## [3] "mmu04110 Cell cycle" "mmu04217 Necroptosis"
## [5] "mmu00910 Nitrogen metabolism"
keggresidslesser = substr(keggrespathways2, start = 1, stop = 8)
keggresidslesser
## [1] "mmu04145" "mmu04714" "mmu04110" "mmu04217" "mmu00910"
Finally, we’ll define our pid function and plot our data using pathview and sapply.
First, the upregulated pathways:
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE)
tmp = sapply(keggresidsgreater, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE))
Then, the downregulated pathways:
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE)
tmp = sapply(keggresidslesser, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE))
Finally, we’ll print the diagrams into our markdown file like so:
file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.edger.png'))
edger_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.edger.png")
alledger_images <- lapply(edger_filenames, load.image)
dev.off()
knitr::include_graphics(edger_filenames)
For this analysis, we’ll need the following libraries loaded:
library("DESeq2")
library("dplyr")
library('apeglm')
Now we need to load our raw data, which we have in two files; counts and phenotype data.
countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")
colData <- read.csv("PHENO_DATA_Mouse.csv", sep= ",", row.names=1)
Our phenotype data needs levels to be analyzed by DESeq. In other words, if the phenotype is a yes/no observation, such as observing spots or no spots, DESeq would need either phenotype coded as a separate level. For this, we will use 0 and 1:
colData$group <- factor(colData$group, levels = c("0","1"))
Below is the head of our countData. Note that the counts are all fractional numbers, and our first column is the gene names.
head(countData)
## X Mmus_C57_6J_KDN_FLT_Rep1_M23 Mmus_C57_6J_KDN_FLT_Rep2_M24
## 1 ENSMUSG00000000001 2811 2742
## 2 ENSMUSG00000000003 0 0
## 3 ENSMUSG00000000028 59 49
## 4 ENSMUSG00000000031 26 19
## 5 ENSMUSG00000000037 20 26
## 6 ENSMUSG00000000049 0 1
## Mmus_C57_6J_KDN_FLT_Rep3_M25 Mmus_C57_6J_KDN_FLT_Rep4_M26
## 1 3578 3118
## 2 0 0
## 3 102 72
## 4 17 23
## 5 37 30
## 6 5 10
## Mmus_C57_6J_KDN_FLT_Rep5_M27 Mmus_C57_6J_KDN_FLT_Rep6_M28
## 1 3610 3102
## 2 0 0
## 3 80 84
## 4 26 18
## 5 22 20
## 6 0 4
## Mmus_C57_6J_KDN_GC_Rep1_M33 Mmus_C57_6J_KDN_GC_Rep2_M34
## 1 3094 3112
## 2 0 0
## 3 56 75
## 4 15 21
## 5 20 27
## 6 1 4
## Mmus_C57_6J_KDN_GC_Rep3_M35 Mmus_C57_6J_KDN_GC_Rep4_M36
## 1 3195 3158
## 2 0 0
## 3 60 64
## 4 21 25
## 5 8 20
## 6 10 2
## Mmus_C57_6J_KDN_GC_Rep5_M37 Mmus_C57_6J_KDN_GC_Rep6_M38
## 1 3339 2958
## 2 0 0
## 3 56 79
## 4 24 18
## 5 30 36
## 6 1 1
DESeq only allows integers, not fractional numbers, so we need to round our data for use with DESeq:
countData %>%
mutate_if(is.numeric, ceiling)
countData[, 2:13] <- sapply(countData[, 2:13], as.integer)
In addition, we want our row names to be the data stored in the first column, so we’ll run the following code to set the row names, then remove the first column as it is redundant:
row.names(countData) <- countData[, 1]
countData <- countData[2:13]
Finally, let’s check that we have the same number of individuals as we do groups:
rownames(colData) == colnames(countData)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
With that, we can perform the analysis proper. To start, we will create a DESeq data set using countData, colData, and groups as our design. Then, we will remove any genes with no observed count changes in the second line. Finally, the DESeq function is called to perform the differential gene expression, and the results function is called to perform the basic analysis. This is all accomplished in the below code chunk:
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)
dds <- dds[rowSums(counts(dds)>2)>=4]
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
Next, we’ll use our differential gene expression data and lfcShrink to adjust our log fold change values, then reorder our results by the p-value significance:
resLFC <- lfcShrink(dds, coef=2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resOrdered <- res[order(res$padj), ]
summary(res)
##
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 325, 1.5%
## LFC < 0 (down) : 327, 1.5%
## outliers [1] : 15, 0.068%
## low counts [2] : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Let’s pull out specifically the significant fold change data by sourcing it from our res object:
FLT_Vs_GC <- as.data.frame(res$log2FoldChange)
head(FLT_Vs_GC)
## res$log2FoldChange
## 1 0.0421
## 2 -0.1334
## 3 -0.0185
## 4 -0.0882
## 5 -0.0079
## 6 0.1136
Next, let’s create an MA plot, or as it was called earlier, a volcano plot. We’ll use the plotMA function to plot our resLFC object to see if our distribution is relatively uniform, then save the plot to a pdf file.
plotMA(resLFC, ylim = c(-2, 2))
pdf(file = "MA_Plot_FLT_vs_GC.pdf")
dev.off()
## png
## 2
Let’s also save our results to its own csv file:
write.csv(as.data.frame(resOrdered), file = "Mouse_DESeq.csv")
To plot our data in PCA format, we’ll need to do a custom transformation. We will start by using estimateSizeFactors() to give our data more options to be interfaced with for analysis, then summarize our experiment using SummarizedExperiment to normalize our data.
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) +1), colData = colData(dds))
With all of that set up, we can plot the data as a PCA:
plotPCA(DESeqTransform(se), intgroup='group')
We would want to see flight (0) and ground (1) each grouped together to show how tightly characterized the differential expression is on biological replicates. This is unfortunately not a great example of a PCA, which is frequently a problem with spaceflight data due to low numbers of replicates.
We’ll need to load a couple libraries to start with for this section:
library(AnnotationDbi)
library(org.Mm.eg.db)
Next, we’ll perform a series of steps that should now be very familiar. We are going to create a dataframe using our fold change data from the res object, then associate those fold changes with gene symbol, entrezID, and gene name. This is carried out below:
columns(org.Mm.eg.db)
foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
res$symbol = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = 'first')
res$entrez = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "ENTREZID",
keytype = "ENSEMBL",
multiVals = 'first')
res$name = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "GENENAME",
keytype = "ENSEMBL",
multiVals = 'first')
Let’s start with loading our packages and create a separate variable for our fold changes and associate each log2FoldChange datapoint with the appropriate entrezIDs:
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
Next, we’ll prepare our sigmet data using the kegg.gsets mouse data using entrezIDs:
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]
With the sigmet data and foldchanges, we can use gage as before to create our keggres object:
keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## mmu03010 Ribosome 0.0056 2.6 0.0056 0.9
## mmu04022 cGMP-PKG signaling pathway 0.0309 1.9 0.0309 0.9
## mmu04360 Axon guidance 0.0382 1.8 0.0382 0.9
## mmu04330 Notch signaling pathway 0.0518 1.6 0.0518 0.9
## mmu04658 Th1 and Th2 cell differentiation 0.0538 1.6 0.0538 0.9
## mmu04350 TGF-beta signaling pathway 0.0544 1.6 0.0544 0.9
## set.size exp1
## mmu03010 Ribosome 133 0.0056
## mmu04022 cGMP-PKG signaling pathway 153 0.0309
## mmu04360 Axon guidance 176 0.0382
## mmu04330 Notch signaling pathway 55 0.0518
## mmu04658 Th1 and Th2 cell differentiation 78 0.0538
## mmu04350 TGF-beta signaling pathway 87 0.0544
##
## $less
## p.geomean stat.mean p.val
## mmu04657 IL-17 signaling pathway 0.011 -2.3 0.011
## mmu04110 Cell cycle 0.020 -2.1 0.020
## mmu04145 Phagosome 0.027 -1.9 0.027
## mmu04621 NOD-like receptor signaling pathway 0.042 -1.7 0.042
## mmu04625 C-type lectin receptor signaling pathway 0.044 -1.7 0.044
## mmu04115 p53 signaling pathway 0.048 -1.7 0.048
## q.val set.size exp1
## mmu04657 IL-17 signaling pathway 0.9 75 0.011
## mmu04110 Cell cycle 0.9 121 0.020
## mmu04145 Phagosome 0.9 145 0.027
## mmu04621 NOD-like receptor signaling pathway 0.9 151 0.042
## mmu04625 C-type lectin receptor signaling pathway 0.9 99 0.044
## mmu04115 p53 signaling pathway 0.9 71 0.048
##
## $stats
## stat.mean exp1
## mmu03010 Ribosome 2.6 2.6
## mmu04022 cGMP-PKG signaling pathway 1.9 1.9
## mmu04360 Axon guidance 1.8 1.8
## mmu04330 Notch signaling pathway 1.6 1.6
## mmu04658 Th1 and Th2 cell differentiation 1.6 1.6
## mmu04350 TGF-beta signaling pathway 1.6 1.6
Next, we’ll combine a bunch of previous steps into one large chunk as you should be very familiar with this process by now. Below, we’ll separate our keggres object into upregulated (greaters) and downregulated (lessers) objects and create new objects for their particular pathways:
greaters <- keggres$greater
lessers <- keggres$less
keggresgreaterpathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number()<=3) %>%
.$id %>%
as.character()
keggresidsg <- substr(keggresgreaterpathways, start=1, stop=8)
keggreslesserpathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number()<=3) %>%
.$id %>%
as.character()
keggresidsl <- substr(keggreslesserpathways, start=1, stop=8)
Finally, we can plot our data with pathview as we have in the previous two analyses:
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse', new.signature = FALSE)
tmp = sapply(keggresidsg, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse'))
tmp = sapply(keggresidsl, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse'))
And here, we can use this chunk to insert our pathview images:
file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.deseq.png'))
deseq_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.deseq.png")
alldeseq_images <- lapply(deseq_filenames, load.image)
dev.off()
knitr::include_graphics(deseq_filenames)
EdgeR and DESeq are two very different pipelines, so it is very worthwhile to compare both methodologies on one dataset for a broader perspective for biological interpretation. As always, we’ll begin by loading any libraries needed for this exercise, and importing our relevant datasets:
library(tidyverse)
library(GOplot)
EdgeR <- read.csv('Mouse_EdgeR_Results_Zero_vs_1.csv')
DESeq <- read.csv('Mouse_DESeq.csv')
We never originally trimmed the Mouse_Deseq.csv file to exclusively the significant genes, so we’ll have to do that before we can compare the two datasets. This can be done by creating a new object that is DESeq filtered by a padj (p-adjustment) of < 0.1. This is done below:
DESeq2 <- DESeq %>%
filter(padj < 0.1)
Next, we want to trim out only the gene names and log fold change data. To do so, we’ll create new objects that are only the relevant columns:
DESeqTrim <- DESeq2[,c(1,3)]
EdgeRTrim <- EdgeR[, 1:2]
Though we have the appropriate data in our objects, the names of each column will need to be changed for GOVenn (a part of GOPlot) to use it. The specific names must be ID and logFC, which we will address below:
colnames(DESeqTrim) <- c("ID", "logFC")
colnames(EdgeRTrim) <- c("ID", "logFC")
With our data formatted and trimmed, we can now use the GOVenn function to compare what genes were considered differentially expressed by either package. Note that in the below code chunk, we set plot to FALSE. The GOVenn function has this set to TRUE by default and will cause the function to skip saving a table with the plot. By setting this to FALSE, we can save both the plot and table.
comp <- GOVenn(DESeqTrim, EdgeRTrim, label = c("DESeq", "EdgeR"), title = "Comparison of DESeq and EdgeR for Differentially Expressed Genes", plot = FALSE)
comp$plot
As you can see in the diagram, 122 upregulated and 61 downregulated genes are shared between both methods. In addition, EdgeR is much more conservative with what it considers significantly differentially expressed, as it only has 37 and 3 genes uniquely upregulated or downregulated respectively.
In a biological context, the shared genes are the most likely candidates for being biologically significant in the context of differential expression in spaceflight, but that doesn’t completely invalidate the unique genes to either DESeq and EdgeR. These genes are still marked by one of the methods, so it may be worth reanalyzing them with other methods, depending on their genetic context.
In this section, I’ll show how to properly perform a multiple protein alignment using the msa (multiple sequence alignment) package. We’ll be using the hglobin.fa data for this exercise.
library(msa)
seq <- readAAStringSet("hglobin.fa")
Next, we’ll call msa to use the ClustalW algorithm to generate alignments for our amino acid sequences, then print the sequences with their alignments as a pdf via msaPrettyPrint(). Note the following arguments and their purpose:
alignments <- msa(seq, method='ClustalW')
msaPrettyPrint(alignments, output='pdf', showNames='left',
showLogo='none', askForOverwrite=FALSE,
verbose=TRUE, file='whole_align.pdf')
We can also adjust the diagram to include only specific amino acids. To do this, before our output argument, we can add a collection with the first and last amino acid we want included in order. Below, we’ll generate another diagram with only amino acids 10 through 30:
msaPrettyPrint(alignments, c(10,30), output='pdf', showNames='left',
file='Zoomed_align.pdf', showLogo='top', askForOverwrite=FALSE,
verbose=TRUE)
If we want to make a tree using our alignments, we’ll have to add on a few more libraries, which we’ll load below:
library(ape)
library(seqinr)
Next, we need to convert our alignments to a seqinr data type via msaConvert(). This will allow seqinr to properly interface with our data.
alignment_seqinr <- msaConvert(alignments, type='seqinr::alignment')
With the data converted, we need to calculate the distances on our tree using the alignment_seqinr data, then use ape to perform a neighbor-joining tree-estimation process. With our distances and neighbor-joining set up, we will finally plot the tree. This whole process is outlined below:
distances1 <- seqinr::dist.alignment(alignment_seqinr, 'identity')
tree <- ape::nj(distances1)
plot(tree, main='Phylogenetic Tree of HBA Sequences')
Notably, our results imply some very peculiar conclusions due to the dataset being quite limited. Since we are looking at a single hemoglobin gene across each of these animals, we can see that humans and gorillas are considered more closely related to porpoises than capucins. In a broad sense, porpoises aren’t so close to humans and gorillas in comparison to capucins, but on strictly this single gene, porpoises are actually closer.
Synteny is the concept that things such as gene order and blocks of genes inherited from common ancestors are conserved through evolution. In this module, we’re going to look at genome synteny and running a script to compare regions of synteny between genomes.
As always, we’ll begin by loading the necessary library and our data, plastid_genomes.fa:
library(DECIPHER)
long_seq <- readDNAStringSet('plastid_genomes.fa')
Note that long_seq is a DNAStringSet object.
Next, we’re going to build a temporary SQLite database to use the data with this package. By doing so, we’ll be able to accomplish a number of analyses using this data.
Seqs2DB(long_seq, 'XStringSet', 'long_db', names(long_seq))
## Adding 5 sequences to the database.
##
## Added 5 new sequences to table Seqs.
## 45 total sequences in table Seqs.
## Time difference of 0.21 secs
With this database established, we can…
synteny <- FindSynteny('long_db')
pairs(synteny)
plot(synteny)
alignment <- AlignSynteny(synteny, 'long_db')
block <- unlist(alignment[[1]])
writeXStringSet(block, 'genome_blocks_out.fa')
With this document, the dashes will show sections with no alignment, while characters will show conserved bases. # Unannotated Gene Regions Oftentimes in more novel genomes or less commonly-used genomes, annotations may not be available. In this section, we’ll look at how to map transcripts to unannotated regions. We’ll start with loading our libraries:
library(locfit)
library(Rsamtools)
Next, we’ll need to create a function that will load the gene region’s information in a GFF file, then convert it into a bioconductor GRanges object. This is outlined below:
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff,'GRanges')
}
With our function set, we can move on to now get counts from 500bp windows across the genome. To do so, we’ll use csaw::windowCounts(), which will perform the necessary statistics to form our desired windows.
csaw:readParam() will then be used to set what qualities will be needed for the mapping of reads. Of the arguments included…
whole_genome <- csaw::windowCounts(
file.path('windows.bam'),
bin=TRUE,
filter=0,
width=500,
param=csaw::readParam(
minq=20,
dedup=TRUE,
pe='both'
)
)
Since the above data is a single column, we’ll rename it:
colnames(whole_genome) <- c('small_data')
Before going any further let’s also establish a separate object for our annotated regions using the function we defined earlier:
annotated_regions <- get_annotated_regions_from_gff(file.path('genes.gff'))
Now that we have our windows of high expression, we need to check if any of those windows overlap with the annotated regions we defined. To accomplish this, we’ll import another two libraries we’ll need:
library(IRanges)
library(SummarizedExperiment)
Now, to actually find those overlaps, we’ll run IRanges::overlapsAny with SummarizedExperiment::rowRanges of our whole_genome data. This will create a G-ranges object that stores TRUE/FALSE for overlaps across the genome.
windows_in_genes <- IRanges::overlapsAny(
SummarizedExperiment::rowRanges(whole_genome),
annotated_regions
)
We can then subset the whole_genome object into annotated and unannotated regions using our windows_in-genes object:
annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]
assay() can then be used to extract the actual counts from the non_annotated_window_counts G-ranges object:
assay(non_annotated_window_counts)
## small_data
## [1,] 0
## [2,] 0
## [3,] 24
## [4,] 25
## [5,] 0
## [6,] 0
Now, we can use Rsamtools’ pileup() function to get a per-base coverage dataframe in which each row represents a single nucleotide in the reference count, and the count column gives the depth of coverage at that point. We’ll also need to load a new library called bumphunter for a later step in this process:
library(bumphunter)
pile_df <- Rsamtools::pileup(file.path('windows.bam'))
In this step, we’ll group the reads within a certain distance of each other into a cluster. We’ll give the sequence names, position, and distance of the reads:
clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap=100)
table(clusters)
## clusters
## 1 2 3
## 1486 1552 1520
Finally, we’ll map the reads to the regions we found for the genome:
bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff=1)
## getSegments: segmenting
## getSegments: splitting
## chr start end value area cluster indexStart indexEnd L clusterL
## 3 Chr1 4503 5500 10.4 15811 3 3039 4558 1520 1520
## 1 Chr1 502 1500 10.0 14839 1 1 1486 1486 1486
## 2 Chr1 2501 3500 8.7 13436 2 1487 3038 1552 1552
As you can see in the table, we’re given the chromosomal location, start and end of the region, and other useful data. With that, we have fully completed our mapping of the unannotated region of this data.
In this section, we’ll take a look at creating nice-looking phylogenetic trees with treeio and ggplot. First, we’ll load our data and required packages:
library(ggplot2)
library(ggtree)
library(treeio)
library(BAMMtools)
itol <- ape::read.tree('itol.nwk')
Note that our data is in Newick format, so we had to use ape::read.tree.
With our raw data, we can print a very basic phylogenetic tree with it:
ggtree(itol)
While this tree is certainly extant, options and parameters can be specified to make the tree look much better.
First, let’s look at setting the tree up as a circular diagram:
ggtree(itol, layout='circular')
We can also change the left/right & up/down orientations of the tree. Below, we’ll flip the original tree vertically and rotate it horizontally:
ggtree(itol) + coord_flip() + scale_x_reverse()
While changing the appearance and organization of the table is handy, names would help in giving context to our tree. We can do this by appending geom_tiplab to ggtree, which is shown in the below code chunk:
ggtree(itol) + geom_tiplab(color = 'aquamarine4', size = 2)
As you can see, the labels are overlapping and hard to read. To help remedy this, in the next step we will reduce the size of our labels and make the diagram circular:
ggtree(itol, layout='circular') + geom_tiplab(color = 'blueviolet', size = 1)
Having the names looks nice, but we can also annotate specific clades or families in our diagram using geom_strip(). The taxa numbers were chosen arbitrarily for the below example:
ggtree(itol) + geom_strip(13, 14, color = 'chartreuse', barsize = 1)
Similarly, we can use geom_hilight() to highlight specific groups in unrooted trees. For this, we use nodes instead of specific taxa, which was chosen arbitrarily for the below example:
ggtree(itol, layout='unrooted') + geom_hilight(node = 11, type = 'encircle', fill = 'steelblue')
Note the type argument, which determines the shape that the highlight will take.
Now, with the knowledge of how to annotate and highlight using nodes and taxa, it’s important to consider how to call for specific nodes or taxa without needing to scour the dataset itself. One way to do this is the getMRCA() function, which calls the node number of the most recent common ancestor of a set of species names. An example is given below, in which we use getMRCA() to find a node, then highlight a section of our tree using it:
nodeMRCA <- getMRCA(itol, tip = c('Photorhabdus_luminescens', 'Blochmannia_floridanus'))
ggtree(itol, layout='unrooted') + geom_hilight(node = nodeMRCA, type = 'encircle', fill = 'indianred')
In this section, we’re going to quantify differences between two separate phylogenetic trees using Treespace. By analyzing differences like this, it can give scientists the opportunity to understand how different genes progress through an evolutionary line.
Let’s begin with our packages and loading our data. Note that we’ll be loading our tree files first as a list, then reading all of the trees into the tree_list object, then converting it into a multiPhylo object. multiPhylo objects are required for treespace to work with our data.
library(ape)
library(adegraphics)
library(treespace)
treefiles <- list.files(file.path(getwd(),'gene_trees'), full.names=TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list) <- 'multiPhylo'
Next, we’ll compute the Kendall-Coljin distances between the trees. To do so, we’ll use the treespace function with our tree_list data and the number of principal components retained set to 3. The Kendall-Colin method used here is particularly suitable for rooted trees, which is what we have here.
When running this function, treespace will run a pairwise comparison of all trees in the input, carry out clustering using PCA, then return the results as a list of objects where $D contains the pairwise metric of the trees and $pco contains the principle component analysis.
comparisons <- treespace(tree_list, nf=3)
With this stage of the analysis complete, we can now plot the pairwise distances between trees with the table.image() function. Note nClass in the below code chunk where we plot the data; this defines how many shades are used in the heatmap of our pairwise comparisons. A greater number for nClass will improve resolution, but the shades can become much harder to read by the eye.
table.image(comparisons$D, nclass= 25)
Finally, we’ll print the PCA and clusters using plotGroves(). This will show us how the group of trees cluster. First, we’ll use plotGroves on our comparisons object’s $pco variable to plot the points alone by their distance from one another.
plotGroves(comparisons$pco, lab.show=TRUE, lab.cex=1.5)
Note the lab.show and lab.cex options. Setting lab.show to TRUE prints the labels for each point, and lab.cex adjusts the dimensions of our plot.
Now that we have seen the general distribution of points, let’s create “groves”, or clusters of our trees given a certain distance. To do so, we’ll create a new object using findGroves() on our comparisons object with a nclust value of 5:
groves <- findGroves(comparisons, nclust=5)
plotGroves(groves)
nclust values define how many clusters the function should try to fit our data into. With some datasets, it’s worth playing around with the number of clusters to minimize odd shapes, unintended relationships, and grove overlap. Below is an example where I’ve reduced the nclust value to 2:
groves <- findGroves(comparisons, nclust=2)
plotGroves(groves)
As you can see, with fewer groves total, the function incorporates a number of previously separate trees into larger groves. Below is what I feel to be the ‘sweet spot’ for this data set, with nclust set to 3:
groves <- findGroves(comparisons, nclust=3)
plotGroves(groves)
In some settings, you may not need to use an entire phylogenetic tree like we did in our Treeio section. By creating subtrees from larger trees, you can minimize the quantity of data you are using, making the processing of your file faster and your diagrams significantly more readable. All of this can be done using one package called ape, so we’ll load that along with our Newick tree data:
library(ape)
newick <- read.tree('mammal_tree.nwk')
l <- subtrees(newick)
Note the usage of the subtrees() function. This function is self-explanatory and does, indeed, separate our data into subtrees. Let’s plot the raw data to see how it looks:
plot(newick)
To subset our subtree’d plot (stored as object 1), we can provide the plot() function additional information. We’ll start by specifying that we want to plot object 1, tree [[4]], with the subtitle “Node 4”:
plot(l[[4]], sub='Node 4')
Recall that for general usage, getMRCA can be very useful to locate specific nodes that you might want to focus in on.
We can also extract subtrees manually with the extract.clade() function, specifying the dataset to use and the node we want to ‘cut’ the tree at.
small_tree <- extract.clade(newick, 9)
plot(small_tree)
We can even bind two separate trees together. For the sake of an example, we’re going to extract another arbitrarily selected section of our first tree, then bind it with the tree we extracted above using bind.tree(). Note the final number of bind.tree(), which specifies the node that the two trees are joined at:
small_tree_2 <- extract.clade(newick, 10)
new_tree <- bind.tree(small_tree_2, small_tree, 3)
plot(new_tree)
Of course, these example trees are a scientific mess, but they are hopefully a good proof of concept for how to subset and bind trees to better manipulate tree data.
In some cases, you may be provided only with sequences and no pre-made trees. If you need to construct your own, this section will cover that. Let’s begin as usual by loading our necessary packages and sequence data:
library(Biostrings)
library(msa)
library(phangorn)
seqs <- readAAStringSet('abc.fa')
Next, we’ll construct the alignment data using msa and ClustalOmega, which we have seen and used before in our Multiple Protein Alignment section. In addition, we’ll need to convert our alignment data into a phyData object. The whole process is outlined in the below chunk:
aln <- msa::msa(seqs, method=c('ClustalOmega'))
aln <- as.phyDat(aln, type='AA')
Note that type is set to ‘AA’. This stands for amino acid and defines the format of the data for as.phyDat to properly convert.
With all of our data now prepped, we’ll get started on the tree-making process. Trees are made from a distance matrix, which we’ll compute for our data with dist.ml(). Note that ml in the functions tands for maximum likelihood.
dist_mat <- dist.ml(aln)
Next, we pass the distance matrix to the upgma() function to make a UPGMA (unweighted pair group method with arithmetic mean) tree:
upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main='UPGMA')
Conversely, we can pass the distance matrix to a neighbor-joining function:
nj_tree <- NJ(dist_mat)
plot(nj_tree, main='NJ')
As you can see, both methods produce fairly different trees. Depending on the type of data you’re using and the goals of the tree you’re producing, one of these methods will fit. Typically, longer sequences and more data can eliminate some differences between these methods as the core data provides higher specificity.
Finally, we will use bootstraps.phyDat() to compute bootstrap support for the branches of the tree. The first argument is the object (aln), and the second is the function. Bootstraps are essentially confidence intervals for how the clade is placed on a given tree.
To create and plot these, we’ll use boostrap.phyDat with our aln data, a custom function where we use neighbor-joining based on dist.ml, and a bootstrap number of 100. This can be seen in the below code chunk:
bootstraps <- bootstrap.phyDat(aln, FUN = function(x){NJ(dist.ml(x))}, bs=100)
plotBS(nj_tree, bootstraps, p = 10)
The numbers you can see on this tree are the confidence that a given join is located in the specified place on your tree. The numbers, since we used a base of 100, function much like a percentage-based confidence interval. 100 defines a 100% likelihood, while 83 defines an 83% likelihood, and so on.
As this tutorial was create as part of a course at Louisiana Tech University and I am a graduate student, I am tasked with performing an additional RNASeq analysis. For this analysis, I will be using DESeq to evaluate differential expression in data from the NASA Genelab, specifically GLDS-38. In this experiment, five plates’ plants were pooled and used as genetic material samples with spaceflight plants preserved using RNAlater.
To begin this analysis, we’ll begin as always with our relevant packages to ensure all we’ll need is ready to go:
library("DESeq2")
library("dplyr")
library('apeglm')
Much like our previous DESeq exercise, we’ll load the data in two files; counts and the flight/ground grouping.
countData <- read.csv("GLDS-38_rna_seq_Unnormalized_Counts_short.csv")
colData <- read.csv("Flight_DATA.csv", sep= ",", row.names=1)
Just as in our DESeq example before, we need to establish levels for our Flight_DATA.csv file to define flight and ground groups. We will be using 0 and 1 as the levels for this factor:
colData$group <- factor(colData$group, levels = c("0","1"))
As we know, DESeq only allows integers, not fractional numbers, so we need to round our data:
countData %>%
mutate_if(is.numeric, ceiling)
countData[, 2:7] <- sapply(countData[, 2:7], as.integer)
Next, we’ll use the below code to set the row names to be equal to our first column, then remove the redundant first column:
row.names(countData) <- countData[, 1]
countData <- countData[2:7]
Finally, let’s check that we have the same number of individuals as we do groups:
rownames(colData) == colnames(countData)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
With that, we can perform the analysis proper. To start, we will create a DESeq data set using countData, colData, and groups as our design. Then, we will remove any genes with no observed count changes in the second line. Finally, the DESeq function is called to perform the differential gene expression, and the results function is called to perform the basic analysis. This is all accomplished in the below code chunk:
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)
dds <- dds[rowSums(counts(dds)>2)>=4]
dds <- DESeq(dds)
res <- results(dds)
Next, we’ll use our differential gene expression data and lfcShrink to adjust our log fold change values, then reorder our results by the p-value significance:
resLFC <- lfcShrink(dds, coef=2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resOrdered <- res[order(res$padj), ]
summary(res)
##
## out of 25687 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2293, 8.9%
## LFC < 0 (down) : 2754, 11%
## outliers [1] : 19, 0.074%
## low counts [2] : 3486, 14%
## (mean count < 15)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Next, we’ll pull out the significant fold change data and store it as its own object:
FLT_Vs_GC <- as.data.frame(res$log2FoldChange)
head(FLT_Vs_GC)
## res$log2FoldChange
## 1 0.5557
## 2 -0.3808
## 3 -0.3554
## 4 0.1410
## 5 0.0056
## 6 -0.4582
Now, we’ll create an MA plot to check our distribution. As we can see, the plot shows a relatively uniform distribution with fairly high fold changes observed.
plotMA(resLFC, ylim = c(-4, 4))
pdf(file = "MA_Plot_FLT_vs_GC_Grad.pdf")
dev.off()
Let’s also save our results to its own csv file:
write.csv(as.data.frame(resOrdered), file = "ArabidThali_DESeq.csv")
Let’s also perform a Principle Component Analysis (PCA) for this data. Let’s begin by using estimateSizeFactors(), then SummarizedExperiment to normalize our data:
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) +1), colData = colData(dds))
With all of that set up, we can plot the data as a PCA:
plotPCA(DESeqTransform(se), intgroup='group')
We would want to see flight (0) and ground (1) each grouped together to show how tightly characterized the differential expression is on biological replicates. Due to the low number of replicates, it’s difficult to make a call on how grouped our data is here, but it at least appears that there is no presence of points from different groups near one another, like the mouse analysis. The general trend of blue points towards the upper areas of the graph and red points to the lower areas indicates this data may be decently characterized.
In setting up for Pathview, we’ll need to load a package and our Arabidopsis Thaliana database:
library(AnnotationDbi)
library(org.At.tair.db)
Next, we’ll perform the same series of steps we have previously with the other RNASeq and Microarray analyses, except using TAIR instead of ENTREZ. We are going to create a dataframe using our fold change data from the res object, then associate those fold changes with gene symbol, TAIR, and gene name. This is carried out below:
columns(org.At.tair.db)
foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
res$symbol = mapIds(org.At.tair.db,
keys = row.names(res),
column = "SYMBOL",
keytype = "TAIR",
multiVals = 'first')
res$entrez = mapIds(org.At.tair.db,
keys = row.names(res),
column = "ENTREZID",
keytype = "TAIR",
multiVals = 'first')
res$name = mapIds(org.At.tair.db,
keys = row.names(res),
column = "GENENAME",
keytype = "TAIR",
multiVals = 'first')
Let’s start with loading our packages and create a separate variable for our fold changes and associate each log2FoldChange datapoint with the appropriate entrezIDs:
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
Next, we’ll take a number of steps to set up our foldchanges with their appropriate pathways via the kegg database. In the first step, we set up our sigmet data using kegg.gsets for Arabidopsis Thaliana’s entrezIDs. Next, we use gage to map the kegg pathways to our foldchanges data. Lastly, we separate the upregulated (greater) and downregulated (lessers) into their own separate objects and write csv files for them for other usage.
kegg.ath = kegg.gsets('ath', id.type = 'entrez')
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]
keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean
## ath00940 Phenylpropanoid biosynthesis 5.7e-06 4.5
## ath02010 ABC transporters 9.2e-03 2.4
## ath00040 Pentose and glucuronate interconversions 1.6e-02 2.2
## ath00966 Glucosinolate biosynthesis 3.6e-02 1.8
## ath00908 Zeatin biosynthesis 4.7e-02 1.7
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis 1.1e-01 1.2
## p.val q.val set.size
## ath00940 Phenylpropanoid biosynthesis 5.7e-06 0.00067 126
## ath02010 ABC transporters 9.2e-03 0.53682 32
## ath00040 Pentose and glucuronate interconversions 1.6e-02 0.61979 101
## ath00966 Glucosinolate biosynthesis 3.6e-02 1.00000 26
## ath00908 Zeatin biosynthesis 4.7e-02 1.00000 32
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis 1.1e-01 1.00000 23
## exp1
## ath00940 Phenylpropanoid biosynthesis 5.7e-06
## ath02010 ABC transporters 9.2e-03
## ath00040 Pentose and glucuronate interconversions 1.6e-02
## ath00966 Glucosinolate biosynthesis 3.6e-02
## ath00908 Zeatin biosynthesis 4.7e-02
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis 1.1e-01
##
## $less
## p.geomean stat.mean p.val
## ath03010 Ribosome 6.0e-25 -10.9 6.0e-25
## ath00195 Photosynthesis 3.0e-10 -7.1 3.0e-10
## ath01240 Biosynthesis of cofactors 4.4e-09 -5.9 4.4e-09
## ath01230 Biosynthesis of amino acids 6.0e-07 -4.9 6.0e-07
## ath03030 DNA replication 6.1e-07 -5.3 6.1e-07
## ath00860 Porphyrin and chlorophyll metabolism 1.1e-05 -4.5 1.1e-05
## q.val set.size exp1
## ath03010 Ribosome 7.0e-23 308 6.0e-25
## ath00195 Photosynthesis 1.7e-08 46 3.0e-10
## ath01240 Biosynthesis of cofactors 1.7e-07 234 4.4e-09
## ath01230 Biosynthesis of amino acids 1.4e-05 240 6.0e-07
## ath03030 DNA replication 1.4e-05 46 6.1e-07
## ath00860 Porphyrin and chlorophyll metabolism 2.2e-04 51 1.1e-05
##
## $stats
## stat.mean exp1
## ath00940 Phenylpropanoid biosynthesis 4.5 4.5
## ath02010 ABC transporters 2.4 2.4
## ath00040 Pentose and glucuronate interconversions 2.2 2.2
## ath00966 Glucosinolate biosynthesis 1.8 1.8
## ath00908 Zeatin biosynthesis 1.7 1.7
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis 1.2 1.2
greaters <- keggres$greater
lessers <- keggres$less
write.csv(greaters, file = 'ArabidThali_pathview_Greater_Pathways.csv')
write.csv(lessers, file = 'ArabidThali_pathview_Lesser_Pathways.csv')
Lastly, we’ll combine a bunch of previous steps into one large chunk as before. Below, we’ll create new objects for the pathways of keggres for both the upregulated and downregulated parts of the dataset, then substring them to just utilize the pathway ids.
keggresgreaterpathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number()<=3) %>%
.$id %>%
as.character()
keggresidsg <- substr(keggresgreaterpathways, start=1, stop=8)
keggreslesserpathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number()<=3) %>%
.$id %>%
as.character()
keggresidsl <- substr(keggreslesserpathways, start=1, stop=8)
Finally, we can plot our data with pathview as we have in the previous two analyses:
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'ath', new.signature = FALSE)
tmp = sapply(keggresidsg, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'ath'))
tmp = sapply(keggresidsl, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'ath'))
And here, we can use this chunk to insert our pathview images:
file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.grad.png'))
grad_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.grad.png")
allgrad_images <- lapply(grad_filenames, load.image)
dev.off()
knitr::include_graphics(grad_filenames)
R is an incredibly useful, diverse, and approachable programming language that can perform the much-needed statistical analyses required for many biological studies. With genetics and personalized medicine on the rise and data generation hitting huge rates, the need for these analyses is greater than ever. With R and the right people, this data can be wrangled, contained, analyzed, and turned into usable data products for research, informatics, and more.