Introduction and Basics

This document will cover how to use R Markdown in R Studio to create reproduceable research.

In R Markdown, you can type long passages/descriptions without hashing out comments. In this first example, we will use the ToothGrowth dataset, in which guinea pigs were given different amounts of vitamin C to see the effects of tooth growth

To run R in a markdown file, you need to denote the section that is to be considered R code. This is a code chunk. Code chunks are defined by ```{r} at their beginning.

Code chunks can be closed using ``` at the bottommost line.

Below is an example code chunk:

Toothdata <- ToothGrowth

head(Toothdata)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

The play button on code chunks allows you to run all code within the chunk. Results are printed inline in the markdown file.

The abline command makes a line from a to b on a given plot:

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))

The slope of the regression line is 9.7635714.

You can call variables directly into the text like above using a backtick + r followed by the variable, then its position in brackets.

The syntax is: variable[position].

Section Headers

You can put section headers into the final document by putting a # before the desired header.

First level header

Second level header

Third level header

Headers require a space after the # or else it will not format as a header. Additionally, you can append .unlisted and .unnumbered to a header to keep it excluded from the table of contents, which I’ll go over later.

Other Text Formatting Options

Tabs

You can also add tabs to the report. This requires that you specify each section that you want to become a tab by placing {.tabset} after the line. Every subsequent header will be a new tab.

Bullet Points

Below are some example bullet points. Dashes (-) followed by a space and the desired characters are used to create bullet points, and indentation produces sub-bullets.

  • item one
  • item two
  • item three
    • subitem one
    • subitem two
    • subitem three

Block Quotes

Really nice quotes can be placed by using >, followed by a space and the desired quote.

“Genes are like the story, and DNA is the language that the story is written in.”

— Sam Kean

Formulas

You can also make nice looking formulas by enclosing our text with two dollar signs. This uses a unique syntax, including using ^ to create exponents and the ampersand to place two symbols next to one another with no operator.

Hardy-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

You can also get more complex using backslash followed by a character name. This will print the desired character in our formula. Additionally, new lines can be used to create additional lines for the equation.

\[\theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end {pmatrix}\]

Knitr and Interpreting the Code Chunk

Echo and Eval

After ```{r, mutliple arguments can be used to adjust how code chunks are evaluated, if at all. echo= governs whether or not the code of the chunk is printed inline on the output file. By default this is set to true. Below is a chunk that has echo= set to true and eval= set to false.

Echo set to true and eval set to false:

print('Hello World')

eval= governs whether or not a code chunk is run. Below is a chunk that has echo= set to false and eval= set to true.

Echo set to false and eval set to true:

## [1] "Hello World"

Other Arguments

In addition to eval= and echo=, the following are other arguments and their functions:

Cache: if enabled, the same code chunk will not be evaluated next time knitr is run, but the first run’s data will be stored for use. Good for code that has long run times.

fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document, then to files that are saved separately

out.width or out.height: the output size of the R plots in the R document itself.

fig.cap: the words for the figure captions.

Code Folding

You can also use the code_folding option in the YAML code (the unique code chunk at the very top of the document) to allow the reader to toggle between displaying and hiding the code. This is done via setting the argument code_folding to hide or show; both options will allow for the toggle, but each respective argument sets the default state. Below is my YAML code for this document for an example of this and other arguments:

title: "R Markdown in R Studio"

author: "Zachary Baham"

date: "2022-05-10"

output:

  html_document:
  
    code_folding: show

    toc: true

    toc_float: true

    theme: simplex

    highlights: monochrome

Tables of Contents

You can also add a table of contents to a document using the toc and toc_float arguments as shown in the Code Folding tab. Pasted below is the same YAML code; note the toc and toc_float arguments are set to true. This will give us a very nice floating table of contents on the right hand side of the document.

title: "R Markdown in R Studio"

author: "Zachary Baham"

date: "2022-05-10"

output:

  html_document:
  
    code_folding: show

    toc: true

    toc_float: true

    theme: simplex

    highlights: monochrome

Themes and Highlights

Themes and highlights can be specified as part of the YAML code as well, allowing you to add a splash of color and some fitting text formatting for your report.

Theme Options

Themes will broadly change your highlight, hyperlink, and header colors and formats, as well as some small spacing adjustments. To change this, change your theme argument in YAML to one of the following:

  • cerulean
  • journal
  • flatly
  • readable
  • spacelab
  • united
  • cosmo
  • lumen
  • paper
  • sandstone
  • simplex
  • yeti
  • null

Highlights

You can also change the color by specifying highlights from one of the following options:

  • default
  • tango
  • payments
  • kate
  • monochrome
  • espresso
  • zenburn
  • haddock
  • textmate

Summary of the Basics

There are a bunch of ways to customize R code using HTML format. This is a great way to display a portfolio of your work if you’re marketing yourself to interested parties. Next, we’ll go over the main methods to wrangle data in R to prepare it for analysis from a raw file.

Data Wrangling with R

We will be using tidyverse for much of our data organization, so we’ll begin by using the library() command to load the tidyverse package into our R session. We’ll also be using a dataset for flights to and from New York City in 2013, called nycflights13.

library(tidyverse)
install.packages("nycflights13")

Loading our Data

Next, we’ll load specifically the flights data from nycflights13 into a variable. To do this, we use the name of the dataset followed by :: and the desired subset. We will then use the head() command, which prints the first 6 rows of our data.

my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Narrowing the Data

In this section, we’re going to filter the data to focus on flights on October 14th. We will use the filter() command to do so, which requires the variable plus any arguments to narrow the data by.

filter(my_data, month == 10, day == 14)
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # … with 977 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The command will print our filtered data, but if we want to manipulate it, we should store the subset in a variable like so:

oct_14_flight <- filter(my_data, month == 10, day == 14)

You can add an extra set of parenthesis around the entire line of code above to both print the filtered data and store the subset as a unique variable.

Operators in R

Note that we use double equals signs instead of a single one for the previous commands. This is required in R for many functions and commands, but it can vary depending on the package. Typically, operators take the below forms:

  • Equals: ==
  • Not equal to: !=
  • Greater than: >
  • Lesser than: <
  • Greater than or equal to: >=
  • Lesser than or equal to: <=

As an example, below is the less-than operator being used to show flights through September:

(flight_thru_sept <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 252,474 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

In addition to the basic operators, logical operators can also be used. These include…

  • And: &
  • Or: |
  • Not: !

Below is another example, where we are looking at specifically flights in both March and April.

March_April_Flight <- filter(my_data, month == 3 | month == 4)

Think for a moment why we didn’t use the & operator. This is because R, like many programming languages, takes operators very literally. By saying that we want data on flights over both March and April, we are looking for all individual flights that occurred over both months simultaneously, which is impossible. Thus, we use the | operator to look at all individual flights that occurred over March OR April.

Finally, let’s look at an example using the ! operator, which we already briefly covered in the normal operators section with “Not equal to” being !=. Below, we will get a subset of all flights except those in January.

Non_Jan_Flight <- filter(my_data, month != 1)

Helper Functions

In R, there are a number of functions available to enable the reorganization, selection, and filtering of daata in addition to the options above.

Arrange

Arrange lets us order our data by a given variable in the set, organizing in sequential order. For example, the below code will arrange our data set by year, then day, then month. By default, data is sorted ascending.

arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Alternatively, you can use desc() to instead sort by descending.

descending <- arrange(my_data, desc(year), desc(day), desc(month))
print(descending)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    31       13           2359        14      439            437
##  2  2013    12    31       18           2359        19      449            444
##  3  2013    12    31       26           2245       101      129           2353
##  4  2013    12    31      459            500        -1      655            651
##  5  2013    12    31      514            515        -1      814            812
##  6  2013    12    31      549            551        -2      925            900
##  7  2013    12    31      550            600       -10      725            745
##  8  2013    12    31      552            600        -8      811            826
##  9  2013    12    31      553            600        -7      741            754
## 10  2013    12    31      554            550         4     1024           1027
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Select

Select allows you to specify columns that you want to see or manipulate. The below code will pull out just the year, month, and day columns of our dataset and store it in a variable called calendar.

calendar <- select(my_data, year, month, day)
print(calendar)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

Select also allows you to look at a range of columns. The below variable will include all columns between month and carrier, inclusively. The rows you can see should be missing their year column.

calendar2 <- select(my_data, month:carrier)
print(calendar2)
## # A tibble: 336,776 × 9
##    month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     1     1      517            515         2      830            819
##  2     1     1      533            529         4      850            830
##  3     1     1      542            540         2      923            850
##  4     1     1      544            545        -1     1004           1022
##  5     1     1      554            600        -6      812            837
##  6     1     1      554            558        -4      740            728
##  7     1     1      555            600        -5      913            854
##  8     1     1      557            600        -3      709            723
##  9     1     1      557            600        -3      838            846
## 10     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 2 more variables: arr_delay <dbl>,
## #   carrier <chr>

Select can also make use of the ! operator to exclude specific columns or ranges of columns. Below, we see a line of code that stores all columns of data except year, day, and the columns in between. Note that in addition to !, some commands like select allow you to use - as a ‘not’ operator.

everything_else <- select(my_data, !(year:day))
print(everything_else)
## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Rename

You can also rename elements of your data using rename. This function will rename the first variable given to the string you input after the =. Note taht in the example, we are replacing the original data with the rename function’s result, overwriting the previous data. This doesn’t change the raw file, but will effectively permanently change the variable’s data.

my_data <- rename(my_data, departure_time = dep_time)
head(my_data)
## # A tibble: 6 × 19
##    year month   day departure_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>          <int>          <int>     <dbl>    <int>
## 1  2013     1     1            517            515         2      830
## 2  2013     1     1            533            529         4      850
## 3  2013     1     1            542            540         2      923
## 4  2013     1     1            544            545        -1     1004
## 5  2013     1     1            554            600        -6      812
## 6  2013     1     1            554            558        -4      740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Mutate

If you need to add new columns to your data frame, the mutate function can do so. To begin, however, we’re going to make our data frame smaller to better facilitate this example.

my_data_s <- select(my_data, year:day, distance, air_time)

Next, we will calculate the average speed of the flights using mutate. To do this, we will create a new column whose values are equal to the distance divided by the air_time multiplied by 60.

my_data_speed <- mutate(my_data_s, speed = distance/air_time*60)
head(my_data_speed)
## # A tibble: 6 × 6
##    year month   day distance air_time speed
##   <int> <int> <int>    <dbl>    <dbl> <dbl>
## 1  2013     1     1     1400      227  370.
## 2  2013     1     1     1416      227  374.
## 3  2013     1     1     1089      160  408.
## 4  2013     1     1     1576      183  517.
## 5  2013     1     1      762      116  394.
## 6  2013     1     1      719      150  288.

Transmute

What if we want to create a new data frame using only the previously calculated data? We can use the transmute function for this, in which we do the calculation again to store it in a separate set alone. In addition, you can create multiple new columns with different calculations to be included in this fresh data frame. We do this in the example below with miles per hour and miles per minute:

airspeed <- transmute(my_data_s, speed_mph = distance/air_time*60 , speed_mpm = distance/air_time)
head(airspeed)
## # A tibble: 6 × 2
##   speed_mph speed_mpm
##       <dbl>     <dbl>
## 1      370.      6.17
## 2      374.      6.24
## 3      408.      6.81
## 4      517.      8.61
## 5      394.      6.57
## 6      288.      4.79

Summarize and by_group

Summarize runs a function on a data column to get a single return. In the below example, we will find the mean delay time. Note the na.rm argument, which stands for “Not applicable.remove”. By setting this to true, N/A data is ignored.

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

As we can see, the average delay time was ~12.6 minutes.

We can get additional value out of summarize by using by_group. This lets you group data by other columns’ variables. Below is an example where we group the data by year, month, and day, then summarize the same delay mean of that grouped dataset.

by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # … with 355 more rows

As you can see, we now have a mean delay by each specific day, with the first value showing a mean delay of 11.5 minutes on January 1st, 2013.

Other Helper Functions

There are a number of other important helpful functions for data selection and organization. The examples below use quotation marks to denote string values, but they can also use numerical characters.

  • starts_with(“xyz”): will select the values that start with xyz
  • ends_with(“xyz”): will select the values that end with xyz
  • contains(“xyz”): will select any values that contain xyz
  • matches(“xyz”): will select only values that are identical to xyz

Missing Data

As was briefly mentioned for summarize, some data can be “N/A”, or not-applicable. This data is essentially either empty values or some type of data that R cannot parse. This data can be detrimental to many analyses as any summary function or operation that includes an N/A value will itself become N/A. We can see this in the below example in which we try to get the mean delay of our grouped data without the na.rm argument:

summarize(by_day, delay = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # … with 355 more rows

As you can see, every visible calculated delay is NA. This is because cancelled flights in this dataset have NA as their delay, and we can check what rows have NA values using the is.na function. Below, we check airspeed for values that are NA. (The full is.na function has been omitted for the head instead due to its length; is.na originally printed 500 lines)

check_na <- is.na(airspeed)
head(check_na)
##      speed_mph speed_mpm
## [1,]     FALSE     FALSE
## [2,]     FALSE     FALSE
## [3,]     FALSE     FALSE
## [4,]     FALSE     FALSE
## [5,]     FALSE     FALSE
## [6,]     FALSE     FALSE

We can filter out NA values using the filter function that we went over previously. Below, we filter out anything from my_data that is TRUE according to the is.na function.

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
head(not_cancelled)
## # A tibble: 6 × 19
##    year month   day departure_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>          <int>          <int>     <dbl>    <int>
## 1  2013     1     1            517            515         2      830
## 2  2013     1     1            533            529         4      850
## 3  2013     1     1            542            540         2      923
## 4  2013     1     1            544            545        -1     1004
## 5  2013     1     1            554            600        -6      812
## 6  2013     1     1            554            558        -4      740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Now, with the NA data removed, we should be able to rerun a summarize without the na.rm argument and our new variable and get the same result as the original summarize function (12.6 minutes).

summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

The final result is actually slightly different due to different methodologies, but rounds to the same approximate value of 12.6.

Counts

Counts can be used to count quantities of value ranges, value types, or exact values. For example, we can count the number of values that are NA.

sum(is.na(my_data$dep_delay))
## [1] 8255

This function shows us that 8,255 values of dep_delay in my_data are NA. Likewise, you can count how many variables are not NA.

sum(!is.na(my_data$dep_delay))
## [1] 328521

Note the usage of the dollar sign; following a data frame with a dollar sign and a variable name targets a specific column for the function.

Piping

When using tibble datasets (which will be discussed more later in this document), results can be piped to avoid using the dollar-sign targeting of specific variables. Via this, we can summarize the number of flights by minutes delayed.

Below, we can see that the %>% line can be used to point everything below the symbol to the dataset in front of it.

my_data %>%
  group_by(year, month, day) %>%
  summarize(mean = mean(departure_time, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 1385.
##  2  2013     1     2 1354.
##  3  2013     1     3 1357.
##  4  2013     1     4 1348.
##  5  2013     1     5 1326.
##  6  2013     1     6 1399.
##  7  2013     1     7 1341.
##  8  2013     1     8 1335.
##  9  2013     1     9 1326.
## 10  2013     1    10 1333.
## # … with 355 more rows

Tibbles

Tibbles are essentially modified data frames that adjust some of the older features of data frames. R is an old language, and tibbles give data frames appropriate functionality for modern implementation. Below, we will begin by loading the tibble package and setting the iris data set to be our tibble.

library(tibble)
as_tibble(iris)

Tibbles can also be craeted from scratch using tibble(). An example is shown below:

tibble(
  x = 1:5,
  y = 1,
  z = x^2 + y
)
## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

You can also use tribble() to instead create basic data tables. These are more ‘drawn out’ when written, and an example is given below:

tribble(
  ~gene_a, ~gene_b, ~gene_c,
  ##########################
  110,       112,     114,
  6,          5,       4
)
## # A tibble: 2 × 3
##   gene_a gene_b gene_c
##    <dbl>  <dbl>  <dbl>
## 1    110    112    114
## 2      6      5      4

Tibbles are designed to not overwhelm your console with data printing, being limited to only the first few lines. Recall the omission of the is.na full printout; data frames will similarly print many lines before reaching the print limit without using head().

Subsetting

Subsetting tibbles is much like subsetting from a data frame. Below, we set up a tibble for the flights data from nycflights13.

df_tibble <- tibble(nycflights13::flights)
df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We can subset by column name using the dollar sign. Below, we can show the subset of all carriers from our flight data. (head was used to truncate the print)

head(df_tibble$carrier)
## [1] "UA" "UA" "AA" "B6" "DL" "UA"

We can also subset by position using brackets. Below, we will use the column number 6 to look at the subset of dep_delay. (head was used to truncate as above)

head(df_tibble[[6]])
## [1]  2  4  2 -1 -6 -4

To use tibbles’ functionality in this way in a pipe, you must use . as a placeholder before the dollar sign. Below is an example of the carrier subset using piping.

head(df_tibble %>%
       .$carrier)
## [1] "UA" "UA" "AA" "B6" "DL" "UA"

As nice as tibbles are, some older functions do not play nice with them, so you might have to convert them back into data frames. This is easily done by simply using the function as.data.frame(). See below for an example of this conversion and a comparison of tibbles and data frames:

class(df_tibble)
## [1] "tbl_df"     "tbl"        "data.frame"
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble_2)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##             time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00

Tidyr

Tidy data is very useful for readable and easy analysis. In order to make a tidy dataset, tidyverse has 3 main rules:

    1. Each variable must have its own column
    1. Each observation has its own row
    1. Each value has its own cell

To adequately meet these requirements, the following instructions must be followed to produce tidy data:

    1. Put each dataset into a tibble
    1. Put each variable into a column
    1. Profit

Picking a consistent method of data storage makes for readable and understandable code, as well as figuring out your own older code if you have to look back at older projects.

But, before anything else, we have to start by loading the tidyverse library:

library(tidyverse)

Next, we’ll store the dataset called ‘women’ in a tibble, which is a small set of the height and weight of fifteen different women, then mutate it to add a bmi column to the tibble.

bmi <- tibble(women)

bmi %>%
  mutate(bmi = (703*weight)/(height)^2)
## # A tibble: 15 × 3
##    height weight   bmi
##     <dbl>  <dbl> <dbl>
##  1     58    115  24.0
##  2     59    117  23.6
##  3     60    120  23.4
##  4     61    123  23.2
##  5     62    126  23.0
##  6     63    129  22.8
##  7     64    132  22.7
##  8     65    135  22.5
##  9     66    139  22.4
## 10     67    142  22.2
## 11     68    146  22.2
## 12     69    150  22.1
## 13     70    154  22.1
## 14     71    159  22.2
## 15     72    164  22.2

Spreading and Gathering

Gathering

Sometimes, you’ll run into datasets that don’t fit very well into tibbles. We’ll use a set built into tidyverse for this, called table4a, which is data on the number cases of an unknown pathogen in Afghanistan, Brazil, and China.

print(table4a)
## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

As you can see in the printed table, we have one variable in column A (country), but columns B and C are the same variable (year). Thus, we have two observations in each row, breaking rule #2 for tidy data. To fix this, we can use the gather function to restructure the table so year is its own column.

table4a_gathered <- table4a %>%
  gather('1999','2000', key = 'year', value = 'cases')
table4a_gathered
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

As you can see in the code itself, we start by listing the columns that we want to sort into its own variable, those being the two year values. Next, we use the key argument to define the name of this new column. Finally, we use value to define the name of the column that the data originally under the 1999 and 2000 will be organized under.

In the new tibble, we can see that each variable has a unique column, and the case numbers remained coupled with their year and country.

Let’s look at another example using table4b.

table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

We have much the same issues as we did with table4a here, so let’s gather it.

table4b_gathered <- table4b %>%
  gather('1999','2000', key = 'year', value = 'population')
table4b_gathered
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

Now, with our tables reorganized, what if we want to join them together via common variables? For this, we can use dplyr’s left_join() function.

table4 <- left_join(table4a_gathered, table4b_gathered)
## Joining, by = c("country", "year")
table4
## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

Spreading

Spreading is the opposite of gathering. This can be useful with datasets that have redundant data, such as in table2.

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

The repeating country and year values make for untidy data since we see a repeat of observations in 1999 for every country, just with a different ‘type’. To fix this, we will use the spread() function to spread the type column into two new variables - cases and population - while moving each count to that column by its type. We can see the results of this below:

spread(table2, key = 'type', value = 'count')
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Similarly to the gather() function, key defines to column to be modified, while value instead now defines the column that will spread to the newly created columns for what was once the type.

Summary

Spread tends to make long tables shorter and wider by condensing repeated observations into their own columns via common variables, while gather makes wide tables narrower and longer by distributing columns of data to a single column with new rows.

Separating and Uniting

Separate

Now, what do we do if we have two observation types in the same cell? We see this issue in table3, which has the rate column showing both counts and population in the same cell.

table3
## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

The separate() function will split the two observations into their own columns using the names we give it. See the example below:

table3_fix <- table3 %>%
  separate(rate, into=c('cases','population'))
table3_fix
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Despite this appearing to fully fix our tibble, recall that class is important for analysis. Our cases and population retained the chr (character) typing despite being integers. This can be fixed by adding the convert = TRUE argument to our separate function, as seen below:

table3_fix <- table3 %>%
  separate(rate, into=c('cases', 'population'), convert = TRUE)
table3_fix
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Importantly, it must be noted that separate() will, by default, use any non-integer character as a separator for how it splits data into separate columns. If you want to specify a specific separation boundary, you can use the sep argument. An example is shown below:

table3_fix <- table3 %>%
  separate(rate, into=c('cases','population'), sep = '/', convert = TRUE)
table3_fix
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Here’s another example where we split the year into century and year by splitting year by the sep value 2. This means we are splitting year after its second character.

table3_cent <- table3 %>%
  separate(year, into=c('century','year'), sep = 2, convert = TRUE)
table3_cent
## # A tibble: 6 × 4
##   country     century  year rate             
##   <chr>         <int> <int> <chr>            
## 1 Afghanistan      19    99 745/19987071     
## 2 Afghanistan      20     0 2666/20595360    
## 3 Brazil           19    99 37737/172006362  
## 4 Brazil           20     0 80488/174504898  
## 5 China            19    99 212258/1272915272
## 6 China            20     0 213766/1280428583

Unite

What do we need to do if we want the inverse effect of separate? Let’s start with table5, which is much like the table we previously created by separating the year, except without conversion. Below is a print of the original table followed by the same table with the century and year united.

table5
## # A tibble: 6 × 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5_un <- table5 %>%
  unite(date, century, year, sep='')
table5_un$date <- as.integer(table5_un$date)
table5_un
## # A tibble: 6 × 3
##   country      date rate             
##   <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Note that in this code, I used the sep argument to place a null space as our separator. In unite, this defines what symbol should separate the united values, which is an underscore (_) by default. Additionally, I added the line beginning with table5_un$date, which converted the previously character-class year and century values into their intended integer class.

Missing Values in Tibbles

There are two forms of missing values; NA (explicit) or non-entered (implicit). Below, I’ll construct a gene_data tibble for illustration purposes:

gene_data <- tibble(
  gene = c('a','a','a','a','b','b','b'),
  nuc = c(20,22,24,25,NA,42,67),
  run = c(1,2,3,4,2,3,4)
)

gene_data
## # A tibble: 7 × 3
##   gene    nuc   run
##   <chr> <dbl> <dbl>
## 1 a        20     1
## 2 a        22     2
## 3 a        24     3
## 4 a        25     4
## 5 b        NA     2
## 6 b        42     3
## 7 b        67     4

The nucleotide count for gene b in run 2 is explicitly missing, while the nucleotide count for gene b in run 1 is implicitly missing. Implicit missing data can be troublesome as there is no way to tell that it is missing without reading the tibble directly.

Implicit to Explicit

One way to make implicit missing values explicit is by putting runs in columns via spread().

gene_data_2 <- gene_data %>%
  spread(gene, nuc)
gene_data_2
## # A tibble: 4 × 3
##     run     a     b
##   <dbl> <dbl> <dbl>
## 1     1    20    NA
## 2     2    22    NA
## 3     3    24    42
## 4     4    25    67

With this layout, we can clearly see that the implicit missing value for b has been made explicit. To remove these values, we can use spread and gather or na.rm = TRUE.

gene_data_3 <- gene_data %>%
  spread(gene, nuc) %>%
  gather(gene, nuc, 'a':'b', na.rm = TRUE)
gene_data_3
## # A tibble: 6 × 3
##     run gene    nuc
##   <dbl> <chr> <dbl>
## 1     1 a        20
## 2     2 a        22
## 3     3 a        24
## 4     4 a        25
## 5     3 b        42
## 6     4 b        67

By first spreading the data, we made all implicit and explicit values for gene b visible. Then, using gather, we reorganized the gene a and gene b columns back into a single gene column, ignoring any NA values thanks to the na.rm = TRUE argument. Now, we have a clean and tidy version of our gene_data dataset.

As an aside, another way to make implicit values explicit it to use the complete() function.

gene_data %>%
  complete(gene, run)
## # A tibble: 8 × 3
##   gene    run   nuc
##   <chr> <dbl> <dbl>
## 1 a         1    20
## 2 a         2    22
## 3 a         3    24
## 4 a         4    25
## 5 b         1    NA
## 6 b         2    NA
## 7 b         3    42
## 8 b         4    67

NA as a Carried Value

Sometimes NA, either explicit or implicit, can be used to represent a value being carried forward in a dataset. To illustrate this, I will construct tribble:

treatment <- tribble(
  ~ person,       ~ treatment,       ~ response,
  ##############################################
  "Isaac",              1,                7,
  NA,                   2,                10,
  NA,                   3,                9,
  "VDB",                1,                8,
  NA,                   2,                11,
  NA,                   3,                10,
)
treatment
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Isaac          1        7
## 2 <NA>           2       10
## 3 <NA>           3        9
## 4 VDB            1        8
## 5 <NA>           2       11
## 6 <NA>           3       10

We can see that the NA values here are simply empty spaces meant to represent the value above them. In order to fill these values so the dataset can be used properly, we can use the fill() function.

treatmentfill <- treatment %>%
  fill(person)
treatmentfill
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Isaac          1        7
## 2 Isaac          2       10
## 3 Isaac          3        9
## 4 VDB            1        8
## 5 VDB            2       11
## 6 VDB            3       10

DPLYR

Oftentimes, you will likely be working with multiple data tables. The DPLYR package allows for the joining of two data tables based on common variables, which we’ve seen before in our Spreading and Grouping section.

Mutate joins adds new variables to one data frame from the matching observations in another.

Filtering joins filters an observation from one data frame based on whether or not they are present in the other.

Set operations treats observations as they are set elements.

To demonstrate some of the things that DPLYR can do, let’s begin by loading the library of nycflights13.

library(nycflights13)

Keys

Keys are unique identifiers per observation. Primary keys uniquely identify an observation within its own table. In other words, primary keys are variables that have unique values for each row. We can identify if a variable is suitable as a primary key using the code below, where I will check if tailnum in the planes dataset of nycflights13 is a primary key.

planes %>%
  count(tailnum) %>%
  filter(n>1)
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>

The lack of any rows here means that in the planes dataset, no tailnum values are identical, making it perfectly suitable as a primary key. Below is an example of variable that isn’t usable as a primary key:

planes %>%
  count(model) %>%
  filter(n>1)
## # A tibble: 79 × 2
##    model       n
##    <chr>   <int>
##  1 717-200    88
##  2 737-301     2
##  3 737-3G7     2
##  4 737-3H4   105
##  5 737-3K2     2
##  6 737-3L9     2
##  7 737-3Q8     5
##  8 737-3TO     2
##  9 737-401     4
## 10 737-4B7    18
## # … with 69 more rows

Mutate Joins

First, let’s select the relevant data for our purposes here:

flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows

Next, we’re going to remove the origin and destination, then join the table with the airlines data by carrier to associate the carrier’s full name with the flight.

flight_carrier <- flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = 'carrier')
flight_carrier
## # A tibble: 336,776 × 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

Other Joins

  • Inner joins (inner_join): matches a pair of observations when their key is equal
  • Outer joins (outer_join): keeps observations that appear in at least one table

Stringr

Stringr is important for many bioinformatic purposes as strings are paramount to enable gene analysis and the like. First, as always, before starting we will load the necessary library:

library(stringr)

String Basics

Strings in R can be created via single or double quotes:

string1 <- "this is a string"
string2 <- 'to put a "quote" in your string, use the opposte quotation type'

string1
## [1] "this is a string"
string2
## [1] "to put a \"quote\" in your string, use the opposte quotation type"

Multiple strings can be stored in collections as character vectors:

string3 <- c('one','two','three')
string3
## [1] "one"   "two"   "three"

String length can be measured using str_length():

str_length(string1)
## [1] 16
str_length(string3)
## [1] 3 3 5

To combine strings, you can use str_c() with a comma separating both strings. They are combined directly with no additional characters.

str_c('X','y')
## [1] "Xy"
str_c(string1, string2)
## [1] "this is a stringto put a \"quote\" in your string, use the opposte quotation type"

You can use the sep argument to control what is placed between combined strings:

str_c(string1, string2, sep = ". If you want ")
## [1] "this is a string. If you want to put a \"quote\" in your string, use the opposte quotation type"

Subsetting Strings

Strings can be subsetted using the str_sub() function. For the below variable, we’re going to pull out just the numbers from each string. To do this, we will add numbers to the end of str_sub() to define which range of characters should be subsetted.

HSP <- c('HSP123','HSP234','HSP456')
HSP
## [1] "HSP123" "HSP234" "HSP456"
HSPnum <- str_sub(HSP,4,6)
HSPnum
## [1] "123" "234" "456"

Alternatively, you can use negative numbers to count backwards from the rightmost part of the string:

HSPmun <- str_sub(HSP,-3,-1)
HSPmun
## [1] "123" "234" "456"

Strings are also often case-sensitive, so there is a function to change case. To change all characters to lowercase, you can use str_to_lower(). To change all characters to uppercase, you use str_to_upper(). An example is shown below:

HSPLow <- str_to_lower(HSP)
HSPLow
## [1] "hsp123" "hsp234" "hsp456"

Regular Expression

In this section, we’re going to look at techniques to improve your gene analysis and binding motif-finding capabilities using stringr. Below, we’re going to create a variable with a few short gene sequences and load a library to help with visualization:

library(htmlwidgets)
x <- c('ATTAGA','CGCCCCCGGAT','TATTA')

str_view(x, 'G')

With str_view, we can highlight the first of a character in each string. Periods can be used to represent placeholders, allowing you to expand the size of highlights in a string:

str_view(x, '.G.')

You can also use ^ and $ to make ‘anchors’, which lets you specify that highlights only appear if a string starts or ends specifically with the desired characters:

str_view(x, '^TA')
str_view(x, 'AT$')

Character Classes/Alternatives

Other useful options exist for str_view(), including… - matches any digit - : matches any space - [abc]: matches any one of a, b, or c

In the below example, the search will essentially look for TAG or TAT since G and T are placed within the brackets and therefore are interchangeable.

str_view(x, 'TA[GT]')

Conversely, [^abc] will match for anything other than one of a, b, or c:

str_view(x, 'TA[^T]')

Detecting Matches

Str_view() has shown us matches based on our argument, but str_detect() can specifically be used to detect matches via TRUE and FALSE logic, instead of printing highlights. Let’s start by making a dummy set of strings, then seeing if we can detect the letter e.

y <- c('apple','banana','pear')
y
## [1] "apple"  "banana" "pear"
str_detect(y, 'e')
## [1]  TRUE FALSE  TRUE

As you can see, the function determines whether or not e is present in each string and returns TRUE if the string contains it, or false if it does not. Next, let’s see how many words contain the letter e in the built-in words dataset.

sum(str_detect(words, 'e'))
## [1] 536

Let’s get more complex; what proportion of words in the set start with a vowel, and what proportion end with a vowel?

mean(str_detect(words, '^[aeiou]'))
## [1] 0.1785714
mean(str_detect(words, '[aeiou]$'))
## [1] 0.2765306

17.9% of words start with a vowel, and 27.7% of words end with a vowel in this dataset. Next, let’s find all words that don’t contain ‘o’ or ‘u’ by specifying what we want from the words set:

wrds <- words[!str_detect(words, '[ou]')]
head(wrds)
## [1] "a"       "able"    "accept"  "achieve" "act"     "active"

In addition to str_detect(), str_count can be used to specifically count the number of matches found in a string. Let’s use the x dataset we made:

x
## [1] "ATTAGA"      "CGCCCCCGGAT" "TATTA"
str_count(x,'[GC]')
## [1] 1 9 0

Let’s couple string matching with mutate. First, let’s create a tibble for the words dataset:

df <- tibble(
  word = words,
  wordnum = seq_along(word)
)
head(df)
## # A tibble: 6 × 2
##   word     wordnum
##   <chr>      <int>
## 1 a              1
## 2 able           2
## 3 about          3
## 4 absolute       4
## 5 accept         5
## 6 account        6

Next, let’s mutate the tibble to add a vowels and consonants count for each word:

df %>%
  mutate(
    vowels = str_count(words, '[aeiou]'),
    consonants = str_count(words, '[^aeiou]')
  )
## # A tibble: 980 × 4
##    word     wordnum vowels consonants
##    <chr>      <int>  <int>      <int>
##  1 a              1      1          0
##  2 able           2      2          2
##  3 about          3      3          2
##  4 absolute       4      4          4
##  5 accept         5      2          4
##  6 account        6      3          4
##  7 achieve        7      4          3
##  8 across         8      2          4
##  9 act            9      1          2
## 10 active        10      3          3
## # … with 970 more rows
head(df)
## # A tibble: 6 × 2
##   word     wordnum
##   <chr>      <int>
## 1 a              1
## 2 able           2
## 3 about          3
## 4 absolute       4
## 5 accept         5
## 6 account        6

Microarray Analysis

Microarrays, though now somewhat outdated, are still widely analyzed since a number of older experiments used this method of genetic data collection. This method often required knowing what genes to test for, hence the modern tendency to use full genome sequencing methods. Nonehtheless, microarrays are still used and important for broad analyses like meta-analyses.

Now, before we get into it proper, we have to set our working directory and load a number of libraries:

library(ath1121501.db)
library(ath1121501cdf)
library(annotate)
library(affy)
library(limma)
library(oligo)
library(dplyr)
library(stats)
library(reshape)
library(imager)
library(stringr)

## setwd("~Desktop/classroom/myfiles")

Naturally, your working directory may vary. You will need to set this directory to where you have Bric16_Targets.csv and other necessary files saved. For the way our RStudio is set up, we cannot change the working directory, so I have commented out the line to set our directory.

Reading the CEL File

Now that we have our directory set and libraries loaded, we can read our file and begin the analysis.

targets <- readTargets("Bric16_Targets.csv", sep=',', row.names = 'filename')

ab <- ReadAffy()

With the data read and stored in variables, we’ll plot a basic histogram:

hist(ab)

One thing of import to note is that in microarray experiments, measurements are taken based on light given off. This light is directly proportional to the amount of RNA, and it’s possible that a given sample just has more RNA on average than another, so we must normalize our data.

eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
##                                          sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL      1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL      2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL      3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL       4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL       5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL       6
hist(eset)

As you can see, compared to the last graph, we have a much tighter fit between each experiment; the differences in sample light have been normalized to one another to give us the relative data, allowing us to compare across microarray experiments. With that done, let’s characterize our data next.

Characterization of the Data

We will start by getting the probe names from our data set. featureNames will call the actual probe names, and getSYMBOL will find the gene symbols for the probes using one of the databases we loaded. Lastly, we will use read.delim to create a data frame from a table of data using a large arabidopsis dataset.

ID <- featureNames(eset)
Symbol <- getSYMBOL(ID, 'ath1121501.db')
affydata <- read.delim("affy_ATH1_array_elements.txt")

Differential Gene Expression Analysis using Microarray

First, we need a variable to represent ground or flight conditions for our data. We will set this up by storing the gravity data of our microarray as the condition variable, then use model.matrix to compare condition variables and categorize them. colnames is used to set the column names once the conditions have been categorized.

condition <- targets$gravity
design <- model.matrix(~factor(condition))
colnames(design) <- c('flight','ground')

Calculating Analysis Statistics

With our flight and ground variable set up, we will first fit a linear model for each gene in our data given a series of arrays using lmFit and eBayes. lmFit establishes the linear model and eBayes will compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors. In other words, eBayes will show us if there is significant differential expression in our linear model. This code is shown below:

fit <- lmFit(eset, design)
fit <- eBayes(fit)

Next, we will make the data look a little nicer with options(digits=2) to limit the number of digits rendered, as well as topTable to have a list of the most differentially expressed genes. We will also use an FDR (false discovery rate) adjustment as our multiple-test correction.

options(digits = 2)
top <- topTable(fit, coef = 2, n = Inf, adjust='fdr')

Combining our Annotations

Next, let’s create a data frame to put everything together. Each line of our data frame construction below creates a column using our greater database’s column of the same name. This frame will include gene names, kegg data (a type of annotation), gene ontology data, symbols, and accession numbers.

After making this data frame, we are going to merge this with our topTable of significantly differentially expressed genes with Annot using column 0 (by.x) and top using column 0 (by.y). Then, we will merge that data frame with our affydata, associating rownames with our all data frame and array_element_name with affydata. Lastly, we’ll set ACCNUM to be character class.

Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ', '),
                    KEGG = sapply(contents(ath1121501PATH), paste, collapse = ', '),
                    GO = sapply(contents(ath1121501GO), paste, collapse = ','),
                    SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ', '),
                    ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ', '))


all <- merge(Annot, top, by.x=0, by.y=0, all = TRUE)
all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")

all2$ACCNUM <- as.character(all2$ACCNUM)

Finally,with our table compiled, we will write the resulting data into a csv file. Note the sep argument; this will place a tab as a separator.

write.table(all2, file="BRIC_16_FINAL.csv:", row.names = TRUE, col.names = TRUE, sep = '\t')

Next, we’re going to use the org.At.tair.db database, which includes much more diverse information on the differentially expressed genes we were looking at. Below is a list of the columns of data in the database:

columns(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"

With this new database, we will add a column to our all2 dataframe called entrez using mapIds. mapIds looks at the arabidopsis database that we’re using and matches ACCNUMs to their entrezID. The keytype denotes that ACCNUM is TAIR-type, and multiVals is an argument that tells the function to use the first variable if multiple are present.

We’ll also store the fold change as its own variable for our next section on Pathview, which will give biological context to our data.

all2$entrez = mapIds(org.At.tair.db,
                     keys = all2$ACCNUM,
                     column = "ENTREZID",
                     keytype = "TAIR",
                     multiVals = 'first')
                     
foldchanges <- as.data.frame(all2$logFC)

Pathview for Microarray Analysis

First, let’s load our libraries and set up some annotation data. Pathview builds excellent visual pathway diagrams and the gage package is an acronym for generically applicable gene expression.

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)

Next, we will assign genes with entrezIDs to their appropriate fold change values:

foldchanges = all2$logFC
names(foldchanges) = all2$entrez

head(foldchanges)
##   <NA>   <NA>   <NA>   <NA>   <NA>   <NA> 
## -0.742 -0.826 -1.214 -0.568  0.120  0.055

In this next step, we are going to map genes to our kegg data’s annotations. We’ll start by setting kegg.ath (our arabidopsis kegg data) equal to kegg.gsets for arabidopsis thaliana’s (ath) entrezIDs, then pulling the relevant data we need fully out of the database and into its own data frame. See this step below:

kegg.ath = kegg.gsets('ath', id.type = 'entrez')
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

Results via Gage

To get the results of our analysis, we will use the gage function with foldchanges and our gene set kegg.ath.sigmet. Note the same.dir argument, which tells the function that the data is found in the same directory:

keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                      p.geomean stat.mean   p.val   q.val
## ath03010 Ribosome                      2.4e-27      11.5 2.4e-27 2.7e-25
## ath00195 Photosynthesis                7.6e-06       4.6 7.6e-06 4.3e-04
## ath04145 Phagosome                     3.2e-05       4.1 3.2e-05 1.2e-03
## ath01230 Biosynthesis of amino acids   5.5e-05       3.9 5.5e-05 1.5e-03
## ath00190 Oxidative phosphorylation     6.0e-04       3.3 6.0e-04 1.3e-02
## ath00020 Citrate cycle (TCA cycle)     1.4e-03       3.1 1.4e-03 2.6e-02
##                                      set.size    exp1
## ath03010 Ribosome                         261 2.4e-27
## ath00195 Photosynthesis                    44 7.6e-06
## ath04145 Phagosome                         71 3.2e-05
## ath01230 Biosynthesis of amino acids      201 5.5e-05
## ath00190 Oxidative phosphorylation         97 6.0e-04
## ath00020 Citrate cycle (TCA cycle)         57 1.4e-03
## 
## $less
##                                                      p.geomean stat.mean  p.val
## ath04016 MAPK signaling pathway - plant                 0.0035      -2.7 0.0035
## ath00906 Carotenoid biosynthesis                        0.0198      -2.1 0.0198
## ath04141 Protein processing in endoplasmic reticulum    0.0283      -1.9 0.0283
## ath00592 alpha-Linolenic acid metabolism                0.0552      -1.6 0.0552
## ath00071 Fatty acid degradation                         0.1160      -1.2 0.1160
## ath04122 Sulfur relay system                            0.1415      -1.1 0.1415
##                                                      q.val set.size   exp1
## ath04016 MAPK signaling pathway - plant                0.4      129 0.0035
## ath00906 Carotenoid biosynthesis                       1.0       28 0.0198
## ath04141 Protein processing in endoplasmic reticulum   1.0      189 0.0283
## ath00592 alpha-Linolenic acid metabolism               1.0       36 0.0552
## ath00071 Fatty acid degradation                        1.0       45 0.1160
## ath04122 Sulfur relay system                           1.0       11 0.1415
## 
## $stats
##                                      stat.mean exp1
## ath03010 Ribosome                         11.5 11.5
## ath00195 Photosynthesis                    4.6  4.6
## ath04145 Phagosome                         4.1  4.1
## ath01230 Biosynthesis of amino acids       3.9  3.9
## ath00190 Oxidative phosphorylation         3.3  3.3
## ath00020 Citrate cycle (TCA cycle)         3.1  3.1

We use lapply here to get a look at the data in a table form.

Separating Upregulated/Downregulated

Thanks to gage, our keggres data is separated into downregulated (less) and upregulated (greater) sections. With this in mind, we can separate these into their own variables, then write each into a csv file for storage or later analysis:

greater <- keggres$greater
lesser <- keggres$less

write.csv(greater, file = 'BRIC_16_pathview_Greater_Pathways.csv')
write.csv(lesser, file = 'BRIC_16_pathview_Lesser_Pathways.csv')

Mapping Genes to Pathways

First, to to begin the end of the mapping process, we will take our upregulated pathways and filter out all but the first five pathways, then make them characters. In other words, we’re taking the first 5 $greaters data points from keggres, formatting it as a new dataframe, then formatting the data as characters.

keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
  tibble::as_tibble() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()

Next, we’ll cut out the full names of each pathway, leaving only the pathway identifiers. This cleans up our pathview diagram nicely. We’ll also go ahead and set character-class logFC from all2 as our genedata variable for later use.

keggresids_greater = substr(keggrespathways, start=1, stop=8)


genedata <- as.character(all2$logFC)

Defining the Pathview Function

Finally, we will define the function to produce our pathview diagram. We will call this function pid, and will call pathview to visually show fold changes along the genetic pathways of Arabidopsis Thaliana. See this definition below:

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = 'ath', new.signature = FALSE)

Plotting Pathways

When we plot the actual pathways, we will need to use a temporary variable, which will be called tmp. sapply() will command the function to go over the five variables of keggresids_greater through the function pid. This will generate diagrams of upregulation/downregulation on the relevant pathways.

tmp = sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = 'ath'))

These files will be saved to disk, with the tmp variable being a throwaway object list. With that, we will have five different pathview diagrams that elegantly show the fold changes of the five pathways we selected.

I will also use file.move to separate the pathview images from the working directory since we will produce more pathview images later.

To render these files in our markdown document, we can run the following code to tell R what files to use, then in the next chunk, tell knitr to include the images stored in the filenames object.

file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
            str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.microarray.png'))

micro_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.microarray.png")

allmicro_images <- lapply(micro_filenames, load.image)

dev.off()
knitr::include_graphics(micro_filenames)

RNASeq Analysis with EdgeR

As always, we’ll start this section by loading our required packages and the database org.Mm.eg.db, which is a database of mouse genomic information.

library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")

Basic Analysis/Summary

Loading the Raw Data

To begin the analysis, we’re going to load our raw data, create a group variable to define ground or flight categories, then utilize the DGEList() function to create a differential gene expression list via edgeR.

rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_id")

group <- factor(c('1','1','1','1','1','1','2','2','2','2','2','2'))

dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 2
##   .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
##   .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
##   .. ..$ :'data.frame':  12 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
##   .. .. ..$ lib.size    : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
##   .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ names: chr [1:2] "counts" "samples"

Printed by this code chunk, we can see a bunch of information in regards to our list including the unique class of the object as well as some of the data stored in it.

Cleaning the Data

Next, we want to remove any rows that have empty cells. To do so, we will run the following code to keep only the data whose number of filled cells per row is four or more.

keep <- rowSums(cpm(dgeGlm)>2) >=4
dgeGlm <- dgeGlm[keep,]

With this narrowed dataset, we’ll only be looking at genes which have data across all of the experiments listed. The below section of code shows the names of our differential expression data, and then prints the names of the samples we have as a snapshot into the state of our data.

names(dgeGlm)
## [1] "counts"  "samples"
dgeGlm[["samples"]]
##                              group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23     1  4.0e+07            1
## Mmus_C57.6J_KDN_FLT_Rep2_M24     1  4.1e+07            1
## Mmus_C57.6J_KDN_FLT_Rep3_M25     1  3.8e+07            1
## Mmus_C57.6J_KDN_FLT_Rep4_M26     1  3.9e+07            1
## Mmus_C57.6J_KDN_FLT_Rep5_M27     1  3.7e+07            1
## Mmus_C57.6J_KDN_FLT_Rep6_M28     1  3.6e+07            1
## Mmus_C57.6J_KDN_GC_Rep1_M33      2  3.7e+07            1
## Mmus_C57.6J_KDN_GC_Rep2_M34      2  3.7e+07            1
## Mmus_C57.6J_KDN_GC_Rep3_M35      2  4.0e+07            1
## Mmus_C57.6J_KDN_GC_Rep4_M36      2  3.6e+07            1
## Mmus_C57.6J_KDN_GC_Rep5_M37      2  3.8e+07            1
## Mmus_C57.6J_KDN_GC_Rep6_M38      2  3.5e+07            1

Calculating using EdgeR

Now, to begin using edgeR proper, we need to set up our data in a way that the package can interpret our data. To achieve this, we’ll create a design using model.matrix based on our group variable.

Then, we will… - calculate the common negative binomial dispersion parameter - estimate the abundance-dispersion trend - compute an empirical Bayes estimate of the negative binomial dispersion parameter for each tag, with expression levels specified via a log-linear model

This is shown in the below code chunk:

design <- model.matrix(~group)

dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

With these values calculated, let’s plot dgeGlmTagDisp to give some context to all this statistical jargon:

plotBCV(dgeGlmTagDisp)

This plot shows the distribution of the variation between samples. Good data should show a fairly uniform curve, and we see that our data indeed has a relatively uniform appearance. However, there is more analysis to do.

Fitting Linear Models to our Data

For the next stage of our analysis, we’ll fit a linear regression and log-linear model to our data using glmFit and glmLRT.

Then, we will use topTags to pull out the most differentially expressed genes from our object in a similar fashion to topTable from our microarray analysis.

fit <- glmFit(dgeGlmTagDisp, design)

colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef=2)

ttGlm <- topTags(lrt, n = Inf)

Summarizing the Data

Using the summary() function, we can use a desired p-value to narrow our lrt data to only that which meets our significance while correcting with the false-discovery rate adjustment.

With the summary, we can then store only the significant data by using the logical output of deGlm; anything TRUE by the summary (significant data points) will be selected from the dgeGlmTagDisp data.

summary(deGlm <- decideTestsDGE(lrt, p=0.1, adjust = 'fdr'))
##        group2
## Down       64
## NotSig  13390
## Up        159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]

Next, we’ll create a volcano plot, which shows all of our logFC values for every gene with red data points defining the differentially expressed genes. Good data should have an even distribution of these red points; a concentration of red points would indicate a problem with the experiment or dataset.

plotSmear(lrt, de.tags=tagsGlm)

Next, we’re going to pull out the data whose false-discovery rate is less than 0.1, then write a csv with that data in it.

hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]
write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")

Prepping the Data for Pathview

To show our data off in Pathview, we need to do some legwork to prep it and associate the genes with their annotations and pathways.

To start, we’ll set up our database, set up a dataframe version of ttGlm’s table, then map out symbols, entrezIDs, and gene names in a similar fashion to the microarray analysis:

columns(org.Mm.eg.db)

ttGlm2 <- as.data.frame(ttGlm$table)

ttGlm2$symbol = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column="SYMBOL",
                       keytype="ENSEMBL",
                       multiVals="first")

ttGlm2$entrez = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column="ENTREZID",
                       keytype="ENSEMBL",
                       multiVals="first")

ttGlm2$name = mapIds (org.Mm.eg.db,
                      keys=row.names(ttGlm2),
                      column="GENENAME",
                      keytype="ENSEMBL",
                      multiVals="first")
head(ttGlm2)

Pathview for EdgeR RNASeq Analysis

Setting it Up

After loading our packages and datasets required for the analysis, we’ll create a separate variable for our fold changes and associate each logFC datapoint with ttGlm2’s entrezIDs.

library(pathview)
library(gage)
library(gageData)
library(imager)
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez

With that done, we’ll next associate genes with their annotation data using a similar method to the microarray data using sigmet. With our sigmet data and foldchanges, we can then get our results (keggres).

kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]

keggres = gage(foldchanges, gsets=kegg.mm.sigmet, same.dir=TRUE)
lapply(keggres, head)
## $greater
##                                                                   p.geomean
## mmu03010 Ribosome                                                   9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells   2.0e-03
## mmu04350 TGF-beta signaling pathway                                 7.7e-03
## mmu04330 Notch signaling pathway                                    1.0e-02
## mmu04390 Hippo signaling pathway                                    2.0e-02
## mmu00830 Retinol metabolism                                         2.1e-02
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04350 TGF-beta signaling pathway                                     2.5
## mmu04330 Notch signaling pathway                                        2.4
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                     p.val q.val
## mmu03010 Ribosome                                                 9.5e-05 0.022
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.230
## mmu04350 TGF-beta signaling pathway                               7.7e-03 0.589
## mmu04330 Notch signaling pathway                                  1.0e-02 0.589
## mmu04390 Hippo signaling pathway                                  2.0e-02 0.809
## mmu00830 Retinol metabolism                                       2.1e-02 0.809
##                                                                   set.size
## mmu03010 Ribosome                                                      127
## mmu04550 Signaling pathways regulating pluripotency of stem cells      100
## mmu04350 TGF-beta signaling pathway                                     72
## mmu04330 Notch signaling pathway                                        51
## mmu04390 Hippo signaling pathway                                       125
## mmu00830 Retinol metabolism                                             37
##                                                                      exp1
## mmu03010 Ribosome                                                 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04350 TGF-beta signaling pathway                               7.7e-03
## mmu04330 Notch signaling pathway                                  1.0e-02
## mmu04390 Hippo signaling pathway                                  2.0e-02
## mmu00830 Retinol metabolism                                       2.1e-02
## 
## $less
##                                    p.geomean stat.mean  p.val q.val set.size
## mmu04145 Phagosome                    0.0019      -2.9 0.0019  0.33      121
## mmu04714 Thermogenesis                0.0048      -2.6 0.0048  0.33      207
## mmu04110 Cell cycle                   0.0051      -2.6 0.0051  0.33      110
## mmu04217 Necroptosis                  0.0056      -2.6 0.0056  0.33      113
## mmu00910 Nitrogen metabolism          0.0087      -2.6 0.0087  0.41       13
## mmu00190 Oxidative phosphorylation    0.0118      -2.3 0.0118  0.41      125
##                                      exp1
## mmu04145 Phagosome                 0.0019
## mmu04714 Thermogenesis             0.0048
## mmu04110 Cell cycle                0.0051
## mmu04217 Necroptosis               0.0056
## mmu00910 Nitrogen metabolism       0.0087
## mmu00190 Oxidative phosphorylation 0.0118
## 
## $stats
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04350 TGF-beta signaling pathway                                     2.5
## mmu04330 Notch signaling pathway                                        2.4
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                   exp1
## mmu03010 Ribosome                                                  3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells  2.9
## mmu04350 TGF-beta signaling pathway                                2.5
## mmu04330 Notch signaling pathway                                   2.4
## mmu04390 Hippo signaling pathway                                   2.1
## mmu00830 Retinol metabolism                                        2.1

As before, we’ll separate the data into upregulated (greaters) and downregulated (lessers) genes:

greaters <- keggres$greater
lessers <- keggres$less

Generating the Diagrams

In a fashion similar to the microarray analysis, we’ll create two keggrespathways objects by making a dataframe with the upregulated and downregulated genes, respectively.

First, upregulated genes:

keggrespathways1 = data.frame(id = rownames(keggres$greater), keggres$greater)%>%
  tibble::as_tibble() %>%
  filter(row_number()<=5)%>%
  .$id %>%
  as.character()

keggrespathways1
## [1] "mmu03010 Ribosome"                                                
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04350 TGF-beta signaling pathway"                              
## [4] "mmu04330 Notch signaling pathway"                                 
## [5] "mmu04390 Hippo signaling pathway"
keggresidsgreater = substr(keggrespathways1, start = 1, stop = 8)
keggresidsgreater
## [1] "mmu03010" "mmu04550" "mmu04350" "mmu04330" "mmu04390"

Then, downregulated genes:

keggrespathways2 = data.frame(id = rownames(keggres$less), keggres$less)%>%
  tibble::as_tibble() %>%
  filter(row_number()<=5)%>%
  .$id %>%
  as.character()

keggrespathways2
## [1] "mmu04145 Phagosome"           "mmu04714 Thermogenesis"      
## [3] "mmu04110 Cell cycle"          "mmu04217 Necroptosis"        
## [5] "mmu00910 Nitrogen metabolism"
keggresidslesser = substr(keggrespathways2, start = 1, stop = 8)
keggresidslesser
## [1] "mmu04145" "mmu04714" "mmu04110" "mmu04217" "mmu00910"

Finally, we’ll define our pid function and plot our data using pathview and sapply.

First, the upregulated pathways:

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE)

tmp = sapply(keggresidsgreater, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE))

Then, the downregulated pathways:

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE)

tmp = sapply(keggresidslesser, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = 'mouse', new.signature = FALSE))

Finally, we’ll print the diagrams into our markdown file like so:

file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
            str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.edger.png'))

edger_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.edger.png")

alledger_images <- lapply(edger_filenames, load.image)

dev.off()
knitr::include_graphics(edger_filenames)

RNASeq Analysis with DESeq

For this analysis, we’ll need the following libraries loaded:

library("DESeq2")
library("dplyr")
library('apeglm')

Data Setup and Summary

Loading the Data

Now we need to load our raw data, which we have in two files; counts and phenotype data.

countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")
colData <- read.csv("PHENO_DATA_Mouse.csv", sep= ",", row.names=1)

Formatting the Data for DESeq

Our phenotype data needs levels to be analyzed by DESeq. In other words, if the phenotype is a yes/no observation, such as observing spots or no spots, DESeq would need either phenotype coded as a separate level. For this, we will use 0 and 1:

colData$group <- factor(colData$group, levels = c("0","1"))

Below is the head of our countData. Note that the counts are all fractional numbers, and our first column is the gene names.

head(countData)
##                    X Mmus_C57_6J_KDN_FLT_Rep1_M23 Mmus_C57_6J_KDN_FLT_Rep2_M24
## 1 ENSMUSG00000000001                         2811                         2742
## 2 ENSMUSG00000000003                            0                            0
## 3 ENSMUSG00000000028                           59                           49
## 4 ENSMUSG00000000031                           26                           19
## 5 ENSMUSG00000000037                           20                           26
## 6 ENSMUSG00000000049                            0                            1
##   Mmus_C57_6J_KDN_FLT_Rep3_M25 Mmus_C57_6J_KDN_FLT_Rep4_M26
## 1                         3578                         3118
## 2                            0                            0
## 3                          102                           72
## 4                           17                           23
## 5                           37                           30
## 6                            5                           10
##   Mmus_C57_6J_KDN_FLT_Rep5_M27 Mmus_C57_6J_KDN_FLT_Rep6_M28
## 1                         3610                         3102
## 2                            0                            0
## 3                           80                           84
## 4                           26                           18
## 5                           22                           20
## 6                            0                            4
##   Mmus_C57_6J_KDN_GC_Rep1_M33 Mmus_C57_6J_KDN_GC_Rep2_M34
## 1                        3094                        3112
## 2                           0                           0
## 3                          56                          75
## 4                          15                          21
## 5                          20                          27
## 6                           1                           4
##   Mmus_C57_6J_KDN_GC_Rep3_M35 Mmus_C57_6J_KDN_GC_Rep4_M36
## 1                        3195                        3158
## 2                           0                           0
## 3                          60                          64
## 4                          21                          25
## 5                           8                          20
## 6                          10                           2
##   Mmus_C57_6J_KDN_GC_Rep5_M37 Mmus_C57_6J_KDN_GC_Rep6_M38
## 1                        3339                        2958
## 2                           0                           0
## 3                          56                          79
## 4                          24                          18
## 5                          30                          36
## 6                           1                           1

DESeq only allows integers, not fractional numbers, so we need to round our data for use with DESeq:

countData %>%
  mutate_if(is.numeric, ceiling)

countData[, 2:13] <- sapply(countData[, 2:13], as.integer)

In addition, we want our row names to be the data stored in the first column, so we’ll run the following code to set the row names, then remove the first column as it is redundant:

row.names(countData) <- countData[, 1]

countData <- countData[2:13]

Finally, let’s check that we have the same number of individuals as we do groups:

rownames(colData) == colnames(countData)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Performing the Analysis with DESeq

With that, we can perform the analysis proper. To start, we will create a DESeq data set using countData, colData, and groups as our design. Then, we will remove any genes with no observed count changes in the second line. Finally, the DESeq function is called to perform the differential gene expression, and the results function is called to perform the basic analysis. This is all accomplished in the below code chunk:

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)

dds <- dds[rowSums(counts(dds)>2)>=4]

dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)

Next, we’ll use our differential gene expression data and lfcShrink to adjust our log fold change values, then reorder our results by the p-value significance:

resLFC <- lfcShrink(dds, coef=2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resOrdered <- res[order(res$padj), ]

summary(res)
## 
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 325, 1.5%
## LFC < 0 (down)     : 327, 1.5%
## outliers [1]       : 15, 0.068%
## low counts [2]     : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Presenting Our Data

Let’s pull out specifically the significant fold change data by sourcing it from our res object:

FLT_Vs_GC <- as.data.frame(res$log2FoldChange)

head(FLT_Vs_GC)
##   res$log2FoldChange
## 1             0.0421
## 2            -0.1334
## 3            -0.0185
## 4            -0.0882
## 5            -0.0079
## 6             0.1136

Next, let’s create an MA plot, or as it was called earlier, a volcano plot. We’ll use the plotMA function to plot our resLFC object to see if our distribution is relatively uniform, then save the plot to a pdf file.

plotMA(resLFC, ylim = c(-2, 2))

pdf(file = "MA_Plot_FLT_vs_GC.pdf")

dev.off()
## png 
##   2

Let’s also save our results to its own csv file:

write.csv(as.data.frame(resOrdered), file = "Mouse_DESeq.csv")

Performing a Principle Component Analysis

To plot our data in PCA format, we’ll need to do a custom transformation. We will start by using estimateSizeFactors() to give our data more options to be interfaced with for analysis, then summarize our experiment using SummarizedExperiment to normalize our data.

dds <- estimateSizeFactors(dds)

se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) +1), colData = colData(dds))

With all of that set up, we can plot the data as a PCA:

plotPCA(DESeqTransform(se), intgroup='group')

We would want to see flight (0) and ground (1) each grouped together to show how tightly characterized the differential expression is on biological replicates. This is unfortunately not a great example of a PCA, which is frequently a problem with spaceflight data due to low numbers of replicates.

Setting things up for Pathview

We’ll need to load a couple libraries to start with for this section:

library(AnnotationDbi)
library(org.Mm.eg.db)

Next, we’ll perform a series of steps that should now be very familiar. We are going to create a dataframe using our fold change data from the res object, then associate those fold changes with gene symbol, entrezID, and gene name. This is carried out below:

columns(org.Mm.eg.db)

foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))

res$symbol = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "SYMBOL",
                    keytype = "ENSEMBL",
                    multiVals = 'first')

res$entrez = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "ENTREZID",
                    keytype = "ENSEMBL",
                    multiVals = 'first')

res$name = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "GENENAME",
                    keytype = "ENSEMBL",
                    multiVals = 'first')

Pathview for DESeq RNASeq Analysis

Setting it Up

Let’s start with loading our packages and create a separate variable for our fold changes and associate each log2FoldChange datapoint with the appropriate entrezIDs:

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez

Next, we’ll prepare our sigmet data using the kegg.gsets mouse data using entrezIDs:

kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]

With the sigmet data and foldchanges, we can use gage as before to create our keggres object:

keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                           p.geomean stat.mean  p.val q.val
## mmu03010 Ribosome                            0.0056       2.6 0.0056   0.9
## mmu04022 cGMP-PKG signaling pathway          0.0309       1.9 0.0309   0.9
## mmu04360 Axon guidance                       0.0382       1.8 0.0382   0.9
## mmu04330 Notch signaling pathway             0.0518       1.6 0.0518   0.9
## mmu04658 Th1 and Th2 cell differentiation    0.0538       1.6 0.0538   0.9
## mmu04350 TGF-beta signaling pathway          0.0544       1.6 0.0544   0.9
##                                           set.size   exp1
## mmu03010 Ribosome                              133 0.0056
## mmu04022 cGMP-PKG signaling pathway            153 0.0309
## mmu04360 Axon guidance                         176 0.0382
## mmu04330 Notch signaling pathway                55 0.0518
## mmu04658 Th1 and Th2 cell differentiation       78 0.0538
## mmu04350 TGF-beta signaling pathway             87 0.0544
## 
## $less
##                                                   p.geomean stat.mean p.val
## mmu04657 IL-17 signaling pathway                      0.011      -2.3 0.011
## mmu04110 Cell cycle                                   0.020      -2.1 0.020
## mmu04145 Phagosome                                    0.027      -1.9 0.027
## mmu04621 NOD-like receptor signaling pathway          0.042      -1.7 0.042
## mmu04625 C-type lectin receptor signaling pathway     0.044      -1.7 0.044
## mmu04115 p53 signaling pathway                        0.048      -1.7 0.048
##                                                   q.val set.size  exp1
## mmu04657 IL-17 signaling pathway                    0.9       75 0.011
## mmu04110 Cell cycle                                 0.9      121 0.020
## mmu04145 Phagosome                                  0.9      145 0.027
## mmu04621 NOD-like receptor signaling pathway        0.9      151 0.042
## mmu04625 C-type lectin receptor signaling pathway   0.9       99 0.044
## mmu04115 p53 signaling pathway                      0.9       71 0.048
## 
## $stats
##                                           stat.mean exp1
## mmu03010 Ribosome                               2.6  2.6
## mmu04022 cGMP-PKG signaling pathway             1.9  1.9
## mmu04360 Axon guidance                          1.8  1.8
## mmu04330 Notch signaling pathway                1.6  1.6
## mmu04658 Th1 and Th2 cell differentiation       1.6  1.6
## mmu04350 TGF-beta signaling pathway             1.6  1.6

Next, we’ll combine a bunch of previous steps into one large chunk as you should be very familiar with this process by now. Below, we’ll separate our keggres object into upregulated (greaters) and downregulated (lessers) objects and create new objects for their particular pathways:

greaters <- keggres$greater
lessers <- keggres$less

keggresgreaterpathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number()<=3) %>%
  .$id %>%
  as.character()

keggresidsg <- substr(keggresgreaterpathways, start=1, stop=8)

keggreslesserpathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number()<=3) %>%
  .$id %>%
  as.character()

keggresidsl <- substr(keggreslesserpathways, start=1, stop=8)

Plotting DESeq Pathview

Finally, we can plot our data with pathview as we have in the previous two analyses:

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse', new.signature = FALSE)

tmp = sapply(keggresidsg, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse'))

tmp = sapply(keggresidsl, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse'))

And here, we can use this chunk to insert our pathview images:

file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
            str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.deseq.png'))

deseq_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.deseq.png")

alldeseq_images <- lapply(deseq_filenames, load.image)

dev.off()
knitr::include_graphics(deseq_filenames)

EdgeR vs. DESeq

EdgeR and DESeq are two very different pipelines, so it is very worthwhile to compare both methodologies on one dataset for a broader perspective for biological interpretation. As always, we’ll begin by loading any libraries needed for this exercise, and importing our relevant datasets:

library(tidyverse)
library(GOplot)
EdgeR <- read.csv('Mouse_EdgeR_Results_Zero_vs_1.csv')
DESeq <- read.csv('Mouse_DESeq.csv')

Trimming the Fat and Formatting

We never originally trimmed the Mouse_Deseq.csv file to exclusively the significant genes, so we’ll have to do that before we can compare the two datasets. This can be done by creating a new object that is DESeq filtered by a padj (p-adjustment) of < 0.1. This is done below:

DESeq2 <- DESeq %>%
  filter(padj < 0.1)

Next, we want to trim out only the gene names and log fold change data. To do so, we’ll create new objects that are only the relevant columns:

DESeqTrim <- DESeq2[,c(1,3)]
EdgeRTrim <- EdgeR[, 1:2]

Though we have the appropriate data in our objects, the names of each column will need to be changed for GOVenn (a part of GOPlot) to use it. The specific names must be ID and logFC, which we will address below:

colnames(DESeqTrim) <- c("ID", "logFC")
colnames(EdgeRTrim) <- c("ID", "logFC")

Plotting EdgeR vs. DESeq using GOVenn

With our data formatted and trimmed, we can now use the GOVenn function to compare what genes were considered differentially expressed by either package. Note that in the below code chunk, we set plot to FALSE. The GOVenn function has this set to TRUE by default and will cause the function to skip saving a table with the plot. By setting this to FALSE, we can save both the plot and table.

comp <- GOVenn(DESeqTrim, EdgeRTrim, label = c("DESeq", "EdgeR"), title = "Comparison of DESeq and EdgeR for Differentially Expressed Genes", plot = FALSE)

comp$plot

As you can see in the diagram, 122 upregulated and 61 downregulated genes are shared between both methods. In addition, EdgeR is much more conservative with what it considers significantly differentially expressed, as it only has 37 and 3 genes uniquely upregulated or downregulated respectively.

In a biological context, the shared genes are the most likely candidates for being biologically significant in the context of differential expression in spaceflight, but that doesn’t completely invalidate the unique genes to either DESeq and EdgeR. These genes are still marked by one of the methods, so it may be worth reanalyzing them with other methods, depending on their genetic context.

Multiple Protein Alignment

In this section, I’ll show how to properly perform a multiple protein alignment using the msa (multiple sequence alignment) package. We’ll be using the hglobin.fa data for this exercise.

library(msa)
seq <- readAAStringSet("hglobin.fa")

Aligning the Amino Acid Sequences

Next, we’ll call msa to use the ClustalW algorithm to generate alignments for our amino acid sequences, then print the sequences with their alignments as a pdf via msaPrettyPrint(). Note the following arguments and their purpose:

  • showNames: shows the species names on the specified side of the sequences
  • showLogo: similarly to above, shows the species logo
  • askForOverwrite: sets whether or not R will ask permission to overwrite. FALSE will make it always overwrite the file without asking
alignments <- msa(seq, method='ClustalW')

msaPrettyPrint(alignments, output='pdf', showNames='left',
               showLogo='none', askForOverwrite=FALSE,
               verbose=TRUE, file='whole_align.pdf')

Adjusting our Alignment Diagram

We can also adjust the diagram to include only specific amino acids. To do this, before our output argument, we can add a collection with the first and last amino acid we want included in order. Below, we’ll generate another diagram with only amino acids 10 through 30:

msaPrettyPrint(alignments, c(10,30), output='pdf', showNames='left',
               file='Zoomed_align.pdf', showLogo='top', askForOverwrite=FALSE,
               verbose=TRUE)

Making a Tree from our Alignments

If we want to make a tree using our alignments, we’ll have to add on a few more libraries, which we’ll load below:

library(ape)
library(seqinr)

Next, we need to convert our alignments to a seqinr data type via msaConvert(). This will allow seqinr to properly interface with our data.

alignment_seqinr <- msaConvert(alignments, type='seqinr::alignment')

With the data converted, we need to calculate the distances on our tree using the alignment_seqinr data, then use ape to perform a neighbor-joining tree-estimation process. With our distances and neighbor-joining set up, we will finally plot the tree. This whole process is outlined below:

distances1 <- seqinr::dist.alignment(alignment_seqinr, 'identity')

tree <- ape::nj(distances1)
plot(tree, main='Phylogenetic Tree of HBA Sequences')

Notably, our results imply some very peculiar conclusions due to the dataset being quite limited. Since we are looking at a single hemoglobin gene across each of these animals, we can see that humans and gorillas are considered more closely related to porpoises than capucins. In a broad sense, porpoises aren’t so close to humans and gorillas in comparison to capucins, but on strictly this single gene, porpoises are actually closer.

Synteny

Synteny is the concept that things such as gene order and blocks of genes inherited from common ancestors are conserved through evolution. In this module, we’re going to look at genome synteny and running a script to compare regions of synteny between genomes.

As always, we’ll begin by loading the necessary library and our data, plastid_genomes.fa:

library(DECIPHER) 
long_seq <- readDNAStringSet('plastid_genomes.fa')

Note that long_seq is a DNAStringSet object.

Next, we’re going to build a temporary SQLite database to use the data with this package. By doing so, we’ll be able to accomplish a number of analyses using this data.

Seqs2DB(long_seq, 'XStringSet', 'long_db', names(long_seq))
## Adding 5 sequences to the database.
## 
## Added 5 new sequences to table Seqs.
## 45 total sequences in table Seqs.
## Time difference of 0.21 secs

With this database established, we can…

  • Find the syntenic blocks:
synteny <- FindSynteny('long_db')
  • View blocks with plotting:
pairs(synteny)

plot(synteny)

  • Make an alignment file:
alignment <- AlignSynteny(synteny, 'long_db')
  • Create a structure holding all aligned syntenic blocks for a pair of sequences:
block <- unlist(alignment[[1]])
  • And write to file one alignment at a time:
writeXStringSet(block, 'genome_blocks_out.fa')

With this document, the dashes will show sections with no alignment, while characters will show conserved bases. # Unannotated Gene Regions Oftentimes in more novel genomes or less commonly-used genomes, annotations may not be available. In this section, we’ll look at how to map transcripts to unannotated regions. We’ll start with loading our libraries:

library(locfit)
library(Rsamtools)

Identifying Unannotated Regions

Next, we’ll need to create a function that will load the gene region’s information in a GFF file, then convert it into a bioconductor GRanges object. This is outlined below:

get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff,'GRanges')
}

With our function set, we can move on to now get counts from 500bp windows across the genome. To do so, we’ll use csaw::windowCounts(), which will perform the necessary statistics to form our desired windows.

csaw:readParam() will then be used to set what qualities will be needed for the mapping of reads. Of the arguments included…

  • minq = 20 determines what is a passing read
  • dedup = TRUE removes any PCR duplicates
  • pe = ‘both’ requires that both pairs of reads are aligned
whole_genome <- csaw::windowCounts(
  file.path('windows.bam'),
  bin=TRUE,
  filter=0,
  width=500,
  param=csaw::readParam(
    minq=20, 
    dedup=TRUE, 
    pe='both' 
    )
)

Since the above data is a single column, we’ll rename it:

colnames(whole_genome) <- c('small_data')

Before going any further let’s also establish a separate object for our annotated regions using the function we defined earlier:

annotated_regions <- get_annotated_regions_from_gff(file.path('genes.gff'))

Now that we have our windows of high expression, we need to check if any of those windows overlap with the annotated regions we defined. To accomplish this, we’ll import another two libraries we’ll need:

library(IRanges)
library(SummarizedExperiment)

Now, to actually find those overlaps, we’ll run IRanges::overlapsAny with SummarizedExperiment::rowRanges of our whole_genome data. This will create a G-ranges object that stores TRUE/FALSE for overlaps across the genome.

windows_in_genes <- IRanges::overlapsAny(
  SummarizedExperiment::rowRanges(whole_genome), 
  annotated_regions
)

We can then subset the whole_genome object into annotated and unannotated regions using our windows_in-genes object:

annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]

assay() can then be used to extract the actual counts from the non_annotated_window_counts G-ranges object:

assay(non_annotated_window_counts)
##      small_data
## [1,]          0
## [2,]          0
## [3,]         24
## [4,]         25
## [5,]          0
## [6,]          0

Mapping the Unannotated Regions

Now, we can use Rsamtools’ pileup() function to get a per-base coverage dataframe in which each row represents a single nucleotide in the reference count, and the count column gives the depth of coverage at that point. We’ll also need to load a new library called bumphunter for a later step in this process:

library(bumphunter)
pile_df <- Rsamtools::pileup(file.path('windows.bam'))

In this step, we’ll group the reads within a certain distance of each other into a cluster. We’ll give the sequence names, position, and distance of the reads:

clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap=100)
table(clusters)
## clusters
##    1    2    3 
## 1486 1552 1520

Finally, we’ll map the reads to the regions we found for the genome:

bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff=1)
## getSegments: segmenting
## getSegments: splitting
##    chr start  end value  area cluster indexStart indexEnd    L clusterL
## 3 Chr1  4503 5500  10.4 15811       3       3039     4558 1520     1520
## 1 Chr1   502 1500  10.0 14839       1          1     1486 1486     1486
## 2 Chr1  2501 3500   8.7 13436       2       1487     3038 1552     1552

As you can see in the table, we’re given the chromosomal location, start and end of the region, and other useful data. With that, we have fully completed our mapping of the unannotated region of this data.

Phylogenetic Trees with Treeio

In this section, we’ll take a look at creating nice-looking phylogenetic trees with treeio and ggplot. First, we’ll load our data and required packages:

library(ggplot2)
library(ggtree)
library(treeio)
library(BAMMtools)

itol <- ape::read.tree('itol.nwk')

Note that our data is in Newick format, so we had to use ape::read.tree.

With our raw data, we can print a very basic phylogenetic tree with it:

ggtree(itol)

While this tree is certainly extant, options and parameters can be specified to make the tree look much better.

ggtree Parameters and Option

First, let’s look at setting the tree up as a circular diagram:

ggtree(itol, layout='circular')

We can also change the left/right & up/down orientations of the tree. Below, we’ll flip the original tree vertically and rotate it horizontally:

ggtree(itol) + coord_flip() + scale_x_reverse()

While changing the appearance and organization of the table is handy, names would help in giving context to our tree. We can do this by appending geom_tiplab to ggtree, which is shown in the below code chunk:

ggtree(itol) + geom_tiplab(color = 'aquamarine4', size = 2)

As you can see, the labels are overlapping and hard to read. To help remedy this, in the next step we will reduce the size of our labels and make the diagram circular:

ggtree(itol, layout='circular') + geom_tiplab(color = 'blueviolet', size = 1)

Having the names looks nice, but we can also annotate specific clades or families in our diagram using geom_strip(). The taxa numbers were chosen arbitrarily for the below example:

ggtree(itol) + geom_strip(13, 14, color = 'chartreuse', barsize = 1)

Similarly, we can use geom_hilight() to highlight specific groups in unrooted trees. For this, we use nodes instead of specific taxa, which was chosen arbitrarily for the below example:

ggtree(itol, layout='unrooted') + geom_hilight(node = 11, type = 'encircle', fill = 'steelblue')

Note the type argument, which determines the shape that the highlight will take.

Determining Usable Nodes

Now, with the knowledge of how to annotate and highlight using nodes and taxa, it’s important to consider how to call for specific nodes or taxa without needing to scour the dataset itself. One way to do this is the getMRCA() function, which calls the node number of the most recent common ancestor of a set of species names. An example is given below, in which we use getMRCA() to find a node, then highlight a section of our tree using it:

nodeMRCA <- getMRCA(itol, tip = c('Photorhabdus_luminescens', 'Blochmannia_floridanus'))
ggtree(itol, layout='unrooted') + geom_hilight(node = nodeMRCA, type = 'encircle', fill = 'indianred')

Treespace

In this section, we’re going to quantify differences between two separate phylogenetic trees using Treespace. By analyzing differences like this, it can give scientists the opportunity to understand how different genes progress through an evolutionary line.

Let’s begin with our packages and loading our data. Note that we’ll be loading our tree files first as a list, then reading all of the trees into the tree_list object, then converting it into a multiPhylo object. multiPhylo objects are required for treespace to work with our data.

library(ape)
library(adegraphics)
library(treespace)

treefiles <- list.files(file.path(getwd(),'gene_trees'), full.names=TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list) <- 'multiPhylo'

Computing Tree Distances

Next, we’ll compute the Kendall-Coljin distances between the trees. To do so, we’ll use the treespace function with our tree_list data and the number of principal components retained set to 3. The Kendall-Colin method used here is particularly suitable for rooted trees, which is what we have here.

When running this function, treespace will run a pairwise comparison of all trees in the input, carry out clustering using PCA, then return the results as a list of objects where $D contains the pairwise metric of the trees and $pco contains the principle component analysis.

comparisons <- treespace(tree_list, nf=3)

With this stage of the analysis complete, we can now plot the pairwise distances between trees with the table.image() function. Note nClass in the below code chunk where we plot the data; this defines how many shades are used in the heatmap of our pairwise comparisons. A greater number for nClass will improve resolution, but the shades can become much harder to read by the eye.

table.image(comparisons$D, nclass= 25)

Plotting Tree Clusters

Finally, we’ll print the PCA and clusters using plotGroves(). This will show us how the group of trees cluster. First, we’ll use plotGroves on our comparisons object’s $pco variable to plot the points alone by their distance from one another.

plotGroves(comparisons$pco, lab.show=TRUE, lab.cex=1.5)

Note the lab.show and lab.cex options. Setting lab.show to TRUE prints the labels for each point, and lab.cex adjusts the dimensions of our plot.

Now that we have seen the general distribution of points, let’s create “groves”, or clusters of our trees given a certain distance. To do so, we’ll create a new object using findGroves() on our comparisons object with a nclust value of 5:

groves <- findGroves(comparisons, nclust=5)
plotGroves(groves)

nclust values define how many clusters the function should try to fit our data into. With some datasets, it’s worth playing around with the number of clusters to minimize odd shapes, unintended relationships, and grove overlap. Below is an example where I’ve reduced the nclust value to 2:

groves <- findGroves(comparisons, nclust=2)
plotGroves(groves)

As you can see, with fewer groves total, the function incorporates a number of previously separate trees into larger groves. Below is what I feel to be the ‘sweet spot’ for this data set, with nclust set to 3:

groves <- findGroves(comparisons, nclust=3)
plotGroves(groves)

Subtrees with Ape

In some settings, you may not need to use an entire phylogenetic tree like we did in our Treeio section. By creating subtrees from larger trees, you can minimize the quantity of data you are using, making the processing of your file faster and your diagrams significantly more readable. All of this can be done using one package called ape, so we’ll load that along with our Newick tree data:

library(ape)

newick <- read.tree('mammal_tree.nwk')
l <- subtrees(newick)

Note the usage of the subtrees() function. This function is self-explanatory and does, indeed, separate our data into subtrees. Let’s plot the raw data to see how it looks:

plot(newick)

To subset our subtree’d plot (stored as object 1), we can provide the plot() function additional information. We’ll start by specifying that we want to plot object 1, tree [[4]], with the subtitle “Node 4”:

plot(l[[4]], sub='Node 4')

Recall that for general usage, getMRCA can be very useful to locate specific nodes that you might want to focus in on.

We can also extract subtrees manually with the extract.clade() function, specifying the dataset to use and the node we want to ‘cut’ the tree at.

small_tree <- extract.clade(newick, 9)
plot(small_tree)

We can even bind two separate trees together. For the sake of an example, we’re going to extract another arbitrarily selected section of our first tree, then bind it with the tree we extracted above using bind.tree(). Note the final number of bind.tree(), which specifies the node that the two trees are joined at:

small_tree_2 <- extract.clade(newick, 10)
new_tree <- bind.tree(small_tree_2, small_tree, 3)
plot(new_tree)

Of course, these example trees are a scientific mess, but they are hopefully a good proof of concept for how to subset and bind trees to better manipulate tree data.

Creating Trees from Sequencing Data

In some cases, you may be provided only with sequences and no pre-made trees. If you need to construct your own, this section will cover that. Let’s begin as usual by loading our necessary packages and sequence data:

library(Biostrings)
library(msa)
library(phangorn)

seqs <- readAAStringSet('abc.fa')

Constructing the Alignments

Next, we’ll construct the alignment data using msa and ClustalOmega, which we have seen and used before in our Multiple Protein Alignment section. In addition, we’ll need to convert our alignment data into a phyData object. The whole process is outlined in the below chunk:

aln <- msa::msa(seqs, method=c('ClustalOmega'))
aln <- as.phyDat(aln, type='AA')

Note that type is set to ‘AA’. This stands for amino acid and defines the format of the data for as.phyDat to properly convert.

Making the Trees

With all of our data now prepped, we’ll get started on the tree-making process. Trees are made from a distance matrix, which we’ll compute for our data with dist.ml(). Note that ml in the functions tands for maximum likelihood.

dist_mat <- dist.ml(aln)

Next, we pass the distance matrix to the upgma() function to make a UPGMA (unweighted pair group method with arithmetic mean) tree:

upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main='UPGMA')

Conversely, we can pass the distance matrix to a neighbor-joining function:

nj_tree <- NJ(dist_mat)
plot(nj_tree, main='NJ')

As you can see, both methods produce fairly different trees. Depending on the type of data you’re using and the goals of the tree you’re producing, one of these methods will fit. Typically, longer sequences and more data can eliminate some differences between these methods as the core data provides higher specificity.

Bootstraps

Finally, we will use bootstraps.phyDat() to compute bootstrap support for the branches of the tree. The first argument is the object (aln), and the second is the function. Bootstraps are essentially confidence intervals for how the clade is placed on a given tree.

To create and plot these, we’ll use boostrap.phyDat with our aln data, a custom function where we use neighbor-joining based on dist.ml, and a bootstrap number of 100. This can be seen in the below code chunk:

bootstraps <- bootstrap.phyDat(aln, FUN = function(x){NJ(dist.ml(x))}, bs=100)
plotBS(nj_tree, bootstraps, p = 10)

The numbers you can see on this tree are the confidence that a given join is located in the specified place on your tree. The numbers, since we used a base of 100, function much like a percentage-based confidence interval. 100 defines a 100% likelihood, while 83 defines an 83% likelihood, and so on.

Graduate Additional Analysis with DESeq

As this tutorial was create as part of a course at Louisiana Tech University and I am a graduate student, I am tasked with performing an additional RNASeq analysis. For this analysis, I will be using DESeq to evaluate differential expression in data from the NASA Genelab, specifically GLDS-38. In this experiment, five plates’ plants were pooled and used as genetic material samples with spaceflight plants preserved using RNAlater.

To begin this analysis, we’ll begin as always with our relevant packages to ensure all we’ll need is ready to go:

library("DESeq2")
library("dplyr")
library('apeglm')

Data Setup and Summary

Loading the Data

Much like our previous DESeq exercise, we’ll load the data in two files; counts and the flight/ground grouping.

countData <- read.csv("GLDS-38_rna_seq_Unnormalized_Counts_short.csv")

colData <- read.csv("Flight_DATA.csv", sep= ",", row.names=1)

Formatting the Data

Just as in our DESeq example before, we need to establish levels for our Flight_DATA.csv file to define flight and ground groups. We will be using 0 and 1 as the levels for this factor:

colData$group <- factor(colData$group, levels = c("0","1"))

As we know, DESeq only allows integers, not fractional numbers, so we need to round our data:

countData %>%
  mutate_if(is.numeric, ceiling)

countData[, 2:7] <- sapply(countData[, 2:7], as.integer)

Next, we’ll use the below code to set the row names to be equal to our first column, then remove the redundant first column:

row.names(countData) <- countData[, 1]

countData <- countData[2:7]

Finally, let’s check that we have the same number of individuals as we do groups:

rownames(colData) == colnames(countData)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Analyzing with DESeq

With that, we can perform the analysis proper. To start, we will create a DESeq data set using countData, colData, and groups as our design. Then, we will remove any genes with no observed count changes in the second line. Finally, the DESeq function is called to perform the differential gene expression, and the results function is called to perform the basic analysis. This is all accomplished in the below code chunk:

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)

dds <- dds[rowSums(counts(dds)>2)>=4]

dds <- DESeq(dds)

res <- results(dds)

Next, we’ll use our differential gene expression data and lfcShrink to adjust our log fold change values, then reorder our results by the p-value significance:

resLFC <- lfcShrink(dds, coef=2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resOrdered <- res[order(res$padj), ]

summary(res)
## 
## out of 25687 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2293, 8.9%
## LFC < 0 (down)     : 2754, 11%
## outliers [1]       : 19, 0.074%
## low counts [2]     : 3486, 14%
## (mean count < 15)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Presenting Our Data

Next, we’ll pull out the significant fold change data and store it as its own object:

FLT_Vs_GC <- as.data.frame(res$log2FoldChange)

head(FLT_Vs_GC)
##   res$log2FoldChange
## 1             0.5557
## 2            -0.3808
## 3            -0.3554
## 4             0.1410
## 5             0.0056
## 6            -0.4582

Now, we’ll create an MA plot to check our distribution. As we can see, the plot shows a relatively uniform distribution with fairly high fold changes observed.

plotMA(resLFC, ylim = c(-4, 4))

pdf(file = "MA_Plot_FLT_vs_GC_Grad.pdf")

dev.off()

Let’s also save our results to its own csv file:

write.csv(as.data.frame(resOrdered), file = "ArabidThali_DESeq.csv")

Performing the Principle Component Analysis

Let’s also perform a Principle Component Analysis (PCA) for this data. Let’s begin by using estimateSizeFactors(), then SummarizedExperiment to normalize our data:

dds <- estimateSizeFactors(dds)

se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) +1), colData = colData(dds))

With all of that set up, we can plot the data as a PCA:

plotPCA(DESeqTransform(se), intgroup='group')

We would want to see flight (0) and ground (1) each grouped together to show how tightly characterized the differential expression is on biological replicates. Due to the low number of replicates, it’s difficult to make a call on how grouped our data is here, but it at least appears that there is no presence of points from different groups near one another, like the mouse analysis. The general trend of blue points towards the upper areas of the graph and red points to the lower areas indicates this data may be decently characterized.

Setting things up for Pathview

In setting up for Pathview, we’ll need to load a package and our Arabidopsis Thaliana database:

library(AnnotationDbi)
library(org.At.tair.db)

Next, we’ll perform the same series of steps we have previously with the other RNASeq and Microarray analyses, except using TAIR instead of ENTREZ. We are going to create a dataframe using our fold change data from the res object, then associate those fold changes with gene symbol, TAIR, and gene name. This is carried out below:

columns(org.At.tair.db)

foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))

res$symbol = mapIds(org.At.tair.db,
                    keys = row.names(res),
                    column = "SYMBOL",
                    keytype = "TAIR",
                    multiVals = 'first')

res$entrez = mapIds(org.At.tair.db,
                    keys = row.names(res),
                    column = "ENTREZID",
                    keytype = "TAIR",
                    multiVals = 'first')

res$name = mapIds(org.At.tair.db,
                    keys = row.names(res),
                    column = "GENENAME",
                    keytype = "TAIR",
                    multiVals = 'first')

Pathview Plotting

Setting it Up

Let’s start with loading our packages and create a separate variable for our fold changes and associate each log2FoldChange datapoint with the appropriate entrezIDs:

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)

foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez

Next, we’ll take a number of steps to set up our foldchanges with their appropriate pathways via the kegg database. In the first step, we set up our sigmet data using kegg.gsets for Arabidopsis Thaliana’s entrezIDs. Next, we use gage to map the kegg pathways to our foldchanges data. Lastly, we separate the upregulated (greater) and downregulated (lessers) into their own separate objects and write csv files for them for other usage.

kegg.ath = kegg.gsets('ath', id.type = 'entrez')
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                                        p.geomean stat.mean
## ath00940 Phenylpropanoid biosynthesis                    5.7e-06       4.5
## ath02010 ABC transporters                                9.2e-03       2.4
## ath00040 Pentose and glucuronate interconversions        1.6e-02       2.2
## ath00966 Glucosinolate biosynthesis                      3.6e-02       1.8
## ath00908 Zeatin biosynthesis                             4.7e-02       1.7
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis   1.1e-01       1.2
##                                                          p.val   q.val set.size
## ath00940 Phenylpropanoid biosynthesis                  5.7e-06 0.00067      126
## ath02010 ABC transporters                              9.2e-03 0.53682       32
## ath00040 Pentose and glucuronate interconversions      1.6e-02 0.61979      101
## ath00966 Glucosinolate biosynthesis                    3.6e-02 1.00000       26
## ath00908 Zeatin biosynthesis                           4.7e-02 1.00000       32
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis 1.1e-01 1.00000       23
##                                                           exp1
## ath00940 Phenylpropanoid biosynthesis                  5.7e-06
## ath02010 ABC transporters                              9.2e-03
## ath00040 Pentose and glucuronate interconversions      1.6e-02
## ath00966 Glucosinolate biosynthesis                    3.6e-02
## ath00908 Zeatin biosynthesis                           4.7e-02
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis 1.1e-01
## 
## $less
##                                               p.geomean stat.mean   p.val
## ath03010 Ribosome                               6.0e-25     -10.9 6.0e-25
## ath00195 Photosynthesis                         3.0e-10      -7.1 3.0e-10
## ath01240 Biosynthesis of cofactors              4.4e-09      -5.9 4.4e-09
## ath01230 Biosynthesis of amino acids            6.0e-07      -4.9 6.0e-07
## ath03030 DNA replication                        6.1e-07      -5.3 6.1e-07
## ath00860 Porphyrin and chlorophyll metabolism   1.1e-05      -4.5 1.1e-05
##                                                 q.val set.size    exp1
## ath03010 Ribosome                             7.0e-23      308 6.0e-25
## ath00195 Photosynthesis                       1.7e-08       46 3.0e-10
## ath01240 Biosynthesis of cofactors            1.7e-07      234 4.4e-09
## ath01230 Biosynthesis of amino acids          1.4e-05      240 6.0e-07
## ath03030 DNA replication                      1.4e-05       46 6.1e-07
## ath00860 Porphyrin and chlorophyll metabolism 2.2e-04       51 1.1e-05
## 
## $stats
##                                                        stat.mean exp1
## ath00940 Phenylpropanoid biosynthesis                        4.5  4.5
## ath02010 ABC transporters                                    2.4  2.4
## ath00040 Pentose and glucuronate interconversions            2.2  2.2
## ath00966 Glucosinolate biosynthesis                          1.8  1.8
## ath00908 Zeatin biosynthesis                                 1.7  1.7
## ath00909 Sesquiterpenoid and triterpenoid biosynthesis       1.2  1.2
greaters <- keggres$greater
lessers <- keggres$less

write.csv(greaters, file = 'ArabidThali_pathview_Greater_Pathways.csv')
write.csv(lessers, file = 'ArabidThali_pathview_Lesser_Pathways.csv')

Lastly, we’ll combine a bunch of previous steps into one large chunk as before. Below, we’ll create new objects for the pathways of keggres for both the upregulated and downregulated parts of the dataset, then substring them to just utilize the pathway ids.

keggresgreaterpathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number()<=3) %>%
  .$id %>%
  as.character()

keggresidsg <- substr(keggresgreaterpathways, start=1, stop=8)

keggreslesserpathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number()<=3) %>%
  .$id %>%
  as.character()

keggresidsl <- substr(keggreslesserpathways, start=1, stop=8)

Plotting DESeq Pathview

Finally, we can plot our data with pathview as we have in the previous two analyses:

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'ath', new.signature = FALSE)

tmp = sapply(keggresidsg, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'ath'))

tmp = sapply(keggresidsl, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = 'ath'))

And here, we can use this chunk to insert our pathview images:

file.rename(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"),
            str_replace(list.files(path = "/home/student/Desktop/classroom/myfiles", pattern ="pathview.png"), pattern='pathview.png', 'pathview.grad.png'))

grad_filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = "pathview.grad.png")

allgrad_images <- lapply(grad_filenames, load.image)

dev.off()
knitr::include_graphics(grad_filenames)

Conclusion

R is an incredibly useful, diverse, and approachable programming language that can perform the much-needed statistical analyses required for many biological studies. With genetics and personalized medicine on the rise and data generation hitting huge rates, the need for these analyses is greater than ever. With R and the right people, this data can be wrangled, contained, analyzed, and turned into usable data products for research, informatics, and more.