R Markdown

This is a tutorial on how to use R markdown for reproducible research.

Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first example, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literal) were given different ammounts of vitamin C to see the effects on the animal’s tooth growth.

To run R code in markdown file, we need to denote the section that is considered R code. We call these “code chunks”.

Below is a code chunk:

Toothdata <- ToothGrowth

head(Toothdata)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

As you can see from running the “play” button on the code chunk, the results are printed inline of the r markdown file.

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))
Figure 1: The tooth growth of Guinea Pigs when given variable amounts of vitamin C

Figure 1: The tooth growth of Guinea Pigs when given variable amounts of vitamin C

The slope of the regression line is 9.7635714.

Section Headers

We can also put sections and subsections in our R Markdown file, similar to numbers of bullet points in a word document. This is done with the “#” that we previously used to denote text in an R script.

First level header

Second level header

Third level header

Make sure that you put a space after the hashtag otherwise it will not work.

We can also add bullet point-type marks in our r markdown file.

  • one item
  • one item
  • one item
    • one more item
    • one more item
    • one more item
      • one last item

It’s important to note here that in R Markdown indentation matters!

  1. First item
  2. Second item
  3. Third item
  1. subitem one
  2. subitem two
  3. subitem three

Block Quotes

We can put really nice quotes into the markdown document. We do this by using the “>” symbol

“Genes are like the story, and DNA is the language that the story is written in.”

— Sam Kean

Formulas

We can also put nice formatted formulas into markdown using two dollar signs.

Hard-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

And you can get really complex as well!

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

Code chuck options

There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options:

Eval (T ot F): Whether or not to evaluate the code chuck

Echo (T or F): Whether or not to show the code for the chuck, but results will still print

Cache: If you enable, the same code chuck will not be evaluated the next time that knitr is run. This is great for code that has LONG run times.

fignwidth or fig.height: the (graphical device) size of the R plot in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.height: the output size of the R plots IN THE R DOCUMENT

fig.cap: the words for the figure caption

Design Elements

Table of contents

We can also add a table of contents to our HTML Documents. We do this by altering the YAML code (The weird code chunk at the top of the document.) We can add this:

title: “HTML_Tutorial” author: “Ilea Watson” date: “2024-07-15” output: html_document: toc: true toc_float: true

This will give us a very nice floating table of contents on the right hand side of the document.

Tabs

You can also add TABS in a report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be in a new tab.

Themes

You can also add themes to you HTML document that change the highlighting color and hyperlink color of your HTML output. This can be nice aesthetically. To do this, change your theme in the YAML to one of the following:

  • default
  • bootstrap
  • cerulean
  • cosmo
  • darkly
  • flatly
  • journal
  • lumen
  • paper
  • readable
  • sandstone
  • simplex
  • spacelab
  • united
  • yeti

You can also change the color by specifying highlight:

  • default
  • tango
  • payments
  • kate
  • monochrome
  • espresso
  • zenburn
  • haddock
  • textmate

Code Folding

You can also use the code folding option to allow the reader to toggle between displaying the code and hiding the code.This is done with:

code_folding: hide

Data wranling with R

Introduction

Introduction

First, we load the library and look at the top of the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
my_data <- nycflights13::flights

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Then, we will just look at the data on October 14th.

filter(my_data, month == 10, day ==14)
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If we want to subset this into a new variable, we do the following:

oct_14_flight_2 <- filter(my_data, month == 10, day ==14)

What if you want to do both print and save the variable?

(oct_14_flight_2 <- filter(my_data, month == 10, day ==14))
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If you want to filter based on different operators, you can use the following:

  • Equals ==
  • Not equal to !=
  • greater than >
  • less than <
  • greater than or equal to >=
  • less than or equal to <=
(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If we dont use the == to mean equals, we get this

(oct_14_flight_2 <- filter(my_data, month = 10, day = 14))

You can also use these logical operators to be more selective:

  • and &
  • or |
  • not !

Let’s use the “or” function to pick flights in March and April.

March_April_flights <- filter(my_data, month == 3 | month == 4)

March_4th_Flights <- filter(my_data, month ==3 & day == 4)

Non_jan_flights <- filter(my_data, month != 1)

print(March_April_flights)
## # A tibble: 57,164 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     3     1        4           2159       125      318             56
##  2  2013     3     1       50           2358        52      526            438
##  3  2013     3     1      117           2245       152      223           2354
##  4  2013     3     1      454            500        -6      633            648
##  5  2013     3     1      505            515       -10      746            810
##  6  2013     3     1      521            530        -9      813            827
##  7  2013     3     1      537            540        -3      856            850
##  8  2013     3     1      541            545        -4     1014           1023
##  9  2013     3     1      549            600       -11      639            703
## 10  2013     3     1      550            600       -10      747            801
## # ℹ 57,154 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
print(March_4th_Flights)
## # A tibble: 977 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     3     4       11           2358        13      440            438
##  2  2013     3     4      455            500        -5      629            648
##  3  2013     3     4      509            515        -6      732            814
##  4  2013     3     4      533            530         3      815            827
##  5  2013     3     4      535            540        -5      814            850
##  6  2013     3     4      539            545        -6     1018           1023
##  7  2013     3     4      550            600       -10      931            925
##  8  2013     3     4      552            600        -8      704            715
##  9  2013     3     4      553            600        -7      847            910
## 10  2013     3     4      553            600        -7      831            856
## # ℹ 967 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
print(Non_jan_flights)
## # A tibble: 309,772 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10     1      447            500       -13      614            648
##  2  2013    10     1      522            517         5      735            757
##  3  2013    10     1      536            545        -9      809            855
##  4  2013    10     1      539            545        -6      801            827
##  5  2013    10     1      539            545        -6      917            933
##  6  2013    10     1      544            550        -6      912            932
##  7  2013    10     1      549            600       -11      653            716
##  8  2013    10     1      550            600       -10      648            700
##  9  2013    10     1      550            600       -10      649            659
## 10  2013    10     1      551            600        -9      727            730
## # ℹ 309,762 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Arrange

Arrange allows us to arrange the data set based on the variables we desire.

arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We can also do this in descending fashion.

descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless of ascending or descending.

Select

We can also select specific columns that we want to look at.

calendar <- select(my_data, year, month, day)
print(calendar)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

We can also look at a range of columns.

calendar_2 <- select(my_data, year:day)
print(calendar_2)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

Let’s look at all columns year through carrier.

calendar_3 <- select(my_data, year:carrier)
print(calendar_3)
## # A tibble: 336,776 × 10
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 2 more variables: arr_delay <dbl>, carrier <chr>

We can also choose which columns NOT to include.

everything_else <- select(my_data, -(year:day))
print(everything_else)
## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ℹ 336,766 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

In this instance we can also use the “not” operator “!”.

everything_else2 <- select(my_data, !(year:day))
print(everything_else2)
## # A tibble: 336,776 × 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ℹ 336,766 more rows
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

There are also some other helper functions that can help you select the columns or data you’re looking for:

  • starts_with(“xyz”) – will select the values that start with xyz
  • ends_ with(“xyz”) — will select the values that end with xyz
  • contains(“xyz”) — will select the values that contain xyz
  • matches(“xyz”) —- will select the identical value xyz

Renaming

We can rename columns of our data using the rename() function.

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
rename(my_data, departure_time = dep_time)
## # A tibble: 336,776 × 19
##     year month   day departure_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>          <int>          <int>     <dbl>    <int>
##  1  2013     1     1            517            515         2      830
##  2  2013     1     1            533            529         4      850
##  3  2013     1     1            542            540         2      923
##  4  2013     1     1            544            545        -1     1004
##  5  2013     1     1            554            600        -6      812
##  6  2013     1     1            554            558        -4      740
##  7  2013     1     1            555            600        -5      913
##  8  2013     1     1            557            600        -3      709
##  9  2013     1     1            557            600        -3      838
## 10  2013     1     1            558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
my_data <- rename(my_data, departure_time = dep_time)

Mutate

What if you want to add new columns to your data frame? We have the mutate() function for that.

First, let’s make smaller data frame so we can see what we are doing.

my_data_small <- select(my_data, year:day, distance, air_time)

Let’s calculate the speed of the flights.

mutate(my_data_small, speed = distance / air_time * 60)
## # A tibble: 336,776 × 6
##     year month   day distance air_time speed
##    <int> <int> <int>    <dbl>    <dbl> <dbl>
##  1  2013     1     1     1400      227  370.
##  2  2013     1     1     1416      227  374.
##  3  2013     1     1     1089      160  408.
##  4  2013     1     1     1576      183  517.
##  5  2013     1     1      762      116  394.
##  6  2013     1     1      719      150  288.
##  7  2013     1     1     1065      158  404.
##  8  2013     1     1      229       53  259.
##  9  2013     1     1      944      140  405.
## 10  2013     1     1      733      138  319.
## # ℹ 336,766 more rows
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

What if we wanted to create a new dataframe with ONLY your calculations? We can use the transmute() function.

airspeed <- transmute(my_data_small, speed = distance / air_time * 60)

Summarize and by_group()

We can use summarize to run a function on a data column to get a single return.

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

So we can see here that the average delay is about 12 minutes.

We gain additional value in summarize by pairing it with by_group().

by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ℹ 355 more rows

As you can see, we now have the delay by the days of the year.

Missing Data

What happens if we don’t tell R what to do with the missing data?

summarize(by_day, delay = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # ℹ 355 more rows

We can also filter our data based on NA (which in this dataset was cancelled flights).

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Let’s run summarize again on this data.

summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

Counts

We can count the number of variables that are NA.

sum(is.na(my_data$dep_delay))
## [1] 8255

We can also count the numbers that are NOT NA.

sum(!is.na(my_data$dep_delay))
## [1] 328521

Piping

With tibble datasets(more on them soon), we can pipe results to get rid of the need to use the dollar signs. We can then summarize the number of flights by minutes delayed.

my_data %>%
  group_by(year, month, day) %>%
  summarize(mean = mean(departure_time, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 1385.
##  2  2013     1     2 1354.
##  3  2013     1     3 1357.
##  4  2013     1     4 1348.
##  5  2013     1     5 1326.
##  6  2013     1     6 1399.
##  7  2013     1     7 1341.
##  8  2013     1     8 1335.
##  9  2013     1     9 1326.
## 10  2013     1    10 1333.
## # ℹ 355 more rows

Tibbles!

Tibbles

First, let’s load the required library.

library(tibble)

Now we will take the time to explore tibbles. Tibbles are modified data frames which tweak some of the older features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore.

as_tibble(iris)
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

As we can see, we have the same data frame, but we have different features.

You can also create a tibble from scratch with tibble().

tibble(
  x = 1:5,
  y = 1,
  z = x ^ 2 + y
)
## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

You can also use tribble() for basic data table creation.

tribble(
  ~genea, ~ geneb, ~ genec,
  ########################
  110,      112,     114,
  6,        5,       4
)
## # A tibble: 2 × 3
##   genea geneb genec
##   <dbl> <dbl> <dbl>
## 1   110   112   114
## 2     6     5     4

Tibbles are built to not overwhelm your console when printing data, and only show the first few lines.

This is how data frames print.

as.data.frame(by_day)
head(by_day)

nycflights13::flights %>%
  print(n=10, width = Inf)

Subsetting

Subsetting tibbles is easy, similar to data.frames.

df_tibble <- tibble(nycflights13::flights)

df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We can subset by column name using “$”.

df_tibble$carrier

We can subset by position using [[]].

df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder.

df_tibble %>%
  .$carrier

Some older functions so not like tibbles, thus you might have to convert them back to dataframe.

class(df_tibble)
## [1] "tbl_df"     "tbl"        "data.frame"
df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble_2)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##             time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00

Tidyr

Tidyr

Let’s load our required library.

library(tidyverse)

How do we make a tidy data set? well the tidyverse follows three rules.

  1. Each variable must have its own column
  2. Each observation has its own row
  3. Each value has its own cell.

It is impossible to satisfy two of the three rules.

This leads to the following instructions for the tidy data:

  1. Put each dataset into a tibble
  2. Put each variable into a column
  3. Profit

Picking one consistent method of data storage makes for easier understanding of your code and what is happening “under the hood” or behind the scenes.

Let’s now look at working with tibbles

bmi <- tibble(women)

bmi %>%
  mutate(bmi = (703 * weight)/(height)^2)
## # A tibble: 15 × 3
##    height weight   bmi
##     <dbl>  <dbl> <dbl>
##  1     58    115  24.0
##  2     59    117  23.6
##  3     60    120  23.4
##  4     61    123  23.2
##  5     62    126  23.0
##  6     63    129  22.8
##  7     64    132  22.7
##  8     65    135  22.5
##  9     66    139  22.4
## 10     67    142  22.2
## 11     68    146  22.2
## 12     69    150  22.1
## 13     70    154  22.1
## 14     71    159  22.2
## 15     72    164  22.2

Spreading and Gathering

Sometimes you’ll find data sets that don’t fit well into a tibble.

We’ll use the built-in data from tidyverse for this part.

table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

As you can see from this data, we have one variable in column A (country), but column B and C are two of the same. Thus, there are two observations in each row.

table4a %>%
  gather('1999', '2000', key = 'year', value = 'cases')
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Let’s look at another example.

table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

As you can see we have the same problem in table 4b.

table4b %>%
  gather("1999", "2000", key = "year", value = "population")
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <dbl>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

Now what if we want to join these two tables? We can use dplyr.

table4a <- table4a %>%
  gather('1999', '2000', key = 'year', value = 'cases')
table4b <- table4b %>%
  gather("1999", "2000", key = "year", value = "population")


left_join(table4a, table4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <dbl>      <dbl>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

Spreading is the opposite of gathering. Let’s look at table 2.

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

You can see that we have redundant info in columns 1 and 2. We can fix that by combining rows 1&2, 3&4 ect.

spread(table2, key = type, value = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Type is the key of what we are turning into columns, the value is what becomes rows/observations.

In summary:

  • spread makes long tables shorter and wider
  • gather makes wide tables, narrower and longer

Separating and pull

Now what happens when we have two observations stuck in one column?

table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

As you can see, the rate is just the population and cases combined. We can use separate to fix this.

table3 %>%
  separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

However, if you notice, the columm type is not correct.

table3 %>%
  separate(rate, into = c("cases", "population"), conver = TRUE)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can specify what you want to separate.

table3 %>%
  separate(rate, into =c("cases", "population"), sep = "/", conver = TRUE)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Let’s make this look more tidy.

table3 %>%
  separate(
    year,
    into = c("century", "year"),
    convert = TRUE,
    sep = 2
  )
## # A tibble: 6 × 4
##   country     century  year rate             
##   <chr>         <int> <int> <chr>            
## 1 Afghanistan      19    99 745/19987071     
## 2 Afghanistan      20     0 2666/20595360    
## 3 Brazil           19    99 37737/172006362  
## 4 Brazil           20     0 80488/174504898  
## 5 China            19    99 212258/1272915272
## 6 China            20     0 213766/1280428583

Unite

What heppends if we want to do the inverse of separate?

table5
## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5 %>%
  unite(data, century, year)
## # A tibble: 6 × 3
##   country     data  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583
table5 %>%
  unite(data, century, year, sep = "")
## # A tibble: 6 × 3
##   country     data  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

The unite() function will combine the observations.

Missing values

There can be two types of mising values. NA (explicit) or just no entry (implicit).

gene_data <- tibble(
  gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
  nuc = c(20, 22, 24, 25, NA, 42, 67),
  run = c(1,2,3,4,2,3,4)
)

gene_data
## # A tibble: 7 × 3
##   gene    nuc   run
##   <chr> <dbl> <dbl>
## 1 a        20     1
## 2 a        22     2
## 3 a        24     3
## 4 a        25     4
## 5 b        NA     2
## 6 b        42     3
## 7 b        67     4

The nucleotide count for gene b and run2 is explicit missing, and the nucleotide count for gene b run 1 is implicitly missing.

One way we can make implicit missing values explicit is by putting runs in columns

gene_data %>%
  spread(gene, nuc)
## # A tibble: 4 × 3
##     run     a     b
##   <dbl> <dbl> <dbl>
## 1     1    20    NA
## 2     2    22    NA
## 3     3    24    42
## 4     4    25    67

If we want to remove the missing values, we can spread and gather, and na.rm = TRUE.

gene_data %>%
  spread(gene, nuc) %>%
  gather(gene, nuc, 'a':'b', na.rm = TRUE)
## # A tibble: 6 × 3
##     run gene    nuc
##   <dbl> <chr> <dbl>
## 1     1 a        20
## 2     2 a        22
## 3     3 a        24
## 4     4 a        25
## 5     3 b        42
## 6     4 b        67

Another way that we can make implicit values explicit, is complete().

gene_data %>%
  complete(gene, run)
## # A tibble: 8 × 3
##   gene    run   nuc
##   <chr> <dbl> <dbl>
## 1 a         1    20
## 2 a         2    22
## 3 a         3    24
## 4 a         4    25
## 5 b         1    NA
## 6 b         2    NA
## 7 b         3    42
## 8 b         4    67

Sometimes NA is present to represent a value being carried forward.

treatment <- tribble(
  ~person,      ~treatment,    ~response,
  ######################################
  "Issac",       1,             7,
  NA,            2,             10,
  NA,            3,             9,
  "VDB",         1,             8,
  NA,            2,             11,
  NA,            3,             10,
)

treatment
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Issac          1        7
## 2 <NA>           2       10
## 3 <NA>           3        9
## 4 VDB            1        8
## 5 <NA>           2       11
## 6 <NA>           3       10

What we can do here is use the fill() option.

treatment %>%
  fill(person)
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Issac          1        7
## 2 Issac          2       10
## 3 Issac          3        9
## 4 VDB            1        8
## 5 VDB            2       11
## 6 VDB            3       10

DPLYR

DPLYR

It is rare that you will be working with a simple data table. The DPLYR package allows you to join two data tables based on common values.

  • Mutate joins: add new variables to one data frame from the matching observations in another
  • Filtering joins: filters observation from one data frame based on whether or not they are present in another
  • Set operations: treats observations as they are set elements.

Let’s load our libraries.

library(tidyverse)
library(nycflights13)

Then, pull full carrier names on letter codes.

airlines
## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

Let’s get info about airports and info about each plane.

airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,448 more rows
planes
## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows

Then, let’s get some info on the weather at the airports.

weather
## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

Finally, let’s get info on singular flights.

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Now let’s look at how these tables connect.

  • Flights -> planes based on tailnumber
  • Flights -> airlines through carrier
  • Flights -> airports origin AND dest
  • Flights -> weather via origin, year/month/day/hour

Keys

Keys are unique identifiers per observation. Primary key uniquely identifies an observation in its own table.

One way to identify a primary key is as follows:

planes %>%
  count(tailnum) %>%
  filter(n>1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>

This indicates that the tailnumber is unique.

planes %>%
  count(model) %>%
  filter(n>1)
## # A tibble: 79 × 2
##    model       n
##    <chr>   <int>
##  1 717-200    88
##  2 737-301     2
##  3 737-3G7     2
##  4 737-3H4   105
##  5 737-3K2     2
##  6 737-3L9     2
##  7 737-3Q8     5
##  8 737-3TO     2
##  9 737-401     4
## 10 737-4B7    18
## # ℹ 69 more rows

Mutate join

Mutating joins add columns from y to x, matching observations based on the key.

flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2
## # A tibble: 336,776 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ℹ 336,766 more rows
flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = 'carrier')
## # A tibble: 336,776 × 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ℹ 336,766 more rows

We’ve now added the airline name to our data frame from the airline data frame.

Other types of joins

  • Inner joins (inner_join): matches a pair of observations that appear in a least one table.
  • Outer joins (outer_join): keeps observations that appear in at least one table.

Stringr

Stringr

Let’s load our required packages.

library(tidyverse)
library(stringr)

You can create strings using single or double quotes.

string1 <- "this is a string"
string2 <- 'to put a "quote" in your string, use the opposite'

string1
## [1] "this is a string"
string2
## [1] "to put a \"quote\" in your string, use the opposite"

If you forget to close your string, you’ll get this:

string3 <- "where is this string going?

string3

Just hit escape and try again.

Multiple strings are stored in character vectors.

string4 <- c("one", "two", "three")
string4
## [1] "one"   "two"   "three"

Let’s measure string length.

str_length(string4)
## [1] 3 3 5

Let’s combine two strings.

str_c("X", "Y")
## [1] "XY"
str_c(string1, string2)
## [1] "this is a stringto put a \"quote\" in your string, use the opposite"

You can use sep to control how they are separated.

str_c(string1, string2, sep = " ")
## [1] "this is a string to put a \"quote\" in your string, use the opposite"
str_c("x", "y", "z", sep = "_")
## [1] "x_y_z"

Subsetting strings

You can subset a string using str_sub().

HSP <- c("HSP123", "HSP234", "HSP456")

str_sub(HSP, 4,6)
## [1] "123" "234" "456"

This just drops the first four letters from the strings.

Or you can use negatives to count back from the end.

str_sub(HSP, -3, -1)
## [1] "123" "234" "456"

You can convert the cases of strings like follows:

HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_lower(HSP)
## [1] "hsp123" "hsp234" "hsp456"

Regular Expression

Let’s install our package.

install.packages("htmlwidgets")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

We can use str_view to find specific values in our data.

x <- c('ATTAGA', 'CGCCCCCGGAT', "TATTA")

str_view(x, "G")
## [1] │ ATTA<G>A
## [2] │ C<G>CCCCC<G><G>AT
str_view(x, "TA")
## [1] │ AT<TA>GA
## [3] │ <TA>T<TA>

The next step is, “.” where the “.” matches an entry.

str_view(x, ".G.")
## [1] │ ATT<AGA>
## [2] │ <CGC>CCC<CGG>AT

Anchors allow you to match at the start or the ending.

str_view(x, "^TA")
## [3] │ <TA>TTA
str_view(x, "TA$")
## [3] │ TAT<TA>

Character classes/alternatices:

  • “ matches any digit
  • “” matches any space
  • “[abc]” matches a, b, or, c
str_view(x, "TA[GT]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA

[^anc] matches anything BUT a, b, or c

str_view(x, "TA[^T]")
## [1] │ AT<TAG>A

You can also use | to pick between two alternatives.

str_view(x, "TA[G|T]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA

Detect Matches

str_detect() returns a logical vector the same length of input.

y <- c("apple", "bannana", "pear")
y
## [1] "apple"   "bannana" "pear"
str_detect(y, "e")
## [1]  TRUE FALSE  TRUE

How many common words contain the letter e?

words
sum(str_detect(words, "e"))
## [1] 536

Let’s get more complex, what proportion of words end in a vowel?

mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306
mean(str_detect(words, "^aeiou]"))
## [1] 0

Let’s find all the words that don’t contain “o” or “u”.

no_o <- !str_detect(words, "[ou]")

Now let’s extract.

non_ou_words <- words[!str_detect(words, "[ou]")]
head(non_ou_words)
## [1] "a"       "able"    "accept"  "achieve" "act"     "active"

You can also use str_count() to say how many matches there are in a string.

x
## [1] "ATTAGA"      "CGCCCCCGGAT" "TATTA"
str_count(x, "[GC]")
## [1] 1 9 0

Let’s couple this with mutate.

df <- tibble(
  word = words,
  count = seq_along(word)
)

df
## # A tibble: 980 × 2
##    word     count
##    <chr>    <int>
##  1 a            1
##  2 able         2
##  3 about        3
##  4 absolute     4
##  5 accept       5
##  6 account      6
##  7 achieve      7
##  8 across       8
##  9 act          9
## 10 active      10
## # ℹ 970 more rows
df %>%
  mutate(
    vowels = str_count(words, "[aeiou]"),
    constonants = str_count(words, "[^aeiou]")
  )
## # A tibble: 980 × 4
##    word     count vowels constonants
##    <chr>    <int>  <int>       <int>
##  1 a            1      1           0
##  2 able         2      2           2
##  3 about        3      3           2
##  4 absolute     4      4           4
##  5 accept       5      2           4
##  6 account      6      3           4
##  7 achieve      7      4           3
##  8 across       8      2           4
##  9 act          9      1           2
## 10 active      10      3           3
## # ℹ 970 more rows

Microarray Analysis

Characterizing the Data

Let’s load required packages.

library(ath1121501cdf)
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
## loading 'ath1121501cdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
## loading 'ath1121501cdf'
## 
library(ath1121501.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:lubridate':
## 
##     second, second<-
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:lubridate':
## 
##     %within%
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: org.At.tair.db
## 
## 
library(annotate)
## Loading required package: XML
library(affy)
## 
## Attaching package: 'affy'
## The following object is masked from 'package:lubridate':
## 
##     pm
library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
library(oligo)
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.64.0
## 
## Attaching package: 'oligoClasses'
## The following object is masked from 'package:affy':
## 
##     list.celfiles
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
## 
##     compact
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## ================================================================================
## Welcome to oligo version 1.66.0
## ================================================================================
## 
## Attaching package: 'oligo'
## The following object is masked from 'package:limma':
## 
##     backgroundCorrect
## The following objects are masked from 'package:affy':
## 
##     intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
##     probeNames, rma
## The following object is masked from 'package:lubridate':
## 
##     pm
## The following object is masked from 'package:dplyr':
## 
##     summarize
library(dplyr)
library(stats)
library(reshape)
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:S4Vectors':
## 
##     expand, rename
## The following object is masked from 'package:lubridate':
## 
##     stamp
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

Then, read all the cell files into the directory.

targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")

ab <- ReadAffy()
hist(ab)
Figure 1: Average intensity of expression during experiment.

Figure 1: Average intensity of expression during experiment.

Normalizing the microarray experiments.

eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
##                                          sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL      1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL      2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL      3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL       4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL       5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL       6

Let’s visualize the normalized data.

hist(eset)
Figure 2: Normalized average intensity of expression during experiment.

Figure 2: Normalized average intensity of expression during experiment.

Let’s characterize the data a bit.

ID <- featureNames(eset)
Symbol <- getSYMBOL(ID, "ath1121501.db")

affydata <- read.delim("AFFY_ATH1_array_elements.txt")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : EOF within quoted string

Differential Gene Expression Analysis

Flight vs Ground

condition <- targets$gravity

design <- model.matrix(~factor(condition))
colnames(design) <- c("flight", "ground")

fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef =2, n=Inf, adjust= 'fdr')

Let’s combine annotations.

Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
                    KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
                    GO = sapply(contents(ath1121501GO), paste, collapse = ", "), 
                    SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "), 
                    ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ", "))
## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

Let’s merge all the data into one data frame.

all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)

all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")

Let’s convert the ACCNUM to a character value.

all2$ACCNUM <- as.character(all2$ACCNUM)

write.table(all2, file="BRIC_16_Final.csv", row.names = TRUE, col.names = TRUE, sep = "\t")


columns(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"
foldchanges <- as.data.frame(all2$logFC)

all2$entrez = mapIds(org.At.tair.db,
                     keys = all2$ACCNUM, 
                     column = "ENTREZID", 
                     keytype = "TAIR", 
                     multiVals = "first")
## 'select()' returned 1:1 mapping between keys and columns
head(all2, 10)
##    Row.names                 GENE KEGG
## 1  244903_at hypothetical protein   NA
## 2  244904_at hypothetical protein   NA
## 3  244906_at hypothetical protein   NA
## 4  244907_at hypothetical protein   NA
## 5  244908_at hypothetical protein   NA
## 6  244911_at hypothetical protein   NA
## 7  244913_at hypothetical protein   NA
## 8  244914_at hypothetical protein   NA
## 9  244916_at hypothetical protein   NA
## 10 244917_at hypothetical protein   NA
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     GO
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005575", Evidence = "ND", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 8  list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005634", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
##     SYMBOL    ACCNUM  logFC AveExpr     t P.Value adj.P.Val    B
## 1   ORF149 ATMG00660 -1.214     7.8 -6.20  0.0004     0.018  0.5
## 2   ORF275 ATMG00670 -0.568     3.2 -2.07  0.0758     0.300 -4.9
## 3  ORF240A ATMG00690  0.055     9.1  0.18  0.8586     0.947 -6.8
## 4   ORF120 ATMG00710 -0.883     4.3 -2.91  0.0220     0.154 -3.7
## 5  ORF107D ATMG00720 -0.463     2.2 -1.73  0.1262     0.396 -5.4
## 6   ORF170 ATMG00820 -0.193     2.7 -0.72  0.4948     0.749 -6.5
## 7  ORF121B ATMG00840 -0.339     1.8 -1.18  0.2764     0.574 -6.1
## 8  ORF107E ATMG00850 -0.305     5.5 -1.30  0.2338     0.534 -6.0
## 9   ORF187 ATMG00880 -0.888     2.8 -2.32  0.0527     0.245 -4.6
## 10  ORF184 ATMG00870 -0.379     3.6 -1.31  0.2307     0.531 -5.9
##    array_element_type             organism is_control     locus
## 1     oligonucleotide Arabidopsis thaliana         no ATMG00660
## 2     oligonucleotide Arabidopsis thaliana         no ATMG00670
## 3     oligonucleotide Arabidopsis thaliana         no ATMG00690
## 4     oligonucleotide Arabidopsis thaliana         no ATMG00710
## 5     oligonucleotide Arabidopsis thaliana         no ATMG00720
## 6     oligonucleotide Arabidopsis thaliana         no ATMG00820
## 7     oligonucleotide Arabidopsis thaliana         no AT2G07626
## 8     oligonucleotide Arabidopsis thaliana         no ATMG00850
## 9     oligonucleotide Arabidopsis thaliana         no ATMG00880
## 10    oligonucleotide Arabidopsis thaliana         no ATMG00870
##                                                                                description
## 1                                                  hypothetical protein;(source:Araport11)
## 2                                                 transmembrane protein;(source:Araport11)
## 3                                                     FO-ATPase subunit;(source:Araport11)
## 4  Polynucleotidyl transferase, ribonuclease H-like superfamily protein;(source:Araport11)
## 5                                                  hypothetical protein;(source:Araport11)
## 6                  Reverse transcriptase (RNA-dependent DNA polymerase);(source:Araport11)
## 7                                                  hypothetical protein;(source:Araport11)
## 8                               DNA/RNA polymerases superfamily protein;(source:Araport11)
## 9                                                  hypothetical protein;(source:Araport11)
## 10                                                 hypothetical protein;(source:Araport11)
##    chromosome entrez
## 1           M   <NA>
## 2           M   <NA>
## 3           M   <NA>
## 4           M   <NA>
## 5           M   <NA>
## 6           M   <NA>
## 7           2   <NA>
## 8           M   <NA>
## 9           M   <NA>
## 10          M   <NA>

Pathview

Let’s load the required packages.

library(pathview)
## 
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
library(gage)
## 
library(gageData)
data(kegg.sets.hs)
foldchanges = all2$logFC
names(foldchanges) = all2$entrez

head(foldchanges)
##   <NA>   <NA>   <NA>   <NA>   <NA>   <NA> 
## -1.214 -0.568  0.055 -0.883 -0.463 -0.193
kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

Let’s get the results.

keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                                   p.geomean stat.mean   p.val
## ath03010 Ribosome                                   1.5e-14       8.2 1.5e-14
## ath01230 Biosynthesis of amino acids                2.9e-04       3.5 2.9e-04
## ath00040 Pentose and glucuronate interconversions   2.0e-03       3.0 2.0e-03
## ath00195 Photosynthesis                             2.6e-03       3.0 2.6e-03
## ath00966 Glucosinolate biosynthesis                 7.0e-03       2.7 7.0e-03
## ath01232 Nucleotide metabolism                      9.2e-03       2.4 9.2e-03
##                                                     q.val set.size    exp1
## ath03010 Ribosome                                 1.6e-12      129 1.5e-14
## ath01230 Biosynthesis of amino acids              1.5e-02       87 2.9e-04
## ath00040 Pentose and glucuronate interconversions 6.6e-02       49 2.0e-03
## ath00195 Photosynthesis                           6.6e-02       19 2.6e-03
## ath00966 Glucosinolate biosynthesis               1.3e-01       12 7.0e-03
## ath01232 Nucleotide metabolism                    1.3e-01       42 9.2e-03
## 
## $less
##                                          p.geomean stat.mean p.val q.val
## ath04120 Ubiquitin mediated proteolysis      0.043      -1.7 0.043     1
## ath04016 MAPK signaling pathway - plant      0.044      -1.7 0.044     1
## ath00592 alpha-Linolenic acid metabolism     0.045      -1.7 0.045     1
## ath03040 Spliceosome                         0.075      -1.4 0.075     1
## ath00350 Tyrosine metabolism                 0.093      -1.4 0.093     1
## ath00906 Carotenoid biosynthesis             0.116      -1.2 0.116     1
##                                          set.size  exp1
## ath04120 Ubiquitin mediated proteolysis        63 0.043
## ath04016 MAPK signaling pathway - plant        73 0.044
## ath00592 alpha-Linolenic acid metabolism       19 0.045
## ath03040 Spliceosome                           77 0.075
## ath00350 Tyrosine metabolism                   19 0.093
## ath00906 Carotenoid biosynthesis               18 0.116
## 
## $stats
##                                                   stat.mean exp1
## ath03010 Ribosome                                       8.2  8.2
## ath01230 Biosynthesis of amino acids                    3.5  3.5
## ath00040 Pentose and glucuronate interconversions       3.0  3.0
## ath00195 Photosynthesis                                 3.0  3.0
## ath00966 Glucosinolate biosynthesis                     2.7  2.7
## ath01232 Nucleotide metabolism                          2.4  2.4
greater <- keggres$greater
lessers <- keggres$less

write.csv(greater, file = "BRIC_16_pathview_Greater_Pathways.csv")
write.csv(lessers, file = "BRIC_16_pathview_Lesser_Pathways.csv")

Let’s map to pathways (greater).

keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath03010 Ribosome"                                
## [2] "ath01230 Biosynthesis of amino acids"             
## [3] "ath00040 Pentose and glucuronate interconversions"
## [4] "ath00195 Photosynthesis"                          
## [5] "ath00966 Glucosinolate biosynthesis"
keggresids_greater = substr(keggrespathways, start=1, stop=8)
keggresids_greater
## [1] "ath03010" "ath01230" "ath00040" "ath00195" "ath00966"
genedata <- as.character(all2$logFC)

Define a plotting function to apply next.

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)

Plot multiple pathways (plots are saved to disk and returns a throwaway object list).

tmp = sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", kegg.native = FALSE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath03010 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath01230.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Warning in .local(from, to, graph): edges replaced: '217|104', '217|216'
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00040.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath00195 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00966.pathview.pdf

Let’s map to pathways (lesser).

keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath04120 Ubiquitin mediated proteolysis" 
## [2] "ath04016 MAPK signaling pathway - plant" 
## [3] "ath00592 alpha-Linolenic acid metabolism"
## [4] "ath03040 Spliceosome"                    
## [5] "ath00350 Tyrosine metabolism"
keggresids_less = substr(keggrespathways, start=1, stop=8)
keggresids_less
## [1] "ath04120" "ath04016" "ath00592" "ath03040" "ath00350"
genedata <- as.character(all2$logFC)

Define a plotting function to apply next.

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)

Plot multiple pathways (plots are saved to disk and returns a throwaway object list).

tmp = sapply(keggresids_less, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", kegg.native = FALSE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath04120 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Warning: reconcile groups sharing member nodes!
##      [,1] [,2] 
## [1,] "7"  "228"
## [2,] "8"  "228"
## [3,] "7"  "229"
## [4,] "8"  "229"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04016.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00592.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath03040 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00350.pathview.pdf

Let’s load our pathways.

library(imager)
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
## 
##     add
## The following object is masked from 'package:Biostrings':
## 
##     width
## The following object is masked from 'package:XVector':
## 
##     width
## The following objects are masked from 'package:oligoClasses':
## 
##     B, B<-
## The following objects are masked from 'package:IRanges':
## 
##     resize, width
## The following object is masked from 'package:S4Vectors':
## 
##     width
## The following object is masked from 'package:Biobase':
## 
##     channel
## The following object is masked from 'package:BiocGenerics':
## 
##     width
## The following object is masked from 'package:stringr':
## 
##     boundary
## The following object is masked from 'package:dplyr':
## 
##     where
## The following object is masked from 'package:tidyr':
## 
##     fill
## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum
## The following object is masked from 'package:graphics':
## 
##     frame
## The following object is masked from 'package:base':
## 
##     save.image
filenames <- list.files(path = "~/Desktop/classroom/myfiles", pattern = ".*pathview.pdf")

Pathway Graphs

knitr::include_graphics(filenames)

DNASeq

Characterizing the Data

Let’s load our packages.

library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
## 

Then, read all the cell files into the directory.

rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_id")

Let’s characterize the data.

group <- factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))

dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 2
##   .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
##   .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
##   .. ..$ :'data.frame':  12 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
##   .. .. ..$ lib.size    : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
##   .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ names: chr [1:2] "counts" "samples"
keep <- rowSums(cpm(dgeGlm)>2) >=4

dgeGlm <- dgeGlm[keep,]

names(dgeGlm)
## [1] "counts"  "samples"
dgeGlm[["samples"]]
##                              group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23     1  4.0e+07            1
## Mmus_C57.6J_KDN_FLT_Rep2_M24     1  4.1e+07            1
## Mmus_C57.6J_KDN_FLT_Rep3_M25     1  3.8e+07            1
## Mmus_C57.6J_KDN_FLT_Rep4_M26     1  3.9e+07            1
## Mmus_C57.6J_KDN_FLT_Rep5_M27     1  3.7e+07            1
## Mmus_C57.6J_KDN_FLT_Rep6_M28     1  3.6e+07            1
## Mmus_C57.6J_KDN_GC_Rep1_M33      2  3.7e+07            1
## Mmus_C57.6J_KDN_GC_Rep2_M34      2  3.7e+07            1
## Mmus_C57.6J_KDN_GC_Rep3_M35      2  4.0e+07            1
## Mmus_C57.6J_KDN_GC_Rep4_M36      2  3.6e+07            1
## Mmus_C57.6J_KDN_GC_Rep5_M37      2  3.8e+07            1
## Mmus_C57.6J_KDN_GC_Rep6_M38      2  3.5e+07            1
design <- model.matrix(~group)
design
##    (Intercept) group2
## 1            1      0
## 2            1      0
## 3            1      0
## 4            1      0
## 5            1      0
## 6            1      0
## 7            1      1
## 8            1      1
## 9            1      1
## 10           1      1
## 11           1      1
## 12           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

plotBCV(dgeGlmTagDisp)
Figure 1: Distribution of variance between samples.

Figure 1: Distribution of variance between samples.

fit <- glmFit(dgeGlmTagDisp, design)

colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef =2 )

ttGlm <- topTags(lrt, n = Inf)

class(ttGlm)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))
##        group2
## Down       64
## NotSig  13390
## Up        159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]

plotSmear(lrt, de.tags = tagsGlm)
Figure 2: Log fold changes for every gene. Differentially expressed genes are seen in red.

Figure 2: Log fold changes for every gene. Differentially expressed genes are seen in red.

hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]

write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")

Let’s convert to a data frame.

columns(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"
ttGlm2 <- as.data.frame(ttGlm$table)

ttGlm2$symbol = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column = "SYMBOL",
                       keytype = "ENSEMBL",
                       multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$entrez = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column = "ENTREZID",
                       keytype = "ENSEMBL",
                       multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$name = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column = "GENENAME",
                       keytype = "ENSEMBL",
                       multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(ttGlm2)
##                    logFC logCPM LR  PValue     FDR symbol entrez
## ENSMUSG00000026077 -1.36    3.6 80 4.3e-19 5.9e-15  Npas2  18143
## ENSMUSG00000007872  0.89    5.5 77 1.9e-18 1.3e-14    Id3  15903
## ENSMUSG00000021775  0.95    6.2 63 2.0e-15 9.1e-12  Nr1d2 353187
## ENSMUSG00000002250 -0.83    5.3 62 2.7e-15 9.2e-12  Ppard  19015
## ENSMUSG00000059824  2.26    4.6 58 2.6e-14 7.2e-11    Dbp  13170
## ENSMUSG00000074715 -1.99    3.8 47 7.0e-12 1.6e-08  Ccl28  56838
##                                                                name
## ENSMUSG00000026077                    neuronal PAS domain protein 2
## ENSMUSG00000007872                       inhibitor of DNA binding 3
## ENSMUSG00000021775  nuclear receptor subfamily 1, group D, member 2
## ENSMUSG00000002250 peroxisome proliferator activator receptor delta
## ENSMUSG00000059824          D site albumin promoter binding protein
## ENSMUSG00000074715                    C-C motif chemokine ligand 28

Pathview

Let’s load the required packages.

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)

Let’s create the pathways.

foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]

Let’s get the results.

keggres = gage(foldchanges, gset=kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                                                   p.geomean
## mmu03010 Ribosome                                                   9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells   2.0e-03
## mmu04330 Notch signaling pathway                                    6.1e-03
## mmu04350 TGF-beta signaling pathway                                 1.3e-02
## mmu04390 Hippo signaling pathway                                    2.0e-02
## mmu00830 Retinol metabolism                                         2.1e-02
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04330 Notch signaling pathway                                        2.6
## mmu04350 TGF-beta signaling pathway                                     2.2
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                     p.val q.val
## mmu03010 Ribosome                                                 9.5e-05 0.023
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.235
## mmu04330 Notch signaling pathway                                  6.1e-03 0.488
## mmu04350 TGF-beta signaling pathway                               1.3e-02 0.783
## mmu04390 Hippo signaling pathway                                  2.0e-02 0.826
## mmu00830 Retinol metabolism                                       2.1e-02 0.826
##                                                                   set.size
## mmu03010 Ribosome                                                      127
## mmu04550 Signaling pathways regulating pluripotency of stem cells      100
## mmu04330 Notch signaling pathway                                        54
## mmu04350 TGF-beta signaling pathway                                     84
## mmu04390 Hippo signaling pathway                                       125
## mmu00830 Retinol metabolism                                             37
##                                                                      exp1
## mmu03010 Ribosome                                                 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway                                  6.1e-03
## mmu04350 TGF-beta signaling pathway                               1.3e-02
## mmu04390 Hippo signaling pathway                                  2.0e-02
## mmu00830 Retinol metabolism                                       2.1e-02
## 
## $less
##                                                  p.geomean stat.mean   p.val
## mmu04613 Neutrophil extracellular trap formation   0.00012      -3.7 0.00012
## mmu04145 Phagosome                                 0.00192      -2.9 0.00192
## mmu04110 Cell cycle                                0.00276      -2.8 0.00276
## mmu04714 Thermogenesis                             0.00472      -2.6 0.00472
## mmu04217 Necroptosis                               0.00614      -2.5 0.00614
## mmu00910 Nitrogen metabolism                       0.00867      -2.6 0.00867
##                                                  q.val set.size    exp1
## mmu04613 Neutrophil extracellular trap formation 0.029      137 0.00012
## mmu04145 Phagosome                               0.221      121 0.00192
## mmu04110 Cell cycle                              0.221      134 0.00276
## mmu04714 Thermogenesis                           0.283      208 0.00472
## mmu04217 Necroptosis                             0.295      113 0.00614
## mmu00910 Nitrogen metabolism                     0.347       13 0.00867
## 
## $stats
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04330 Notch signaling pathway                                        2.6
## mmu04350 TGF-beta signaling pathway                                     2.2
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                   exp1
## mmu03010 Ribosome                                                  3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells  2.9
## mmu04330 Notch signaling pathway                                   2.6
## mmu04350 TGF-beta signaling pathway                                2.2
## mmu04390 Hippo signaling pathway                                   2.1
## mmu00830 Retinol metabolism                                        2.1
greaters <- keggres$greater
leassers <- keggres$less

Now, we’ll create the greater path.

keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tibble::as_tibble() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()
keggrespathways
## [1] "mmu03010 Ribosome"                                                
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04330 Notch signaling pathway"                                 
## [4] "mmu04350 TGF-beta signaling pathway"                              
## [5] "mmu04390 Hippo signaling pathway"
keggresidsgreater = substr(keggrespathways, start=1, stop=8)
keggresidsgreater
## [1] "mmu03010" "mmu04550" "mmu04330" "mmu04350" "mmu04390"
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

Plot multiple pathways.

tmp = sapply(keggresidsgreater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "mouse", kegg.native = FALSE))
## 'select()' returned 1:1 mapping between keys and columns
## Note: mmu03010 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04550.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
##      [,1] [,2]
## [1,] "13" "74"
## [2,] "13" "75"
## [3,] "13" "76"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04330.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
##      [,1]  [,2] 
## [1,] "213" "234"
## [2,] "214" "234"
## [3,] "215" "234"
## [4,] "216" "234"
## [5,] "213" "235"
## [6,] "214" "235"
## [7,] "215" "235"
## [8,] "216" "235"
## Warning: Nodes are not the same type, hence unable to combine!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04350.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04390.pathview.pdf

Let’s plot the lesser path.

keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
   tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04145 Phagosome"                              
## [3] "mmu04110 Cell cycle"                             
## [4] "mmu04714 Thermogenesis"                          
## [5] "mmu04217 Necroptosis"
keggresidslesser = substr(keggrespathways, start=1, stop=8)
keggresidslesser
## [1] "mmu04613" "mmu04145" "mmu04110" "mmu04714" "mmu04217"
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

Plot multiple pathways.

tmp = sapply(keggresidslesser, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "mouse", kegg.native = FALSE))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04613.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04145.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
##      [,1] [,2] 
## [1,] "9"  "300"
## [2,] "9"  "306"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04110.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04714.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
##      [,1]  [,2] 
## [1,] "226" "232"
## [2,] "228" "232"
## [3,] "226" "234"
## [4,] "228" "234"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04217.pathview.pdf

Pathview Graphs

library(imager)

filenames <- list.files(path = "~/Desktop/classroom/myfiles", pattern = ".*pathview.pdf")
knitr::include_graphics(filenames)

Other Tools

Multiple Protein Alignment

First, let’s load our library.

library(msa)

Now let’s load our data.

seq <- readAAStringSet("hglobin-copy.fa")

seq
## AAStringSet object of length 3:
##     width seq                                               names               
## [1]   142 MVLSPADKTNVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2]   142 MVLSGEDKSNIKAAWGKIGGHGA...PAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3]   142 MSLTRTERTIILSLWSKISTQAD...ADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI

Let’s align the different amino acid sequences.

alignments <- msa(seq, method = "ClustalW")
## use default substitution matrix
alignments
## CLUSTAL 2.1  
## 
## Call:
##    msa(seq, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 3 rows and 142 columns
##     aln                                                    names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] MSLTRTERTIILSLWSKISTQADVIG...FTADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
## Con MVLS??DKTNIKAAWGKIG?HA?EYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus
msaPrettyPrint(alignments, output = "pdf", showNames = "left", 
               showLogo = "none", askForOverwrite = FALSE, 
               verbose = TRUE, file = "whole_align.pdf")
## Multiple alignment written to temporary file /tmp/Rtmpsjqmbh/seq14fa1ce6da7c.fasta
## File whole_align.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file whole_align.pdf created
msaPrettyPrint(alignments, c(10,30), output = "pdf", showNames = "left",
               file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
               verbose = TRUE)
## Multiple alignment written to temporary file /tmp/Rtmpsjqmbh/seq14fa74d3890d.fasta
## File Zoomed_align.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file Zoomed_align.pdf created

Let’s make a tree from our alignment.

First, we need to load the required libraries.

library(ape)
## 
## Attaching package: 'ape'
## The following object is masked from 'package:imager':
## 
##     where
## The following object is masked from 'package:Biostrings':
## 
##     complement
## The following object is masked from 'package:dplyr':
## 
##     where
library(seqinr)
## 
## Attaching package: 'seqinr'
## The following objects are masked from 'package:ape':
## 
##     as.alignment, consensus
## The following object is masked from 'package:Biostrings':
## 
##     translate
## The following object is masked from 'package:limma':
## 
##     zscore
## The following object is masked from 'package:dplyr':
## 
##     count

Convert to a seqinr alignment -> get the distance and make a tree.

alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")

distancesl <- seqinr::dist.alignment(alignment_seqinr, "identity")

tree <- ape::nj(distancesl)
plot(tree, main = "Phylogenetic Tree of HBA Sequences")

Synteny

In the first step we load the libraries and the sequence into long.seq. This is a DNAstringset object.

library(DECIPHER)
## Loading required package: RSQLite
## Loading required package: parallel
long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq
## DNAStringSet object of length 5:
##      width seq                                              names               
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT NC_018523.1 Sacch...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA NC_022431.1 Ascle...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC NC_022259.1 Nanno...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC NC_022417.1 Cocos...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT NC_022459.1 Camel...

Now let’s build a temporary SQLite database.

Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
## Adding 5 sequences to the database.
## 
## Added 5 new sequences to table Seqs.
## 65 total sequences in table Seqs.
## Time difference of 0.22 secs

Now that we have built the database, we can do the following:

Find the syntenic blocks.

synteny <- FindSynteny("long_db")
## ================================================================================
## 
## Time difference of 10 secs

View blocks with plotting.

pairs(synteny)
Figure 1: Comparison of chromosome alignment between organisms.

Figure 1: Comparison of chromosome alignment between organisms.

plot(synteny)
Figure 2: Conservered regions in other organisms compared to reference genome.

Figure 2: Conservered regions in other organisms compared to reference genome.

Make an actual alignment file.

alignmnet <- AlignSynteny(synteny, "long_db")
## ================================================================================
## 
## Time difference of 87 secs

Let’s create a structure holding all aligned sytenic blocks for a pair of sequences.

block <- unlist(alignmnet[[1]])

We can write to file one alignment at a time.

writeXStringSet(block, "genome_blocks_out.fa")

Unannotated Gene Regions

First, let’s load our required libraries.

library(locfit)
## locfit 1.5-9.8    2023-06-11
## 
## Attaching package: 'locfit'
## The following object is masked from 'package:purrr':
## 
##     none
library(Rsamtools)
## Loading required package: GenomicRanges
## 
## Attaching package: 'GenomicRanges'
## The following object is masked from 'package:magrittr':
## 
##     subtract

Now, let’s create a function that will load the gene region in a GFF file and convert it to a bioconductor GRanges object.

get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff, "GRanges")
}

Let’s get counts in windows across the genome in 500bp segments.

whole_genome <- csaw::windowCounts(
  file.path(getwd(), "windows.bam"),
  bin = TRUE,
  filter = 0,
  width = 500,
  param = csaw::readParam(
    minq = 20, #determines what is a passing read
    dedup = TRUE, # removes pcr duplicates
    pe = "both" # requires that both pairs of reads are aligned
  )
)

Since this is a single column of data that is generated, let’s rename it.

colnames(whole_genome) <- c("small_data")

annotated_regions <- get_annotated_regions_from_gff("genes.gff")

Now that we have the windows of high expression, we want to see if any of them overlap with annotated regions.

Let’s load the required packages.

library(IRanges)
library(SummarizedExperiment) 
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:seqinr':
## 
##     count
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
## 
##     rowMedians

Then we find the overlap between the windows we made.

windows_in_genes <- IRanges::overlapsAny(
  SummarizedExperiment::rowRanges(whole_genome), # creates a Granges object
  annotated_regions
)

windows_in_genes
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

Here we subset the whole_genome object into annotated and non annotated regions.

annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]

Then we use assay() to extrat the actua counts from the Granges objects.

assay(non_annotated_window_counts)
##       small_data
##  [1,]          0
##  [2,]         31
##  [3,]         25
##  [4,]          0
##  [5,]          0
##  [6,]         24
##  [7,]         25
##  [8,]          0
##  [9,]          0

In this step, we use Rsamtools Pileup() function to get a per-base coverage dataframe, each row represents a single nucleotide in the refence count and the count column gives the depth of coverage at that point.

Let’s load our next library and data.

library(bumphunter)
## Loading required package: foreach
## Parallel computing support for 'oligo/crlmm': Disabled
##      - Load 'ff'
##      - Load and register a 'foreach' adaptor
##         Example - Using 'multicore' for 2 cores:
##              library(doMC)
##              registerDoMC(2)
## ================================================================================
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loading required package: iterators
pile_df <- Rsamtools::pileup("windows.bam")

This step groups the reads with certain distance of each other into a cluster, we give the sequence names, position, and distance.

clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap = 100)

table(clusters)
## clusters
##    1    2    3 
## 1486 1552 1520

In this step we will map the reads to the regions we found for the genome.

bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff = 1)
## getSegments: segmenting
## getSegments: splitting
##    chr start  end value  area cluster indexStart indexEnd    L clusterL
## 3 Chr1  4503 5500  10.4 15811       3       3039     4558 1520     1520
## 1 Chr1   502 1500  10.0 14839       1          1     1486 1486     1486
## 2 Chr1  2501 3500   8.7 13436       2       1487     3038 1552     1552

Phylogenetic Analysis

Treeio

Let’s load the required packages.

library(ggplot2)
library(ggtree)
## ggtree v3.10.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
## 
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## 
## Attaching package: 'ggtree'
## The following object is masked from 'package:ape':
## 
##     rotate
## The following object is masked from 'package:magrittr':
## 
##     inset
## The following object is masked from 'package:reshape':
## 
##     expand
## The following object is masked from 'package:Biostrings':
## 
##     collapse
## The following object is masked from 'package:IRanges':
## 
##     collapse
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## The following object is masked from 'package:tidyr':
## 
##     expand
library(treeio)
## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
## 
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## 
## Attaching package: 'treeio'
## The following object is masked from 'package:seqinr':
## 
##     read.fasta
## The following object is masked from 'package:Biostrings':
## 
##     mask

First we need to load our raw tree data. It’s a Newick format so we use:

itol <- ape::read.tree("itol.nwk")

Now we will print out a very basic phylogenetic tree.

ggtree(itol)
Figure 1: Basic phylogenetic tree.

Figure 1: Basic phylogenetic tree.

We can also change the format to make it a circular tree.

ggtree(itol, layout = "circular")
Figure 2: Circular phylogenetic tree.

Figure 2: Circular phylogenetic tree.

We can also change the left-right/ up-down direction.

ggtree(itol) + coord_flip() + scale_x_reverse()
Figure 3: Flipped basic phylogenetic tree.

Figure 3: Flipped basic phylogenetic tree.

By using geom_tiplab() we can add names to the end of tips.

ggtree(itol) + geom_tiplab(color = "indianred", size = 1.5)
Figure 4: Labeled basic phylogenetic tree.

Figure 4: Labeled basic phylogenetic tree.

By adding a geom_strip() layer we can annotate clades in the tree with a block of color.

ggtree(itol, layour = "unrooted") + geom_strip(13,14, color = "red", barsize = 1)
## Warning in stat_tree(data = data, mapping = mapping, geom = "segment", position = position, : Ignoring unknown parameters: `layour`
## Ignoring unknown parameters: `layour`
Figure 5: Basic phylogenetic tree with selected clades labeled in red.

Figure 5: Basic phylogenetic tree with selected clades labeled in red.

We can highlight clades in unrooted trees with blocks of color using geom_hilight.

ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "steelblue")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
Figure 6: Unrooted phylogenetic tree with clades highlighted in blue.

Figure 6: Unrooted phylogenetic tree with clades highlighted in blue.

Next, we need to install our required packages.

install.packages('BAMMtools')
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(BAMMtools)

We can use the MRCA (most recent common ancestor) function to find the node we want.

getMRCA(itol, tip = c("Photorhabdus_luminescens", "Blochmannia_floridanus"))
## [1] 206

Now if we want to highlight the section of the most recent common ancestor between the two above.

ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "steelblue")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
Figure 7: Unrooted phylogenetic tree with the most recent common ancestor (MRCA) between the previous two clades highlighted in blue.

Figure 7: Unrooted phylogenetic tree with the most recent common ancestor (MRCA) between the previous two clades highlighted in blue.

Quantifying Trees

First let’s load the required packages.

library(ape)
library(adegraphics)
## Registered S3 methods overwritten by 'adegraphics':
##   method         from
##   biplot.dudi    ade4
##   kplot.foucart  ade4
##   kplot.mcoa     ade4
##   kplot.mfa      ade4
##   kplot.pta      ade4
##   kplot.sepan    ade4
##   kplot.statis   ade4
##   scatter.coa    ade4
##   scatter.dudi   ade4
##   scatter.nipals ade4
##   scatter.pco    ade4
##   score.acm      ade4
##   score.mix      ade4
##   score.pca      ade4
##   screeplot.dudi ade4
## 
## Attaching package: 'adegraphics'
## The following object is masked from 'package:GenomicRanges':
## 
##     score
## The following object is masked from 'package:ape':
## 
##     zoom
## The following object is masked from 'package:Biostrings':
## 
##     score
## The following object is masked from 'package:BiocGenerics':
## 
##     score
library(treespace)
## Loading required package: ade4
## 
## Attaching package: 'ade4'
## The following objects are masked from 'package:adegraphics':
## 
##     kplotsepan.coa, s.arrow, s.class, s.corcircle, s.distri, s.image,
##     s.label, s.logo, s.match, s.traject, s.value, table.value,
##     triangle.class
## The following object is masked from 'package:GenomicRanges':
## 
##     score
## The following object is masked from 'package:Biostrings':
## 
##     score
## The following object is masked from 'package:BiocGenerics':
## 
##     score

Now we need to load all the treefiles into a multiPhylo object.

treefiles <- list.files(file.path(getwd(),"genetrees"), full.names = TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)
## [1] "list"
class(tree_list) <- "multiPhylo"

class(tree_list)
## [1] "multiPhylo"

Now we can compute the kendall-coljin distance between trees. This function does a LOT of anaylses.

First it runs a pairwise comparison of all trees in the input. Second it carries out clustering using PCA These results are returned in a list of objects, where “\(D" contains the pairwise metric of the tree, and "\)pco” contains the PCA. the method we use (Kendal-Coljin) is particularly suitable for rooted trees as we have here. The option NF tells us how many principal components to retain.

comparisons <- treespace(tree_list, nf = 3)

We can plot the pairwise distances between trees with table.image.

table.image(comparisons$D, nclass = 25)
Figure 1: Heat map comparison of trees 1-20.

Figure 1: Heat map comparison of trees 1-20.

Now let’s print the PCA and clusters, this shows us how the group of trees cluster.

plotGroves(comparisons$pco, lab.show = TRUE, lab.cex = 1.5)
Figure 2: Principle component analysis.

Figure 2: Principle component analysis.

groves <- findGroves(comparisons, nclust = 4)
plotGroves(groves)
Figure 3: Estimated clusters using principle component analysis.

Figure 3: Estimated clusters using principle component analysis.

Binding Trees

Extracting and working with subtrees using APE

Let’s load our required packages.

library(ape)

Now let’s load the tree data we will be working with.

newick <- read.tree("mammal_tree.nwk")

l <- subtrees(newick)

Let’s plot the tree to see what it looks like.

plot(newick)
Figure 1: Mammal phylogenetic tree.

Figure 1: Mammal phylogenetic tree.

We can subset this plot using the “node” function.

plot(l[[4]], sub = "Node 4")
Figure 2: Node 4 of the mammal phylogenetic tree.

Figure 2: Node 4 of the mammal phylogenetic tree.

Extract the tree manually.

small_tree <- extract.clade(newick, 9)

plot(small_tree)
Figure 3: Extracted clade from mammal phylogenetic tree.

Figure 3: Extracted clade from mammal phylogenetic tree.

Now what if we want to bind two trees together?

new_tree <- bind.tree(newick, small_tree, 3)
plot(new_tree)
Figure 4: Mammal phylogenetic tree binded with extracted clade phylogenetic tree.

Figure 4: Mammal phylogenetic tree binded with extracted clade phylogenetic tree.

Trees from Alignment

Reconstructing trees from alignments.

Let’s load the packages.

library(Biostrings)
library(msa)
library(phangorn)

First we’ll load the sequences into a seq variable.

seqs <- readAAStringSet("abc.fa")

Now let’s construct an alignmnet with the msa package and ClustalOmega.

aln <- msa::msa(seqs, method = c("ClustalOmega"))
## using Gonnet

To create a tree, we need to convert the alignment to a phyData objects.

aln <- as.phyDat(aln, type = "AA")

class(aln)
## [1] "phyDat"

In this step, we’ll make the trees. Trees are made from a distance matrix, which can be computed with dist.ml() - ML stands for maximum likelihood.

dist_mat <- dist.ml(aln)

Now we pass the distance matrix to upgma to make a UPGMA tree.

upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main = "UPGMA")
Figure 1: UPGMA phylogenetic tree.

Figure 1: UPGMA phylogenetic tree.

We can conversley pass distance matrix to a neightbor joining function.

nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")
Figure 2: NJ phylogenetic tree.

Figure 2: NJ phylogenetic tree.

Lastly, we are going to use the bootstraps.phyDat() function to comput bootstrap support for the branches of the tree. The first argument is the object (aln), while the second argument in the function nj.

Bootstraps are essentially confidence intervals for how the clade is placed in the correct position.

fit <- pml(nj_tree, aln)

bootstraps <- bootstrap.phyDat(aln, FUN = function(x) (NJ(dist.ml(x))), bs = 100)

plotBS(nj_tree, bootstraps, p = 10)
Figure 3: NJ phylogenetic tree with bootstraps.

Figure 3: NJ phylogenetic tree with bootstraps.

GWAS

Identifying Genetic Variants

First let’s load the required libraries.

library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
## 
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:stringr':
## 
##     fixed
## The following object is masked from 'package:base':
## 
##     tabulate
library(VariantTools)

Now we want to load our data sets. We need .bam and .fa files for this pipeline to work.

bam_file <- file.path(getwd(), "hg17_snps.bam")

fasta_file <- file.path(getwd(), "chr17_83k.fa")

Now we need to set up the genome object and the parameters object.

fa <- rtracklayer::FastaFile(fasta_file)

Now we create a GMapGenome object, which discribes genome to the later variant calling function.

genome <- gmapR::GmapGenome(fa, create = TRUE)
## NOTE: genome 'chr17_83k' already exists, not overwriting

This next step sets our parameter for what is considered a variant. There can be a lot of fine-tuning to this function. We are just going to use standard settings.

qual_params <- TallyVariantsParam(
  genome = genome,
  minimum_mapq = 20)

var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)

Now we use callVariants function in accordance with our parameters we defined above.

called_variants <- callVariants(bam_file,
                                qual_params,
                                calling.filter = var_params)

head(called_variants)
## VRanges object with 6 ranges and 17 metadata columns:
##           seqnames    ranges strand         ref              alt     totalDepth
##              <Rle> <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle>
##   [1] NC_000017.10        64      *           G                T            759
##   [2] NC_000017.10        69      *           G                T            812
##   [3] NC_000017.10        70      *           G                T            818
##   [4] NC_000017.10        73      *           T                A            814
##   [5] NC_000017.10        77      *           T                A            802
##   [6] NC_000017.10        78      *           G                T            798
##             refDepth       altDepth   sampleNames softFilterMatrix | n.read.pos
##       <integerOrRle> <integerOrRle> <factorOrRle>         <matrix> |  <integer>
##   [1]            739             20          <NA>                  |         17
##   [2]            790             22          <NA>                  |         19
##   [3]            796             22          <NA>                  |         20
##   [4]            795             19          <NA>                  |         13
##   [5]            780             22          <NA>                  |         19
##   [6]            777             21          <NA>                  |         17
##       n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
##            <integer>       <integer>  <integer>      <integer>   <integer>
##   [1]             64             759         20            739           0
##   [2]             69             812         22            790           0
##   [3]             70             818         22            796           0
##   [4]             70             814         19            795           0
##   [5]             70             802         22            780           0
##   [6]             70             798         21            777           0
##       count.minus.ref count.del.plus count.del.minus read.pos.mean
##             <integer>      <integer>       <integer>     <numeric>
##   [1]               0              0               0       30.9000
##   [2]               0              0               0       40.7273
##   [3]               0              0               0       34.7727
##   [4]               0              0               0       36.1579
##   [5]               0              0               0       38.3636
##   [6]               0              0               0       39.7143
##       read.pos.mean.ref read.pos.var read.pos.var.ref     mdfne mdfne.ref
##               <numeric>    <numeric>        <numeric> <numeric> <numeric>
##   [1]           32.8755      318.558          347.804        NA        NA
##   [2]           35.4190      377.004          398.876        NA        NA
##   [3]           36.3442      497.762          402.360        NA        NA
##   [4]           36.2176      519.551          402.843        NA        NA
##   [5]           36.0064      472.327          397.070        NA        NA
##   [6]           35.9241      609.076          390.463        NA        NA
##       count.high.nm count.high.nm.ref
##           <integer>         <integer>
##   [1]            20               738
##   [2]            22               789
##   [3]            22               796
##   [4]            19               769
##   [5]            22               780
##   [6]            21               777
##   -------
##   seqinfo: 1 sequence from chr17_83k genome
##   hardFilters(4): nonRef nonNRef readCount likelihoodRatio

We have identified 6 variants.

Now, we move on to annotation and load the feature position information from gff.

get_annotated_regions_from_gff <- function(file_name){
  gff <- rtracklayer::import(file_name)
  as(gff, "GRanges")
}

Note you can also load this data from a bed file.

genes <- get_annotated_regions_from_gff("chr17.83k.gff3")

Now we can calculate which variants overlaps which genes on our chromosome.

overlaps <- GenomicRanges::findOverlaps(called_variants, genes)

overlaps
## Hits object with 12684 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]     35176           1
##       [2]     35176           2
##       [3]     35176           3
##       [4]     35177           1
##       [5]     35177           2
##       ...       ...         ...
##   [12680]     40944           2
##   [12681]     40944           7
##   [12682]     40945           1
##   [12683]     40945           2
##   [12684]     40945           7
##   -------
##   queryLength: 44949 / subjectLength: 8

Lastly subset genes with the list of overlaps.

identified <- genes[subjectHits(overlaps)]

Predicting ORF

First thing, let’s load the required packages.

library(Biostrings)
library(systemPipeR)
## Loading required package: ShortRead
## Loading required package: BiocParallel
## Loading required package: GenomicAlignments
## 
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:dplyr':
## 
##     last
## 
## Attaching package: 'ShortRead'
## The following object is masked from 'package:adegraphics':
## 
##     zoom
## The following objects are masked from 'package:locfit':
## 
##     left, right
## The following object is masked from 'package:ape':
## 
##     zoom
## The following object is masked from 'package:imager':
## 
##     clean
## The following object is masked from 'package:magrittr':
## 
##     functions
## The following object is masked from 'package:oligo':
## 
##     intensity
## The following objects are masked from 'package:oligoClasses':
## 
##     chromosome, position
## The following object is masked from 'package:affy':
## 
##     intensity
## The following object is masked from 'package:dplyr':
## 
##     id
## The following object is masked from 'package:purrr':
## 
##     compose
## The following object is masked from 'package:tibble':
## 
##     view
## 
## Attaching package: 'systemPipeR'
## The following object is masked from 'package:VariantAnnotation':
## 
##     reference

Let’s load the data into a DNAStrings object, in this case, Arabidopsis chloroplast genome.

dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa")

Now let’s predict the open rading frames with the predORF(), we’ll predict all ORF on both strands.

predict_orfs <- predORF(dna_object, n = 'all', type = 'gr', mode = 'ORF', strand = 'both',
                        longest_disjoint = TRUE)
head(predict_orfs)
## GRanges object with 6 ranges and 2 metadata columns:
##           seqnames      ranges strand | subject_id inframe2end
##              <Rle>   <IRanges>  <Rle> |  <integer>   <numeric>
##      1 chloroplast 86762-93358      + |          1           2
##   1162 chloroplast   2056-2532      - |          1           3
##      2 chloroplast 72371-73897      + |          2           2
##   1163 chloroplast 77901-78362      - |          2           1
##      3 chloroplast 54937-56397      + |          3           3
##   1164 chloroplast 20251-20691      - |          3           3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

This printed out a GRanges object in return, with 2,501 open reading frames. This is FAR too many reading frames.

To filter out erronous ORFs, we do a simulation. We first estimate the length an ORF can reach by chance. We will create a string of random nucleotides that is the length of our chroloplast genome and determine the longer ORF that can arise by chance.

bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")

Now we need to ensure that our random nucleotides match the proportion of nucleotides match in our chloroplast genome so we have no bias.

seq_length <- width(dna_object[1])
counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})
probs <- (lapply(counts, function(base_count){signif(base_count/seq_length, 2)}))
probs
## [[1]]
## [1] 6.5e-06
## 
## [[2]]
## [1] 6.5e-06
## 
## [[3]]
## [1] 6.5e-06
## 
## [[4]]
## [1] 6.5e-06

Now we can build our function to stimulate a genome.

get_longest_orf_in_random_genome <- function(x,
                                             length = 1000,
                                             probs = c(0.25, 0.25, 0.25, 0.25),
                                             bases = c("A", "T", "G", "C")){
  # Here we crate our random genome and allow replacement for the next iteration
  random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")
  random_dna_object <- DNAStringSet(random_genome)
  names(random_dna_object) <- c("random_dna_string")
  
  orfs <- predORF(random_dna_object, n = 1, type = 'gr', mode = 'ORF', strand = 'both', longest_disjoint = TRUE)
  return(max(width(orfs)))
}

Now we use the function we just created to predict the ORF’s in 10 random genomes.

random_lengths <- unlist(lapply(1:10, get_longest_orf_in_random_genome, length = seq_length, probs = probs, bases = bases))

Let’s pull out the longest length from out 10 simulations.

longest_random_orf <- max(random_lengths)

Let’s only keep the frames that are longer in our chloroplast genome.

keep <- width(predict_orfs) > longest_random_orf

orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
## GRanges object with 9 ranges and 2 metadata columns:
##        seqnames        ranges strand | subject_id inframe2end
##           <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##   1 chloroplast   86762-93358      + |          1           2
##   2 chloroplast   72371-73897      + |          2           2
##   3 chloroplast   54937-56397      + |          3           3
##   4 chloroplast   57147-58541      + |          4           1
##   5 chloroplast   33918-35141      + |          5           1
##   6 chloroplast   32693-33772      + |          6           2
##   7 chloroplast 109408-110436      + |          7           3
##   8 chloroplast 114461-115447      + |          8           2
##   9 chloroplast 141539-142276      + |          9           2
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Write this data to file.

extacted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extacted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extacted_orfs, "saved_orfs.fa")

KaryoplotR

First let’s load the required packages.

library(karyoploteR)
## Loading required package: regioneR
library(GenomicRanges)

Now we need to set up the genome object for our karyotype.

genome_df <- data.frame(
  #First we dictate the number of chromosomes
  chr = paste0("chr", 1:5),
  start = rep(1, 5),
  # and then we will dictate the length of each chromosome
  end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)

Now we can convert the data frame to a granges object.

genome_gr <- makeGRangesFromDataFrame(genome_df)

Now let’s create some snp positions to map out. we do this by using the sample() function.

snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
  chr = paste0("chr", sample(1:5,25, replace = TRUE)),
  start = snp_pos,
  end = snp_pos
)

Again we convert the data frame to granges.

snps_gr <- makeGRangesFromDataFrame(snps)

Now let’s create some snps labels.

snp_labels <- paste0("snp_", 1:25)

Here we will set the margin for our plot.

plot.params <- getDefaultPlotParams(plot.type = 1)

Here we will set the margins of our plot.

plot.params$data1outmargin <- 600

Now let’s plot our snps.

kp <- plotKaryotype(genome = genome_gr, plot.type = 1, plot.params = plot.params)
## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.
## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
Figure 1: Chromosomes with snps.

Figure 1: Chromosomes with snps.

We can also add some numeric data to our plots, we will generate 100 random number that plot to 100 windows on chromosome 4.

numeric_data <- data.frame(
  y = rnorm(100, mean = 1, sd = 0.5),
  chr = rep("chr4", 100),
  start = seq(1,20862711, 20862711/100),
  end = seq(1,20862711, 20862711/100)
)

Now let’s make the data a granges object.

numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)

Again let’s set our plot parameters.

plot.params <- getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800

Let’s plot the data.

kp <- plotKaryotype(genome = genome_gr, plot.type = 2, plot.params = plot.params)
## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.
## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)
Figure 2: Chromosomes with snps and line graph.

Figure 2: Chromosomes with snps and line graph.