This document is the final project for the course Bioinformatics with R on Praxis AI.
All subjects and associated R projects that were worked on throughout this course have been compiled in this file.

1 Introduction to Bioinformatics with R

1.1 Simple operations

This is how R does addition:

12 + 6
## [1] 18

This is how R does subtraction:

12 - 6
## [1] 6

This is how we store data as avariable. We are going store the days of the week for this current example.

days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

Let’s find the 5th entry.

days[5]
## [1] "Friday"

Let’s pull out a range of entries.

days[1:3]
## [1] "Monday"    "Tuesday"   "Wednesday"

Let’s pull out the end of the week.

days[5:7]
## [1] "Friday"   "Saturday" "Sunday"

Let’s pull out specific days.

days[c(1,3,5,7)]
## [1] "Monday"    "Wednesday" "Friday"    "Sunday"

Let’s subset our days variable and create weekdays.

weekdays <- days[1:5]

print(weekdays)
## [1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"

Let’s subset for the weekend.

weekend <- days[6:7]

print(weekend)
## [1] "Saturday" "Sunday"


1.2 Functions

A generic function in R may look like this: f(argument1, argument2, …)
Where f is the name of the function, and argument1, argument2,… are the different conditions that we are asking the function to evaluate.

exampleFunction <- function(x, y)
  {c(x + 1, y + 10)}

exampleFunction(2, 4)
## [1]  3 14

Examples of built-in functions of R:

Exponents:

exp(2)
## [1] 7.389056

Tangents:

tan(2)
## [1] -2.18504

Logs:

log(12)
## [1] 2.484907
log(x = 12, base = 4)
## [1] 1.792481

Shorter version og logs:

log(12, 4)
## [1] 1.792481

If you want the documentation of functions and packages in R, you can type ?functionname or ?packagename.

1.3 Data Structures

Let’s make an array and compare it to a regular list.

months <- array(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), dim = c(3, 4))

months
##      [,1]  [,2]  [,3]  [,4] 
## [1,] "Jan" "Apr" "Jul" "Oct"
## [2,] "Feb" "May" "Aug" "Nov"
## [3,] "Mar" "Jun" "Sep" "Dec"

The array function fills out a matrix with the values given and following the specified dimensions. Now let’s make a list for comparison. Then let’s see how to pull a value from the array.

months1 <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

months1
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
months[2, 3]
## [1] "Aug"

Let’s look at a matrix.

months2 <- matrix(data = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), nrow = 3, ncol = 4)

months2
##      [,1]  [,2]  [,3]  [,4] 
## [1,] "Jan" "Apr" "Jul" "Oct"
## [2,] "Feb" "May" "Aug" "Nov"
## [3,] "Mar" "Jun" "Sep" "Dec"

Remember to always look at the different arguments a function takes.

Let’s make a 3D array.

array3D <- array(c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36), dim = c(3, 3, 2))

array3D
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    2    8   14
## [2,]    4   10   16
## [3,]    6   12   18
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   20   26   32
## [2,]   22   28   34
## [3,]   24   30   36
array3D[1, 3, 2]
## [1] 32

The argument for array3D[] goes: First it looks at row, then find column, finally, it says look at the following layer.

If you want to pull an entire row or column from 3D:

array3D[2, , 1]
## [1]  4 10 16

First we say: row 2, all columns, layer 1.

1.4 List and Data Frames

HSPA1a <- list(gene = "HSPA1a", amino.acids = 641, nucleotides = 2400)

HSPA1a
## $gene
## [1] "HSPA1a"
## 
## $amino.acids
## [1] 641
## 
## $nucleotides
## [1] 2400

Let’s pull out just the amino acids.

HSPA1a$amino.acids
## [1] 641

Let’s pull out the nucleotides.

HSPA1a$nucleotides
## [1] 2400

Now let’s make 3 lists and combine them into a data frame.

gene <- c("HSPA4", "HSPA5", "HSPA8", "HSPA9", "HSPA1A", "HSPA1B")

nucleotides <- c(54537, 6491, 4648, 21646, 2400, 2517)

aminoAcids <- c(840, 654, 493, 679, 641, 641)

HSPs <- data.frame(gene, nucleotides, aminoAcids)

HSPs
##     gene nucleotides aminoAcids
## 1  HSPA4       54537        840
## 2  HSPA5        6491        654
## 3  HSPA8        4648        493
## 4  HSPA9       21646        679
## 5 HSPA1A        2400        641
## 6 HSPA1B        2517        641

Let’s pull out the genes.

HSPs$gene
## [1] "HSPA4"  "HSPA5"  "HSPA8"  "HSPA9"  "HSPA1A" "HSPA1B"

Let’s pull out the nucleotides.

HSPs$nucleotides
## [1] 54537  6491  4648 21646  2400  2517

Let’s seach for a specific amino acid count.

HSPs$aminoAcids[HSPs$gene == "HSPA8"]
## [1] 493

What we did was: Look at the amino acid column in HSPs dataframe, and pull out the value for gene HSP8.

1.5 Object Classes

Let’s quickly look at our gene vector and HSP dataframe.

gene
## [1] "HSPA4"  "HSPA5"  "HSPA8"  "HSPA9"  "HSPA1A" "HSPA1B"
HSPs
##     gene nucleotides aminoAcids
## 1  HSPA4       54537        840
## 2  HSPA5        6491        654
## 3  HSPA8        4648        493
## 4  HSPA9       21646        679
## 5 HSPA1A        2400        641
## 6 HSPA1B        2517        641

The function class() allows you to check the class of a value or set of values in a dataframe.

class(HSPs$gene)
## [1] "character"
HSPs$gene
## [1] "HSPA4"  "HSPA5"  "HSPA8"  "HSPA9"  "HSPA1A" "HSPA1B"
class(HSPs$nucleotides)
## [1] "numeric"
HSPs$nucleotides
## [1] 54537  6491  4648 21646  2400  2517
class(aminoAcids)
## [1] "numeric"
HSPs$aminoAcids
## [1] 840 654 493 679 641 641
x <- 15 + 38

x
## [1] 53
class(x)
## [1] "numeric"

With the function as.character, you can force R to store values as a character.

z <- as.character(c(1, 2, 3, 4, 5, 6))

z
## [1] "1" "2" "3" "4" "5" "6"
class(z)
## [1] "character"
y <- as.character(c(9, 8, 7, 6, 5, 4))

z + y
## Error in z + y: non-numeric argument to binary operator

z + y will give an error message, as math cannot be applied to characters. However, you can apply math to integers in a list.

z2 <- c(1, 2, 3, 4, 5, 6)

y2 <- c(9, 8, 7, 6, 5, 4)

z2 + y2
## [1] 10 10 10 10 10 10

Now let’s take a look at the different classes we have worked with:

class(z2)
## [1] "numeric"
class(HSPs)
## [1] "data.frame"
class(y)
## [1] "character"

A character can be converted to a numeric value.

y <- as.numeric(y)

class(y)
## [1] "numeric"


1.6 Models and Formulas

An example of a formula could be: \[y ~ x1 + x2\]
First, we load our example data and save it.

dataset <- iris

With large datasets, when you want to check if your data is there, it may be easier to call the head and tail (first 5 and last 5 rows).

head(dataset)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(dataset)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
dataset
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

As seen, a long table was printed with “dataset”. This can become cumbersome and unnecessary for a document. Henceforth, long tables will only be shown with the head and tail function. The original code for printing the table will still be presented.
Let’s look at the total number of rows in the dataset.

nrow(dataset)
## [1] 150

Let’s look at the number of columns in our dataset.

ncol(dataset)
## [1] 5

Let’s start with a simple linear model.

petals.lm <- lm(formula = Petal.Length ~ Petal.Width, data = dataset)

petals.lm
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = dataset)
## 
## Coefficients:
## (Intercept)  Petal.Width  
##       1.084        2.230
summary(petals.lm)
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33542 -0.30347 -0.02955  0.25776  1.39453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08356    0.07297   14.85   <2e-16 ***
## Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Summary gives a better overview of the statistics for the linear regression.

1.7 Charts and Graphs

First, let’s call the list of names in the iris dataset.

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Let’s make a histogram. This shows the frequency distribution.

hist(iris$Sepal.Length)


Let’s increase the number of bins in our histogram.

hist(iris$Sepal.Length, breaks = 25)


Let’s add some labels.

hist(iris$Sepal.Length, breaks = 25, xlab = "Sepal Length", main = "Sepal Length Frequency")


Let’s make a scatter plot of the regression analysis of length and width.

plot(iris$Sepal.Length ~ iris$Sepal.Width, xlab = "Sepal Length", ylab = "Sepal Width")


Let’s make a lattice dotplot.
First, let’s load the required package:

library(lattice)

Here we can separate by species.

dotplot(Sepal.Length ~ Sepal.Width|Species, data = iris)


Let’s do the same for petals.

dotplot(Petal.Length ~ Petal.Width|Species, data = dataset)



2 Data Wrangling with R

With the basics down, we need to work on wrangling data to make it neat and tidy.

2.1 Basic Wrangling


First, let’s load the required libraries and look at the top of the data.

library(tidyverse)
??flights

my_data <- nycflights13::flights

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

First, we will just look at the data on October 14th.

filter(my_data, month == 10, day == 14)
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If we want to subset this into a new variable, we do the following:

oct_14_flight <- filter(my_data, month == 10, day == 14)

What if you want to do both; print and save the variable?

(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If you want to filter based on different operators, you can use the following:

Equals: ==
Not equal: to !=
Greater than: >
Less than: <
Greater than or equal to: >=
Less than or equal to: <=

Let’s try it:

(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If we don’t use the == to mean equals, we get this:

(oct_14_flight_2 <- filter(my_data, month = 10, day = 14))
## Error in `filter()`:
## ! We detected a named input.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `month == 10`?

In addition, the error message in the console will start with:
Quitting from lines x-y [unnamed-chunk-z] (Filename)

And end with:

Backtrace:
1. dplyr::filter(my_data, month = 10, day = 14)
2. dplyr:::filter.data.frame(my_data, month = 10, day = 14)
Execution halted

You can also use logical operators to be more selective:

and: &
or: |
not: !

Let’s use the “or” function to pick flights in march and april.
Then let’s use the “and” function to pick a day in a month.
And finally let’s use the “not” function.

March_April_Flights <- filter(my_data, month == 3 | month == 4)

March_4th_Flights <- filter(my_data, month == 3 & day == 4)

Non_jan_Flights <- filter(my_data, month != 1)


2.1.1 Arrange

Arrange allows us to arrange the dataset based on the variables we desire.

arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We can also do this in descending order.

descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless of ascending or descending.

2.1.2 Select

We can also select specific columns that we want to look at.

calendar <- select(my_data, year, month, day)
print(calendar)
## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

We can also look at a range of columns.

calendar2 <- select(my_data, year:day)

Lets look at all columns months through carrier.

calendar3 <- select(my_data, year:carrier)

We can also choose which columns NOT to include.

everything_else <- select(my_data, -(year:day))

In this instance, we can also use the NOT operator.

everything_else2 <- select(my_data, !(year:day))

There are also some other helper function that can help you select the columns or data you’re looking for.

starts_with(“xyz”) - will select the values that start with xyz.
ends_with(“xyz) - will select the values that end with xyz.
contains(”xyz”) - will select the values that contain xyz.
matches(“xyz”) - will select the values that are identical to xyz.

2.1.3 Renaming

Let’s look at the data and rename a column.

head(my_data)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
rename(my_data, departure_time = dep_time)
## # A tibble: 336,776 × 19
##     year month   day departure_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>          <int>          <int>     <dbl>    <int>
##  1  2013     1     1            517            515         2      830
##  2  2013     1     1            533            529         4      850
##  3  2013     1     1            542            540         2      923
##  4  2013     1     1            544            545        -1     1004
##  5  2013     1     1            554            600        -6      812
##  6  2013     1     1            554            558        -4      740
##  7  2013     1     1            555            600        -5      913
##  8  2013     1     1            557            600        -3      709
##  9  2013     1     1            557            600        -3      838
## 10  2013     1     1            558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
my_data2 <- rename(my_data, departure_time = dep_time)


2.1.4 Mutate

What if you want to add new columns to your data frame? We have the mutate() function for that.

First, we make a smaller data frame so we can see what we are doing.

my_data_small <- select(my_data, year:day, distance, air_time)

Let’s calculate the speed of the flights.

mutate(my_data_small, speed = distance / air_time * 60)
## # A tibble: 336,776 × 6
##     year month   day distance air_time speed
##    <int> <int> <int>    <dbl>    <dbl> <dbl>
##  1  2013     1     1     1400      227  370.
##  2  2013     1     1     1416      227  374.
##  3  2013     1     1     1089      160  408.
##  4  2013     1     1     1576      183  517.
##  5  2013     1     1      762      116  394.
##  6  2013     1     1      719      150  288.
##  7  2013     1     1     1065      158  404.
##  8  2013     1     1      229       53  259.
##  9  2013     1     1      944      140  405.
## 10  2013     1     1      733      138  319.
## # ℹ 336,766 more rows
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

What if we wanted to create a new dataframe with ONLY your calculations (transmute).

airspeed <- transmute(my_data_small, speed = distance / air_time * 60, speed2 = distance / air_time)


2.1.5 Summarize and by_group()

We can use summarize to run a function on a data column to get a single return.

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

So we can see here that the average delay is about 12 minutes.

We gain additional value in summarize by pairing it with by_group().

by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ℹ 355 more rows

As you can see, we now have the delay by the days of the year.

2.1.6 Missing Data

What happens if we don’t tell R what to do with the missing data?

summarize(by_day, delay = mean(dep_delay))
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # ℹ 355 more rows

We can also filter our data based on NA (which in this dataset was cancelled flights).

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

is.na tests the statement of if data is NA by TRUE or FALSE.

is.na(airspeed)
head(is.na(airspeed))
##      speed speed2
## [1,] FALSE  FALSE
## [2,] FALSE  FALSE
## [3,] FALSE  FALSE
## [4,] FALSE  FALSE
## [5,] FALSE  FALSE
## [6,] FALSE  FALSE
tail(is.na(airspeed))
##           speed speed2
## [336771,]  TRUE   TRUE
## [336772,]  TRUE   TRUE
## [336773,]  TRUE   TRUE
## [336774,]  TRUE   TRUE
## [336775,]  TRUE   TRUE
## [336776,]  TRUE   TRUE

Let’s run summarize again on this data:

summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6


2.1.7 Counts

We can also count the amount of variables that are NA.

sum(is.na(my_data$dep_delay))
## [1] 8255

We can also count the numbers that are NOT NA.

sum(!is.na(my_data$dep_delay))
## [1] 328521


2.1.8 Piping

With tibble datasets, we can pipe results to get rid of the need to use the dollarsigns.
We can then summarize the number of flights by minutes delayed.

my_data %>%
  group_by(year, month, day) %>%
  summarize(mean = mean(dep_time, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 1385.
##  2  2013     1     2 1354.
##  3  2013     1     3 1357.
##  4  2013     1     4 1348.
##  5  2013     1     5 1326.
##  6  2013     1     6 1399.
##  7  2013     1     7 1341.
##  8  2013     1     8 1335.
##  9  2013     1     9 1326.
## 10  2013     1    10 1333.
## # ℹ 355 more rows


2.2 Tibbles

Let’s load the required package.

library(tibble)

Now we will take the time to explore tibbles. Tibbles are modified data frames which tweak some of the older features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore.

as_tibble(iris)
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

As we can see, we have the same data frame, but we have different features.

You can also create a tibble from scratch with tibble():

tibble(
  x = 1:5,
  y = 1,
  z = x ^2 + y
)
## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

You can also use tribble() for basic data table creation.

tribble(
  ~genea, ~geneb, ~genec,
  #######################
  110,      112,     114,
  6,        5,       4
)
## # A tibble: 2 × 3
##   genea geneb genec
##   <dbl> <dbl> <dbl>
## 1   110   112   114
## 2     6     5     4

Tibbles are built to not overwhelm your console when printing data, only showing the first few lines.

Let’s compare how tibbles and data frames print:

print(by_day)
## # A tibble: 336,776 × 19
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
as.data.frame(by_day)
head(by_day)
## # A tibble: 6 × 19
## # Groups:   year, month, day [1]
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

As you can see when you run the code “as.data.frame(by_day)”, the entire data frame is printed, which is overwhelming in the console and in a document.
head() is kind of what tibble() does by default.

We can specify with tibbles how many rows we want to print:

nycflights13::flights %>%
  print(n = 10, width = Inf)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
##    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
##        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
##  2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
##  3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
##  4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
##  5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
##  6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
##  7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
##  8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
##  9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
## 10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
##    time_hour          
##    <dttm>             
##  1 2013-01-01 05:00:00
##  2 2013-01-01 05:00:00
##  3 2013-01-01 05:00:00
##  4 2013-01-01 05:00:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 05:00:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # ℹ 336,766 more rows


2.2.1 Subsetting

Subsetting tibbles is easy, similar to data frames.

df_tibble <- tibble(nycflights13::flights)

df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We can subset by column name using the $:

df_tibble$carrier
head(df_tibble$carrier)
## [1] "UA" "UA" "AA" "B6" "DL" "UA"

We can subset by position using []

df_tibble[[2]]
head(df_tibble[[2]])
## [1] 1 1 1 1 1 1

If you want to use this in a pipe, you need to use the “.” placeholder.

df_tibble %>%
  .$carrier
df_tibble %>%
  .$carrier %>%
  head()
## [1] "UA" "UA" "AA" "B6" "DL" "UA"

Some older functions do not like tibbles, thus you might have to convert them back to a data frame.

class(df_tibble)
## [1] "tbl_df"     "tbl"        "data.frame"
df_tibble2 <- as.data.frame(df_tibble)

class(df_tibble2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble2)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##             time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00


2.3 TidyR

Let’s load the required package.

library(tidyverse)

How do we make a tidy dataset? Well the tidyverse follows three rules:

1 - Each variable must have its own column.
2 - Each observation has its own row.
3 - Each value has its own cell.

It is impossible to satisfy two of the three rules.

This leads to the following instructions for tidy data:

1 - Put each dataset into a tibble.
2 - Put each variable into a column.
3 - Profit

Picking one consistent method of data storage makes for easier understanding of your code and what is happening ” under the hood” or behind the scenes.

Let’s now look at working with tibbles, calculating and adding bmi to the tibble as an example:

bmi <- tibble(women)

bmi %>%
  mutate(bmi = (703 * weight)/(height)^2)
## # A tibble: 15 × 3
##    height weight   bmi
##     <dbl>  <dbl> <dbl>
##  1     58    115  24.0
##  2     59    117  23.6
##  3     60    120  23.4
##  4     61    123  23.2
##  5     62    126  23.0
##  6     63    129  22.8
##  7     64    132  22.7
##  8     65    135  22.5
##  9     66    139  22.4
## 10     67    142  22.2
## 11     68    146  22.2
## 12     69    150  22.1
## 13     70    154  22.1
## 14     71    159  22.2
## 15     72    164  22.2


2.3.1 Gathering

Sometimes you will find datasets that don’t fit well into a tibble.

We’ll use the built-in data from tidyverse for this part.

table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

As you can see from this data, we have one variable in column A (country), but columns b and c are two of the same. Thus, there are two observations in each row.

To fix this we can use the gather function:

table4a %>%
  gather('1999', '2000', key = 'year', value = 'cases')
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Let’s look at another example.

table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

As you can see, we have the same problem in table 4b.

table4b %>%
  gather('1999', '2000', key = 'year', value = 'population')
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <dbl>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

Now what if we want to join these two tables? We can use dplyr.

table4a <- table4a %>%
  gather('1999', '2000', key = 'year', value = 'cases')

table4b <- table4b %>%
  gather('1999', '2000', key = 'year', value = 'population')

left_join(table4a, table4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <dbl>      <dbl>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583


2.3.2 Spreading

Spreading is the opposite of gathering.
Let’s look at table 2

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

You can see that we have redundant info in columns 1 and 2.
We can fix that by combining rows 1&2, 3&4, etc.

spread(table2, key = type, value = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Type is the key of what we are returning into columns, the value is what becomes rows/observations.

In summary, spread makes long tables shorter and wider, and gather makes wide tables narrower and longer.

2.3.3 Separating and Pull

Now what happens if we have two observations in one column?

table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

As you can see, the rate is just the population and cases combined.
We can use separate to fix this.

table3 %>%
  separate(rate, into = c('cases', 'population'))
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

However, if you notice, the columns’ type is not correct.
We can easily fix this with the convert command:

table3 %>%
  separate(rate, into = c('cases', 'population'), convert = TRUE)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can actually specify what you want to separate based on.

table3 %>%
  separate(rate, into = c('cases', 'population'), sep = '/', convert = TRUE)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Let’s make this code look more tidy:

table3 %>%
  separate(
    rate,
    into = c('cases', 'population'),
    convert = TRUE,
    sep = '/'
  )

We can also separate based on number of digits:

table3 %>%
  separate(
    year,
    into = c('century', 'year'),
    convert = TRUE,
    sep = 2
  )
## # A tibble: 6 × 4
##   country     century  year rate             
##   <chr>         <int> <int> <chr>            
## 1 Afghanistan      19    99 745/19987071     
## 2 Afghanistan      20     0 2666/20595360    
## 3 Brazil           19    99 37737/172006362  
## 4 Brazil           20     0 80488/174504898  
## 5 China            19    99 212258/1272915272
## 6 China            20     0 213766/1280428583


2.3.4 Unite

What happens if we want to do the inverse of separate?
We can use the unite() function for that:

table5
## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
table5 %>%
  unite(data, century, year)
## # A tibble: 6 × 3
##   country     data  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583
table5 %>%
  unite(data, century, year, sep = "")
## # A tibble: 6 × 3
##   country     data  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583


2.4 Missing values

There can be two types of missing values: NA (explicit) or just entry (implicit).

gene_data <- tibble(
  gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
  nuc = c(20, 22, 24, 25, NA, 42, 67),
  run = c(1, 2, 3, 4, 2, 3, 4)
)

gene_data
## # A tibble: 7 × 3
##   gene    nuc   run
##   <chr> <dbl> <dbl>
## 1 a        20     1
## 2 a        22     2
## 3 a        24     3
## 4 a        25     4
## 5 b        NA     2
## 6 b        42     3
## 7 b        67     4

The nucleotide count for gene b, run 2, is explicitly missing.
The nucleotide count for gene b, run 1, is implicitly missing.

One way we can make implicit values explicit, is by putting runs into columns.

gene_data %>%
  spread(gene, nuc)
## # A tibble: 4 × 3
##     run     a     b
##   <dbl> <dbl> <dbl>
## 1     1    20    NA
## 2     2    22    NA
## 3     3    24    42
## 4     4    25    67

If we want to remove the missing values, we can use spread and gather, and na.rm = TRUE

gene_data %>%
  spread(gene, nuc) %>%
  gather(gene, nuc, 'a':'b', na.rm = TRUE)
## # A tibble: 6 × 3
##     run gene    nuc
##   <dbl> <chr> <dbl>
## 1     1 a        20
## 2     2 a        22
## 3     3 a        24
## 4     4 a        25
## 5     3 b        42
## 6     4 b        67

This is a way to deal with NAs.

Another way that we can make implicit values explicit, is with complete() from the TidyR package.

gene_data %>%
  complete(gene, run)
## # A tibble: 8 × 3
##   gene    run   nuc
##   <chr> <dbl> <dbl>
## 1 a         1    20
## 2 a         2    22
## 3 a         3    24
## 4 a         4    25
## 5 b         1    NA
## 6 b         2    NA
## 7 b         3    42
## 8 b         4    67

Sometimes an NA is present to represent a value being carried forward:

treatment <- tribble(
  ~ person,        ~ treatment,      ~ response,
  ##############################################
  "Isaac",              1,                7,
  NA,                   2,                10,
  NA,                   3,                9,
  "VDB",                1,                8,
  NA,                   2,                11,
  NA,                   3,                10,
)

treatment
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Isaac          1        7
## 2 <NA>           2       10
## 3 <NA>           3        9
## 4 VDB            1        8
## 5 <NA>           2       11
## 6 <NA>           3       10

What we can do here is use the fill() option.

treatment %>%
  fill(person)
## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 Isaac          1        7
## 2 Isaac          2       10
## 3 Isaac          3        9
## 4 VDB            1        8
## 5 VDB            2       11
## 6 VDB            3       10


2.5 DPLYR

It is rare that you will be working with a single data table. The dplyr package allows you to join two data tables based on common values.

Mutate joins - add new variables to one data frame from the matching observations in another.
Filtering joins - filters observation from one data frame based on whether or not they are present in another.
Set operations - treats observations as they are set elements.

Let’s load the required packages.

library(tidyverse)
library(nycflights13)

Let’s pull full carrier names based on letter codes.

airlines
## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

Let’s get info about airports.

airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,448 more rows

Let’s get some info on the weather at the airports.

weather
## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

Let’s get info on singular flights

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Let’s look at how these tables connect:

Flights -> planes based on tailnumber.
Flights -> airlines through carrier.
Flights -> airports origin AND destination.
Flights -> weather via origin, year/month/day/hour.

2.5.1 Keys

Keys are unique identifiers per observation.
A primary key uniquely identifies an observation in its own table.

One way to identify a primary key is as follows:

planes %>%
  count(tailnum) %>%
  filter(n>1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>

This indicates that the tailnumber is unique.

Sometimes there are no unique identifiers.

planes
## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows
planes %>%
  count(model) %>%
  filter(n>1)
## # A tibble: 79 × 2
##    model       n
##    <chr>   <int>
##  1 717-200    88
##  2 737-301     2
##  3 737-3G7     2
##  4 737-3H4   105
##  5 737-3K2     2
##  6 737-3L9     2
##  7 737-3Q8     5
##  8 737-3TO     2
##  9 737-401     4
## 10 737-4B7    18
## # ℹ 69 more rows


2.5.2 Mutate join

We can mutate a data frame and join two data frames together.
In addition, we can remove columns with the select() function:

flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2
## # A tibble: 336,776 × 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ℹ 336,766 more rows
flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = 'carrier')
## # A tibble: 336,776 × 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ℹ 336,766 more rows

We have now added the airline name to our data frame from the airline data frame.

Other types of joins:

Inner joins (inner_join) matches a pair of observations when their key is equal.
Outer joins (outer_join) keeps observations that appear in at least one table.


2.6 StringR

Let’s load the required packages.

library(tidyverse)
library(stringr)

You can create strings using single or double quotes.

string1 <- "this is a string"
string2 <- 'to put a "quote" in your string, use the opposite'

string1
## [1] "this is a string"
string2
## [1] "to put a \"quote\" in your string, use the opposite"

If you forget to close your string, you’ll get this:

string3 <- “where is this string going?

Error: :1:12: unexpected INCOMPLETE_STRING
1: string3 <- “where is this string going?
^ Just hit escape to stop the console in R from running and try again.

Multiple strings can be stored in character vectors.

string4 <- c("one", "two", "three")

string4
## [1] "one"   "two"   "three"

Measuring string length can be done for a single string, as well as for vectors with multiple strings:

str_length(string2)
## [1] 49
str_length(string4)
## [1] 3 3 5

Let’s combine two strings.

str_c("X", "Y")
## [1] "XY"
str_c(string1, string2)
## [1] "this is a stringto put a \"quote\" in your string, use the opposite"

You can use sep to control how they are separated.

str_c(string1, string2, sep = " ")
## [1] "this is a string to put a \"quote\" in your string, use the opposite"
str_c("x", "y", "z", sep = "_")
## [1] "x_y_z"


2.6.1 Subsetting strings

You can subset a string using str_sub():

HSP <- c("HSP123", "HSP234", "HSP456")

str_sub(HSP, 4, 6)
## [1] "123" "234" "456"

This just drops the first four letters from the strings.

Or you can use negatives to count back from the end.

str_sub(HSP, -3, -1)
## [1] "123" "234" "456"

You can convert the cases of strings as follows:

HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_lower(HSP)
## [1] "hsp123" "hsp234" "hsp456"

str_to_upper() does the opposite.

2.6.2 Regular Expression

Let’s install a package that helps us visualize what we are doing:

install.packages("htmlwidgets")
x <- c('ATTAGA', 'CGCCCCGGAT', 'TATTA')

str_view(x, 'G')
## [1] │ ATTA<G>A
## [2] │ C<G>CCCC<G><G>AT
str_view(x, 'TA')
## [1] │ AT<TA>GA
## [3] │ <TA>T<TA>

The next step is, “.” where the “.” matches an entry.

str_view(x, ".G.")
## [1] │ ATT<AGA>
## [2] │ <CGC>CC<CGG>AT

This is great for TF binding positions, which can be quite flexible.

Anchors allow you to match entries at the start or the ending.

str_view(x, "^TA")
## [3] │ <TA>TTA
str_view(x, "TA$")
## [3] │ TAT<TA>


2.6.3 Character classes/alternatives

\d matches any digit.
\s matches any space.
[abc] matches a, b, or c. 

str_view(x, "TA[GT]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA

[^anc] matches anything BUT a, b , or c

str_view(x, "TA[^T]")
## [1] │ AT<TAG>A

You can also use | to pick between two alternatives

str_view(x, "TA[G|T]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA


2.6.4 Detect matches

str_detect() returns a logical vector the same length of input.

y <- c("apple,", "banana", "pear")
y
## [1] "apple," "banana" "pear"
str_detect(y, "e")
## [1]  TRUE FALSE  TRUE

How many common words contain the letter “e”.

words
head(words)
## [1] "a"        "able"     "about"    "absolute" "accept"   "account"
sum(str_detect(words, "e"))
## [1] 536

Let’s get more complex, how many words start with a vowel, and how many end in a vowel?

sum(str_detect(words, "[aeiou]$"))
## [1] 271
sum(str_detect(words, "^[aeiou]"))
## [1] 175

Let’s find all the words that don’t contain “o” or “u”.

no_o <- !str_detect(words, "[ou]")

no_o
head(no_o)
## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

Now let’s extract.

words[!str_detect(words, "[ou]")]
##   [1] "a"          "able"       "accept"     "achieve"    "act"       
##   [6] "active"     "add"        "address"    "admit"      "advertise" 
##  [11] "affect"     "after"      "again"      "against"    "age"       
##  [16] "agent"      "agree"      "air"        "all"        "already"   
##  [21] "alright"    "always"     "america"    "and"        "answer"    
##  [26] "any"        "apart"      "apparent"   "appear"     "apply"     
##  [31] "area"       "arm"        "arrange"    "art"        "as"        
##  [36] "ask"        "at"         "attend"     "available"  "aware"     
##  [41] "away"       "baby"       "back"       "bad"        "bag"       
##  [46] "balance"    "ball"       "bank"       "bar"        "base"      
##  [51] "basis"      "be"         "bear"       "beat"       "bed"       
##  [56] "begin"      "behind"     "believe"    "benefit"    "best"      
##  [61] "bet"        "between"    "big"        "bill"       "birth"     
##  [66] "bit"        "black"      "break"      "brief"      "brilliant" 
##  [71] "bring"      "britain"    "by"         "cake"       "call"      
##  [76] "can"        "car"        "card"       "care"       "carry"     
##  [81] "case"       "cat"        "catch"      "cent"       "centre"    
##  [86] "certain"    "chair"      "chairman"   "chance"     "change"    
##  [91] "chap"       "character"  "charge"     "cheap"      "check"     
##  [96] "child"      "Christ"     "Christmas"  "city"       "claim"     
## [101] "class"      "clean"      "clear"      "client"     "create"    
## [106] "dad"        "danger"     "date"       "day"        "dead"      
## [111] "deal"       "dear"       "debate"     "decide"     "deep"      
## [116] "definite"   "degree"     "department" "depend"     "describe"  
## [121] "design"     "detail"     "die"        "difference" "dinner"    
## [126] "direct"     "district"   "divide"     "draw"       "dress"     
## [131] "drink"      "drive"      "dry"        "each"       "early"     
## [136] "east"       "easy"       "eat"        "effect"     "egg"       
## [141] "eight"      "either"     "elect"      "electric"   "eleven"    
## [146] "else"       "end"        "engine"     "english"    "enter"     
## [151] "especial"   "even"       "evening"    "ever"       "every"     
## [156] "evidence"   "exact"      "example"    "except"     "exercise"  
## [161] "exist"      "expect"     "expense"    "experience" "explain"   
## [166] "express"    "extra"      "eye"        "face"       "fact"      
## [171] "fair"       "fall"       "family"     "far"        "farm"      
## [176] "fast"       "father"     "feed"       "feel"       "few"       
## [181] "field"      "fight"      "file"       "fill"       "film"      
## [186] "final"      "finance"    "find"       "fine"       "finish"    
## [191] "fire"       "first"      "fish"       "fit"        "five"      
## [196] "flat"       "fly"        "france"     "free"       "friday"    
## [201] "friend"     "game"       "garden"     "gas"        "general"   
## [206] "germany"    "get"        "girl"       "give"       "glass"     
## [211] "grand"      "grant"      "great"      "green"      "hair"      
## [216] "half"       "hall"       "hand"       "hang"       "happen"    
## [221] "happy"      "hard"       "hate"       "have"       "he"        
## [226] "head"       "health"     "hear"       "heart"      "heat"      
## [231] "heavy"      "hell"       "help"       "here"       "high"      
## [236] "hit"        "idea"       "identify"   "if"         "imagine"   
## [241] "in"         "increase"   "indeed"     "inside"     "instead"   
## [246] "interest"   "invest"     "it"         "item"       "keep"      
## [251] "key"        "kid"        "kill"       "kind"       "king"      
## [256] "kitchen"    "lad"        "lady"       "land"       "large"     
## [261] "last"       "late"       "law"        "lay"        "lead"      
## [266] "learn"      "leave"      "left"       "leg"        "less"      
## [271] "let"        "letter"     "level"      "lie"        "life"      
## [276] "light"      "like"       "likely"     "limit"      "line"      
## [281] "link"       "list"       "listen"     "little"     "live"      
## [286] "machine"    "main"       "make"       "man"        "manage"    
## [291] "many"       "mark"       "market"     "marry"      "match"     
## [296] "matter"     "may"        "maybe"      "mean"       "meaning"   
## [301] "meet"       "member"     "middle"     "might"      "mile"      
## [306] "milk"       "mind"       "minister"   "miss"       "mister"    
## [311] "mrs"        "name"       "near"       "necessary"  "need"      
## [316] "never"      "new"        "news"       "next"       "nice"      
## [321] "night"      "nine"       "pack"       "page"       "paint"     
## [326] "pair"       "paper"      "paragraph"  "parent"     "park"      
## [331] "part"       "party"      "pass"       "past"       "pay"       
## [336] "pence"      "per"        "percent"    "perfect"    "perhaps"   
## [341] "pick"       "piece"      "place"      "plan"       "play"      
## [346] "please"     "practise"   "prepare"    "present"    "press"     
## [351] "pretty"     "price"      "print"      "private"    "rail"      
## [356] "raise"      "range"      "rate"       "rather"     "read"      
## [361] "ready"      "real"       "realise"    "really"     "receive"   
## [366] "recent"     "red"        "refer"      "regard"     "remember"  
## [371] "represent"  "research"   "respect"    "rest"       "rid"       
## [376] "right"      "ring"       "rise"       "safe"       "sale"      
## [381] "same"       "save"       "say"        "scheme"     "science"   
## [386] "seat"       "secretary"  "see"        "seem"       "self"      
## [391] "sell"       "send"       "sense"      "separate"   "serve"     
## [396] "service"    "set"        "settle"     "seven"      "sex"       
## [401] "shall"      "share"      "she"        "sheet"      "sick"      
## [406] "side"       "sign"       "similar"    "simple"     "since"     
## [411] "sing"       "single"     "sir"        "sister"     "sit"       
## [416] "site"       "six"        "size"       "sleep"      "slight"    
## [421] "small"      "space"      "speak"      "special"    "specific"  
## [426] "speed"      "spell"      "spend"      "staff"      "stage"     
## [431] "stairs"     "stand"      "standard"   "start"      "state"     
## [436] "stay"       "step"       "stick"      "still"      "straight"  
## [441] "strategy"   "street"     "strike"     "switch"     "system"    
## [446] "table"      "take"       "talk"       "tape"       "tax"       
## [451] "tea"        "teach"      "team"       "tell"       "ten"       
## [456] "tend"       "term"       "terrible"   "test"       "than"      
## [461] "thank"      "the"        "then"       "there"      "they"      
## [466] "thing"      "think"      "thirteen"   "thirty"     "this"      
## [471] "three"      "tie"        "time"       "trade"      "traffic"   
## [476] "train"      "travel"     "treat"      "tree"       "try"       
## [481] "twelve"     "twenty"     "type"       "very"       "view"      
## [486] "village"    "visit"      "wage"       "wait"       "walk"      
## [491] "wall"       "want"       "war"        "warm"       "wash"      
## [496] "waste"      "watch"      "water"      "way"        "we"        
## [501] "wear"       "wednesday"  "wee"        "week"       "weigh"     
## [506] "well"       "west"       "what"       "when"       "where"     
## [511] "whether"    "which"      "while"      "white"      "why"       
## [516] "wide"       "wife"       "will"       "win"        "wind"      
## [521] "wish"       "with"       "within"     "write"      "year"      
## [526] "yes"        "yesterday"  "yet"

This is useful for getting rid of all string for HSP that do not have TF binding site etc.

You can also use str_count() to say how many matches there are in string.

x
## [1] "ATTAGA"     "CGCCCCGGAT" "TATTA"
str_count(x, "[GC]")
## [1] 1 8 0

Let’s couple this with mutate:

df <- tibble(
  word = words, 
  i = seq_along(words)
)

df
## # A tibble: 980 × 2
##    word         i
##    <chr>    <int>
##  1 a            1
##  2 able         2
##  3 about        3
##  4 absolute     4
##  5 accept       5
##  6 account      6
##  7 achieve      7
##  8 across       8
##  9 act          9
## 10 active      10
## # ℹ 970 more rows
df %>%
  mutate(
    vowels = str_count(words, "[aeiou]"),
    consonants = str_count(words, "[^aeiou]")
  )
## # A tibble: 980 × 4
##    word         i vowels consonants
##    <chr>    <int>  <int>      <int>
##  1 a            1      1          0
##  2 able         2      2          2
##  3 about        3      3          2
##  4 absolute     4      4          4
##  5 accept       5      2          4
##  6 account      6      3          4
##  7 achieve      7      4          3
##  8 across       8      2          4
##  9 act          9      1          2
## 10 active      10      3          3
## # ℹ 970 more rows



3 Microarrays

Let’s work with microarrays!

First, let’s set the working directory:

# getwd("Home/Desktop/classroom/myfiles/Microarrays")

Not necessary on the virtual machine.

This code was updated to produce the KEGG pathview files. The old code was no longer able to produce the files. The old code has been hashed out to show as an example.

Let’s install additional packages:

BiocManager::install("KEGGREST", update = FALSE, force = TRUE)

BiocManager::install("affy")

Now let’s load the required libraries.

library(ath1121501.db)
library(ath1121501cdf)
library(annotate)
library(affy)
library(limma)
library(oligo)
library(dplyr)
library(stats)
library(reshape)

# Load the extra library:
library(KEGGREST)

Now read the cel files into the directory:

targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")

ab <- ReadAffy()

ReadAffy() reads the cel files from a folder and calculates the expression.

Let’s take a quick look at the data:

hist(ab)


3.1 Normalizing the microarray experiments.

eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
##                                          sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL      1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL      2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL      3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL       4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL       5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL       6

Let’s visualize the normalized data.

hist(eset)


Let’s characterize the data a bit.

ID <- featureNames(eset)

Here, we take the following symbols for the IDs from the Arabidopsis database.

Symbol <- getSYMBOL(ID, "ath1121501.db")

affydata <- read.delim("AFFY_ATH1_array_elements.txt")


3.2 Differential Gene Expression Analysis

Flight vs Ground

condition <- targets$gravity

design <- model.matrix(~factor(condition))
colnames(design) <- c("Flight", "Ground")

fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef = 2, n = Inf, adjust = "fdr")

head(top)
##           logFC AveExpr   t P.Value adj.P.Val   B
## 264005_at  -3.5     4.3 -20 1.4e-07    0.0022 7.6
## 266790_at   3.5     3.9  19 2.1e-07    0.0022 7.4
## 266865_at   3.3     5.1  17 4.6e-07    0.0022 6.8
## 248337_at  -7.0     6.1 -16 7.7e-07    0.0022 6.4
## 252543_at   2.1     4.2  16 8.3e-07    0.0022 6.3
## 248179_at   2.1     4.5  15 1.1e-06    0.0022 6.1

Let’s combine annotations.

Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
                    KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
                    GO = sapply(contents(ath1121501GO), paste, collapse = ", "),
                    SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "),
                    ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ", "))

Let’s merge all the data into one dataframe.

all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)

all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")

Let’s turn the ACCNUM into a character value.

all2$ACCNUM <- as.character(all2$ACCNUM)

Now, let’s write it into a csv file.

write.table(all2, file = "BRIC16_Final.csv", row.names = TRUE, col.names = TRUE, sep = "\t")

Let’s look at the columns in the org.tair database.

columns(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"

Let’s add a column to the all2 data frame from the org.tair database.

all2$entrez <- mapIds(org.At.tair.db,
                      keys = all2$ACCNUM,
                      column = "ENTREZID",
                      keytype = "TAIR",
                      multiVals = "first")

Let’s look at the top of the all2 data frame.

head(all2, 10)
##    Row.names                 GENE KEGG
## 1  244903_at hypothetical protein   NA
## 2  244904_at hypothetical protein   NA
## 3  244906_at hypothetical protein   NA
## 4  244907_at hypothetical protein   NA
## 5  244908_at hypothetical protein   NA
## 6  244911_at hypothetical protein   NA
## 7  244913_at hypothetical protein   NA
## 8  244914_at hypothetical protein   NA
## 9  244916_at hypothetical protein   NA
## 10 244917_at hypothetical protein   NA
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     GO
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005575", Evidence = "ND", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 8  list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005634", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
##     SYMBOL    ACCNUM  logFC AveExpr     t P.Value adj.P.Val    B
## 1   ORF149 ATMG00660 -1.214     7.8 -6.20  0.0004     0.018  0.5
## 2   ORF275 ATMG00670 -0.568     3.2 -2.07  0.0758     0.300 -4.9
## 3  ORF240A ATMG00690  0.055     9.1  0.18  0.8586     0.947 -6.8
## 4   ORF120 ATMG00710 -0.883     4.3 -2.91  0.0220     0.154 -3.7
## 5  ORF107D ATMG00720 -0.463     2.2 -1.73  0.1262     0.396 -5.4
## 6   ORF170 ATMG00820 -0.193     2.7 -0.72  0.4948     0.749 -6.5
## 7  ORF121B ATMG00840 -0.339     1.8 -1.18  0.2764     0.574 -6.1
## 8  ORF107E ATMG00850 -0.305     5.5 -1.30  0.2338     0.534 -6.0
## 9   ORF187 ATMG00880 -0.888     2.8 -2.32  0.0527     0.245 -4.6
## 10  ORF184 ATMG00870 -0.379     3.6 -1.31  0.2307     0.531 -5.9
##    array_element_type             organism is_control     locus
## 1     oligonucleotide Arabidopsis thaliana         no ATMG00660
## 2     oligonucleotide Arabidopsis thaliana         no ATMG00670
## 3     oligonucleotide Arabidopsis thaliana         no ATMG00690
## 4     oligonucleotide Arabidopsis thaliana         no ATMG00710
## 5     oligonucleotide Arabidopsis thaliana         no ATMG00720
## 6     oligonucleotide Arabidopsis thaliana         no ATMG00820
## 7     oligonucleotide Arabidopsis thaliana         no AT2G07626
## 8     oligonucleotide Arabidopsis thaliana         no ATMG00850
## 9     oligonucleotide Arabidopsis thaliana         no ATMG00880
## 10    oligonucleotide Arabidopsis thaliana         no ATMG00870
##                                                                                description
## 1                                                  hypothetical protein;(source:Araport11)
## 2                                                 transmembrane protein;(source:Araport11)
## 3                                                     FO-ATPase subunit;(source:Araport11)
## 4  Polynucleotidyl transferase, ribonuclease H-like superfamily protein;(source:Araport11)
## 5                                                  hypothetical protein;(source:Araport11)
## 6                  Reverse transcriptase (RNA-dependent DNA polymerase);(source:Araport11)
## 7                                                  hypothetical protein;(source:Araport11)
## 8                               DNA/RNA polymerases superfamily protein;(source:Araport11)
## 9                                                  hypothetical protein;(source:Araport11)
## 10                                                 hypothetical protein;(source:Araport11)
##    chromosome entrez
## 1           M   <NA>
## 2           M   <NA>
## 3           M   <NA>
## 4           M   <NA>
## 5           M   <NA>
## 6           M   <NA>
## 7           2   <NA>
## 8           M   <NA>
## 9           M   <NA>
## 10          M   <NA>

As we can see, some of the entries are hypothetical proteins.

3.3 Pathview

Let’s load the required packages.

library(pathview)
library(gage)
library(gageData)

Let’s call upon kegg data from the gageData library.

data(kegg.sets.hs)

Let’s create a data frame with fold changes.

foldchanges = all2$logFC
names(foldchanges) = all2$entrez

head(foldchanges)
##   <NA>   <NA>   <NA>   <NA>   <NA>   <NA> 
## -1.214 -0.568  0.055 -0.883 -0.463 -0.193

Now, let’s load the pathways that our data will be mapped onto.

kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

Let’s get the results.

keggres = gage(foldchanges, gsets = kegg.ath.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                                   p.geomean stat.mean   p.val
## ath03010 Ribosome                                   1.5e-14       8.2 1.5e-14
## ath01230 Biosynthesis of amino acids                3.1e-04       3.5 3.1e-04
## ath00040 Pentose and glucuronate interconversions   2.0e-03       3.0 2.0e-03
## ath00195 Photosynthesis                             2.6e-03       3.0 2.6e-03
## ath00966 Glucosinolate biosynthesis                 7.0e-03       2.7 7.0e-03
## ath00541 O-Antigen nucleotide sugar biosynthesis    7.3e-03       2.6 7.3e-03
##                                                     q.val set.size    exp1
## ath03010 Ribosome                                 1.6e-12      129 1.5e-14
## ath01230 Biosynthesis of amino acids              1.7e-02       85 3.1e-04
## ath00040 Pentose and glucuronate interconversions 6.9e-02       49 2.0e-03
## ath00195 Photosynthesis                           6.9e-02       19 2.6e-03
## ath00966 Glucosinolate biosynthesis               1.3e-01       12 7.0e-03
## ath00541 O-Antigen nucleotide sugar biosynthesis  1.3e-01       13 7.3e-03
## 
## $less
##                                          p.geomean stat.mean p.val q.val
## ath04120 Ubiquitin mediated proteolysis      0.043      -1.7 0.043     1
## ath04016 MAPK signaling pathway - plant      0.043      -1.7 0.043     1
## ath00592 alpha-Linolenic acid metabolism     0.045      -1.7 0.045     1
## ath03040 Spliceosome                         0.075      -1.4 0.075     1
## ath00350 Tyrosine metabolism                 0.093      -1.4 0.093     1
## ath04146 Peroxisome                          0.093      -1.3 0.093     1
##                                          set.size  exp1
## ath04120 Ubiquitin mediated proteolysis        63 0.043
## ath04016 MAPK signaling pathway - plant        74 0.043
## ath00592 alpha-Linolenic acid metabolism       19 0.045
## ath03040 Spliceosome                           77 0.075
## ath00350 Tyrosine metabolism                   19 0.093
## ath04146 Peroxisome                            40 0.093
## 
## $stats
##                                                   stat.mean exp1
## ath03010 Ribosome                                       8.2  8.2
## ath01230 Biosynthesis of amino acids                    3.5  3.5
## ath00040 Pentose and glucuronate interconversions       3.0  3.0
## ath00195 Photosynthesis                                 3.0  3.0
## ath00966 Glucosinolate biosynthesis                     2.7  2.7
## ath00541 O-Antigen nucleotide sugar biosynthesis        2.6  2.6

Let’s pull out the upregulated and downregulated genes separately. Then we will write them into their own files.

greater <- keggres$greater
lessers <- keggres$less

write.csv(greater, file = "BRIC16_pathview_Greater_Pathways.csv")
write.csv(lessers, file = "BRIC16_pathview_Lesser_Pathways.csv")


3.3.1 Mapping pathways

Let’s map to pathways greater.

keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <= 5) %>%
  .$id %>%
  as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath03010 Ribosome"                                
## [2] "ath01230 Biosynthesis of amino acids"             
## [3] "ath00040 Pentose and glucuronate interconversions"
## [4] "ath00195 Photosynthesis"                          
## [5] "ath00966 Glucosinolate biosynthesis"
keggresids_greater = substr(keggrespathways, start = 1, stop = 8)
keggresids_greater
## [1] "ath03010" "ath01230" "ath00040" "ath00195" "ath00966"
genedata <- as.character(all2$logFC)

Next, we will define a plotting function to apply.

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)

Now we plot multiple pathways (plots are saved to disk and returns a throwaway object list).

tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath"))

Since the KEGG data and pathways are updated from time to time, the above code may not necessarily work. As we here experience trouble with mapping all the pathways, some variables in the code can be adjusted such as:

same.layer = T/F
kegg.native = T/F
map.symbol = T/F

This new code should work:

tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", same.layer = FALSE, map.symbol = TRUE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath03010.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath01230.pathview.png
## Info: some node width is different from others, and hence adjusted!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00040.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00195.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00966.pathview.png
## Info: some node width is different from others, and hence adjusted!

Here, we successfully mapped all of the greaters.
Depending on personal preferences, the previously mentioned variables can be played around with and adjusted.

Let’s map to pathways lessers.

keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <= 5) %>%
  .$id %>%
  as.character()
keggrespathways
## [1] "ath04120 Ubiquitin mediated proteolysis" 
## [2] "ath04016 MAPK signaling pathway - plant" 
## [3] "ath00592 alpha-Linolenic acid metabolism"
## [4] "ath03040 Spliceosome"                    
## [5] "ath00350 Tyrosine metabolism"
keggresids_less = substr(keggrespathways, start = 1, stop = 8)
keggresids_less
## [1] "ath04120" "ath04016" "ath00592" "ath03040" "ath00350"

Now we define a plotting function to apply as before.

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)

Now we map the top downregulated genes:

tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", same.layer = FALSE, map.symbol = TRUE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04120.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04016.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00592.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath03040.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00350.pathview.png



4 RNASeq

Here, we will work with different approaches to analyzing results from RNA sequencing in R.

4.1 EdgeR

First, let’s load the required packages.

library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")

Let’s load the raw data and group it based on flight and ground (here denoted as 1 and 2 respectively).

rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "X")

group <- factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))

Now let’s do a differential gene expression.

dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)  
## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 2
##   .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
##   .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
##   .. ..$ :'data.frame':  12 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
##   .. .. ..$ lib.size    : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
##   .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ names: chr [1:2] "counts" "samples"

Let’s sort the data based on cpm, and take a look at the data afterwards.

keep <- rowSums(cpm(dgeGlm) > 2) >= 4

dgeGlm <- dgeGlm[keep,]

names(dgeGlm)
## [1] "counts"  "samples"
head(dgeGlm)
## An object of class "DGEList"
## $counts
##                    Mmus_C57.6J_KDN_FLT_Rep1_M23 Mmus_C57.6J_KDN_FLT_Rep2_M24
## ENSMUSG00000000001                         2977                         3058
## ENSMUSG00000000028                           60                           56
## ENSMUSG00000000056                         1254                         1282
## ENSMUSG00000000058                         1228                         1665
## ENSMUSG00000000078                         3223                         2477
## ENSMUSG00000000085                          884                          841
##                    Mmus_C57.6J_KDN_FLT_Rep3_M25 Mmus_C57.6J_KDN_FLT_Rep4_M26
## ENSMUSG00000000001                         2948                         3209
## ENSMUSG00000000028                           79                           77
## ENSMUSG00000000056                         1661                         1203
## ENSMUSG00000000058                         1213                         1399
## ENSMUSG00000000078                         2084                         2044
## ENSMUSG00000000085                          875                          724
##                    Mmus_C57.6J_KDN_FLT_Rep5_M27 Mmus_C57.6J_KDN_FLT_Rep6_M28
## ENSMUSG00000000001                         3256                         3102
## ENSMUSG00000000028                           76                           93
## ENSMUSG00000000056                         1454                         1483
## ENSMUSG00000000058                         1344                         1498
## ENSMUSG00000000078                         2030                         2302
## ENSMUSG00000000085                          831                          815
##                    Mmus_C57.6J_KDN_GC_Rep1_M33 Mmus_C57.6J_KDN_GC_Rep2_M34
## ENSMUSG00000000001                        3266                        3184
## ENSMUSG00000000028                          56                          75
## ENSMUSG00000000056                        1574                        1370
## ENSMUSG00000000058                        1527                        1546
## ENSMUSG00000000078                        2032                        2425
## ENSMUSG00000000085                         934                         723
##                    Mmus_C57.6J_KDN_GC_Rep3_M35 Mmus_C57.6J_KDN_GC_Rep4_M36
## ENSMUSG00000000001                        3370                        3078
## ENSMUSG00000000028                          75                          66
## ENSMUSG00000000056                        1497                        1510
## ENSMUSG00000000058                        1566                        1501
## ENSMUSG00000000078                        1834                        2102
## ENSMUSG00000000085                         817                         919
##                    Mmus_C57.6J_KDN_GC_Rep5_M37 Mmus_C57.6J_KDN_GC_Rep6_M38
## ENSMUSG00000000001                        3306                        2979
## ENSMUSG00000000028                          49                          68
## ENSMUSG00000000056                        1490                        1388
## ENSMUSG00000000058                        1248                        1390
## ENSMUSG00000000078                        2314                        2434
## ENSMUSG00000000085                         804                         760
## 
## $samples
##                              group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23     1  4.0e+07            1
## Mmus_C57.6J_KDN_FLT_Rep2_M24     1  4.1e+07            1
## Mmus_C57.6J_KDN_FLT_Rep3_M25     1  3.8e+07            1
## Mmus_C57.6J_KDN_FLT_Rep4_M26     1  3.9e+07            1
## Mmus_C57.6J_KDN_FLT_Rep5_M27     1  3.7e+07            1
## 7 more rows ...

Now let’s make a design that is readable for EdgeR:

design <- model.matrix(~group)
design
##    (Intercept) group2
## 1            1      0
## 2            1      0
## 3            1      0
## 4            1      0
## 5            1      0
## 6            1      0
## 7            1      1
## 8            1      1
## 9            1      1
## 10           1      1
## 11           1      1
## 12           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

The keep function is a large logical operator, which sorts the data depending on TRUE or FALSE statements.

Let’s look at the dispersion of the average log(cpm) by running it through a linear model function and plot it:

dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

plotBCV(dgeGlmTagDisp)

Now, we will do a fit of our linear model. Then we save the top tags.

fit <- glmFit(dgeGlmTagDisp, design)

colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef = 2)

ttGlm <- topTags(lrt, n = Inf)

class(ttGlm)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"

We will now use a decide test to find the values are statistically significant.
decideTests will compare the two groups and determine which are upregulated and which are downregulated.

summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))
##        group2
## Down       64
## NotSig  13390
## Up        159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]

plotSmear(lrt, de.tags = tagsGlm)

You want to see a normal distribution of the differentially expressed genes (red dots) across the avg logCPM.

Let’s write the results to a file.

hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]

write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")

Let’s take a look at the columns in the org.Mm (Mouse) database:

columns(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

Now let’s make a new ttGlm dataframe and add columns from the org.Mm database.

ttGlm2 <- as.data.frame(ttGlm$table)

ttGlm2$symbol = mapIds(org.Mm.eg.db,
                       keys = row.names(ttGlm2),
                       column = "SYMBOL",
                       keytype = "ENSEMBL",
                       multiVals = "first")

ttGlm2$entrez = mapIds(org.Mm.eg.db,
                       keys = row.names(ttGlm2),
                       column = "ENTREZID",
                       keytype = "ENSEMBL",
                       multiVals = "first")

ttGlm2$name = mapIds(org.Mm.eg.db,
                       keys = row.names(ttGlm2),
                       column = "GENENAME",
                       keytype = "ENSEMBL",
                       multiVals = "first")

head(ttGlm2)
##                    logFC logCPM LR  PValue     FDR symbol entrez
## ENSMUSG00000026077 -1.36    3.6 80 4.3e-19 5.9e-15  Npas2  18143
## ENSMUSG00000007872  0.89    5.5 77 1.9e-18 1.3e-14    Id3  15903
## ENSMUSG00000021775  0.95    6.2 63 2.0e-15 9.1e-12  Nr1d2 353187
## ENSMUSG00000002250 -0.83    5.3 62 2.7e-15 9.2e-12  Ppard  19015
## ENSMUSG00000059824  2.26    4.6 58 2.6e-14 7.2e-11    Dbp  13170
## ENSMUSG00000074715 -1.99    3.8 47 7.0e-12 1.6e-08  Ccl28  56838
##                                                                name
## ENSMUSG00000026077                    neuronal PAS domain protein 2
## ENSMUSG00000007872                       inhibitor of DNA binding 3
## ENSMUSG00000021775  nuclear receptor subfamily 1, group D, member 2
## ENSMUSG00000002250 peroxisome proliferator activator receptor delta
## ENSMUSG00000059824          D site albumin promoter binding protein
## ENSMUSG00000074715                    C-C motif chemokine ligand 28

We now have our results saved into a dataframe with extra columns giving context.

4.1.1 Pathview results

First, let’s load the required packages.

library(pathview)
library(gage)
library(gageData)

Let’s now load the kegg- and go data.

data(kegg.sets.mm)
data(go.subs.mm)

Let’s take the differentially expressed genes and save in a value for themselves.

foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez

Now let’s make tables from the kegg data.

kegg.mm = kegg.gsets("Mus musculus", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]

Usually, calling “mouse” as the species in the kegg gsets is enough. However, the kegg site or the pathing to certain names may “break”. Therefore, “Mus musculus” was called as species, as this seemed to fix the issue.

Let’s get the results from the kegg data based on the foldchanges.

keggres = gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                                                   p.geomean
## mmu03010 Ribosome                                                   9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells   2.0e-03
## mmu04330 Notch signaling pathway                                    6.1e-03
## mmu04350 TGF-beta signaling pathway                                 1.3e-02
## mmu04390 Hippo signaling pathway                                    2.0e-02
## mmu00830 Retinol metabolism                                         2.1e-02
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04330 Notch signaling pathway                                        2.6
## mmu04350 TGF-beta signaling pathway                                     2.2
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                     p.val q.val
## mmu03010 Ribosome                                                 9.5e-05 0.023
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.236
## mmu04330 Notch signaling pathway                                  6.1e-03 0.490
## mmu04350 TGF-beta signaling pathway                               1.3e-02 0.786
## mmu04390 Hippo signaling pathway                                  2.0e-02 0.830
## mmu00830 Retinol metabolism                                       2.1e-02 0.830
##                                                                   set.size
## mmu03010 Ribosome                                                      127
## mmu04550 Signaling pathways regulating pluripotency of stem cells      100
## mmu04330 Notch signaling pathway                                        54
## mmu04350 TGF-beta signaling pathway                                     84
## mmu04390 Hippo signaling pathway                                       125
## mmu00830 Retinol metabolism                                             37
##                                                                      exp1
## mmu03010 Ribosome                                                 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway                                  6.1e-03
## mmu04350 TGF-beta signaling pathway                               1.3e-02
## mmu04390 Hippo signaling pathway                                  2.0e-02
## mmu00830 Retinol metabolism                                       2.1e-02
## 
## $less
##                                                  p.geomean stat.mean   p.val
## mmu04613 Neutrophil extracellular trap formation   0.00012      -3.7 0.00012
## mmu04145 Phagosome                                 0.00192      -2.9 0.00192
## mmu04110 Cell cycle                                0.00276      -2.8 0.00276
## mmu04714 Thermogenesis                             0.00472      -2.6 0.00472
## mmu04217 Necroptosis                               0.00614      -2.5 0.00614
## mmu00910 Nitrogen metabolism                       0.00867      -2.6 0.00867
##                                                  q.val set.size    exp1
## mmu04613 Neutrophil extracellular trap formation 0.029      137 0.00012
## mmu04145 Phagosome                               0.222      121 0.00192
## mmu04110 Cell cycle                              0.222      134 0.00276
## mmu04714 Thermogenesis                           0.284      208 0.00472
## mmu04217 Necroptosis                             0.296      113 0.00614
## mmu00910 Nitrogen metabolism                     0.348       13 0.00867
## 
## $stats
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04330 Notch signaling pathway                                        2.6
## mmu04350 TGF-beta signaling pathway                                     2.2
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                   exp1
## mmu03010 Ribosome                                                  3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells  2.9
## mmu04330 Notch signaling pathway                                   2.6
## mmu04350 TGF-beta signaling pathway                                2.2
## mmu04390 Hippo signaling pathway                                   2.1
## mmu00830 Retinol metabolism                                        2.1
greaters <- keggres$greater

lessers <- keggres$less

Let’s prepare the upregulated genes for plotting by extracting their IDs.

keggrespathways_greater = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <= 5) %>%
  .$id %>%
  as.character()
keggrespathways_greater
## [1] "mmu03010 Ribosome"                                                
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04330 Notch signaling pathway"                                 
## [4] "mmu04350 TGF-beta signaling pathway"                              
## [5] "mmu04390 Hippo signaling pathway"
keggresids_greater = substr(keggrespathways_greater, start = 1, stop = 8)
keggresids_greater
## [1] "mmu03010" "mmu04550" "mmu04330" "mmu04350" "mmu04390"

Now let’s plot the upregulated genes unto their pathways.

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)

# plot multiple pathways
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "edgeR"))

Let’s prepare the downregulated genes for plotting by extracting their IDs.

keggrespathways_less = data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <= 5) %>%
  .$id %>%
  as.character()
keggrespathways_less
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04145 Phagosome"                              
## [3] "mmu04110 Cell cycle"                             
## [4] "mmu04714 Thermogenesis"                          
## [5] "mmu04217 Necroptosis"
keggresids_less = substr(keggrespathways_less, start = 1, stop = 8)
keggresids_less
## [1] "mmu04613" "mmu04145" "mmu04110" "mmu04714" "mmu04217"

Now let’s plot the downregulated genes unto their pathways.

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)

# plot multiple pathways
tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "edgeR"))

Let’s load the images into the html file:

library(imager)

edgeR_files <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = ".*edgeR.png")

edgeR_images <- lapply(edgeR_files, load.image)
knitr::include_graphics(edgeR_files)


4.2 DESeq

Let’s load the required packages for this analysis.

library("DESeq2")
library("dplyr")
library("apeglm")

Let’s load in our data.

countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")

colData <- read.csv("PHENO_DATA_mouse.csv", sep = ",", row.names = 1)

DESeq requires you to give levels to your data/group.

Therefore, we need to add leveling factors to our coldata.

colData$group <- factor(colData$group, levels = c("0", "1"))

Now let’s make sure we have the same number of individuals as well as groups.

all(rownames(colData)) %in% colnames(countData)
## Warning in all(rownames(colData)): coercing argument of type 'character' to
## logical
## [1] FALSE

The FALSE statement given here is due to one character difference between the two dataframes, the characters being “-” and “_“, and the countData has another named column. We will fix this later.

We need to round up the counts, because DESeq2 only allows integers as an input, and not fractional numbers. This depends on the method of alignment that was used upstream.

countData %>%
  mutate_if(is.numeric, ceiling)
head(countData)
##              gene_ID Mmus_C57_6J_KDN_FLT_Rep1_M23 Mmus_C57_6J_KDN_FLT_Rep2_M24
## 1 ENSMUSG00000000001                         2811                         2742
## 2 ENSMUSG00000000003                            0                            0
## 3 ENSMUSG00000000028                           59                           49
## 4 ENSMUSG00000000031                           26                           19
## 5 ENSMUSG00000000037                           20                           26
## 6 ENSMUSG00000000049                            0                            1
##   Mmus_C57_6J_KDN_FLT_Rep3_M25 Mmus_C57_6J_KDN_FLT_Rep4_M26
## 1                         3578                         3118
## 2                            0                            0
## 3                          102                           72
## 4                           17                           23
## 5                           37                           30
## 6                            5                           10
##   Mmus_C57_6J_KDN_FLT_Rep5_M27 Mmus_C57_6J_KDN_FLT_Rep6_M28
## 1                         3610                         3102
## 2                            0                            0
## 3                           80                           84
## 4                           26                           18
## 5                           22                           20
## 6                            0                            4
##   Mmus_C57_6J_KDN_GC_Rep1_M33 Mmus_C57_6J_KDN_GC_Rep2_M34
## 1                        3094                        3112
## 2                           0                           0
## 3                          56                          75
## 4                          15                          21
## 5                          20                          27
## 6                           1                           4
##   Mmus_C57_6J_KDN_GC_Rep3_M35 Mmus_C57_6J_KDN_GC_Rep4_M36
## 1                        3195                        3158
## 2                           0                           0
## 3                          60                          64
## 4                          21                          25
## 5                           8                          20
## 6                          10                           2
##   Mmus_C57_6J_KDN_GC_Rep5_M37 Mmus_C57_6J_KDN_GC_Rep6_M38
## 1                        3339                        2958
## 2                           0                           0
## 3                          56                          79
## 4                          24                          18
## 5                          30                          36
## 6                           1                           1

Let’s modify the countData dataframe to not include the first column.

countData[, 2:13] <- sapply(countData[, 2:13], as.integer)

row.names(countData) <- countData[, 1]

countData <- countData[2:13]

Now let’s fix those characters in the colData dataframe:

library(stringr)

rownames(colData) <- str_replace_all(rownames(colData), "-", "_")

all(rownames(colData) %in% colnames(countData))
## [1] TRUE
rownames(colData) == colnames(countData)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Now, we will do the differential gene expression.
First, we convert our dataframes into a matrix, making it readable for DESeq.

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)

dds <- dds[rowSums(counts(dds) > 2) >= 4]

dds <- DESeq(dds)

res <- results(dds)

Now we will adjust our log foldchange.
We will use lfcShrink for this, as it is adjusted for DESeq objects.

resLFC <- lfcShrink(dds, coef = 2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

Then we adjust our results by p-value.

(resOrdered <- res[order(res$padj), ])
## log2 fold change (MLE): group 1 vs 0 
## Wald test p-value: group 1 vs 0 
## DataFrame with 22008 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE        stat      pvalue
##                    <numeric>      <numeric> <numeric>   <numeric>   <numeric>
## ENSMUSG00000002250  1459.223      -0.926292 0.1112628    -8.32526 8.41430e-17
## ENSMUSG00000007872  1719.375       0.818829 0.1122820     7.29261 3.04014e-13
## ENSMUSG00000026077   437.035      -1.191812 0.1655873    -7.19748 6.13338e-13
## ENSMUSG00000040998 14579.593      -0.506307 0.0703771    -7.19421 6.28252e-13
## ENSMUSG00000021775  2804.923       0.842511 0.1233312     6.83129 8.41546e-12
## ...                      ...            ...       ...         ...         ...
## ENSMUSG00000118345   4.22314    -0.12097478  0.599072 -0.20193699    0.839966
## ENSMUSG00000118353   6.60578     0.56456713  0.481195  1.17326031    0.240691
## ENSMUSG00000118358   3.30902    -0.00273584  0.763559 -0.00358301    0.997141
## ENSMUSG00000118369   2.91657    -1.11623145  0.790702 -1.41169702    0.158039
## ENSMUSG00000118384   7.43136     0.23830798  0.489273  0.48706567    0.626212
##                           padj
##                      <numeric>
## ENSMUSG00000002250 1.24077e-12
## ENSMUSG00000007872 2.24149e-09
## ENSMUSG00000026077 2.31605e-09
## ENSMUSG00000040998 2.31605e-09
## ENSMUSG00000021775 2.48189e-08
## ...                        ...
## ENSMUSG00000118345          NA
## ENSMUSG00000118353          NA
## ENSMUSG00000118358          NA
## ENSMUSG00000118369          NA
## ENSMUSG00000118384          NA
summary(res)
## 
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 325, 1.5%
## LFC < 0 (down)     : 327, 1.5%
## outliers [1]       : 15, 0.068%
## low counts [2]     : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Now we will pull out some of these data.

FLT_vs_GC <- as.data.frame(res$log2FoldChange)

head(FLT_vs_GC)
##   res$log2FoldChange
## 1             0.0421
## 2            -0.1334
## 3            -0.0185
## 4            -0.0882
## 5            -0.0079
## 6             0.1136

Now we plot the data and print it to a pdf.

plotMA(resLFC, ylim = c(-2, 2))

pdf(file = "MA_Plot_FLT_vs_GC.pdf")

dev.off()
## png 
##   2

First the plot was made, then we write a pdf plot, but we then turn “dev.off” to allow the machine to be able to run the pdf plot function again. Otherwise, the machine will be locked on the one iteration.

Let’s save our differential expression results to file.

write.csv(as.data.frame(resOrdered), file = "Mouse_DESeq.csv")

Let’s perform a custom transformation and plot a principal component analysis (PCA).

dds <- estimateSizeFactors(dds)

se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) + 1), colData = colData(dds))

pdf(file = "PCA_PLOT_FLT_vs_GC.pdf")

dev.off()
## png 
##   2
plotPCA(DESeqTransform(se), intgroup = "group")
## using ntop=500 top features by variance

Now, let’s prepare the data for pathview and save a variable with the foldchanges.

Let’s load our required packages

library("AnnotationDbi")
library("org.Mm.eg.db")

Here, we add columns from the mouse database to the DESeq results.

columns(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"
foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
head(foldchanges)
##                    res$log2FoldChange
## ENSMUSG00000000001             0.0421
## ENSMUSG00000000028            -0.1334
## ENSMUSG00000000031            -0.0185
## ENSMUSG00000000037            -0.0882
## ENSMUSG00000000049            -0.0079
## ENSMUSG00000000056             0.1136
res$symbol = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "SYMBOL",
                    keytype = "ENSEMBL",
                    mutliVals = "first")

res$entrez = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "ENTREZID",
                    keytype = "ENSEMBL",
                    mutliVals = "first")

res$anme = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "GENENAME",
                    keytype = "ENSEMBL",
                    mutliVals = "first")

head(res, 10)
## log2 fold change (MLE): group 1 vs 0 
## Wald test p-value: group 1 vs 0 
## DataFrame with 10 rows and 9 columns
##                      baseMean log2FoldChange     lfcSE       stat    pvalue
##                     <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSMUSG00000000001 3132.35128     0.04214340 0.0436714  0.9650117  0.334539
## ENSMUSG00000000028   68.75801    -0.13342706 0.1565936 -0.8520597  0.394181
## ENSMUSG00000000031   21.05397    -0.01853142 0.2486477 -0.0745288  0.940590
## ENSMUSG00000000037   24.42314    -0.08817270 0.2982220 -0.2956613  0.767489
## ENSMUSG00000000049    3.24919    -0.00790342 0.9613572 -0.0082211  0.993441
## ENSMUSG00000000056 1424.88216     0.11355979 0.0777635  1.4603234  0.144201
## ENSMUSG00000000058 1420.78992    -0.03893850 0.0850602 -0.4577759  0.647113
## ENSMUSG00000000078 2254.53129    -0.07540275 0.0874314 -0.8624214  0.388456
## ENSMUSG00000000085  822.68179     0.04586772 0.0584699  0.7844667  0.432766
## ENSMUSG00000000088 5946.81754    -0.05461549 0.0691178 -0.7901795  0.429423
##                         padj      symbol      entrez                   anme
##                    <numeric> <character> <character>            <character>
## ENSMUSG00000000001  0.739637       Gnai3       14679 G protein subunit al..
## ENSMUSG00000000028  0.777085       Cdc45       12544 cell division cycle 45
## ENSMUSG00000000031        NA         H19       14955 H19, imprinted mater..
## ENSMUSG00000000037        NA       Scml2      107815 Scm polycomb group p..
## ENSMUSG00000000049        NA        Apoh       11818       apolipoprotein H
## ENSMUSG00000000056  0.547527        Narf       67608 nuclear prelamin A r..
## ENSMUSG00000000058  0.901904        Cav2       12390             caveolin 2
## ENSMUSG00000000078  0.774294        Klf6       23849 Kruppel-like transcr..
## ENSMUSG00000000085  0.800018       Scmh1       29871 sex comb on midleg h..
## ENSMUSG00000000088  0.798420       Cox5a       12858 cytochrome c oxidase..


4.2.1 Pathview results

Let’s load our pathview packages.

library(pathview)
library(gage)
library(gageData)

Let’s load our kegg- and go data and set the names for the data in the foldchanges dataframe.

data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez

head(foldchanges)
##   14679   12544   14955  107815   11818   67608 
##  0.0421 -0.1334 -0.0185 -0.0882 -0.0079  0.1136
kegg.mm <- kegg.gsets("Mus musculus", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]

Let’s map the results

keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)
## $greater
##                                           p.geomean stat.mean p.val q.val
## mmu03010 Ribosome                             0.017       2.1 0.017  0.89
## mmu04022 cGMP-PKG signaling pathway           0.030       1.9 0.030  0.89
## mmu04360 Axon guidance                        0.038       1.8 0.038  0.89
## mmu04330 Notch signaling pathway              0.042       1.8 0.042  0.89
## mmu04658 Th1 and Th2 cell differentiation     0.054       1.6 0.054  0.89
## mmu02010 ABC transporters                     0.075       1.5 0.075  0.89
##                                           set.size  exp1
## mmu03010 Ribosome                              139 0.017
## mmu04022 cGMP-PKG signaling pathway            152 0.030
## mmu04360 Axon guidance                         176 0.038
## mmu04330 Notch signaling pathway                58 0.042
## mmu04658 Th1 and Th2 cell differentiation       78 0.054
## mmu02010 ABC transporters                       46 0.075
## 
## $less
##                                                   p.geomean stat.mean  p.val
## mmu04613 Neutrophil extracellular trap formation     0.0043      -2.6 0.0043
## mmu04110 Cell cycle                                  0.0062      -2.5 0.0062
## mmu04657 IL-17 signaling pathway                     0.0113      -2.3 0.0113
## mmu04145 Phagosome                                   0.0332      -1.8 0.0332
## mmu04621 NOD-like receptor signaling pathway         0.0388      -1.8 0.0388
## mmu04625 C-type lectin receptor signaling pathway    0.0441      -1.7 0.0441
##                                                   q.val set.size   exp1
## mmu04613 Neutrophil extracellular trap formation   0.75      160 0.0043
## mmu04110 Cell cycle                                0.75      151 0.0062
## mmu04657 IL-17 signaling pathway                   0.90       75 0.0113
## mmu04145 Phagosome                                 0.91      144 0.0332
## mmu04621 NOD-like receptor signaling pathway       0.91      154 0.0388
## mmu04625 C-type lectin receptor signaling pathway  0.91       99 0.0441
## 
## $stats
##                                           stat.mean exp1
## mmu03010 Ribosome                               2.1  2.1
## mmu04022 cGMP-PKG signaling pathway             1.9  1.9
## mmu04360 Axon guidance                          1.8  1.8
## mmu04330 Notch signaling pathway                1.8  1.8
## mmu04658 Th1 and Th2 cell differentiation       1.6  1.6
## mmu02010 ABC transporters                       1.5  1.5

Let’s save our greater and less than pathways.

greaters <- keggres$greater
lessers <- keggres$less


4.2.1.1 Mapping pathways

Let’s map the upregulated pathways.

keggrespathways_greater <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <= 3) %>%
  .$id %>%
  as.character()
keggrespathways_greater
## [1] "mmu03010 Ribosome"                   "mmu04022 cGMP-PKG signaling pathway"
## [3] "mmu04360 Axon guidance"
keggresids_greater <- substr(keggrespathways_greater, start = 1, stop = 8)
keggresids_greater
## [1] "mmu03010" "mmu04022" "mmu04360"

Time to plot.

plot_pathway = function(pid) pathwiew(gene.data = foldchange, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)

tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "DESeq"))

sapply tells to run the function for every item in the list.

Let’s map the downregulated pathways.

keggrespathways_less <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <= 3) %>%
  .$id %>%
  as.character()
keggrespathways_less
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04110 Cell cycle"                             
## [3] "mmu04657 IL-17 signaling pathway"
keggresids_less <- substr(keggrespathways_less, start = 1, stop = 8)
keggresids_less
## [1] "mmu04613" "mmu04110" "mmu04657"

Now let’s plot the lessers.

plot_pathway = function(pid) pathwiew(gene.data = foldchange, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)

tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "DESeq"))

Let’s load the images into our html file.

library(imager)

DESeq_files <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = ".*DESeq.png")

DESeq_images <- lapply(DESeq_files, load.image)
knitr::include_graphics(DESeq_files)


4.3 EdgeR vs DESeq

Let’s do a comparison of the two methods of analysis. First, let’s load the required package.

library(tidyverse)

Let’s load our data.

EdgeR <- read.csv("Mouse_EdgeR_Results_Zero_vs_1.csv")

DESeq <- read.csv("Mouse_DESeq.csv")

Let’s adjust the data of both to match for the comparison.

DESeq2 <- DESeq %>%
  filter(padj < 0.1)

DESeq2 <- DESeq2[, c(1, 3)]

EdgeR <- EdgeR[, 1:2]

Now, let’s change the names of the columns to match each other in the dataframes.

colnames(DESeq2) <- c("ID", "logFC")

colnames(EdgeR) <- c("ID", "logFC")

Let’s install and load the required package for plotting.

install.packages("GOplot")

library(GOplot)

Now, let’s make the comparison.

comp <- GOVenn(DESeq2, EdgeR, label = c("DESeq2", "EdgeR"), title = "Comparison of DESeq an EdgeR DE Genes", plot = FALSE)

Let’s plot the comparison in a Venn diagram.

comp$plot

Here, let’s look at the shared significant genes listed in a table.

comp$table
## $A_only
##                     logFC Trend
## ENSMUSG00000022580 -0.340  DOWN
## ENSMUSG00000006517 -0.361  DOWN
## ENSMUSG00000038375 -0.311  DOWN
## ENSMUSG00000032374 -0.331  DOWN
## ENSMUSG00000014158 -0.245  DOWN
## ENSMUSG00000026179 -0.277  DOWN
## ENSMUSG00000037455 -0.319  DOWN
## ENSMUSG00000020232 -0.214  DOWN
## ENSMUSG00000023367 -0.222  DOWN
## ENSMUSG00000079465 -0.348  DOWN
## ENSMUSG00000032911 -0.253  DOWN
## ENSMUSG00000028671 -0.529  DOWN
## ENSMUSG00000022763 -0.463  DOWN
## ENSMUSG00000069565 -0.233  DOWN
## ENSMUSG00000029763 -0.322  DOWN
## ENSMUSG00000040263 -0.267  DOWN
## ENSMUSG00000039047 -0.200  DOWN
## ENSMUSG00000023938 -0.298  DOWN
## ENSMUSG00000048707 -0.256  DOWN
## ENSMUSG00000024958 -0.216  DOWN
## ENSMUSG00000022351 -0.374  DOWN
## ENSMUSG00000092500 -0.678  DOWN
## ENSMUSG00000031594 -0.705  DOWN
## ENSMUSG00000053414 -0.423  DOWN
## ENSMUSG00000035910 -0.465  DOWN
## ENSMUSG00000046079 -0.230  DOWN
## ENSMUSG00000039834 -0.209  DOWN
## ENSMUSG00000024064 -0.322  DOWN
## ENSMUSG00000017428 -0.192  DOWN
## ENSMUSG00000040188 -0.182  DOWN
## ENSMUSG00000026399 -0.463  DOWN
## ENSMUSG00000000594 -0.353  DOWN
## ENSMUSG00000001473 -0.520  DOWN
## ENSMUSG00000029032 -0.173  DOWN
## ENSMUSG00000034457 -0.656  DOWN
## ENSMUSG00000038704 -0.442  DOWN
## ENSMUSG00000004565 -0.185  DOWN
## ENSMUSG00000020381 -0.355  DOWN
## ENSMUSG00000041889 -0.407  DOWN
## ENSMUSG00000023259 -0.549  DOWN
## ENSMUSG00000100131 -0.745  DOWN
## ENSMUSG00000062203 -0.169  DOWN
## ENSMUSG00000025735 -0.547  DOWN
## ENSMUSG00000028545 -0.349  DOWN
## ENSMUSG00000058454 -0.322  DOWN
## ENSMUSG00000066441 -0.206  DOWN
## ENSMUSG00000022540 -0.222  DOWN
## ENSMUSG00000041939 -0.369  DOWN
## ENSMUSG00000035561 -0.629  DOWN
## ENSMUSG00000038023 -0.162  DOWN
## ENSMUSG00000101249 -0.424  DOWN
## ENSMUSG00000032370 -0.277  DOWN
## ENSMUSG00000004266 -0.241  DOWN
## ENSMUSG00000001627 -0.368  DOWN
## ENSMUSG00000029810 -0.212  DOWN
## ENSMUSG00000024579 -0.339  DOWN
## ENSMUSG00000030739 -0.257  DOWN
## ENSMUSG00000011382 -0.289  DOWN
## ENSMUSG00000032306 -0.197  DOWN
## ENSMUSG00000031960 -0.232  DOWN
## ENSMUSG00000022512 -0.313  DOWN
## ENSMUSG00000037243 -0.384  DOWN
## ENSMUSG00000028041 -0.228  DOWN
## ENSMUSG00000026307 -0.180  DOWN
## ENSMUSG00000029580 -0.225  DOWN
## ENSMUSG00000033107 -0.465  DOWN
## ENSMUSG00000112324 -1.178  DOWN
## ENSMUSG00000020744 -0.239  DOWN
## ENSMUSG00000055745 -0.330  DOWN
## ENSMUSG00000041959 -0.302  DOWN
## ENSMUSG00000036957 -0.236  DOWN
## ENSMUSG00000021697 -0.244  DOWN
## ENSMUSG00000037894 -0.149  DOWN
## ENSMUSG00000021792 -0.271  DOWN
## ENSMUSG00000037347 -0.583  DOWN
## ENSMUSG00000020877 -0.305  DOWN
## ENSMUSG00000087574 -0.525  DOWN
## ENSMUSG00000030512 -0.297  DOWN
## ENSMUSG00000006800 -0.384  DOWN
## ENSMUSG00000027412 -0.238  DOWN
## ENSMUSG00000031161 -0.193  DOWN
## ENSMUSG00000027429 -0.197  DOWN
## ENSMUSG00000045594 -0.322  DOWN
## ENSMUSG00000018171 -0.210  DOWN
## ENSMUSG00000032349 -0.205  DOWN
## ENSMUSG00000015094 -0.274  DOWN
## ENSMUSG00000028538 -0.214  DOWN
## ENSMUSG00000029616 -0.170  DOWN
## ENSMUSG00000019797 -0.260  DOWN
## ENSMUSG00000095193 -0.465  DOWN
## ENSMUSG00000068876 -0.369  DOWN
## ENSMUSG00000026784 -0.337  DOWN
## ENSMUSG00000057363 -0.230  DOWN
## ENSMUSG00000064367 -0.397  DOWN
## ENSMUSG00000024319 -0.161  DOWN
## ENSMUSG00000030137 -0.733  DOWN
## ENSMUSG00000030538 -0.210  DOWN
## ENSMUSG00000032978 -0.412  DOWN
## ENSMUSG00000069633 -0.241  DOWN
## ENSMUSG00000067158 -0.167  DOWN
## ENSMUSG00000024292 -0.697  DOWN
## ENSMUSG00000031958 -0.326  DOWN
## ENSMUSG00000064341 -0.427  DOWN
## ENSMUSG00000024993 -0.240  DOWN
## ENSMUSG00000074170 -0.323  DOWN
## ENSMUSG00000030641 -0.708  DOWN
## ENSMUSG00000034858 -0.241  DOWN
## ENSMUSG00000036040 -0.449  DOWN
## ENSMUSG00000063229 -0.274  DOWN
## ENSMUSG00000027490 -0.501  DOWN
## ENSMUSG00000047735 -0.296  DOWN
## ENSMUSG00000011752 -0.216  DOWN
## ENSMUSG00000034714 -0.251  DOWN
## ENSMUSG00000034613 -0.198  DOWN
## ENSMUSG00000032452 -0.366  DOWN
## ENSMUSG00000038775 -0.428  DOWN
## ENSMUSG00000045294 -0.410  DOWN
## ENSMUSG00000084128 -0.205  DOWN
## ENSMUSG00000037686 -0.488  DOWN
## ENSMUSG00000014245 -0.343  DOWN
## ENSMUSG00000029093 -0.702  DOWN
## ENSMUSG00000024736 -0.295  DOWN
## ENSMUSG00000028937 -0.460  DOWN
## ENSMUSG00000096795 -0.429  DOWN
## ENSMUSG00000051518 -0.238  DOWN
## ENSMUSG00000000934 -0.210  DOWN
## ENSMUSG00000030880 -0.374  DOWN
## ENSMUSG00000025726 -0.471  DOWN
## ENSMUSG00000052117 -0.383  DOWN
## ENSMUSG00000006342 -0.322  DOWN
## ENSMUSG00000062825 -0.214  DOWN
## ENSMUSG00000041733 -0.164  DOWN
## ENSMUSG00000028780 -0.390  DOWN
## ENSMUSG00000024665 -0.283  DOWN
## ENSMUSG00000025317 -0.452  DOWN
## ENSMUSG00000020142 -0.513  DOWN
## ENSMUSG00000082016 -0.500  DOWN
## ENSMUSG00000034371 -0.356  DOWN
## ENSMUSG00000032492 -0.239  DOWN
## ENSMUSG00000110755 -0.855  DOWN
## ENSMUSG00000064254 -0.353  DOWN
## ENSMUSG00000035845 -0.230  DOWN
## ENSMUSG00000017210 -0.184  DOWN
## ENSMUSG00000023832 -0.242  DOWN
## ENSMUSG00000037999 -0.136  DOWN
## ENSMUSG00000068220 -0.327  DOWN
## ENSMUSG00000064345 -0.470  DOWN
## ENSMUSG00000109532 -0.468  DOWN
## ENSMUSG00000024503 -0.299  DOWN
## ENSMUSG00000004843 -0.193  DOWN
## ENSMUSG00000030298 -0.187  DOWN
## ENSMUSG00000048578 -0.185  DOWN
## ENSMUSG00000085042 -0.537  DOWN
## ENSMUSG00000016942 -0.573  DOWN
## ENSMUSG00000020116 -0.221  DOWN
## ENSMUSG00000028989 -0.830  DOWN
## ENSMUSG00000045328 -0.675  DOWN
## ENSMUSG00000022012 -0.602  DOWN
## ENSMUSG00000030111 -0.899  DOWN
## ENSMUSG00000070283 -0.218  DOWN
## ENSMUSG00000015357 -0.250  DOWN
## ENSMUSG00000023004 -0.286  DOWN
## ENSMUSG00000026627 -0.314  DOWN
## ENSMUSG00000031604 -0.471  DOWN
## ENSMUSG00000032515 -0.308  DOWN
## ENSMUSG00000067924 -0.224  DOWN
## ENSMUSG00000037263 -0.843  DOWN
## ENSMUSG00000027230 -0.362  DOWN
## ENSMUSG00000110663 -0.548  DOWN
## ENSMUSG00000026042 -0.492  DOWN
## ENSMUSG00000024818 -0.340  DOWN
## ENSMUSG00000028063 -0.176  DOWN
## ENSMUSG00000002393 -0.151  DOWN
## ENSMUSG00000029304 -0.304  DOWN
## ENSMUSG00000063897 -0.232  DOWN
## ENSMUSG00000047281 -0.292  DOWN
## ENSMUSG00000020827 -0.092  DOWN
## ENSMUSG00000057103 -0.272  DOWN
## ENSMUSG00000035863 -0.136  DOWN
## ENSMUSG00000038217 -0.427  DOWN
## ENSMUSG00000021414 -0.467  DOWN
## ENSMUSG00000029394 -0.155  DOWN
## ENSMUSG00000025203 -0.440  DOWN
## ENSMUSG00000022021 -0.747  DOWN
## ENSMUSG00000021666 -0.172  DOWN
## ENSMUSG00000023153 -0.476  DOWN
## ENSMUSG00000032269 -0.487  DOWN
## ENSMUSG00000029701 -0.155  DOWN
## ENSMUSG00000044786 -0.441  DOWN
## ENSMUSG00000037470 -0.128  DOWN
## ENSMUSG00000003813 -0.177  DOWN
## ENSMUSG00000073405 -0.630  DOWN
## ENSMUSG00000022893 -0.270  DOWN
## ENSMUSG00000040694 -0.480  DOWN
## ENSMUSG00000078695 -0.226  DOWN
## ENSMUSG00000079261 -0.517  DOWN
## ENSMUSG00000022033 -0.717  DOWN
## ENSMUSG00000037169 -0.428  DOWN
## ENSMUSG00000031781 -0.220  DOWN
## ENSMUSG00000033538 -0.611  DOWN
## ENSMUSG00000058569 -0.131  DOWN
## ENSMUSG00000024683 -0.174  DOWN
## ENSMUSG00000027327 -0.296  DOWN
## ENSMUSG00000046999 -0.821  DOWN
## ENSMUSG00000036002 -0.179  DOWN
## ENSMUSG00000048277 -0.161  DOWN
## ENSMUSG00000039254 -0.171  DOWN
## ENSMUSG00000074890 -0.265  DOWN
## ENSMUSG00000020898 -0.176  DOWN
## ENSMUSG00000045725 -0.348  DOWN
## ENSMUSG00000069302 -0.826  DOWN
## ENSMUSG00000020917 -0.414  DOWN
## ENSMUSG00000025762 -0.150  DOWN
## ENSMUSG00000051397 -0.602  DOWN
## ENSMUSG00000071076 -0.329  DOWN
## ENSMUSG00000020062 -0.557  DOWN
## ENSMUSG00000027610 -0.208  DOWN
## ENSMUSG00000046603 -0.150  DOWN
## ENSMUSG00000026249 -0.398  DOWN
## ENSMUSG00000062248 -0.403  DOWN
## ENSMUSG00000067144 -0.584  DOWN
## ENSMUSG00000018427 -0.298  DOWN
## ENSMUSG00000029161 -0.316  DOWN
## ENSMUSG00000054951 -0.420  DOWN
## ENSMUSG00000026201 -0.168  DOWN
## ENSMUSG00000028001 -0.429  DOWN
## ENSMUSG00000055725 -0.339  DOWN
## ENSMUSG00000036452 -0.209  DOWN
## ENSMUSG00000038527 -0.278  DOWN
## ENSMUSG00000028042 -0.182  DOWN
## ENSMUSG00000056608 -0.362  DOWN
## ENSMUSG00000005667 -0.619  DOWN
## ENSMUSG00000037386 -0.407  DOWN
## ENSMUSG00000037725 -0.819  DOWN
## ENSMUSG00000047986 -0.327  DOWN
## ENSMUSG00000029771 -0.386  DOWN
## ENSMUSG00000032902 -0.548  DOWN
## ENSMUSG00000030805 -0.154  DOWN
## ENSMUSG00000102070 -0.342  DOWN
## ENSMUSG00000048911 -0.636  DOWN
## ENSMUSG00000020307 -0.208  DOWN
## ENSMUSG00000028327 -0.429  DOWN
## ENSMUSG00000061474 -0.216  DOWN
## ENSMUSG00000031776 -0.133  DOWN
## ENSMUSG00000023048 -0.185  DOWN
## ENSMUSG00000028834 -0.453  DOWN
## ENSMUSG00000032786 -0.342  DOWN
## ENSMUSG00000020571 -0.189  DOWN
## ENSMUSG00000026036 -0.183  DOWN
## ENSMUSG00000037103 -0.234  DOWN
## ENSMUSG00000032806 -0.215  DOWN
## ENSMUSG00000032115 -0.226  DOWN
## ENSMUSG00000026914 -0.188  DOWN
## ENSMUSG00000017221 -0.154  DOWN
## ENSMUSG00000093930 -0.256  DOWN
## ENSMUSG00000020775 -0.193  DOWN
## ENSMUSG00000015085 -0.277  DOWN
## ENSMUSG00000007783 -0.681  DOWN
## ENSMUSG00000040688 -0.163  DOWN
## ENSMUSG00000053291 -0.219  DOWN
## ENSMUSG00000027496 -0.549  DOWN
## ENSMUSG00000033031 -0.563  DOWN
## ENSMUSG00000030122 -0.203  DOWN
## ENSMUSG00000030340 -0.153  DOWN
## ENSMUSG00000045438 -0.171  DOWN
## ENSMUSG00000083287 -0.702  DOWN
## ENSMUSG00000017776  0.143    UP
## ENSMUSG00000050310  0.157    UP
## ENSMUSG00000039068  0.166    UP
## ENSMUSG00000021540  0.170    UP
## ENSMUSG00000031393  0.166    UP
## ENSMUSG00000036550  0.122    UP
## ENSMUSG00000020257  0.165    UP
## ENSMUSG00000022604  0.245    UP
## ENSMUSG00000028053  0.129    UP
## ENSMUSG00000030213  0.228    UP
## ENSMUSG00000060657  0.171    UP
## ENSMUSG00000002266  0.798    UP
## ENSMUSG00000027351  0.201    UP
## ENSMUSG00000041530  0.142    UP
## ENSMUSG00000031216  0.185    UP
## ENSMUSG00000025223  0.199    UP
## ENSMUSG00000032086  0.160    UP
## ENSMUSG00000050812  0.123    UP
## ENSMUSG00000038290  0.181    UP
## ENSMUSG00000038530  0.458    UP
## ENSMUSG00000038766  0.194    UP
## ENSMUSG00000050947  0.195    UP
## ENSMUSG00000045098  0.167    UP
## ENSMUSG00000026918  0.151    UP
## ENSMUSG00000037003  0.254    UP
## ENSMUSG00000074748  0.135    UP
## ENSMUSG00000097412  0.362    UP
## ENSMUSG00000031729  0.136    UP
## ENSMUSG00000060419  0.567    UP
## ENSMUSG00000027519  0.127    UP
## ENSMUSG00000021669  0.173    UP
## ENSMUSG00000051817  0.205    UP
## ENSMUSG00000043090  0.246    UP
## ENSMUSG00000037029  0.161    UP
## ENSMUSG00000027395  0.233    UP
## ENSMUSG00000021488  0.140    UP
## ENSMUSG00000073678  0.238    UP
## ENSMUSG00000041378  0.247    UP
## ENSMUSG00000046947  0.230    UP
## ENSMUSG00000058793  0.158    UP
## ENSMUSG00000037369  0.169    UP
## ENSMUSG00000035247  0.144    UP
## ENSMUSG00000026436  0.151    UP
## ENSMUSG00000046897  0.138    UP
## ENSMUSG00000057133  0.152    UP
## ENSMUSG00000027680  0.155    UP
## ENSMUSG00000043991  0.143    UP
## ENSMUSG00000034189  0.223    UP
## ENSMUSG00000018076  0.140    UP
## ENSMUSG00000052446  0.213    UP
## ENSMUSG00000000901  0.460    UP
## ENSMUSG00000040209  0.149    UP
## ENSMUSG00000021140  0.140    UP
## ENSMUSG00000015942  0.281    UP
## ENSMUSG00000043241  0.154    UP
## ENSMUSG00000017897  0.479    UP
## ENSMUSG00000022353  0.170    UP
## ENSMUSG00000005893  0.127    UP
## ENSMUSG00000044791  0.108    UP
## ENSMUSG00000034156  0.304    UP
## ENSMUSG00000037736  0.156    UP
## ENSMUSG00000059486  0.191    UP
## ENSMUSG00000040865  0.146    UP
## ENSMUSG00000035666  0.183    UP
## ENSMUSG00000029647  0.151    UP
## ENSMUSG00000038486  0.223    UP
## ENSMUSG00000025927  0.260    UP
## ENSMUSG00000022415  0.260    UP
## ENSMUSG00000092558  0.233    UP
## ENSMUSG00000014195  0.132    UP
## ENSMUSG00000019866  0.168    UP
## ENSMUSG00000078202  0.238    UP
## ENSMUSG00000035495  0.162    UP
## ENSMUSG00000044674  0.147    UP
## ENSMUSG00000102869  0.133    UP
## ENSMUSG00000026923  0.238    UP
## ENSMUSG00000020642  0.304    UP
## ENSMUSG00000037503  0.132    UP
## ENSMUSG00000043716  0.155    UP
## ENSMUSG00000037822  0.123    UP
## ENSMUSG00000035413  0.280    UP
## ENSMUSG00000021395  0.145    UP
## ENSMUSG00000036097  0.170    UP
## ENSMUSG00000034297  0.142    UP
## ENSMUSG00000037742  0.183    UP
## ENSMUSG00000031841  0.367    UP
## ENSMUSG00000021661  0.236    UP
## ENSMUSG00000021959  0.201    UP
## ENSMUSG00000020594  0.133    UP
## ENSMUSG00000027079  0.245    UP
## ENSMUSG00000087150  0.369    UP
## ENSMUSG00000091811  0.293    UP
## ENSMUSG00000066415  0.147    UP
## ENSMUSG00000027524  0.334    UP
## ENSMUSG00000008683  0.179    UP
## ENSMUSG00000066235  0.238    UP
## ENSMUSG00000099689  0.326    UP
## ENSMUSG00000074994  0.154    UP
## ENSMUSG00000005698  0.136    UP
## ENSMUSG00000020357  0.220    UP
## ENSMUSG00000051864  0.167    UP
## ENSMUSG00000033857  0.262    UP
## ENSMUSG00000027678  0.141    UP
## ENSMUSG00000028522  0.191    UP
## ENSMUSG00000038677  0.256    UP
## ENSMUSG00000031365  0.243    UP
## ENSMUSG00000079215  0.136    UP
## ENSMUSG00000008435  0.196    UP
## ENSMUSG00000022791  0.185    UP
## ENSMUSG00000035284  0.150    UP
## ENSMUSG00000071660  0.149    UP
## ENSMUSG00000031209  0.180    UP
## ENSMUSG00000043384  0.177    UP
## ENSMUSG00000062519  0.203    UP
## ENSMUSG00000046792  0.154    UP
## ENSMUSG00000060036  0.183    UP
## ENSMUSG00000027799  0.128    UP
## ENSMUSG00000070280  1.104    UP
## ENSMUSG00000039086  0.302    UP
## ENSMUSG00000010721  0.279    UP
## ENSMUSG00000026694  0.224    UP
## ENSMUSG00000090958  0.260    UP
## ENSMUSG00000041329  0.340    UP
## ENSMUSG00000110353  0.559    UP
## ENSMUSG00000024431  0.155    UP
## ENSMUSG00000028970  0.653    UP
## ENSMUSG00000044617  0.169    UP
## ENSMUSG00000053110  0.113    UP
## ENSMUSG00000033721  0.194    UP
## ENSMUSG00000033237  0.125    UP
## ENSMUSG00000026782  0.154    UP
## ENSMUSG00000020716  0.107    UP
## ENSMUSG00000028869  0.157    UP
## ENSMUSG00000040123  0.176    UP
## ENSMUSG00000023279  0.660    UP
## ENSMUSG00000031618  0.182    UP
## ENSMUSG00000046480  0.153    UP
## ENSMUSG00000053604  0.222    UP
## ENSMUSG00000019817  0.216    UP
## ENSMUSG00000063317  0.192    UP
## ENSMUSG00000027782  0.169    UP
## ENSMUSG00000003970  0.175    UP
## ENSMUSG00000020448  0.156    UP
## ENSMUSG00000075592  0.209    UP
## ENSMUSG00000005886  0.114    UP
## ENSMUSG00000097119  0.186    UP
## ENSMUSG00000049739  0.148    UP
## ENSMUSG00000033671  0.132    UP
## ENSMUSG00000036197  0.155    UP
## ENSMUSG00000048379  0.183    UP
## ENSMUSG00000039315  0.549    UP
## ENSMUSG00000033943  0.124    UP
## ENSMUSG00000017418  0.178    UP
## ENSMUSG00000041037  0.153    UP
## ENSMUSG00000048696  0.292    UP
## ENSMUSG00000039801  0.137    UP
## ENSMUSG00000060938  0.154    UP
## ENSMUSG00000049800  0.175    UP
## ENSMUSG00000013921  0.359    UP
## ENSMUSG00000031169  0.321    UP
## ENSMUSG00000106062  0.642    UP
## ENSMUSG00000017405  0.225    UP
## ENSMUSG00000034647  0.187    UP
## ENSMUSG00000095597  0.398    UP
## ENSMUSG00000048170  0.119    UP
## ENSMUSG00000038437  0.144    UP
## ENSMUSG00000043929  0.200    UP
## ENSMUSG00000052751  0.127    UP
## ENSMUSG00000038481  0.163    UP
## ENSMUSG00000090083  0.190    UP
## ENSMUSG00000034893  0.133    UP
## ENSMUSG00000069682  0.295    UP
## ENSMUSG00000025571  0.170    UP
## ENSMUSG00000040481  0.137    UP
## ENSMUSG00000009470  0.137    UP
## ENSMUSG00000038056  0.107    UP
## ENSMUSG00000020491  0.286    UP
## ENSMUSG00000038544  0.226    UP
## ENSMUSG00000015290  0.124    UP
## ENSMUSG00000017667  0.232    UP
## ENSMUSG00000037712  0.152    UP
## ENSMUSG00000031754  0.152    UP
## ENSMUSG00000039377  0.409    UP
## ENSMUSG00000065954  0.130    UP
## ENSMUSG00000021180  0.252    UP
## ENSMUSG00000021144  0.123    UP
## ENSMUSG00000032892  0.266    UP
## ENSMUSG00000047632  0.361    UP
## ENSMUSG00000036990  0.138    UP
## ENSMUSG00000025997  0.165    UP
## ENSMUSG00000025195  0.133    UP
## ENSMUSG00000018736  0.156    UP
## ENSMUSG00000002748  0.160    UP
## ENSMUSG00000062866  0.139    UP
## ENSMUSG00000047888  0.157    UP
## ENSMUSG00000053754  0.086    UP
## ENSMUSG00000105692  0.424    UP
## ENSMUSG00000029154  0.224    UP
## ENSMUSG00000062328  0.135    UP
## ENSMUSG00000020460  0.134    UP
## ENSMUSG00000026455  0.140    UP
## ENSMUSG00000028936  0.188    UP
## ENSMUSG00000048285  0.324    UP
## 
## $B_only
##                    logFC Trend
## ENSMUSG00000095616 -2.18  DOWN
## ENSMUSG00000055254 -1.26  DOWN
## ENSMUSG00000036083 -0.42  DOWN
## ENSMUSG00000062901  0.37    UP
## ENSMUSG00000066113  0.54    UP
## ENSMUSG00000047861  0.48    UP
## ENSMUSG00000068270  0.37    UP
## ENSMUSG00000026315  0.48    UP
## ENSMUSG00000010651  0.37    UP
## ENSMUSG00000057914  0.65    UP
## ENSMUSG00000038567  0.79    UP
## ENSMUSG00000021379  0.43    UP
## ENSMUSG00000038528  0.28    UP
## ENSMUSG00000021876  1.02    UP
## ENSMUSG00000020566  0.38    UP
## ENSMUSG00000002289  0.96    UP
## ENSMUSG00000025880  0.33    UP
## ENSMUSG00000032898  0.30    UP
## ENSMUSG00000051344  0.40    UP
## ENSMUSG00000028234  0.28    UP
## ENSMUSG00000059173  0.31    UP
## ENSMUSG00000061410  0.26    UP
## ENSMUSG00000104445  0.29    UP
## ENSMUSG00000029004  0.26    UP
## ENSMUSG00000006599  0.30    UP
## ENSMUSG00000031530  0.46    UP
## ENSMUSG00000075520  0.32    UP
## ENSMUSG00000032604  0.27    UP
## ENSMUSG00000013089  0.48    UP
## ENSMUSG00000037058  0.25    UP
## ENSMUSG00000090165  0.72    UP
## ENSMUSG00000078429  0.25    UP
## ENSMUSG00000032554  0.52    UP
## ENSMUSG00000028266  0.28    UP
## ENSMUSG00000035504  0.44    UP
## ENSMUSG00000042379  0.66    UP
## ENSMUSG00000063415  0.58    UP
## ENSMUSG00000074179  0.55    UP
## ENSMUSG00000006494  0.29    UP
## ENSMUSG00000024472  0.31    UP
## 
## $AB
##                    logFC_A logFC_B Trend
## ENSMUSG00000000253   -0.36   -0.29  DOWN
## ENSMUSG00000002250   -0.93   -0.83  DOWN
## ENSMUSG00000002797   -0.53   -0.45  DOWN
## ENSMUSG00000010663   -0.38   -0.31  DOWN
## ENSMUSG00000019944   -0.43   -0.50  DOWN
## ENSMUSG00000020326   -0.35   -0.28  DOWN
## ENSMUSG00000020538   -0.34   -0.41  DOWN
## ENSMUSG00000021135   -1.24   -1.17  DOWN
## ENSMUSG00000021185   -0.37   -0.31  DOWN
## ENSMUSG00000021214   -1.29   -1.23  DOWN
## ENSMUSG00000021364   -0.35   -0.29  DOWN
## ENSMUSG00000021670   -0.52   -0.51  DOWN
## ENSMUSG00000022797   -0.76   -0.61  DOWN
## ENSMUSG00000023067   -0.92   -0.86  DOWN
## ENSMUSG00000023120   -0.66   -0.74  DOWN
## ENSMUSG00000024772   -0.21   -0.26  DOWN
## ENSMUSG00000024866   -0.46   -0.40  DOWN
## ENSMUSG00000025185   -0.78   -0.72  DOWN
## ENSMUSG00000026077   -1.19   -1.36  DOWN
## ENSMUSG00000026188   -0.97   -0.86  DOWN
## ENSMUSG00000026189   -0.49   -0.42  DOWN
## ENSMUSG00000026202   -0.48   -0.41  DOWN
## ENSMUSG00000026827   -0.58   -0.53  DOWN
## ENSMUSG00000027111   -0.47   -0.39  DOWN
## ENSMUSG00000028357   -0.37   -0.32  DOWN
## ENSMUSG00000028919   -0.42   -0.39  DOWN
## ENSMUSG00000029361   -0.40   -0.61  DOWN
## ENSMUSG00000029752   -0.83   -0.74  DOWN
## ENSMUSG00000031283   -0.82   -0.81  DOWN
## ENSMUSG00000031349   -0.39   -0.31  DOWN
## ENSMUSG00000031725   -0.63   -0.59  DOWN
## ENSMUSG00000031994   -1.33   -1.26  DOWN
## ENSMUSG00000032420   -0.39   -0.32  DOWN
## ENSMUSG00000032758   -1.47   -1.41  DOWN
## ENSMUSG00000036752   -0.49   -0.42  DOWN
## ENSMUSG00000038224   -0.77   -0.69  DOWN
## ENSMUSG00000039062   -0.37   -0.32  DOWN
## ENSMUSG00000040998   -0.51   -0.43  DOWN
## ENSMUSG00000041605   -0.51   -0.45  DOWN
## ENSMUSG00000041920   -0.70   -0.66  DOWN
## ENSMUSG00000042487   -0.49   -0.42  DOWN
## ENSMUSG00000043091   -0.43   -0.35  DOWN
## ENSMUSG00000043681   -0.81   -0.73  DOWN
## ENSMUSG00000045136   -0.53   -0.44  DOWN
## ENSMUSG00000046070   -0.50   -0.42  DOWN
## ENSMUSG00000050097   -0.69   -0.66  DOWN
## ENSMUSG00000052562   -1.20   -1.12  DOWN
## ENSMUSG00000053303   -1.45   -1.36  DOWN
## ENSMUSG00000054520   -0.53   -0.55  DOWN
## ENSMUSG00000054986   -1.13   -1.04  DOWN
## ENSMUSG00000055116   -1.04   -0.97  DOWN
## ENSMUSG00000056749   -1.03   -0.95  DOWN
## ENSMUSG00000058258   -0.50   -0.47  DOWN
## ENSMUSG00000058672   -0.49   -0.41  DOWN
## ENSMUSG00000059743   -0.42   -0.35  DOWN
## ENSMUSG00000063694   -0.35   -0.30  DOWN
## ENSMUSG00000070985   -0.49   -0.38  DOWN
## ENSMUSG00000074261   -0.37   -0.30  DOWN
## ENSMUSG00000074715   -2.05   -1.99  DOWN
## ENSMUSG00000090236   -0.36   -0.33  DOWN
## ENSMUSG00000090264   -0.85   -0.79  DOWN
## ENSMUSG00000001280    0.17    0.26    UP
## ENSMUSG00000002265    0.34    0.39    UP
## ENSMUSG00000002346    0.41    0.49    UP
## ENSMUSG00000003477    0.62    0.70    UP
## ENSMUSG00000003849    0.33    0.39    UP
## ENSMUSG00000004105    0.31    0.37    UP
## ENSMUSG00000005034    0.15    0.23    UP
## ENSMUSG00000005483    0.36    0.42    UP
## ENSMUSG00000006127    0.24    0.28    UP
## ENSMUSG00000006216    0.22    0.34    UP
## ENSMUSG00000006269    0.19    0.26    UP
## ENSMUSG00000006333    0.19    0.26    UP
## ENSMUSG00000007872    0.82    0.89    UP
## ENSMUSG00000008682    0.24    0.30    UP
## ENSMUSG00000009927    0.16    0.23    UP
## ENSMUSG00000012848    0.22    0.28    UP
## ENSMUSG00000015656    0.36    0.39    UP
## ENSMUSG00000015957    1.15    1.41    UP
## ENSMUSG00000018900    0.45    0.52    UP
## ENSMUSG00000020372    0.20    0.28    UP
## ENSMUSG00000020427    0.64    0.71    UP
## ENSMUSG00000020473    0.26    0.33    UP
## ENSMUSG00000020482    0.34    0.41    UP
## ENSMUSG00000020607    0.38    0.45    UP
## ENSMUSG00000020653    0.81    0.89    UP
## ENSMUSG00000020889    0.74    0.79    UP
## ENSMUSG00000021482    0.22    0.27    UP
## ENSMUSG00000021508    0.75    0.81    UP
## ENSMUSG00000021775    0.84    0.95    UP
## ENSMUSG00000022122    0.42    0.47    UP
## ENSMUSG00000022389    0.65    0.71    UP
## ENSMUSG00000022949    0.72    0.80    UP
## ENSMUSG00000023022    0.23    0.27    UP
## ENSMUSG00000024298    0.22    0.32    UP
## ENSMUSG00000024900    0.38    0.45    UP
## ENSMUSG00000025019    0.26    0.30    UP
## ENSMUSG00000025197    0.28    0.36    UP
## ENSMUSG00000025511    0.44    0.51    UP
## ENSMUSG00000025764    0.23    0.30    UP
## ENSMUSG00000025815    0.59    0.44    UP
## ENSMUSG00000026313    0.15    0.32    UP
## ENSMUSG00000027314    0.59    0.64    UP
## ENSMUSG00000027796    0.72    0.79    UP
## ENSMUSG00000027875    1.68    1.72    UP
## ENSMUSG00000028081    0.23    0.30    UP
## ENSMUSG00000028957    1.23    1.26    UP
## ENSMUSG00000029587    0.24    0.39    UP
## ENSMUSG00000029714    0.20    0.34    UP
## ENSMUSG00000030201    0.26    0.30    UP
## ENSMUSG00000030256    0.98    1.19    UP
## ENSMUSG00000031167    0.53    0.61    UP
## ENSMUSG00000031320    0.26    0.32    UP
## ENSMUSG00000032097    0.20    0.26    UP
## ENSMUSG00000032594    0.18    0.29    UP
## ENSMUSG00000032624    0.22    0.28    UP
## ENSMUSG00000033327    0.43    0.50    UP
## ENSMUSG00000033350    0.37    0.43    UP
## ENSMUSG00000033411    0.20    0.36    UP
## ENSMUSG00000034111    0.21    0.28    UP
## ENSMUSG00000034450    1.88    1.93    UP
## ENSMUSG00000034460    0.32    0.38    UP
## ENSMUSG00000035469    0.19    0.29    UP
## ENSMUSG00000035530    0.20    0.27    UP
## ENSMUSG00000035614    0.18    0.32    UP
## ENSMUSG00000037172    0.22    0.28    UP
## ENSMUSG00000037465    0.58    0.61    UP
## ENSMUSG00000037523    0.18    0.24    UP
## ENSMUSG00000037621    0.52    0.59    UP
## ENSMUSG00000038393    0.68    0.77    UP
## ENSMUSG00000039108    0.16    0.26    UP
## ENSMUSG00000039783    0.37    0.45    UP
## ENSMUSG00000039789    0.23    0.39    UP
## ENSMUSG00000039831    0.19    0.32    UP
## ENSMUSG00000040363    0.20    0.29    UP
## ENSMUSG00000040423    0.21    0.30    UP
## ENSMUSG00000040584    0.52    0.59    UP
## ENSMUSG00000040740    0.58    0.65    UP
## ENSMUSG00000041075    0.28    0.35    UP
## ENSMUSG00000041351    0.41    0.46    UP
## ENSMUSG00000041841    0.25    0.30    UP
## ENSMUSG00000042046    0.20    0.26    UP
## ENSMUSG00000042659    0.36    0.45    UP
## ENSMUSG00000042745    0.46    0.53    UP
## ENSMUSG00000043144    0.60    0.67    UP
## ENSMUSG00000044026    0.23    0.30    UP
## ENSMUSG00000045382    0.58    0.69    UP
## ENSMUSG00000045441    0.34    0.41    UP
## ENSMUSG00000045519    0.37    0.53    UP
## ENSMUSG00000047215    0.21    0.28    UP
## ENSMUSG00000047878    0.25    0.35    UP
## ENSMUSG00000048826    0.26    0.34    UP
## ENSMUSG00000049241    0.55    0.61    UP
## ENSMUSG00000050100    0.73    0.76    UP
## ENSMUSG00000053411    0.35    0.44    UP
## ENSMUSG00000053964    0.64    0.82    UP
## ENSMUSG00000054499    0.26    0.32    UP
## ENSMUSG00000054793    0.22    0.29    UP
## ENSMUSG00000055866    0.81    0.77    UP
## ENSMUSG00000055980    0.27    0.34    UP
## ENSMUSG00000056851    0.18    0.27    UP
## ENSMUSG00000058056    0.34    0.41    UP
## ENSMUSG00000058600    0.19    0.30    UP
## ENSMUSG00000058655    0.26    0.33    UP
## ENSMUSG00000059824    2.26    2.26    UP
## ENSMUSG00000061143    0.28    0.34    UP
## ENSMUSG00000061353    0.55    0.60    UP
## ENSMUSG00000061477    0.16    0.23    UP
## ENSMUSG00000062563    0.29    0.35    UP
## ENSMUSG00000063681    0.66    0.76    UP
## ENSMUSG00000064065    0.26    0.31    UP
## ENSMUSG00000067586    0.36    0.42    UP
## ENSMUSG00000068742    0.45    0.48    UP
## ENSMUSG00000069495    0.18    0.28    UP
## ENSMUSG00000070348    0.32    0.39    UP
## ENSMUSG00000071415    0.19    0.26    UP
## ENSMUSG00000074063    0.48    0.55    UP
## ENSMUSG00000074578    0.32    0.45    UP
## ENSMUSG00000086583    0.42    0.48    UP
## ENSMUSG00000098557    0.28    0.34    UP
## ENSMUSG00000106847    0.29    0.36    UP
## ENSMUSG00000110185    0.26    0.33    UP
## ENSMUSG00000110195    0.32    0.33    UP

The table is important to identify which pathways are represented in the Venn diagram. Often, it is a good idea to run both EdgeR and DESeq2.
You can choose whether to focus on the results of one, both, or the shared genes. It can be argued that the shared pathways have some higher confidence in their significance.


5 Other Tools

Now let’s work with a couple of different diverse tools in R.

5.1 Protein Alignment

We will perform a multiple sequence alignment (msa).
Let’s load the required library.

library(msa)

Let’s load our amino acid sequences from a fasta file.

seq <- readAAStringSet("hglobin.fa")

seq
## AAStringSet object of length 3:
##     width seq                                               names               
## [1]   142 MVLSPADKTNVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2]   142 MVLSGEDKSNIKAAWGKIGGHGA...PAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3]   142 MSLTRTERTIILSLWSKISTQAD...ADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI

Let’s do a msa of the amino acid sequences, using the ClustalW tool.

alignments <- msa(seq, method = "ClustalW")
## use default substitution matrix
alignments
## CLUSTAL 2.1  
## 
## Call:
##    msa(seq, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 3 rows and 142 columns
##     aln                                                    names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] MSLTRTERTIILSLWSKISTQADVIG...FTADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
## Con MVLS??DKTNIKAAWGKIG?HA?EYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus

Let’s print our alignments to a pdf file.

msaPrettyPrint(alignments, output = "pdf", showNames = "left", 
               showLogo = "none", askForOverwrite = FALSE, 
               verbose = TRUE, file = "whole_align.pdf")
## Multiple alignment written to temporary file /tmp/Rtmp2XBKuv/seq225f4d281c47.fasta
## File whole_align.tex created
## Output file whole_align.pdf created
msaPrettyPrint(alignments, c(10, 30), output = "pdf", showNames = "left",
               file = "zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
               verbose = TRUE)
## Multiple alignment written to temporary file /tmp/Rtmp2XBKuv/seq225f4e6328e2.fasta
## File zoomed_align.tex created
## Output file zoomed_align.pdf created

Let’s make a tree for our alignment.

First, let’s load the required packages.

library(ape)
library(seqinr)

Convert to seqinr alignment -> get the distances and make a tree.

alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")

distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")

tree <- ape::nj(distances1)
plot(tree, main = "Phylogenetic Tree of HBA Sequences")


Here, we used the neighbour-joining method for our alignment.

5.2 Synteny

Let’s load the required package.

library(DECIPHER)

In the first step, we will load the libraries and the sequence into long_seqs. This is a DNAStringSet object.

long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq
## DNAStringSet object of length 5:
##      width seq                                              names               
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT NC_018523.1 Sacch...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA NC_022431.1 Ascle...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC NC_022259.1 Nanno...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC NC_022417.1 Cocos...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT NC_022459.1 Camel...

Now we build a temporary SQLite database.

Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
## Adding 5 sequences to the database.
## 
## Added 5 new sequences to table Seqs.
## 55 total sequences in table Seqs.
## Time difference of 0.2 secs

Now that we’ve built the database, we can do the following:
Find the syntheny blocks.

synteny <- FindSynteny("long_db")
## ================================================================================
## 
## Time difference of 7.5 secs

Let’s plot the synteny blocks to view them.

pairs(synteny)

plot(synteny)

The colors across the line is the basepairs, the blocks are then conserved regions.
The pairing and comparison depends on the order of the sequences, as they are all compared to the first sequence.

Let’s make an actual alignment file.

alignment <- AlignSynteny(synteny, "long_db")
## ================================================================================
## 
## Time difference of 78 secs

This is a pairwise alignment of all blocks of synteny between sets of sequences.

Let’s create a structure holding all aligned syntenic blocks for a pair of sequences.

block <- unlist(alignment[[1]])

Let’s finish by writing to file one alignment at a time.

writeXStringSet(block, "genome_blocks_out.fa")


5.3 Unannotated Regions

Let’s load the required packages.

library(locfit)
library(Rsamtools)

Let’s create a function that will load the gene region information in a GFF file and convert it to a bioconductor GRanges object.

get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff, "GRanges")
}

Get counts into windows across the genome in 500bp segments.

whole_genome <- csaw::windowCounts(
  file.path("windows.bam"),
  bin = TRUE,
  filter = 0,
  width = 500,
  param = csaw::readParam(
    minq = 20, #determines what is a passing read
    dedup = TRUE, #removes pcr duplicates
    pe = "both" #requires that both pairs of reads are aligned
  )
)

Since this is a single column of data, let’s rename it.

colnames(whole_genome) <- c("small_data")

Let’s extract our annotated regions.

annotated_regions <- get_annotated_regions_from_gff(file.path("genes.gff"))

Now that we have the windows of high expression, we want to see if any of them overlap with annotated regions.

Let’s load the required packages.

library(IRanges)
library(SummarizedExperiment)

Let’s find the overlaps between the windows we made.

windows_in_genes <- IRanges::overlapsAny(
  SummarizedExperiment::rowRanges(whole_genome), # creates a GRanges object
  annotated_regions
)

This goes over the entire annotated region and windows to see if overlap is TRUE.

windows_in_genes
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

Here let’s subset the whole_genome object into annotated and nonannotated regions.

annotated_window_counts <- whole_genome[windows_in_genes, ]
non_annotated_window_counts <- whole_genome[!windows_in_genes, ]

Now, let’s use assay() to extract the actual counts from the GRanges object.

assay(non_annotated_window_counts)
##       small_data
##  [1,]          0
##  [2,]         31
##  [3,]         25
##  [4,]          0
##  [5,]          0
##  [6,]         24
##  [7,]         25
##  [8,]          0
##  [9,]          0

In this next step, we use Rsamtools Pileup() function to get a per-base coverage dataframe.
Each row represents a single nucleotide in the reference count and the count column gives the depth of coverage at that point.

First, we load the required package.

library(bumphunter)
pile_df <- Rsamtools::pileup("windows.bam")

Next, let’s group reads with a certain distance from each other into a cluster, we then give the sequence names, position and distance.
A gene is very big, so you have reads that overlap, but do not completely overlap, thus, you want to group together those that overlap a lot.

clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap = 100)

table(clusters)
## clusters
##    1    2    3 
## 1486 1552 1520

Lastly, we will map the reads to the regions we found for the genome.

bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff = 1)
##    chr start  end value  area cluster indexStart indexEnd    L clusterL
## 3 Chr1  4503 5500  10.4 15811       3       3039     4558 1520     1520
## 1 Chr1   502 1500  10.0 14839       1          1     1486 1486     1486
## 2 Chr1  2501 3500   8.7 13436       2       1487     3038 1552     1552



6 Phylogenetic Trees

Let’s go on to working with Phylogenetic trees!

6.1 Treeio

Let’s load the required packages.

library(ggplot2)
library(ggtree)
library(treeio)

First, we need to load our raw tree data. It’s a Newick format so we use:

itol <- ape::read.tree("itol.nwk")

Now we will print out a very basic phylogenetic tree.

ggtree(itol)

We can also change the format to make it a circular tree

ggtree(itol, layout = "circular")

We can also change the left-right / up-down direction.

ggtree(itol) + coord_flip() + scale_x_reverse()

By using geom_tiplab() we can add names to the end of tips.

ggtree(itol) + geom_tiplab(color = "indianred", size = 1.5)

By adding a geom_strip() layer we can annotate clades in the tree with a block of color.

ggtree(itol) + geom_strip(13, 14, color = "red", barsize = 1)

We can highlight clades in unrooted trees with blobs of color using geom_hilight.

ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "steelblue")

We have to give a node number for it to know where to draw. But how do we know, when we cannot see node numbers on the tree?

Let’s load some more packages:

install.packages("BAMMtools")
library(BAMMtools)

We can use the MRCA (most recent common ancestor) function to find the node we want.

getmrca(itol, "Photorhabdus_luminescens", "Blochmannia_floridanus")
## [1] 206

Now if we want to highlight the section of the most recent common ancestor between the two above, then we can do the following:

ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "steelblue")


6.2 Quantifying Trees - Treespace

Quantifying difference between trees can be done with treespace.

First, let’s load the required packages

library(ape)
library(adegraphics)
library(treespace)

Now we need to load all the treefiles into a multiPhylo object.

treefiles <- list.files(file.path(getwd(), "genetrees"), full.names = TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)
## [1] "list"

We want the class to be a mulitPhylo, thus we do the following:

class(tree_list) <- "multiPhylo"
class(tree_list)
## [1] "multiPhylo"

Now we can compute the kendall-colijn distances between trees. This function does a LOT of analysis:

  • First, it runs a pairwise comparison of all trees in the input.
  • Second, it carries our clustering using PCA.

These results are returned in a list of objects, where $D contains the pairwise metric of the trees, and $pco contains the PCA. The method we use (Kendal-Colijn) is particularly suitable for rooted trees as we have here. The option NF tells us how many principal components to retain.

comparisons <- treespace(tree_list, nf = 3)

We can plot the pairwise distances between trees trees with table.image

table.image(comparisons$D, nclass = 25)


The number of classes determines the resolution of the image.

Now let’s print the PCA and clusters, this shows us how the group of trees cluster.

plotGroves(comparisons$pco, lab.show = TRUE, lab.cex = 1.5)


The more clustered together, the more similar the sequences are. In this case, the PCA is all over the place.

To work with this, we will take our PCA, and show what we think are different clusters. Thus we have grouped the clusters. The number of clusters is given by nclust. Forcing too many clusters will give conflicting reports.

groves <- findGroves(comparisons, nclust = 4) 
plotGroves(groves)


Doing fewer clusters, like 2, may also look fine. This is where it becomes subjective how many clusters there should be.

6.3 Binding Trees - Subtrees

Extracting and working with subtrees can be done using APE.

Let’s load our required packages.

library(ape)

Now let’s load the tree data we will be working with.

newick <- read.tree("mammal_tree.nwk")
l <- subtrees(newick)

Let’s plot the tree to see what it looks like.

plot(newick)

We can subset this plot using the “node” function.

plot(l[[4]], sub = "Node 4")

Now we extract the tree manually.

small_tree <- extract.clade(newick, 9)
plot(small_tree)

Now what if we want to bind two trees together? We can do the following:

new_tree <- bind.tree(newick, small_tree, 3)
plot(new_tree)

This binds the subsetted tree to the regular tree. This would make sense if we had different subsets of trees, which we would like to bind together.

6.4 Trees from Alignments - Phangorn

Reconstructing trees from alignments can be done with phangorn.

Let’s load the required packages.

library(Biostrings)
library(msa)
library(phangorn)

First, we’ll load the sequences into a seqs variable.

seqs <- readAAStringSet("abc.fa")

Now, let’s construct an alignment with the msa package and ClustalOmega.

aln <- msa::msa(seqs, method = c("ClustalOmega"))
## using Gonnet

To create a tree, we need to convert the alignment to a phyData objects.

aln <- as.phyDat(aln, type = "AA")
class(aln)
## [1] "phyDat"

In this step, we’ll actually make the trees. Trees are made from a distance matrix, which can be computed with dist.ml() - ML stands for maximum likelihood.

dist_mat <- dist.ml(aln)

Now we pass the distance matrix to UPGMA to make a UPGMA tree.

upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main = "UPGMA")

Conversely, we can pass the distance matrix to a neighbour-joining function.

nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")

Lastly, we are going to use the bootstraps.phyDat() function to compute bootstrap support for the branches of the tree. The first argument is the object (aln), while the second argument in the function is nj.

Bootstraps are essentially confidence intervals for how the clade is placed in the correct position.

fit <- pml(nj_tree, aln)
bootstraps <- bootstrap.phyDat(aln, FUN = function(x) {NJ(dist.ml(x))}, bs = 100)
plotBS(nj_tree, bootstraps, p = 10)



7 GWAS

Finally, let’s finish by working with genome wide association studies in R.

7.1 Variant Tools

Here, we will work with tools to find and map genetic variants.

First, let’s load the required libraries.

library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
library(VariantTools)

Now we want to load our datasets. We need .Bam and .fa files for this pipeline to work.

bam_file <- file.path(getwd(), "hg17_snps.bam")
fasta_file <- file.path(getwd(), "chr17_83k.fa")

Now we need to set up the genome object and the parameters object:

fa <- rtracklayer::FastaFile(fasta_file)

Now we create a GMapGenome object, which describes the genome to the later variant calling function.

genome <- gmapR::GmapGenome(fa, create = TRUE)

This next step sets our parameter for what is considered a variant.
There can be a lot of fine-tuning to this function. We are just going to use the standard settings.
If done for a research purpose, you could have a test set of bam files, making your own SNPs to see how good it is at identifying them.

qual_params <- TallyVariantsParam(
  genome = genome,
  minimum_mapq = 20)

var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)

Now we use the callVariants function in accordance with the parameters we defined above:

called_Variants <- callVariants(bam_file,
                                qual_params,
                                calling.filter = var_params)

head(called_Variants)
## VRanges object with 6 ranges and 17 metadata columns:
##           seqnames    ranges strand         ref              alt     totalDepth
##              <Rle> <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle>
##   [1] NC_000017.10        64      *           G                T            759
##   [2] NC_000017.10        69      *           G                T            812
##   [3] NC_000017.10        70      *           G                T            818
##   [4] NC_000017.10        73      *           T                A            814
##   [5] NC_000017.10        77      *           T                A            802
##   [6] NC_000017.10        78      *           G                T            798
##             refDepth       altDepth   sampleNames softFilterMatrix | n.read.pos
##       <integerOrRle> <integerOrRle> <factorOrRle>         <matrix> |  <integer>
##   [1]            739             20          <NA>                  |         17
##   [2]            790             22          <NA>                  |         19
##   [3]            796             22          <NA>                  |         20
##   [4]            795             19          <NA>                  |         13
##   [5]            780             22          <NA>                  |         19
##   [6]            777             21          <NA>                  |         17
##       n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
##            <integer>       <integer>  <integer>      <integer>   <integer>
##   [1]             64             759         20            739           0
##   [2]             69             812         22            790           0
##   [3]             70             818         22            796           0
##   [4]             70             814         19            795           0
##   [5]             70             802         22            780           0
##   [6]             70             798         21            777           0
##       count.minus.ref count.del.plus count.del.minus read.pos.mean
##             <integer>      <integer>       <integer>     <numeric>
##   [1]               0              0               0       30.9000
##   [2]               0              0               0       40.7273
##   [3]               0              0               0       34.7727
##   [4]               0              0               0       36.1579
##   [5]               0              0               0       38.3636
##   [6]               0              0               0       39.7143
##       read.pos.mean.ref read.pos.var read.pos.var.ref     mdfne mdfne.ref
##               <numeric>    <numeric>        <numeric> <numeric> <numeric>
##   [1]           32.8755      318.558          347.804        NA        NA
##   [2]           35.4190      377.004          398.876        NA        NA
##   [3]           36.3442      497.762          402.360        NA        NA
##   [4]           36.2176      519.551          402.843        NA        NA
##   [5]           36.0064      472.327          397.070        NA        NA
##   [6]           35.9241      609.076          390.463        NA        NA
##       count.high.nm count.high.nm.ref
##           <integer>         <integer>
##   [1]            20               738
##   [2]            22               789
##   [3]            22               796
##   [4]            19               769
##   [5]            22               780
##   [6]            21               777
##   -------
##   seqinfo: 1 sequence from chr17_83k genome
##   hardFilters(4): nonRef nonNRef readCount likelihoodRatio

We have identified 6 variants.

Now, we move on to annotation and load the feature position information from gff.

get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff, "GRanges")
}

Note that you can load this data from a BED file. A Browser Extensible Data (BED) file, is a text file format used to store genomic regions as coordinates and associated annotations. You can read more about the bed files and the package here.

Let’s save our annotated genes into a variable using the function above.

genes <- get_annotated_regions_from_gff("chr17.83k.gff3")

Now we can calculate which variants overlap which genes:

overlaps <- GenomicRanges::findOverlaps(called_Variants, genes)
overlaps
## Hits object with 12684 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]     35176           1
##       [2]     35176           2
##       [3]     35176           3
##       [4]     35177           1
##       [5]     35177           2
##       ...       ...         ...
##   [12680]     40944           2
##   [12681]     40944           7
##   [12682]     40945           1
##   [12683]     40945           2
##   [12684]     40945           7
##   -------
##   queryLength: 44949 / subjectLength: 8

And lastly, we subset the genes with the list of overlaps:

identified <- genes[subjectHits(overlaps)]
identified
## GRanges object with 12684 ranges and 20 metadata columns:
##               seqnames      ranges strand |   source       type     score
##                  <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric>
##       [1] NC_000017.10 64099-76866      - |   havana ncRNA_gene        NA
##       [2] NC_000017.10 64099-76866      - |   havana lnc_RNA           NA
##       [3] NC_000017.10 64099-65736      - |   havana exon              NA
##       [4] NC_000017.10 64099-76866      - |   havana ncRNA_gene        NA
##       [5] NC_000017.10 64099-76866      - |   havana lnc_RNA           NA
##       ...          ...         ...    ... .      ...        ...       ...
##   [12680] NC_000017.10 64099-76866      - |   havana lnc_RNA           NA
##   [12681] NC_000017.10 76723-76866      - |   havana exon              NA
##   [12682] NC_000017.10 64099-76866      - |   havana ncRNA_gene        NA
##   [12683] NC_000017.10 64099-76866      - |   havana lnc_RNA           NA
##   [12684] NC_000017.10 76723-76866      - |   havana exon              NA
##               phase                     ID            Name     biotype
##           <integer>            <character>     <character> <character>
##       [1]      <NA>   gene:ENSG00000280279      AC240565.2     lincRNA
##       [2]      <NA> transcript:ENST00000..  AC240565.2-201     lincRNA
##       [3]      <NA>                   <NA> ENSE00003759547        <NA>
##       [4]      <NA>   gene:ENSG00000280279      AC240565.2     lincRNA
##       [5]      <NA> transcript:ENST00000..  AC240565.2-201     lincRNA
##       ...       ...                    ...             ...         ...
##   [12680]      <NA> transcript:ENST00000..  AC240565.2-201     lincRNA
##   [12681]      <NA>                   <NA> ENSE00003756684        <NA>
##   [12682]      <NA>   gene:ENSG00000280279      AC240565.2     lincRNA
##   [12683]      <NA> transcript:ENST00000..  AC240565.2-201     lincRNA
##   [12684]      <NA>                   <NA> ENSE00003756684        <NA>
##                description         gene_id  logic_name     version
##                <character>     <character> <character> <character>
##       [1] novel transcript ENSG00000280279      havana           1
##       [2]             <NA>            <NA>        <NA>           1
##       [3]             <NA>            <NA>        <NA>           1
##       [4] novel transcript ENSG00000280279      havana           1
##       [5]             <NA>            <NA>        <NA>           1
##       ...              ...             ...         ...         ...
##   [12680]             <NA>            <NA>        <NA>           1
##   [12681]             <NA>            <NA>        <NA>           1
##   [12682] novel transcript ENSG00000280279      havana           1
##   [12683]             <NA>            <NA>        <NA>           1
##   [12684]             <NA>            <NA>        <NA>           1
##                           Parent         tag   transcript_id
##                  <CharacterList> <character>     <character>
##       [1]                               <NA>            <NA>
##       [2]   gene:ENSG00000280279       basic ENST00000623180
##       [3] transcript:ENST00000..        <NA>            <NA>
##       [4]                               <NA>            <NA>
##       [5]   gene:ENSG00000280279       basic ENST00000623180
##       ...                    ...         ...             ...
##   [12680]   gene:ENSG00000280279       basic ENST00000623180
##   [12681] transcript:ENST00000..        <NA>            <NA>
##   [12682]                               <NA>            <NA>
##   [12683]   gene:ENSG00000280279       basic ENST00000623180
##   [12684] transcript:ENST00000..        <NA>            <NA>
##           transcript_support_level constitutive ensembl_end_phase ensembl_phase
##                        <character>  <character>       <character>   <character>
##       [1]                     <NA>         <NA>              <NA>          <NA>
##       [2]                        5         <NA>              <NA>          <NA>
##       [3]                     <NA>            1                -1            -1
##       [4]                     <NA>         <NA>              <NA>          <NA>
##       [5]                        5         <NA>              <NA>          <NA>
##       ...                      ...          ...               ...           ...
##   [12680]                        5         <NA>              <NA>          <NA>
##   [12681]                     <NA>            1                -1            -1
##   [12682]                     <NA>         <NA>              <NA>          <NA>
##   [12683]                        5         <NA>              <NA>          <NA>
##   [12684]                     <NA>            1                -1            -1
##                   exon_id        rank
##               <character> <character>
##       [1]            <NA>        <NA>
##       [2]            <NA>        <NA>
##       [3] ENSE00003759547           5
##       [4]            <NA>        <NA>
##       [5]            <NA>        <NA>
##       ...             ...         ...
##   [12680]            <NA>        <NA>
##   [12681] ENSE00003756684           1
##   [12682]            <NA>        <NA>
##   [12683]            <NA>        <NA>
##   [12684] ENSE00003756684           1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths


7.2 Predicting ORFs

Let’s try predicting Open Reading Frames (ORFs) in R.

First, let’s load the required packages.

library(Biostrings)
library(systemPipeR)

Let’s load the data into a DNAStrings object, in this case, an Arabidopsis chloroplast genome.

dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa")

Now let’s predict the open reading frames with predORF(), we’ll predict all ORFs on both strands.

predict_orfs <- predORF(dna_object, n = "all", type = "gr", mode = "ORF", strand = "both",
                        longest_disjoint = TRUE)

predict_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
##           seqnames        ranges strand | subject_id inframe2end
##              <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##      1 chloroplast   86762-93358      + |          1           2
##   1162 chloroplast     2056-2532      - |          1           3
##      2 chloroplast   72371-73897      + |          2           2
##   1163 chloroplast   77901-78362      - |          2           1
##      3 chloroplast   54937-56397      + |          3           3
##    ...         ...           ...    ... .        ...         ...
##   2497 chloroplast 129757-129762      - |       1336           3
##   2498 chloroplast 139258-139263      - |       1337           3
##   2499 chloroplast 140026-140031      - |       1338           3
##   2500 chloroplast 143947-143952      - |       1339           3
##   2501 chloroplast 153619-153624      - |       1340           3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

This printed out a GRanges object in return, with 2501 open reading frames.
This is FAR too many open reading frames. It finds every TF binding site, Start codon, etc.

To filter out erronous ORFs, we do a simulation. We first estimate the length an ORF can reach by chance. We will create a string of random nucleotides that is the length of our chloroplast genome, and determine the longer ORFs that can arise by chance.

bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")

Now we ensure that our random nucleotides match the proportion of nucleotides in our chloroplast genome, so we have no bias.

seq_length <- width(dna_object[1])

Now we make a function to go over the column, every row, find the bases, and add them.

counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})

Now, we find the probabilities of each nucleotide being at a position.

probs <- unlist(lapply(counts, function(base_count) {signif(base_count/seq_length, 2)}))
probs
## [1] 6.5e-06 6.5e-06 6.5e-06 6.5e-06

As can be seen, the probability for each nucleotide to be at each position is equal. Thus, it seems there is no bias.

Now we can build our function to simulate our genome.

get_longest_ORF_in_random_genome <- function(x,
                                             length = 1000,
                                             probs = c(0.25, 0.25, 0.25, 0.25),
                                             bases = c("A", "T", "G", "C")){
  # Here we create our random genome and allow replacement for the next iteration.
  random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")
  random_dna_object <- DNAStringSet(random_genome)
  names(random_dna_object) <- c("random_dna_string")
  orfs <- predORF(random_dna_object, n = 1, type = "gr", mode = "ORF", strand = "both", longest_disjoint = TRUE)
  return(max(width(orfs)))
}

Now we use the function we just created to predict the ORFs in 10 random genomes.

random_lengths <- unlist(lapply(1:10, get_longest_ORF_in_random_genome, length = seq_length, probs = probs, bases = bases))

Let’s pull out the longest length from our 10 simulations.

longest_random_orf <- max(random_lengths)

We have now done the simulation, where each basepair was randomly selected.
By chance, how many ORFs do we predict, and what is the longest possible one?

Let’s only keep the frames that are longer in our chloroplast genome.

keep <- width(predict_orfs) > longest_random_orf

This creates TRUE / FALSE statements. We can use this to filter.

orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
## GRanges object with 12 ranges and 2 metadata columns:
##         seqnames        ranges strand | subject_id inframe2end
##            <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##    1 chloroplast   86762-93358      + |          1           2
##    2 chloroplast   72371-73897      + |          2           2
##    3 chloroplast   54937-56397      + |          3           3
##    4 chloroplast   57147-58541      + |          4           1
##    5 chloroplast   33918-35141      + |          5           1
##   ..         ...           ...    ... .        ...         ...
##    8 chloroplast 114461-115447      + |          8           2
##    9 chloroplast 141539-142276      + |          9           2
##   10 chloroplast   60741-61430      + |         10           1
##   11 chloroplast 143088-143708      + |         11           1
##   12 chloroplast   62038-62619      + |         12           3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Let’s write this data to a file.

extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")


7.3 KaryploteR

Let’s work with finding karyotypes.

First, let’s load the required packages.

library(karyoploteR)
library(GenomicRanges)

Now we need to set up the genome object for our karyotype.

genome_df <- data.frame(
  # first we dictate the number of chromosomes.
  chr = paste0("chr", 1:5),
  start = rep(1, 5),
  # and then we will dictate the length of each chromosome.
  end = c(34964571, 22037565, 25499034, 20862711, 31279811)
)

Now we convert the dataframe to a GRanges object.

genome_gr <- makeGRangesFromDataFrame(genome_df)

Now let’s create some snp positions to map out. We do this by using the sample() function.

snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
  chr = paste0("chr", sample(1:5, 25, replace = TRUE)),
  start = snp_pos,
  end = snp_pos
)

Again we convert the dataframe to a GRanges object.

snps_gr <- makeGRangesFromDataFrame(snps)

Now let’s create some snp labels.

snp_labels <- paste0("snp_", 1:25)

Here we will set the parameters for our plot.

plot.params <- getDefaultPlotParams(plot.type = 1)

Here we will set the margins of our plot.

plot.params$data1outmargin <- 600

Now let’s plot our snps.

kp <- plotKaryotype(genome = genome_gr, plot.type = 1, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)

We can also add some numeric data to our plots. We will generate 100 random numbers that plot to 100 windows on our chromosome 4.

numeric_data <- data.frame(
  y = rnorm(100, mean = 1, sd = 0.5),
  chr = rep("chr4", 100),
  start = seq(1, 20862771, 20862771 / 100),
  end = seq(1, 20862771, 20862771 / 100)
)

Now let’s make the data a GRanges object.

numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)

Let’s set our plot parameters again.

plot.params <- getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800

Let’s plot the data.

kp <- plotKaryotype(genome = genome_gr, plot.type = 2, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)


That was the end of the final project, going through the course Bioinformatics with R on Praxis AI.