This document is the final project for the course Bioinformatics with
R on Praxis AI.
All subjects and associated R projects that were
worked on throughout this course have been compiled in this file.
This is how R does addition:
12 + 6
## [1] 18
This is how R does subtraction:
12 - 6
## [1] 6
This is how we store data as avariable. We are going store the days of the week for this current example.
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
Let’s find the 5th entry.
days[5]
## [1] "Friday"
Let’s pull out a range of entries.
days[1:3]
## [1] "Monday" "Tuesday" "Wednesday"
Let’s pull out the end of the week.
days[5:7]
## [1] "Friday" "Saturday" "Sunday"
Let’s pull out specific days.
days[c(1,3,5,7)]
## [1] "Monday" "Wednesday" "Friday" "Sunday"
Let’s subset our days variable and create weekdays.
weekdays <- days[1:5]
print(weekdays)
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
Let’s subset for the weekend.
weekend <- days[6:7]
print(weekend)
## [1] "Saturday" "Sunday"
A generic function in R may look like this: f(argument1, argument2,
…)
Where f is the name of the function, and argument1, argument2,…
are the different conditions that we are asking the function to
evaluate.
exampleFunction <- function(x, y)
{c(x + 1, y + 10)}
exampleFunction(2, 4)
## [1] 3 14
Examples of built-in functions of R:
Exponents:
exp(2)
## [1] 7.389056
Tangents:
tan(2)
## [1] -2.18504
Logs:
log(12)
## [1] 2.484907
log(x = 12, base = 4)
## [1] 1.792481
Shorter version og logs:
log(12, 4)
## [1] 1.792481
If you want the documentation of functions and packages in R, you can
type ?functionname or ?packagename.
Let’s make an array and compare it to a regular list.
months <- array(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), dim = c(3, 4))
months
## [,1] [,2] [,3] [,4]
## [1,] "Jan" "Apr" "Jul" "Oct"
## [2,] "Feb" "May" "Aug" "Nov"
## [3,] "Mar" "Jun" "Sep" "Dec"
The array function fills out a matrix with the values given and following the specified dimensions. Now let’s make a list for comparison. Then let’s see how to pull a value from the array.
months1 <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
months1
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
months[2, 3]
## [1] "Aug"
Let’s look at a matrix.
months2 <- matrix(data = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), nrow = 3, ncol = 4)
months2
## [,1] [,2] [,3] [,4]
## [1,] "Jan" "Apr" "Jul" "Oct"
## [2,] "Feb" "May" "Aug" "Nov"
## [3,] "Mar" "Jun" "Sep" "Dec"
Remember to always look at the different arguments a function takes.
Let’s make a 3D array.
array3D <- array(c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36), dim = c(3, 3, 2))
array3D
## , , 1
##
## [,1] [,2] [,3]
## [1,] 2 8 14
## [2,] 4 10 16
## [3,] 6 12 18
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 20 26 32
## [2,] 22 28 34
## [3,] 24 30 36
array3D[1, 3, 2]
## [1] 32
The argument for array3D[] goes: First it looks at row, then find column, finally, it says look at the following layer.
If you want to pull an entire row or column from 3D:
array3D[2, , 1]
## [1] 4 10 16
First we say: row 2, all columns, layer 1.
HSPA1a <- list(gene = "HSPA1a", amino.acids = 641, nucleotides = 2400)
HSPA1a
## $gene
## [1] "HSPA1a"
##
## $amino.acids
## [1] 641
##
## $nucleotides
## [1] 2400
Let’s pull out just the amino acids.
HSPA1a$amino.acids
## [1] 641
Let’s pull out the nucleotides.
HSPA1a$nucleotides
## [1] 2400
Now let’s make 3 lists and combine them into a data frame.
gene <- c("HSPA4", "HSPA5", "HSPA8", "HSPA9", "HSPA1A", "HSPA1B")
nucleotides <- c(54537, 6491, 4648, 21646, 2400, 2517)
aminoAcids <- c(840, 654, 493, 679, 641, 641)
HSPs <- data.frame(gene, nucleotides, aminoAcids)
HSPs
## gene nucleotides aminoAcids
## 1 HSPA4 54537 840
## 2 HSPA5 6491 654
## 3 HSPA8 4648 493
## 4 HSPA9 21646 679
## 5 HSPA1A 2400 641
## 6 HSPA1B 2517 641
Let’s pull out the genes.
HSPs$gene
## [1] "HSPA4" "HSPA5" "HSPA8" "HSPA9" "HSPA1A" "HSPA1B"
Let’s pull out the nucleotides.
HSPs$nucleotides
## [1] 54537 6491 4648 21646 2400 2517
Let’s seach for a specific amino acid count.
HSPs$aminoAcids[HSPs$gene == "HSPA8"]
## [1] 493
What we did was: Look at the amino acid column in HSPs dataframe, and
pull out the value for gene HSP8.
Let’s quickly look at our gene vector and HSP dataframe.
gene
## [1] "HSPA4" "HSPA5" "HSPA8" "HSPA9" "HSPA1A" "HSPA1B"
HSPs
## gene nucleotides aminoAcids
## 1 HSPA4 54537 840
## 2 HSPA5 6491 654
## 3 HSPA8 4648 493
## 4 HSPA9 21646 679
## 5 HSPA1A 2400 641
## 6 HSPA1B 2517 641
The function class() allows you to check the class of a value or set of values in a dataframe.
class(HSPs$gene)
## [1] "character"
HSPs$gene
## [1] "HSPA4" "HSPA5" "HSPA8" "HSPA9" "HSPA1A" "HSPA1B"
class(HSPs$nucleotides)
## [1] "numeric"
HSPs$nucleotides
## [1] 54537 6491 4648 21646 2400 2517
class(aminoAcids)
## [1] "numeric"
HSPs$aminoAcids
## [1] 840 654 493 679 641 641
x <- 15 + 38
x
## [1] 53
class(x)
## [1] "numeric"
With the function as.character, you can force R to store values as a character.
z <- as.character(c(1, 2, 3, 4, 5, 6))
z
## [1] "1" "2" "3" "4" "5" "6"
class(z)
## [1] "character"
y <- as.character(c(9, 8, 7, 6, 5, 4))
z + y
## Error in z + y: non-numeric argument to binary operator
z + y will give an error message, as math cannot be applied to characters. However, you can apply math to integers in a list.
z2 <- c(1, 2, 3, 4, 5, 6)
y2 <- c(9, 8, 7, 6, 5, 4)
z2 + y2
## [1] 10 10 10 10 10 10
Now let’s take a look at the different classes we have worked with:
class(z2)
## [1] "numeric"
class(HSPs)
## [1] "data.frame"
class(y)
## [1] "character"
A character can be converted to a numeric value.
y <- as.numeric(y)
class(y)
## [1] "numeric"
An example of a formula could be: \[y ~ x1
+ x2\]
First, we load our example data and save it.
dataset <- iris
With large datasets, when you want to check if your data is there, it may be easier to call the head and tail (first 5 and last 5 rows).
head(dataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(dataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
dataset
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
As seen, a long table was printed with “dataset”. This can become
cumbersome and unnecessary for a document. Henceforth, long tables will
only be shown with the head and tail function. The original code for
printing the table will still be presented.
Let’s look at the total
number of rows in the dataset.
nrow(dataset)
## [1] 150
Let’s look at the number of columns in our dataset.
ncol(dataset)
## [1] 5
Let’s start with a simple linear model.
petals.lm <- lm(formula = Petal.Length ~ Petal.Width, data = dataset)
petals.lm
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = dataset)
##
## Coefficients:
## (Intercept) Petal.Width
## 1.084 2.230
summary(petals.lm)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
Summary gives a better overview of the statistics for the linear
regression.
First, let’s call the list of names in the iris dataset.
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Let’s make a histogram. This shows the frequency distribution.
hist(iris$Sepal.Length)
Let’s increase the number of bins in our histogram.
hist(iris$Sepal.Length, breaks = 25)
Let’s add some labels.
hist(iris$Sepal.Length, breaks = 25, xlab = "Sepal Length", main = "Sepal Length Frequency")
Let’s make a scatter plot of the regression analysis of length and
width.
plot(iris$Sepal.Length ~ iris$Sepal.Width, xlab = "Sepal Length", ylab = "Sepal Width")
Let’s make a lattice dotplot.
First, let’s load the required
package:
library(lattice)
Here we can separate by species.
dotplot(Sepal.Length ~ Sepal.Width|Species, data = iris)
Let’s do the same for petals.
dotplot(Petal.Length ~ Petal.Width|Species, data = dataset)
With the basics down, we need to work on wrangling data to make it
neat and tidy.
First, let’s load the required libraries and look at the top of
the data.
library(tidyverse)
??flights
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
First, we will just look at the data on October 14th.
filter(my_data, month == 10, day == 14)
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we want to subset this into a new variable, we do the following:
oct_14_flight <- filter(my_data, month == 10, day == 14)
What if you want to do both; print and save the variable?
(oct_14_flight_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If you want to filter based on different operators, you can use the following:
Equals: ==
Not equal: to !=
Greater than: >
Less
than: <
Greater than or equal to: >=
Less than or equal
to: <=
Let’s try it:
(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we don’t use the == to mean equals, we get this:
(oct_14_flight_2 <- filter(my_data, month = 10, day = 14))
## Error in `filter()`:
## ! We detected a named input.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `month == 10`?
In addition, the error message in the console will start with:
Quitting from lines x-y [unnamed-chunk-z] (Filename)
And end with:
Backtrace:
1. dplyr::filter(my_data, month = 10, day = 14)
2.
dplyr:::filter.data.frame(my_data, month = 10, day = 14)
Execution
halted
You can also use logical operators to be more selective:
and: &
or: |
not: !
Let’s use the “or” function to pick flights in march and april.
Then let’s use the “and” function to pick a day in a month.
And
finally let’s use the “not” function.
March_April_Flights <- filter(my_data, month == 3 | month == 4)
March_4th_Flights <- filter(my_data, month == 3 & day == 4)
Non_jan_Flights <- filter(my_data, month != 1)
Arrange allows us to arrange the dataset based on the variables we desire.
arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can also do this in descending order.
descending <- arrange(my_data, desc(year), desc(day), desc(month))
Missing values are always placed at the end of the dataframe
regardless of ascending or descending.
We can also select specific columns that we want to look at.
calendar <- select(my_data, year, month, day)
print(calendar)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
We can also look at a range of columns.
calendar2 <- select(my_data, year:day)
Lets look at all columns months through carrier.
calendar3 <- select(my_data, year:carrier)
We can also choose which columns NOT to include.
everything_else <- select(my_data, -(year:day))
In this instance, we can also use the NOT operator.
everything_else2 <- select(my_data, !(year:day))
There are also some other helper function that can help you select the columns or data you’re looking for.
starts_with(“xyz”) - will select the values that start with xyz.
ends_with(“xyz) - will select the values that end with xyz.
contains(”xyz”) - will select the values that contain xyz.
matches(“xyz”) - will select the values that are identical to xyz.
Let’s look at the data and rename a column.
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
rename(my_data, departure_time = dep_time)
## # A tibble: 336,776 × 19
## year month day departure_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
my_data2 <- rename(my_data, departure_time = dep_time)
What if you want to add new columns to your data frame? We have the mutate() function for that.
First, we make a smaller data frame so we can see what we are doing.
my_data_small <- select(my_data, year:day, distance, air_time)
Let’s calculate the speed of the flights.
mutate(my_data_small, speed = distance / air_time * 60)
## # A tibble: 336,776 × 6
## year month day distance air_time speed
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2013 1 1 1400 227 370.
## 2 2013 1 1 1416 227 374.
## 3 2013 1 1 1089 160 408.
## 4 2013 1 1 1576 183 517.
## 5 2013 1 1 762 116 394.
## 6 2013 1 1 719 150 288.
## 7 2013 1 1 1065 158 404.
## 8 2013 1 1 229 53 259.
## 9 2013 1 1 944 140 405.
## 10 2013 1 1 733 138 319.
## # ℹ 336,766 more rows
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)
What if we wanted to create a new dataframe with ONLY your calculations (transmute).
airspeed <- transmute(my_data_small, speed = distance / air_time * 60, speed2 = distance / air_time)
We can use summarize to run a function on a data column to get a single return.
summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
So we can see here that the average delay is about 12 minutes.
We gain additional value in summarize by pairing it with by_group().
by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ℹ 355 more rows
As you can see, we now have the delay by the days of the year.
What happens if we don’t tell R what to do with the missing data?
summarize(by_day, delay = mean(dep_delay))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # ℹ 355 more rows
We can also filter our data based on NA (which in this dataset was cancelled flights).
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
is.na tests the statement of if data is NA by TRUE or FALSE.
is.na(airspeed)
head(is.na(airspeed))
## speed speed2
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
## [5,] FALSE FALSE
## [6,] FALSE FALSE
tail(is.na(airspeed))
## speed speed2
## [336771,] TRUE TRUE
## [336772,] TRUE TRUE
## [336773,] TRUE TRUE
## [336774,] TRUE TRUE
## [336775,] TRUE TRUE
## [336776,] TRUE TRUE
Let’s run summarize again on this data:
summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
We can also count the amount of variables that are NA.
sum(is.na(my_data$dep_delay))
## [1] 8255
We can also count the numbers that are NOT NA.
sum(!is.na(my_data$dep_delay))
## [1] 328521
With tibble datasets, we can pipe results to get rid of the need to
use the dollarsigns.
We can then summarize the number of flights by
minutes delayed.
my_data %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_time, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 1385.
## 2 2013 1 2 1354.
## 3 2013 1 3 1357.
## 4 2013 1 4 1348.
## 5 2013 1 5 1326.
## 6 2013 1 6 1399.
## 7 2013 1 7 1341.
## 8 2013 1 8 1335.
## 9 2013 1 9 1326.
## 10 2013 1 10 1333.
## # ℹ 355 more rows
Let’s load the required package.
library(tibble)
Now we will take the time to explore tibbles. Tibbles are modified data frames which tweak some of the older features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore.
as_tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
As we can see, we have the same data frame, but we have different features.
You can also create a tibble from scratch with tibble():
tibble(
x = 1:5,
y = 1,
z = x ^2 + y
)
## # A tibble: 5 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
You can also use tribble() for basic data table creation.
tribble(
~genea, ~geneb, ~genec,
#######################
110, 112, 114,
6, 5, 4
)
## # A tibble: 2 × 3
## genea geneb genec
## <dbl> <dbl> <dbl>
## 1 110 112 114
## 2 6 5 4
Tibbles are built to not overwhelm your console when printing data, only showing the first few lines.
Let’s compare how tibbles and data frames print:
print(by_day)
## # A tibble: 336,776 × 19
## # Groups: year, month, day [365]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
as.data.frame(by_day)
head(by_day)
## # A tibble: 6 × 19
## # Groups: year, month, day [1]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
As you can see when you run the code “as.data.frame(by_day)”, the
entire data frame is printed, which is overwhelming in the console and
in a document.
head() is kind of what tibble() does by default.
We can specify with tibbles how many rows we want to print:
nycflights13::flights %>%
print(n = 10, width = Inf)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## 7 19 B6 507 N516JB EWR FLL 158 1065 6 0
## 8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
## 9 -8 B6 79 N593JB JFK MCO 140 944 6 0
## 10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
## time_hour
## <dttm>
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
## 7 2013-01-01 06:00:00
## 8 2013-01-01 06:00:00
## 9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # ℹ 336,766 more rows
Subsetting tibbles is easy, similar to data frames.
df_tibble <- tibble(nycflights13::flights)
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can subset by column name using the $:
df_tibble$carrier
head(df_tibble$carrier)
## [1] "UA" "UA" "AA" "B6" "DL" "UA"
We can subset by position using []
df_tibble[[2]]
head(df_tibble[[2]])
## [1] 1 1 1 1 1 1
If you want to use this in a pipe, you need to use the “.” placeholder.
df_tibble %>%
.$carrier
df_tibble %>%
.$carrier %>%
head()
## [1] "UA" "UA" "AA" "B6" "DL" "UA"
Some older functions do not like tibbles, thus you might have to convert them back to a data frame.
class(df_tibble)
## [1] "tbl_df" "tbl" "data.frame"
df_tibble2 <- as.data.frame(df_tibble)
class(df_tibble2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
Let’s load the required package.
library(tidyverse)
How do we make a tidy dataset? Well the tidyverse follows three rules:
1 - Each variable must have its own column.
2 - Each observation
has its own row.
3 - Each value has its own cell.
It is impossible to satisfy two of the three rules.
This leads to the following instructions for tidy data:
1 - Put each dataset into a tibble.
2 - Put each variable into a
column.
3 - Profit
Picking one consistent method of data storage makes for easier understanding of your code and what is happening ” under the hood” or behind the scenes.
Let’s now look at working with tibbles, calculating and adding bmi to the tibble as an example:
bmi <- tibble(women)
bmi %>%
mutate(bmi = (703 * weight)/(height)^2)
## # A tibble: 15 × 3
## height weight bmi
## <dbl> <dbl> <dbl>
## 1 58 115 24.0
## 2 59 117 23.6
## 3 60 120 23.4
## 4 61 123 23.2
## 5 62 126 23.0
## 6 63 129 22.8
## 7 64 132 22.7
## 8 65 135 22.5
## 9 66 139 22.4
## 10 67 142 22.2
## 11 68 146 22.2
## 12 69 150 22.1
## 13 70 154 22.1
## 14 71 159 22.2
## 15 72 164 22.2
Sometimes you will find datasets that don’t fit well into a tibble.
We’ll use the built-in data from tidyverse for this part.
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
As you can see from this data, we have one variable in column A (country), but columns b and c are two of the same. Thus, there are two observations in each row.
To fix this we can use the gather function:
table4a %>%
gather('1999', '2000', key = 'year', value = 'cases')
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Let’s look at another example.
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
As you can see, we have the same problem in table 4b.
table4b %>%
gather('1999', '2000', key = 'year', value = 'population')
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
Now what if we want to join these two tables? We can use dplyr.
table4a <- table4a %>%
gather('1999', '2000', key = 'year', value = 'cases')
table4b <- table4b %>%
gather('1999', '2000', key = 'year', value = 'population')
left_join(table4a, table4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
Spreading is the opposite of gathering.
Let’s look at table
2
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
You can see that we have redundant info in columns 1 and 2.
We
can fix that by combining rows 1&2, 3&4, etc.
spread(table2, key = type, value = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Type is the key of what we are returning into columns, the value is what becomes rows/observations.
In summary, spread makes long tables shorter and wider, and gather
makes wide tables narrower and longer.
Now what happens if we have two observations in one column?
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
As you can see, the rate is just the population and cases combined.
We can use separate to fix this.
table3 %>%
separate(rate, into = c('cases', 'population'))
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
However, if you notice, the columns’ type is not correct.
We can
easily fix this with the convert command:
table3 %>%
separate(rate, into = c('cases', 'population'), convert = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can actually specify what you want to separate based on.
table3 %>%
separate(rate, into = c('cases', 'population'), sep = '/', convert = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Let’s make this code look more tidy:
table3 %>%
separate(
rate,
into = c('cases', 'population'),
convert = TRUE,
sep = '/'
)
We can also separate based on number of digits:
table3 %>%
separate(
year,
into = c('century', 'year'),
convert = TRUE,
sep = 2
)
## # A tibble: 6 × 4
## country century year rate
## <chr> <int> <int> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 0 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 0 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 0 213766/1280428583
What happens if we want to do the inverse of separate?
We can
use the unite() function for that:
table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5 %>%
unite(data, century, year)
## # A tibble: 6 × 3
## country data rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
table5 %>%
unite(data, century, year, sep = "")
## # A tibble: 6 × 3
## country data rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
There can be two types of missing values: NA (explicit) or just entry (implicit).
gene_data <- tibble(
gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
nuc = c(20, 22, 24, 25, NA, 42, 67),
run = c(1, 2, 3, 4, 2, 3, 4)
)
gene_data
## # A tibble: 7 × 3
## gene nuc run
## <chr> <dbl> <dbl>
## 1 a 20 1
## 2 a 22 2
## 3 a 24 3
## 4 a 25 4
## 5 b NA 2
## 6 b 42 3
## 7 b 67 4
The nucleotide count for gene b, run 2, is explicitly missing.
The nucleotide count for gene b, run 1, is implicitly missing.
One way we can make implicit values explicit, is by putting runs into columns.
gene_data %>%
spread(gene, nuc)
## # A tibble: 4 × 3
## run a b
## <dbl> <dbl> <dbl>
## 1 1 20 NA
## 2 2 22 NA
## 3 3 24 42
## 4 4 25 67
If we want to remove the missing values, we can use spread and gather, and na.rm = TRUE
gene_data %>%
spread(gene, nuc) %>%
gather(gene, nuc, 'a':'b', na.rm = TRUE)
## # A tibble: 6 × 3
## run gene nuc
## <dbl> <chr> <dbl>
## 1 1 a 20
## 2 2 a 22
## 3 3 a 24
## 4 4 a 25
## 5 3 b 42
## 6 4 b 67
This is a way to deal with NAs.
Another way that we can make implicit values explicit, is with complete() from the TidyR package.
gene_data %>%
complete(gene, run)
## # A tibble: 8 × 3
## gene run nuc
## <chr> <dbl> <dbl>
## 1 a 1 20
## 2 a 2 22
## 3 a 3 24
## 4 a 4 25
## 5 b 1 NA
## 6 b 2 NA
## 7 b 3 42
## 8 b 4 67
Sometimes an NA is present to represent a value being carried forward:
treatment <- tribble(
~ person, ~ treatment, ~ response,
##############################################
"Isaac", 1, 7,
NA, 2, 10,
NA, 3, 9,
"VDB", 1, 8,
NA, 2, 11,
NA, 3, 10,
)
treatment
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 VDB 1 8
## 5 <NA> 2 11
## 6 <NA> 3 10
What we can do here is use the fill() option.
treatment %>%
fill(person)
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 Isaac 2 10
## 3 Isaac 3 9
## 4 VDB 1 8
## 5 VDB 2 11
## 6 VDB 3 10
It is rare that you will be working with a single data table. The dplyr package allows you to join two data tables based on common values.
Mutate joins - add new variables to one data frame from the matching
observations in another.
Filtering joins - filters observation from
one data frame based on whether or not they are present in another.
Set operations - treats observations as they are set elements.
Let’s load the required packages.
library(tidyverse)
library(nycflights13)
Let’s pull full carrier names based on letter codes.
airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
Let’s get info about airports.
airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
Let’s get some info on the weather at the airports.
weather
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
Let’s get info on singular flights
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Let’s look at how these tables connect:
Flights -> planes based on tailnumber.
Flights -> airlines
through carrier.
Flights -> airports origin AND destination.
Flights -> weather via origin, year/month/day/hour.
Keys are unique identifiers per observation.
A primary key
uniquely identifies an observation in its own table.
One way to identify a primary key is as follows:
planes %>%
count(tailnum) %>%
filter(n>1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
This indicates that the tailnumber is unique.
Sometimes there are no unique identifiers.
planes
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## # ℹ 3,312 more rows
planes %>%
count(model) %>%
filter(n>1)
## # A tibble: 79 × 2
## model n
## <chr> <int>
## 1 717-200 88
## 2 737-301 2
## 3 737-3G7 2
## 4 737-3H4 105
## 5 737-3K2 2
## 6 737-3L9 2
## 7 737-3Q8 5
## 8 737-3TO 2
## 9 737-401 4
## 10 737-4B7 18
## # ℹ 69 more rows
We can mutate a data frame and join two data frames together.
In
addition, we can remove columns with the select() function:
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 × 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ℹ 336,766 more rows
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = 'carrier')
## # A tibble: 336,776 × 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ℹ 336,766 more rows
We have now added the airline name to our data frame from the airline data frame.
Other types of joins:
Inner joins (inner_join) matches a pair of observations when their
key is equal.
Outer joins (outer_join) keeps observations that
appear in at least one table.
Let’s load the required packages.
library(tidyverse)
library(stringr)
You can create strings using single or double quotes.
string1 <- "this is a string"
string2 <- 'to put a "quote" in your string, use the opposite'
string1
## [1] "this is a string"
string2
## [1] "to put a \"quote\" in your string, use the opposite"
If you forget to close your string, you’ll get this:
string3 <- “where is this string going?
Error:
1: string3
<- “where is this string going?
^ Just hit escape to stop the
console in R from running and try again.
Multiple strings can be stored in character vectors.
string4 <- c("one", "two", "three")
string4
## [1] "one" "two" "three"
Measuring string length can be done for a single string, as well as for vectors with multiple strings:
str_length(string2)
## [1] 49
str_length(string4)
## [1] 3 3 5
Let’s combine two strings.
str_c("X", "Y")
## [1] "XY"
str_c(string1, string2)
## [1] "this is a stringto put a \"quote\" in your string, use the opposite"
You can use sep to control how they are separated.
str_c(string1, string2, sep = " ")
## [1] "this is a string to put a \"quote\" in your string, use the opposite"
str_c("x", "y", "z", sep = "_")
## [1] "x_y_z"
You can subset a string using str_sub():
HSP <- c("HSP123", "HSP234", "HSP456")
str_sub(HSP, 4, 6)
## [1] "123" "234" "456"
This just drops the first four letters from the strings.
Or you can use negatives to count back from the end.
str_sub(HSP, -3, -1)
## [1] "123" "234" "456"
You can convert the cases of strings as follows:
HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_lower(HSP)
## [1] "hsp123" "hsp234" "hsp456"
str_to_upper() does the opposite.
Let’s install a package that helps us visualize what we are doing:
install.packages("htmlwidgets")
x <- c('ATTAGA', 'CGCCCCGGAT', 'TATTA')
str_view(x, 'G')
## [1] │ ATTA<G>A
## [2] │ C<G>CCCC<G><G>AT
str_view(x, 'TA')
## [1] │ AT<TA>GA
## [3] │ <TA>T<TA>
The next step is, “.” where the “.” matches an entry.
str_view(x, ".G.")
## [1] │ ATT<AGA>
## [2] │ <CGC>CC<CGG>AT
This is great for TF binding positions, which can be quite flexible.
Anchors allow you to match entries at the start or the ending.
str_view(x, "^TA")
## [3] │ <TA>TTA
str_view(x, "TA$")
## [3] │ TAT<TA>
\d matches any digit.
\s matches any space.
[abc] matches
a, b, or c.
str_view(x, "TA[GT]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA
[^anc] matches anything BUT a, b , or c
str_view(x, "TA[^T]")
## [1] │ AT<TAG>A
You can also use | to pick between two alternatives
str_view(x, "TA[G|T]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA
str_detect() returns a logical vector the same length of input.
y <- c("apple,", "banana", "pear")
y
## [1] "apple," "banana" "pear"
str_detect(y, "e")
## [1] TRUE FALSE TRUE
How many common words contain the letter “e”.
words
head(words)
## [1] "a" "able" "about" "absolute" "accept" "account"
sum(str_detect(words, "e"))
## [1] 536
Let’s get more complex, how many words start with a vowel, and how many end in a vowel?
sum(str_detect(words, "[aeiou]$"))
## [1] 271
sum(str_detect(words, "^[aeiou]"))
## [1] 175
Let’s find all the words that don’t contain “o” or “u”.
no_o <- !str_detect(words, "[ou]")
no_o
head(no_o)
## [1] TRUE TRUE FALSE FALSE TRUE FALSE
Now let’s extract.
words[!str_detect(words, "[ou]")]
## [1] "a" "able" "accept" "achieve" "act"
## [6] "active" "add" "address" "admit" "advertise"
## [11] "affect" "after" "again" "against" "age"
## [16] "agent" "agree" "air" "all" "already"
## [21] "alright" "always" "america" "and" "answer"
## [26] "any" "apart" "apparent" "appear" "apply"
## [31] "area" "arm" "arrange" "art" "as"
## [36] "ask" "at" "attend" "available" "aware"
## [41] "away" "baby" "back" "bad" "bag"
## [46] "balance" "ball" "bank" "bar" "base"
## [51] "basis" "be" "bear" "beat" "bed"
## [56] "begin" "behind" "believe" "benefit" "best"
## [61] "bet" "between" "big" "bill" "birth"
## [66] "bit" "black" "break" "brief" "brilliant"
## [71] "bring" "britain" "by" "cake" "call"
## [76] "can" "car" "card" "care" "carry"
## [81] "case" "cat" "catch" "cent" "centre"
## [86] "certain" "chair" "chairman" "chance" "change"
## [91] "chap" "character" "charge" "cheap" "check"
## [96] "child" "Christ" "Christmas" "city" "claim"
## [101] "class" "clean" "clear" "client" "create"
## [106] "dad" "danger" "date" "day" "dead"
## [111] "deal" "dear" "debate" "decide" "deep"
## [116] "definite" "degree" "department" "depend" "describe"
## [121] "design" "detail" "die" "difference" "dinner"
## [126] "direct" "district" "divide" "draw" "dress"
## [131] "drink" "drive" "dry" "each" "early"
## [136] "east" "easy" "eat" "effect" "egg"
## [141] "eight" "either" "elect" "electric" "eleven"
## [146] "else" "end" "engine" "english" "enter"
## [151] "especial" "even" "evening" "ever" "every"
## [156] "evidence" "exact" "example" "except" "exercise"
## [161] "exist" "expect" "expense" "experience" "explain"
## [166] "express" "extra" "eye" "face" "fact"
## [171] "fair" "fall" "family" "far" "farm"
## [176] "fast" "father" "feed" "feel" "few"
## [181] "field" "fight" "file" "fill" "film"
## [186] "final" "finance" "find" "fine" "finish"
## [191] "fire" "first" "fish" "fit" "five"
## [196] "flat" "fly" "france" "free" "friday"
## [201] "friend" "game" "garden" "gas" "general"
## [206] "germany" "get" "girl" "give" "glass"
## [211] "grand" "grant" "great" "green" "hair"
## [216] "half" "hall" "hand" "hang" "happen"
## [221] "happy" "hard" "hate" "have" "he"
## [226] "head" "health" "hear" "heart" "heat"
## [231] "heavy" "hell" "help" "here" "high"
## [236] "hit" "idea" "identify" "if" "imagine"
## [241] "in" "increase" "indeed" "inside" "instead"
## [246] "interest" "invest" "it" "item" "keep"
## [251] "key" "kid" "kill" "kind" "king"
## [256] "kitchen" "lad" "lady" "land" "large"
## [261] "last" "late" "law" "lay" "lead"
## [266] "learn" "leave" "left" "leg" "less"
## [271] "let" "letter" "level" "lie" "life"
## [276] "light" "like" "likely" "limit" "line"
## [281] "link" "list" "listen" "little" "live"
## [286] "machine" "main" "make" "man" "manage"
## [291] "many" "mark" "market" "marry" "match"
## [296] "matter" "may" "maybe" "mean" "meaning"
## [301] "meet" "member" "middle" "might" "mile"
## [306] "milk" "mind" "minister" "miss" "mister"
## [311] "mrs" "name" "near" "necessary" "need"
## [316] "never" "new" "news" "next" "nice"
## [321] "night" "nine" "pack" "page" "paint"
## [326] "pair" "paper" "paragraph" "parent" "park"
## [331] "part" "party" "pass" "past" "pay"
## [336] "pence" "per" "percent" "perfect" "perhaps"
## [341] "pick" "piece" "place" "plan" "play"
## [346] "please" "practise" "prepare" "present" "press"
## [351] "pretty" "price" "print" "private" "rail"
## [356] "raise" "range" "rate" "rather" "read"
## [361] "ready" "real" "realise" "really" "receive"
## [366] "recent" "red" "refer" "regard" "remember"
## [371] "represent" "research" "respect" "rest" "rid"
## [376] "right" "ring" "rise" "safe" "sale"
## [381] "same" "save" "say" "scheme" "science"
## [386] "seat" "secretary" "see" "seem" "self"
## [391] "sell" "send" "sense" "separate" "serve"
## [396] "service" "set" "settle" "seven" "sex"
## [401] "shall" "share" "she" "sheet" "sick"
## [406] "side" "sign" "similar" "simple" "since"
## [411] "sing" "single" "sir" "sister" "sit"
## [416] "site" "six" "size" "sleep" "slight"
## [421] "small" "space" "speak" "special" "specific"
## [426] "speed" "spell" "spend" "staff" "stage"
## [431] "stairs" "stand" "standard" "start" "state"
## [436] "stay" "step" "stick" "still" "straight"
## [441] "strategy" "street" "strike" "switch" "system"
## [446] "table" "take" "talk" "tape" "tax"
## [451] "tea" "teach" "team" "tell" "ten"
## [456] "tend" "term" "terrible" "test" "than"
## [461] "thank" "the" "then" "there" "they"
## [466] "thing" "think" "thirteen" "thirty" "this"
## [471] "three" "tie" "time" "trade" "traffic"
## [476] "train" "travel" "treat" "tree" "try"
## [481] "twelve" "twenty" "type" "very" "view"
## [486] "village" "visit" "wage" "wait" "walk"
## [491] "wall" "want" "war" "warm" "wash"
## [496] "waste" "watch" "water" "way" "we"
## [501] "wear" "wednesday" "wee" "week" "weigh"
## [506] "well" "west" "what" "when" "where"
## [511] "whether" "which" "while" "white" "why"
## [516] "wide" "wife" "will" "win" "wind"
## [521] "wish" "with" "within" "write" "year"
## [526] "yes" "yesterday" "yet"
This is useful for getting rid of all string for HSP that do not have TF binding site etc.
You can also use str_count() to say how many matches there are in string.
x
## [1] "ATTAGA" "CGCCCCGGAT" "TATTA"
str_count(x, "[GC]")
## [1] 1 8 0
Let’s couple this with mutate:
df <- tibble(
word = words,
i = seq_along(words)
)
df
## # A tibble: 980 × 2
## word i
## <chr> <int>
## 1 a 1
## 2 able 2
## 3 about 3
## 4 absolute 4
## 5 accept 5
## 6 account 6
## 7 achieve 7
## 8 across 8
## 9 act 9
## 10 active 10
## # ℹ 970 more rows
df %>%
mutate(
vowels = str_count(words, "[aeiou]"),
consonants = str_count(words, "[^aeiou]")
)
## # A tibble: 980 × 4
## word i vowels consonants
## <chr> <int> <int> <int>
## 1 a 1 1 0
## 2 able 2 2 2
## 3 about 3 3 2
## 4 absolute 4 4 4
## 5 accept 5 2 4
## 6 account 6 3 4
## 7 achieve 7 4 3
## 8 across 8 2 4
## 9 act 9 1 2
## 10 active 10 3 3
## # ℹ 970 more rows
Let’s work with microarrays!
First, let’s set the working directory:
# getwd("Home/Desktop/classroom/myfiles/Microarrays")
Not necessary on the virtual machine.
This code was updated to produce the KEGG pathview files. The old code was no longer able to produce the files. The old code has been hashed out to show as an example.
Let’s install additional packages:
BiocManager::install("KEGGREST", update = FALSE, force = TRUE)
BiocManager::install("affy")
Now let’s load the required libraries.
library(ath1121501.db)
library(ath1121501cdf)
library(annotate)
library(affy)
library(limma)
library(oligo)
library(dplyr)
library(stats)
library(reshape)
# Load the extra library:
library(KEGGREST)
Now read the cel files into the directory:
targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")
ab <- ReadAffy()
ReadAffy() reads the cel files from a folder and calculates the expression.
Let’s take a quick look at the data:
hist(ab)
eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
## sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL 1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL 2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL 3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL 4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL 5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL 6
Let’s visualize the normalized data.
hist(eset)
Let’s characterize the data a bit.
ID <- featureNames(eset)
Here, we take the following symbols for the IDs from the Arabidopsis database.
Symbol <- getSYMBOL(ID, "ath1121501.db")
affydata <- read.delim("AFFY_ATH1_array_elements.txt")
Flight vs Ground
condition <- targets$gravity
design <- model.matrix(~factor(condition))
colnames(design) <- c("Flight", "Ground")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef = 2, n = Inf, adjust = "fdr")
head(top)
## logFC AveExpr t P.Value adj.P.Val B
## 264005_at -3.5 4.3 -20 1.4e-07 0.0022 7.6
## 266790_at 3.5 3.9 19 2.1e-07 0.0022 7.4
## 266865_at 3.3 5.1 17 4.6e-07 0.0022 6.8
## 248337_at -7.0 6.1 -16 7.7e-07 0.0022 6.4
## 252543_at 2.1 4.2 16 8.3e-07 0.0022 6.3
## 248179_at 2.1 4.5 15 1.1e-06 0.0022 6.1
Let’s combine annotations.
Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
GO = sapply(contents(ath1121501GO), paste, collapse = ", "),
SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "),
ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ", "))
Let’s merge all the data into one dataframe.
all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)
all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")
Let’s turn the ACCNUM into a character value.
all2$ACCNUM <- as.character(all2$ACCNUM)
Now, let’s write it into a csv file.
write.table(all2, file = "BRIC16_Final.csv", row.names = TRUE, col.names = TRUE, sep = "\t")
Let’s look at the columns in the org.tair database.
columns(org.At.tair.db)
## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
Let’s add a column to the all2 data frame from the org.tair database.
all2$entrez <- mapIds(org.At.tair.db,
keys = all2$ACCNUM,
column = "ENTREZID",
keytype = "TAIR",
multiVals = "first")
Let’s look at the top of the all2 data frame.
head(all2, 10)
## Row.names GENE KEGG
## 1 244903_at hypothetical protein NA
## 2 244904_at hypothetical protein NA
## 3 244906_at hypothetical protein NA
## 4 244907_at hypothetical protein NA
## 5 244908_at hypothetical protein NA
## 6 244911_at hypothetical protein NA
## 7 244913_at hypothetical protein NA
## 8 244914_at hypothetical protein NA
## 9 244916_at hypothetical protein NA
## 10 244917_at hypothetical protein NA
## GO
## 1 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 2 list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 3 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005575", Evidence = "ND", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 7 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 8 list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005634", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## SYMBOL ACCNUM logFC AveExpr t P.Value adj.P.Val B
## 1 ORF149 ATMG00660 -1.214 7.8 -6.20 0.0004 0.018 0.5
## 2 ORF275 ATMG00670 -0.568 3.2 -2.07 0.0758 0.300 -4.9
## 3 ORF240A ATMG00690 0.055 9.1 0.18 0.8586 0.947 -6.8
## 4 ORF120 ATMG00710 -0.883 4.3 -2.91 0.0220 0.154 -3.7
## 5 ORF107D ATMG00720 -0.463 2.2 -1.73 0.1262 0.396 -5.4
## 6 ORF170 ATMG00820 -0.193 2.7 -0.72 0.4948 0.749 -6.5
## 7 ORF121B ATMG00840 -0.339 1.8 -1.18 0.2764 0.574 -6.1
## 8 ORF107E ATMG00850 -0.305 5.5 -1.30 0.2338 0.534 -6.0
## 9 ORF187 ATMG00880 -0.888 2.8 -2.32 0.0527 0.245 -4.6
## 10 ORF184 ATMG00870 -0.379 3.6 -1.31 0.2307 0.531 -5.9
## array_element_type organism is_control locus
## 1 oligonucleotide Arabidopsis thaliana no ATMG00660
## 2 oligonucleotide Arabidopsis thaliana no ATMG00670
## 3 oligonucleotide Arabidopsis thaliana no ATMG00690
## 4 oligonucleotide Arabidopsis thaliana no ATMG00710
## 5 oligonucleotide Arabidopsis thaliana no ATMG00720
## 6 oligonucleotide Arabidopsis thaliana no ATMG00820
## 7 oligonucleotide Arabidopsis thaliana no AT2G07626
## 8 oligonucleotide Arabidopsis thaliana no ATMG00850
## 9 oligonucleotide Arabidopsis thaliana no ATMG00880
## 10 oligonucleotide Arabidopsis thaliana no ATMG00870
## description
## 1 hypothetical protein;(source:Araport11)
## 2 transmembrane protein;(source:Araport11)
## 3 FO-ATPase subunit;(source:Araport11)
## 4 Polynucleotidyl transferase, ribonuclease H-like superfamily protein;(source:Araport11)
## 5 hypothetical protein;(source:Araport11)
## 6 Reverse transcriptase (RNA-dependent DNA polymerase);(source:Araport11)
## 7 hypothetical protein;(source:Araport11)
## 8 DNA/RNA polymerases superfamily protein;(source:Araport11)
## 9 hypothetical protein;(source:Araport11)
## 10 hypothetical protein;(source:Araport11)
## chromosome entrez
## 1 M <NA>
## 2 M <NA>
## 3 M <NA>
## 4 M <NA>
## 5 M <NA>
## 6 M <NA>
## 7 2 <NA>
## 8 M <NA>
## 9 M <NA>
## 10 M <NA>
As we can see, some of the entries are hypothetical proteins.
Let’s load the required packages.
library(pathview)
library(gage)
library(gageData)
Let’s call upon kegg data from the gageData library.
data(kegg.sets.hs)
Let’s create a data frame with fold changes.
foldchanges = all2$logFC
names(foldchanges) = all2$entrez
head(foldchanges)
## <NA> <NA> <NA> <NA> <NA> <NA>
## -1.214 -0.568 0.055 -0.883 -0.463 -0.193
Now, let’s load the pathways that our data will be mapped onto.
kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]
Let’s get the results.
keggres = gage(foldchanges, gsets = kegg.ath.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val
## ath03010 Ribosome 1.5e-14 8.2 1.5e-14
## ath01230 Biosynthesis of amino acids 3.1e-04 3.5 3.1e-04
## ath00040 Pentose and glucuronate interconversions 2.0e-03 3.0 2.0e-03
## ath00195 Photosynthesis 2.6e-03 3.0 2.6e-03
## ath00966 Glucosinolate biosynthesis 7.0e-03 2.7 7.0e-03
## ath00541 O-Antigen nucleotide sugar biosynthesis 7.3e-03 2.6 7.3e-03
## q.val set.size exp1
## ath03010 Ribosome 1.6e-12 129 1.5e-14
## ath01230 Biosynthesis of amino acids 1.7e-02 85 3.1e-04
## ath00040 Pentose and glucuronate interconversions 6.9e-02 49 2.0e-03
## ath00195 Photosynthesis 6.9e-02 19 2.6e-03
## ath00966 Glucosinolate biosynthesis 1.3e-01 12 7.0e-03
## ath00541 O-Antigen nucleotide sugar biosynthesis 1.3e-01 13 7.3e-03
##
## $less
## p.geomean stat.mean p.val q.val
## ath04120 Ubiquitin mediated proteolysis 0.043 -1.7 0.043 1
## ath04016 MAPK signaling pathway - plant 0.043 -1.7 0.043 1
## ath00592 alpha-Linolenic acid metabolism 0.045 -1.7 0.045 1
## ath03040 Spliceosome 0.075 -1.4 0.075 1
## ath00350 Tyrosine metabolism 0.093 -1.4 0.093 1
## ath04146 Peroxisome 0.093 -1.3 0.093 1
## set.size exp1
## ath04120 Ubiquitin mediated proteolysis 63 0.043
## ath04016 MAPK signaling pathway - plant 74 0.043
## ath00592 alpha-Linolenic acid metabolism 19 0.045
## ath03040 Spliceosome 77 0.075
## ath00350 Tyrosine metabolism 19 0.093
## ath04146 Peroxisome 40 0.093
##
## $stats
## stat.mean exp1
## ath03010 Ribosome 8.2 8.2
## ath01230 Biosynthesis of amino acids 3.5 3.5
## ath00040 Pentose and glucuronate interconversions 3.0 3.0
## ath00195 Photosynthesis 3.0 3.0
## ath00966 Glucosinolate biosynthesis 2.7 2.7
## ath00541 O-Antigen nucleotide sugar biosynthesis 2.6 2.6
Let’s pull out the upregulated and downregulated genes separately. Then we will write them into their own files.
greater <- keggres$greater
lessers <- keggres$less
write.csv(greater, file = "BRIC16_pathview_Greater_Pathways.csv")
write.csv(lessers, file = "BRIC16_pathview_Lesser_Pathways.csv")
Let’s map to pathways greater.
keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <= 5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath03010 Ribosome"
## [2] "ath01230 Biosynthesis of amino acids"
## [3] "ath00040 Pentose and glucuronate interconversions"
## [4] "ath00195 Photosynthesis"
## [5] "ath00966 Glucosinolate biosynthesis"
keggresids_greater = substr(keggrespathways, start = 1, stop = 8)
keggresids_greater
## [1] "ath03010" "ath01230" "ath00040" "ath00195" "ath00966"
genedata <- as.character(all2$logFC)
Next, we will define a plotting function to apply.
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)
Now we plot multiple pathways (plots are saved to disk and returns a throwaway object list).
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath"))
Since the KEGG data and pathways are updated from time to time, the
above code may not necessarily work. As we here experience trouble with
mapping all the pathways, some variables in the code can be adjusted
such as:
same.layer = T/F
kegg.native = T/F
map.symbol = T/F
This new code should work:
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", same.layer = FALSE, map.symbol = TRUE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath03010.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath01230.pathview.png
## Info: some node width is different from others, and hence adjusted!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00040.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00195.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00966.pathview.png
## Info: some node width is different from others, and hence adjusted!
Here, we successfully mapped all of the greaters.
Depending on
personal preferences, the previously mentioned variables can be played
around with and adjusted.
Let’s map to pathways lessers.
keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <= 5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "ath04120 Ubiquitin mediated proteolysis"
## [2] "ath04016 MAPK signaling pathway - plant"
## [3] "ath00592 alpha-Linolenic acid metabolism"
## [4] "ath03040 Spliceosome"
## [5] "ath00350 Tyrosine metabolism"
keggresids_less = substr(keggrespathways, start = 1, stop = 8)
keggresids_less
## [1] "ath04120" "ath04016" "ath00592" "ath03040" "ath00350"
Now we define a plotting function to apply as before.
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE)
Now we map the top downregulated genes:
tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", same.layer = FALSE, map.symbol = TRUE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04120.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04016.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00592.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath03040.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00350.pathview.png
Here, we will work with different approaches to analyzing results
from RNA sequencing in R.
First, let’s load the required packages.
library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
Let’s load the raw data and group it based on flight and ground (here denoted as 1 and 2 respectively).
rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "X")
group <- factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))
Now let’s do a differential gene expression.
dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
## ..@ .Data:List of 2
## .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
## .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
## .. ..$ :'data.frame': 12 obs. of 3 variables:
## .. .. ..$ group : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
## .. .. ..$ lib.size : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
## .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ names: chr [1:2] "counts" "samples"
Let’s sort the data based on cpm, and take a look at the data afterwards.
keep <- rowSums(cpm(dgeGlm) > 2) >= 4
dgeGlm <- dgeGlm[keep,]
names(dgeGlm)
## [1] "counts" "samples"
head(dgeGlm)
## An object of class "DGEList"
## $counts
## Mmus_C57.6J_KDN_FLT_Rep1_M23 Mmus_C57.6J_KDN_FLT_Rep2_M24
## ENSMUSG00000000001 2977 3058
## ENSMUSG00000000028 60 56
## ENSMUSG00000000056 1254 1282
## ENSMUSG00000000058 1228 1665
## ENSMUSG00000000078 3223 2477
## ENSMUSG00000000085 884 841
## Mmus_C57.6J_KDN_FLT_Rep3_M25 Mmus_C57.6J_KDN_FLT_Rep4_M26
## ENSMUSG00000000001 2948 3209
## ENSMUSG00000000028 79 77
## ENSMUSG00000000056 1661 1203
## ENSMUSG00000000058 1213 1399
## ENSMUSG00000000078 2084 2044
## ENSMUSG00000000085 875 724
## Mmus_C57.6J_KDN_FLT_Rep5_M27 Mmus_C57.6J_KDN_FLT_Rep6_M28
## ENSMUSG00000000001 3256 3102
## ENSMUSG00000000028 76 93
## ENSMUSG00000000056 1454 1483
## ENSMUSG00000000058 1344 1498
## ENSMUSG00000000078 2030 2302
## ENSMUSG00000000085 831 815
## Mmus_C57.6J_KDN_GC_Rep1_M33 Mmus_C57.6J_KDN_GC_Rep2_M34
## ENSMUSG00000000001 3266 3184
## ENSMUSG00000000028 56 75
## ENSMUSG00000000056 1574 1370
## ENSMUSG00000000058 1527 1546
## ENSMUSG00000000078 2032 2425
## ENSMUSG00000000085 934 723
## Mmus_C57.6J_KDN_GC_Rep3_M35 Mmus_C57.6J_KDN_GC_Rep4_M36
## ENSMUSG00000000001 3370 3078
## ENSMUSG00000000028 75 66
## ENSMUSG00000000056 1497 1510
## ENSMUSG00000000058 1566 1501
## ENSMUSG00000000078 1834 2102
## ENSMUSG00000000085 817 919
## Mmus_C57.6J_KDN_GC_Rep5_M37 Mmus_C57.6J_KDN_GC_Rep6_M38
## ENSMUSG00000000001 3306 2979
## ENSMUSG00000000028 49 68
## ENSMUSG00000000056 1490 1388
## ENSMUSG00000000058 1248 1390
## ENSMUSG00000000078 2314 2434
## ENSMUSG00000000085 804 760
##
## $samples
## group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23 1 4.0e+07 1
## Mmus_C57.6J_KDN_FLT_Rep2_M24 1 4.1e+07 1
## Mmus_C57.6J_KDN_FLT_Rep3_M25 1 3.8e+07 1
## Mmus_C57.6J_KDN_FLT_Rep4_M26 1 3.9e+07 1
## Mmus_C57.6J_KDN_FLT_Rep5_M27 1 3.7e+07 1
## 7 more rows ...
Now let’s make a design that is readable for EdgeR:
design <- model.matrix(~group)
design
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 1 1
## 12 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
The keep function is a large logical operator, which sorts the data
depending on TRUE or FALSE statements.
Let’s look at the dispersion of the average log(cpm) by running it through a linear model function and plot it:
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmTagDisp)
Now, we will do a fit of our linear model. Then we save the top tags.
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef = 2)
ttGlm <- topTags(lrt, n = Inf)
class(ttGlm)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
We will now use a decide test to find the values are statistically
significant.
decideTests will compare the two groups and determine
which are upregulated and which are downregulated.
summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))
## group2
## Down 64
## NotSig 13390
## Up 159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
You want to see a normal distribution of the differentially expressed
genes (red dots) across the avg logCPM.
Let’s write the results to a file.
hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]
write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")
Let’s take a look at the columns in the org.Mm (Mouse) database:
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
Now let’s make a new ttGlm dataframe and add columns from the org.Mm database.
ttGlm2 <- as.data.frame(ttGlm$table)
ttGlm2$symbol = mapIds(org.Mm.eg.db,
keys = row.names(ttGlm2),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
ttGlm2$entrez = mapIds(org.Mm.eg.db,
keys = row.names(ttGlm2),
column = "ENTREZID",
keytype = "ENSEMBL",
multiVals = "first")
ttGlm2$name = mapIds(org.Mm.eg.db,
keys = row.names(ttGlm2),
column = "GENENAME",
keytype = "ENSEMBL",
multiVals = "first")
head(ttGlm2)
## logFC logCPM LR PValue FDR symbol entrez
## ENSMUSG00000026077 -1.36 3.6 80 4.3e-19 5.9e-15 Npas2 18143
## ENSMUSG00000007872 0.89 5.5 77 1.9e-18 1.3e-14 Id3 15903
## ENSMUSG00000021775 0.95 6.2 63 2.0e-15 9.1e-12 Nr1d2 353187
## ENSMUSG00000002250 -0.83 5.3 62 2.7e-15 9.2e-12 Ppard 19015
## ENSMUSG00000059824 2.26 4.6 58 2.6e-14 7.2e-11 Dbp 13170
## ENSMUSG00000074715 -1.99 3.8 47 7.0e-12 1.6e-08 Ccl28 56838
## name
## ENSMUSG00000026077 neuronal PAS domain protein 2
## ENSMUSG00000007872 inhibitor of DNA binding 3
## ENSMUSG00000021775 nuclear receptor subfamily 1, group D, member 2
## ENSMUSG00000002250 peroxisome proliferator activator receptor delta
## ENSMUSG00000059824 D site albumin promoter binding protein
## ENSMUSG00000074715 C-C motif chemokine ligand 28
We now have our results saved into a dataframe with extra columns
giving context.
First, let’s load the required packages.
library(pathview)
library(gage)
library(gageData)
Let’s now load the kegg- and go data.
data(kegg.sets.mm)
data(go.subs.mm)
Let’s take the differentially expressed genes and save in a value for themselves.
foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
Now let’s make tables from the kegg data.
kegg.mm = kegg.gsets("Mus musculus", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]
Usually, calling “mouse” as the species in the kegg gsets is enough. However, the kegg site or the pathing to certain names may “break”. Therefore, “Mus musculus” was called as species, as this seemed to fix the issue.
Let’s get the results from the kegg data based on the foldchanges.
keggres = gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway 6.1e-03
## mmu04350 TGF-beta signaling pathway 1.3e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## p.val q.val
## mmu03010 Ribosome 9.5e-05 0.023
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.236
## mmu04330 Notch signaling pathway 6.1e-03 0.490
## mmu04350 TGF-beta signaling pathway 1.3e-02 0.786
## mmu04390 Hippo signaling pathway 2.0e-02 0.830
## mmu00830 Retinol metabolism 2.1e-02 0.830
## set.size
## mmu03010 Ribosome 127
## mmu04550 Signaling pathways regulating pluripotency of stem cells 100
## mmu04330 Notch signaling pathway 54
## mmu04350 TGF-beta signaling pathway 84
## mmu04390 Hippo signaling pathway 125
## mmu00830 Retinol metabolism 37
## exp1
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway 6.1e-03
## mmu04350 TGF-beta signaling pathway 1.3e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
##
## $less
## p.geomean stat.mean p.val
## mmu04613 Neutrophil extracellular trap formation 0.00012 -3.7 0.00012
## mmu04145 Phagosome 0.00192 -2.9 0.00192
## mmu04110 Cell cycle 0.00276 -2.8 0.00276
## mmu04714 Thermogenesis 0.00472 -2.6 0.00472
## mmu04217 Necroptosis 0.00614 -2.5 0.00614
## mmu00910 Nitrogen metabolism 0.00867 -2.6 0.00867
## q.val set.size exp1
## mmu04613 Neutrophil extracellular trap formation 0.029 137 0.00012
## mmu04145 Phagosome 0.222 121 0.00192
## mmu04110 Cell cycle 0.222 134 0.00276
## mmu04714 Thermogenesis 0.284 208 0.00472
## mmu04217 Necroptosis 0.296 113 0.00614
## mmu00910 Nitrogen metabolism 0.348 13 0.00867
##
## $stats
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## exp1
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
greaters <- keggres$greater
lessers <- keggres$less
Let’s prepare the upregulated genes for plotting by extracting their IDs.
keggrespathways_greater = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <= 5) %>%
.$id %>%
as.character()
keggrespathways_greater
## [1] "mmu03010 Ribosome"
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04330 Notch signaling pathway"
## [4] "mmu04350 TGF-beta signaling pathway"
## [5] "mmu04390 Hippo signaling pathway"
keggresids_greater = substr(keggrespathways_greater, start = 1, stop = 8)
keggresids_greater
## [1] "mmu03010" "mmu04550" "mmu04330" "mmu04350" "mmu04390"
Now let’s plot the upregulated genes unto their pathways.
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)
# plot multiple pathways
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "edgeR"))
Let’s prepare the downregulated genes for plotting by extracting their IDs.
keggrespathways_less = data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <= 5) %>%
.$id %>%
as.character()
keggrespathways_less
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04145 Phagosome"
## [3] "mmu04110 Cell cycle"
## [4] "mmu04714 Thermogenesis"
## [5] "mmu04217 Necroptosis"
keggresids_less = substr(keggrespathways_less, start = 1, stop = 8)
keggresids_less
## [1] "mmu04613" "mmu04145" "mmu04110" "mmu04714" "mmu04217"
Now let’s plot the downregulated genes unto their pathways.
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)
# plot multiple pathways
tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "edgeR"))
Let’s load the images into the html file:
library(imager)
edgeR_files <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = ".*edgeR.png")
edgeR_images <- lapply(edgeR_files, load.image)
knitr::include_graphics(edgeR_files)
Let’s load the required packages for this analysis.
library("DESeq2")
library("dplyr")
library("apeglm")
Let’s load in our data.
countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")
colData <- read.csv("PHENO_DATA_mouse.csv", sep = ",", row.names = 1)
DESeq requires you to give levels to your data/group.
Therefore, we need to add leveling factors to our coldata.
colData$group <- factor(colData$group, levels = c("0", "1"))
Now let’s make sure we have the same number of individuals as well as groups.
all(rownames(colData)) %in% colnames(countData)
## Warning in all(rownames(colData)): coercing argument of type 'character' to
## logical
## [1] FALSE
The FALSE statement given here is due to one character difference between the two dataframes, the characters being “-” and “_“, and the countData has another named column. We will fix this later.
We need to round up the counts, because DESeq2 only allows integers as an input, and not fractional numbers. This depends on the method of alignment that was used upstream.
countData %>%
mutate_if(is.numeric, ceiling)
head(countData)
## gene_ID Mmus_C57_6J_KDN_FLT_Rep1_M23 Mmus_C57_6J_KDN_FLT_Rep2_M24
## 1 ENSMUSG00000000001 2811 2742
## 2 ENSMUSG00000000003 0 0
## 3 ENSMUSG00000000028 59 49
## 4 ENSMUSG00000000031 26 19
## 5 ENSMUSG00000000037 20 26
## 6 ENSMUSG00000000049 0 1
## Mmus_C57_6J_KDN_FLT_Rep3_M25 Mmus_C57_6J_KDN_FLT_Rep4_M26
## 1 3578 3118
## 2 0 0
## 3 102 72
## 4 17 23
## 5 37 30
## 6 5 10
## Mmus_C57_6J_KDN_FLT_Rep5_M27 Mmus_C57_6J_KDN_FLT_Rep6_M28
## 1 3610 3102
## 2 0 0
## 3 80 84
## 4 26 18
## 5 22 20
## 6 0 4
## Mmus_C57_6J_KDN_GC_Rep1_M33 Mmus_C57_6J_KDN_GC_Rep2_M34
## 1 3094 3112
## 2 0 0
## 3 56 75
## 4 15 21
## 5 20 27
## 6 1 4
## Mmus_C57_6J_KDN_GC_Rep3_M35 Mmus_C57_6J_KDN_GC_Rep4_M36
## 1 3195 3158
## 2 0 0
## 3 60 64
## 4 21 25
## 5 8 20
## 6 10 2
## Mmus_C57_6J_KDN_GC_Rep5_M37 Mmus_C57_6J_KDN_GC_Rep6_M38
## 1 3339 2958
## 2 0 0
## 3 56 79
## 4 24 18
## 5 30 36
## 6 1 1
Let’s modify the countData dataframe to not include the first column.
countData[, 2:13] <- sapply(countData[, 2:13], as.integer)
row.names(countData) <- countData[, 1]
countData <- countData[2:13]
Now let’s fix those characters in the colData dataframe:
library(stringr)
rownames(colData) <- str_replace_all(rownames(colData), "-", "_")
all(rownames(colData) %in% colnames(countData))
## [1] TRUE
rownames(colData) == colnames(countData)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now, we will do the differential gene expression.
First, we
convert our dataframes into a matrix, making it readable for DESeq.
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)
dds <- dds[rowSums(counts(dds) > 2) >= 4]
dds <- DESeq(dds)
res <- results(dds)
Now we will adjust our log foldchange.
We will use lfcShrink for
this, as it is adjusted for DESeq objects.
resLFC <- lfcShrink(dds, coef = 2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
Then we adjust our results by p-value.
(resOrdered <- res[order(res$padj), ])
## log2 fold change (MLE): group 1 vs 0
## Wald test p-value: group 1 vs 0
## DataFrame with 22008 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000002250 1459.223 -0.926292 0.1112628 -8.32526 8.41430e-17
## ENSMUSG00000007872 1719.375 0.818829 0.1122820 7.29261 3.04014e-13
## ENSMUSG00000026077 437.035 -1.191812 0.1655873 -7.19748 6.13338e-13
## ENSMUSG00000040998 14579.593 -0.506307 0.0703771 -7.19421 6.28252e-13
## ENSMUSG00000021775 2804.923 0.842511 0.1233312 6.83129 8.41546e-12
## ... ... ... ... ... ...
## ENSMUSG00000118345 4.22314 -0.12097478 0.599072 -0.20193699 0.839966
## ENSMUSG00000118353 6.60578 0.56456713 0.481195 1.17326031 0.240691
## ENSMUSG00000118358 3.30902 -0.00273584 0.763559 -0.00358301 0.997141
## ENSMUSG00000118369 2.91657 -1.11623145 0.790702 -1.41169702 0.158039
## ENSMUSG00000118384 7.43136 0.23830798 0.489273 0.48706567 0.626212
## padj
## <numeric>
## ENSMUSG00000002250 1.24077e-12
## ENSMUSG00000007872 2.24149e-09
## ENSMUSG00000026077 2.31605e-09
## ENSMUSG00000040998 2.31605e-09
## ENSMUSG00000021775 2.48189e-08
## ... ...
## ENSMUSG00000118345 NA
## ENSMUSG00000118353 NA
## ENSMUSG00000118358 NA
## ENSMUSG00000118369 NA
## ENSMUSG00000118384 NA
summary(res)
##
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 325, 1.5%
## LFC < 0 (down) : 327, 1.5%
## outliers [1] : 15, 0.068%
## low counts [2] : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Now we will pull out some of these data.
FLT_vs_GC <- as.data.frame(res$log2FoldChange)
head(FLT_vs_GC)
## res$log2FoldChange
## 1 0.0421
## 2 -0.1334
## 3 -0.0185
## 4 -0.0882
## 5 -0.0079
## 6 0.1136
Now we plot the data and print it to a pdf.
plotMA(resLFC, ylim = c(-2, 2))
pdf(file = "MA_Plot_FLT_vs_GC.pdf")
dev.off()
## png
## 2
First the plot was made, then we write a pdf plot, but we then turn “dev.off” to allow the machine to be able to run the pdf plot function again. Otherwise, the machine will be locked on the one iteration.
Let’s save our differential expression results to file.
write.csv(as.data.frame(resOrdered), file = "Mouse_DESeq.csv")
Let’s perform a custom transformation and plot a principal component analysis (PCA).
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) + 1), colData = colData(dds))
pdf(file = "PCA_PLOT_FLT_vs_GC.pdf")
dev.off()
## png
## 2
plotPCA(DESeqTransform(se), intgroup = "group")
## using ntop=500 top features by variance
Now, let’s prepare the data for pathview and save a variable with the
foldchanges.
Let’s load our required packages
library("AnnotationDbi")
library("org.Mm.eg.db")
Here, we add columns from the mouse database to the DESeq results.
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
head(foldchanges)
## res$log2FoldChange
## ENSMUSG00000000001 0.0421
## ENSMUSG00000000028 -0.1334
## ENSMUSG00000000031 -0.0185
## ENSMUSG00000000037 -0.0882
## ENSMUSG00000000049 -0.0079
## ENSMUSG00000000056 0.1136
res$symbol = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "SYMBOL",
keytype = "ENSEMBL",
mutliVals = "first")
res$entrez = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "ENTREZID",
keytype = "ENSEMBL",
mutliVals = "first")
res$anme = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "GENENAME",
keytype = "ENSEMBL",
mutliVals = "first")
head(res, 10)
## log2 fold change (MLE): group 1 vs 0
## Wald test p-value: group 1 vs 0
## DataFrame with 10 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 3132.35128 0.04214340 0.0436714 0.9650117 0.334539
## ENSMUSG00000000028 68.75801 -0.13342706 0.1565936 -0.8520597 0.394181
## ENSMUSG00000000031 21.05397 -0.01853142 0.2486477 -0.0745288 0.940590
## ENSMUSG00000000037 24.42314 -0.08817270 0.2982220 -0.2956613 0.767489
## ENSMUSG00000000049 3.24919 -0.00790342 0.9613572 -0.0082211 0.993441
## ENSMUSG00000000056 1424.88216 0.11355979 0.0777635 1.4603234 0.144201
## ENSMUSG00000000058 1420.78992 -0.03893850 0.0850602 -0.4577759 0.647113
## ENSMUSG00000000078 2254.53129 -0.07540275 0.0874314 -0.8624214 0.388456
## ENSMUSG00000000085 822.68179 0.04586772 0.0584699 0.7844667 0.432766
## ENSMUSG00000000088 5946.81754 -0.05461549 0.0691178 -0.7901795 0.429423
## padj symbol entrez anme
## <numeric> <character> <character> <character>
## ENSMUSG00000000001 0.739637 Gnai3 14679 G protein subunit al..
## ENSMUSG00000000028 0.777085 Cdc45 12544 cell division cycle 45
## ENSMUSG00000000031 NA H19 14955 H19, imprinted mater..
## ENSMUSG00000000037 NA Scml2 107815 Scm polycomb group p..
## ENSMUSG00000000049 NA Apoh 11818 apolipoprotein H
## ENSMUSG00000000056 0.547527 Narf 67608 nuclear prelamin A r..
## ENSMUSG00000000058 0.901904 Cav2 12390 caveolin 2
## ENSMUSG00000000078 0.774294 Klf6 23849 Kruppel-like transcr..
## ENSMUSG00000000085 0.800018 Scmh1 29871 sex comb on midleg h..
## ENSMUSG00000000088 0.798420 Cox5a 12858 cytochrome c oxidase..
Let’s load our pathview packages.
library(pathview)
library(gage)
library(gageData)
Let’s load our kegg- and go data and set the names for the data in the foldchanges dataframe.
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
## 14679 12544 14955 107815 11818 67608
## 0.0421 -0.1334 -0.0185 -0.0882 -0.0079 0.1136
kegg.mm <- kegg.gsets("Mus musculus", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]
Let’s map the results
keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## mmu03010 Ribosome 0.017 2.1 0.017 0.89
## mmu04022 cGMP-PKG signaling pathway 0.030 1.9 0.030 0.89
## mmu04360 Axon guidance 0.038 1.8 0.038 0.89
## mmu04330 Notch signaling pathway 0.042 1.8 0.042 0.89
## mmu04658 Th1 and Th2 cell differentiation 0.054 1.6 0.054 0.89
## mmu02010 ABC transporters 0.075 1.5 0.075 0.89
## set.size exp1
## mmu03010 Ribosome 139 0.017
## mmu04022 cGMP-PKG signaling pathway 152 0.030
## mmu04360 Axon guidance 176 0.038
## mmu04330 Notch signaling pathway 58 0.042
## mmu04658 Th1 and Th2 cell differentiation 78 0.054
## mmu02010 ABC transporters 46 0.075
##
## $less
## p.geomean stat.mean p.val
## mmu04613 Neutrophil extracellular trap formation 0.0043 -2.6 0.0043
## mmu04110 Cell cycle 0.0062 -2.5 0.0062
## mmu04657 IL-17 signaling pathway 0.0113 -2.3 0.0113
## mmu04145 Phagosome 0.0332 -1.8 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.0388 -1.8 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.0441 -1.7 0.0441
## q.val set.size exp1
## mmu04613 Neutrophil extracellular trap formation 0.75 160 0.0043
## mmu04110 Cell cycle 0.75 151 0.0062
## mmu04657 IL-17 signaling pathway 0.90 75 0.0113
## mmu04145 Phagosome 0.91 144 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.91 154 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.91 99 0.0441
##
## $stats
## stat.mean exp1
## mmu03010 Ribosome 2.1 2.1
## mmu04022 cGMP-PKG signaling pathway 1.9 1.9
## mmu04360 Axon guidance 1.8 1.8
## mmu04330 Notch signaling pathway 1.8 1.8
## mmu04658 Th1 and Th2 cell differentiation 1.6 1.6
## mmu02010 ABC transporters 1.5 1.5
Let’s save our greater and less than pathways.
greaters <- keggres$greater
lessers <- keggres$less
Let’s map the upregulated pathways.
keggrespathways_greater <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
keggrespathways_greater
## [1] "mmu03010 Ribosome" "mmu04022 cGMP-PKG signaling pathway"
## [3] "mmu04360 Axon guidance"
keggresids_greater <- substr(keggrespathways_greater, start = 1, stop = 8)
keggresids_greater
## [1] "mmu03010" "mmu04022" "mmu04360"
Time to plot.
plot_pathway = function(pid) pathwiew(gene.data = foldchange, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "DESeq"))
sapply tells to run the function for every item in the list.
Let’s map the downregulated pathways.
keggrespathways_less <- data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
keggrespathways_less
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04110 Cell cycle"
## [3] "mmu04657 IL-17 signaling pathway"
keggresids_less <- substr(keggrespathways_less, start = 1, stop = 8)
keggresids_less
## [1] "mmu04613" "mmu04110" "mmu04657"
Now let’s plot the lessers.
plot_pathway = function(pid) pathwiew(gene.data = foldchange, pathway.id = pid, species = "Mus musculus", new.signature = FALSE)
tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "Mus musculus", new.signature = FALSE, out.suffix = "DESeq"))
Let’s load the images into our html file.
library(imager)
DESeq_files <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = ".*DESeq.png")
DESeq_images <- lapply(DESeq_files, load.image)
knitr::include_graphics(DESeq_files)
Let’s do a comparison of the two methods of analysis. First, let’s load the required package.
library(tidyverse)
Let’s load our data.
EdgeR <- read.csv("Mouse_EdgeR_Results_Zero_vs_1.csv")
DESeq <- read.csv("Mouse_DESeq.csv")
Let’s adjust the data of both to match for the comparison.
DESeq2 <- DESeq %>%
filter(padj < 0.1)
DESeq2 <- DESeq2[, c(1, 3)]
EdgeR <- EdgeR[, 1:2]
Now, let’s change the names of the columns to match each other in the dataframes.
colnames(DESeq2) <- c("ID", "logFC")
colnames(EdgeR) <- c("ID", "logFC")
Let’s install and load the required package for plotting.
install.packages("GOplot")
library(GOplot)
Now, let’s make the comparison.
comp <- GOVenn(DESeq2, EdgeR, label = c("DESeq2", "EdgeR"), title = "Comparison of DESeq an EdgeR DE Genes", plot = FALSE)
Let’s plot the comparison in a Venn diagram.
comp$plot
Here, let’s look at the shared significant genes listed in a table.
comp$table
## $A_only
## logFC Trend
## ENSMUSG00000022580 -0.340 DOWN
## ENSMUSG00000006517 -0.361 DOWN
## ENSMUSG00000038375 -0.311 DOWN
## ENSMUSG00000032374 -0.331 DOWN
## ENSMUSG00000014158 -0.245 DOWN
## ENSMUSG00000026179 -0.277 DOWN
## ENSMUSG00000037455 -0.319 DOWN
## ENSMUSG00000020232 -0.214 DOWN
## ENSMUSG00000023367 -0.222 DOWN
## ENSMUSG00000079465 -0.348 DOWN
## ENSMUSG00000032911 -0.253 DOWN
## ENSMUSG00000028671 -0.529 DOWN
## ENSMUSG00000022763 -0.463 DOWN
## ENSMUSG00000069565 -0.233 DOWN
## ENSMUSG00000029763 -0.322 DOWN
## ENSMUSG00000040263 -0.267 DOWN
## ENSMUSG00000039047 -0.200 DOWN
## ENSMUSG00000023938 -0.298 DOWN
## ENSMUSG00000048707 -0.256 DOWN
## ENSMUSG00000024958 -0.216 DOWN
## ENSMUSG00000022351 -0.374 DOWN
## ENSMUSG00000092500 -0.678 DOWN
## ENSMUSG00000031594 -0.705 DOWN
## ENSMUSG00000053414 -0.423 DOWN
## ENSMUSG00000035910 -0.465 DOWN
## ENSMUSG00000046079 -0.230 DOWN
## ENSMUSG00000039834 -0.209 DOWN
## ENSMUSG00000024064 -0.322 DOWN
## ENSMUSG00000017428 -0.192 DOWN
## ENSMUSG00000040188 -0.182 DOWN
## ENSMUSG00000026399 -0.463 DOWN
## ENSMUSG00000000594 -0.353 DOWN
## ENSMUSG00000001473 -0.520 DOWN
## ENSMUSG00000029032 -0.173 DOWN
## ENSMUSG00000034457 -0.656 DOWN
## ENSMUSG00000038704 -0.442 DOWN
## ENSMUSG00000004565 -0.185 DOWN
## ENSMUSG00000020381 -0.355 DOWN
## ENSMUSG00000041889 -0.407 DOWN
## ENSMUSG00000023259 -0.549 DOWN
## ENSMUSG00000100131 -0.745 DOWN
## ENSMUSG00000062203 -0.169 DOWN
## ENSMUSG00000025735 -0.547 DOWN
## ENSMUSG00000028545 -0.349 DOWN
## ENSMUSG00000058454 -0.322 DOWN
## ENSMUSG00000066441 -0.206 DOWN
## ENSMUSG00000022540 -0.222 DOWN
## ENSMUSG00000041939 -0.369 DOWN
## ENSMUSG00000035561 -0.629 DOWN
## ENSMUSG00000038023 -0.162 DOWN
## ENSMUSG00000101249 -0.424 DOWN
## ENSMUSG00000032370 -0.277 DOWN
## ENSMUSG00000004266 -0.241 DOWN
## ENSMUSG00000001627 -0.368 DOWN
## ENSMUSG00000029810 -0.212 DOWN
## ENSMUSG00000024579 -0.339 DOWN
## ENSMUSG00000030739 -0.257 DOWN
## ENSMUSG00000011382 -0.289 DOWN
## ENSMUSG00000032306 -0.197 DOWN
## ENSMUSG00000031960 -0.232 DOWN
## ENSMUSG00000022512 -0.313 DOWN
## ENSMUSG00000037243 -0.384 DOWN
## ENSMUSG00000028041 -0.228 DOWN
## ENSMUSG00000026307 -0.180 DOWN
## ENSMUSG00000029580 -0.225 DOWN
## ENSMUSG00000033107 -0.465 DOWN
## ENSMUSG00000112324 -1.178 DOWN
## ENSMUSG00000020744 -0.239 DOWN
## ENSMUSG00000055745 -0.330 DOWN
## ENSMUSG00000041959 -0.302 DOWN
## ENSMUSG00000036957 -0.236 DOWN
## ENSMUSG00000021697 -0.244 DOWN
## ENSMUSG00000037894 -0.149 DOWN
## ENSMUSG00000021792 -0.271 DOWN
## ENSMUSG00000037347 -0.583 DOWN
## ENSMUSG00000020877 -0.305 DOWN
## ENSMUSG00000087574 -0.525 DOWN
## ENSMUSG00000030512 -0.297 DOWN
## ENSMUSG00000006800 -0.384 DOWN
## ENSMUSG00000027412 -0.238 DOWN
## ENSMUSG00000031161 -0.193 DOWN
## ENSMUSG00000027429 -0.197 DOWN
## ENSMUSG00000045594 -0.322 DOWN
## ENSMUSG00000018171 -0.210 DOWN
## ENSMUSG00000032349 -0.205 DOWN
## ENSMUSG00000015094 -0.274 DOWN
## ENSMUSG00000028538 -0.214 DOWN
## ENSMUSG00000029616 -0.170 DOWN
## ENSMUSG00000019797 -0.260 DOWN
## ENSMUSG00000095193 -0.465 DOWN
## ENSMUSG00000068876 -0.369 DOWN
## ENSMUSG00000026784 -0.337 DOWN
## ENSMUSG00000057363 -0.230 DOWN
## ENSMUSG00000064367 -0.397 DOWN
## ENSMUSG00000024319 -0.161 DOWN
## ENSMUSG00000030137 -0.733 DOWN
## ENSMUSG00000030538 -0.210 DOWN
## ENSMUSG00000032978 -0.412 DOWN
## ENSMUSG00000069633 -0.241 DOWN
## ENSMUSG00000067158 -0.167 DOWN
## ENSMUSG00000024292 -0.697 DOWN
## ENSMUSG00000031958 -0.326 DOWN
## ENSMUSG00000064341 -0.427 DOWN
## ENSMUSG00000024993 -0.240 DOWN
## ENSMUSG00000074170 -0.323 DOWN
## ENSMUSG00000030641 -0.708 DOWN
## ENSMUSG00000034858 -0.241 DOWN
## ENSMUSG00000036040 -0.449 DOWN
## ENSMUSG00000063229 -0.274 DOWN
## ENSMUSG00000027490 -0.501 DOWN
## ENSMUSG00000047735 -0.296 DOWN
## ENSMUSG00000011752 -0.216 DOWN
## ENSMUSG00000034714 -0.251 DOWN
## ENSMUSG00000034613 -0.198 DOWN
## ENSMUSG00000032452 -0.366 DOWN
## ENSMUSG00000038775 -0.428 DOWN
## ENSMUSG00000045294 -0.410 DOWN
## ENSMUSG00000084128 -0.205 DOWN
## ENSMUSG00000037686 -0.488 DOWN
## ENSMUSG00000014245 -0.343 DOWN
## ENSMUSG00000029093 -0.702 DOWN
## ENSMUSG00000024736 -0.295 DOWN
## ENSMUSG00000028937 -0.460 DOWN
## ENSMUSG00000096795 -0.429 DOWN
## ENSMUSG00000051518 -0.238 DOWN
## ENSMUSG00000000934 -0.210 DOWN
## ENSMUSG00000030880 -0.374 DOWN
## ENSMUSG00000025726 -0.471 DOWN
## ENSMUSG00000052117 -0.383 DOWN
## ENSMUSG00000006342 -0.322 DOWN
## ENSMUSG00000062825 -0.214 DOWN
## ENSMUSG00000041733 -0.164 DOWN
## ENSMUSG00000028780 -0.390 DOWN
## ENSMUSG00000024665 -0.283 DOWN
## ENSMUSG00000025317 -0.452 DOWN
## ENSMUSG00000020142 -0.513 DOWN
## ENSMUSG00000082016 -0.500 DOWN
## ENSMUSG00000034371 -0.356 DOWN
## ENSMUSG00000032492 -0.239 DOWN
## ENSMUSG00000110755 -0.855 DOWN
## ENSMUSG00000064254 -0.353 DOWN
## ENSMUSG00000035845 -0.230 DOWN
## ENSMUSG00000017210 -0.184 DOWN
## ENSMUSG00000023832 -0.242 DOWN
## ENSMUSG00000037999 -0.136 DOWN
## ENSMUSG00000068220 -0.327 DOWN
## ENSMUSG00000064345 -0.470 DOWN
## ENSMUSG00000109532 -0.468 DOWN
## ENSMUSG00000024503 -0.299 DOWN
## ENSMUSG00000004843 -0.193 DOWN
## ENSMUSG00000030298 -0.187 DOWN
## ENSMUSG00000048578 -0.185 DOWN
## ENSMUSG00000085042 -0.537 DOWN
## ENSMUSG00000016942 -0.573 DOWN
## ENSMUSG00000020116 -0.221 DOWN
## ENSMUSG00000028989 -0.830 DOWN
## ENSMUSG00000045328 -0.675 DOWN
## ENSMUSG00000022012 -0.602 DOWN
## ENSMUSG00000030111 -0.899 DOWN
## ENSMUSG00000070283 -0.218 DOWN
## ENSMUSG00000015357 -0.250 DOWN
## ENSMUSG00000023004 -0.286 DOWN
## ENSMUSG00000026627 -0.314 DOWN
## ENSMUSG00000031604 -0.471 DOWN
## ENSMUSG00000032515 -0.308 DOWN
## ENSMUSG00000067924 -0.224 DOWN
## ENSMUSG00000037263 -0.843 DOWN
## ENSMUSG00000027230 -0.362 DOWN
## ENSMUSG00000110663 -0.548 DOWN
## ENSMUSG00000026042 -0.492 DOWN
## ENSMUSG00000024818 -0.340 DOWN
## ENSMUSG00000028063 -0.176 DOWN
## ENSMUSG00000002393 -0.151 DOWN
## ENSMUSG00000029304 -0.304 DOWN
## ENSMUSG00000063897 -0.232 DOWN
## ENSMUSG00000047281 -0.292 DOWN
## ENSMUSG00000020827 -0.092 DOWN
## ENSMUSG00000057103 -0.272 DOWN
## ENSMUSG00000035863 -0.136 DOWN
## ENSMUSG00000038217 -0.427 DOWN
## ENSMUSG00000021414 -0.467 DOWN
## ENSMUSG00000029394 -0.155 DOWN
## ENSMUSG00000025203 -0.440 DOWN
## ENSMUSG00000022021 -0.747 DOWN
## ENSMUSG00000021666 -0.172 DOWN
## ENSMUSG00000023153 -0.476 DOWN
## ENSMUSG00000032269 -0.487 DOWN
## ENSMUSG00000029701 -0.155 DOWN
## ENSMUSG00000044786 -0.441 DOWN
## ENSMUSG00000037470 -0.128 DOWN
## ENSMUSG00000003813 -0.177 DOWN
## ENSMUSG00000073405 -0.630 DOWN
## ENSMUSG00000022893 -0.270 DOWN
## ENSMUSG00000040694 -0.480 DOWN
## ENSMUSG00000078695 -0.226 DOWN
## ENSMUSG00000079261 -0.517 DOWN
## ENSMUSG00000022033 -0.717 DOWN
## ENSMUSG00000037169 -0.428 DOWN
## ENSMUSG00000031781 -0.220 DOWN
## ENSMUSG00000033538 -0.611 DOWN
## ENSMUSG00000058569 -0.131 DOWN
## ENSMUSG00000024683 -0.174 DOWN
## ENSMUSG00000027327 -0.296 DOWN
## ENSMUSG00000046999 -0.821 DOWN
## ENSMUSG00000036002 -0.179 DOWN
## ENSMUSG00000048277 -0.161 DOWN
## ENSMUSG00000039254 -0.171 DOWN
## ENSMUSG00000074890 -0.265 DOWN
## ENSMUSG00000020898 -0.176 DOWN
## ENSMUSG00000045725 -0.348 DOWN
## ENSMUSG00000069302 -0.826 DOWN
## ENSMUSG00000020917 -0.414 DOWN
## ENSMUSG00000025762 -0.150 DOWN
## ENSMUSG00000051397 -0.602 DOWN
## ENSMUSG00000071076 -0.329 DOWN
## ENSMUSG00000020062 -0.557 DOWN
## ENSMUSG00000027610 -0.208 DOWN
## ENSMUSG00000046603 -0.150 DOWN
## ENSMUSG00000026249 -0.398 DOWN
## ENSMUSG00000062248 -0.403 DOWN
## ENSMUSG00000067144 -0.584 DOWN
## ENSMUSG00000018427 -0.298 DOWN
## ENSMUSG00000029161 -0.316 DOWN
## ENSMUSG00000054951 -0.420 DOWN
## ENSMUSG00000026201 -0.168 DOWN
## ENSMUSG00000028001 -0.429 DOWN
## ENSMUSG00000055725 -0.339 DOWN
## ENSMUSG00000036452 -0.209 DOWN
## ENSMUSG00000038527 -0.278 DOWN
## ENSMUSG00000028042 -0.182 DOWN
## ENSMUSG00000056608 -0.362 DOWN
## ENSMUSG00000005667 -0.619 DOWN
## ENSMUSG00000037386 -0.407 DOWN
## ENSMUSG00000037725 -0.819 DOWN
## ENSMUSG00000047986 -0.327 DOWN
## ENSMUSG00000029771 -0.386 DOWN
## ENSMUSG00000032902 -0.548 DOWN
## ENSMUSG00000030805 -0.154 DOWN
## ENSMUSG00000102070 -0.342 DOWN
## ENSMUSG00000048911 -0.636 DOWN
## ENSMUSG00000020307 -0.208 DOWN
## ENSMUSG00000028327 -0.429 DOWN
## ENSMUSG00000061474 -0.216 DOWN
## ENSMUSG00000031776 -0.133 DOWN
## ENSMUSG00000023048 -0.185 DOWN
## ENSMUSG00000028834 -0.453 DOWN
## ENSMUSG00000032786 -0.342 DOWN
## ENSMUSG00000020571 -0.189 DOWN
## ENSMUSG00000026036 -0.183 DOWN
## ENSMUSG00000037103 -0.234 DOWN
## ENSMUSG00000032806 -0.215 DOWN
## ENSMUSG00000032115 -0.226 DOWN
## ENSMUSG00000026914 -0.188 DOWN
## ENSMUSG00000017221 -0.154 DOWN
## ENSMUSG00000093930 -0.256 DOWN
## ENSMUSG00000020775 -0.193 DOWN
## ENSMUSG00000015085 -0.277 DOWN
## ENSMUSG00000007783 -0.681 DOWN
## ENSMUSG00000040688 -0.163 DOWN
## ENSMUSG00000053291 -0.219 DOWN
## ENSMUSG00000027496 -0.549 DOWN
## ENSMUSG00000033031 -0.563 DOWN
## ENSMUSG00000030122 -0.203 DOWN
## ENSMUSG00000030340 -0.153 DOWN
## ENSMUSG00000045438 -0.171 DOWN
## ENSMUSG00000083287 -0.702 DOWN
## ENSMUSG00000017776 0.143 UP
## ENSMUSG00000050310 0.157 UP
## ENSMUSG00000039068 0.166 UP
## ENSMUSG00000021540 0.170 UP
## ENSMUSG00000031393 0.166 UP
## ENSMUSG00000036550 0.122 UP
## ENSMUSG00000020257 0.165 UP
## ENSMUSG00000022604 0.245 UP
## ENSMUSG00000028053 0.129 UP
## ENSMUSG00000030213 0.228 UP
## ENSMUSG00000060657 0.171 UP
## ENSMUSG00000002266 0.798 UP
## ENSMUSG00000027351 0.201 UP
## ENSMUSG00000041530 0.142 UP
## ENSMUSG00000031216 0.185 UP
## ENSMUSG00000025223 0.199 UP
## ENSMUSG00000032086 0.160 UP
## ENSMUSG00000050812 0.123 UP
## ENSMUSG00000038290 0.181 UP
## ENSMUSG00000038530 0.458 UP
## ENSMUSG00000038766 0.194 UP
## ENSMUSG00000050947 0.195 UP
## ENSMUSG00000045098 0.167 UP
## ENSMUSG00000026918 0.151 UP
## ENSMUSG00000037003 0.254 UP
## ENSMUSG00000074748 0.135 UP
## ENSMUSG00000097412 0.362 UP
## ENSMUSG00000031729 0.136 UP
## ENSMUSG00000060419 0.567 UP
## ENSMUSG00000027519 0.127 UP
## ENSMUSG00000021669 0.173 UP
## ENSMUSG00000051817 0.205 UP
## ENSMUSG00000043090 0.246 UP
## ENSMUSG00000037029 0.161 UP
## ENSMUSG00000027395 0.233 UP
## ENSMUSG00000021488 0.140 UP
## ENSMUSG00000073678 0.238 UP
## ENSMUSG00000041378 0.247 UP
## ENSMUSG00000046947 0.230 UP
## ENSMUSG00000058793 0.158 UP
## ENSMUSG00000037369 0.169 UP
## ENSMUSG00000035247 0.144 UP
## ENSMUSG00000026436 0.151 UP
## ENSMUSG00000046897 0.138 UP
## ENSMUSG00000057133 0.152 UP
## ENSMUSG00000027680 0.155 UP
## ENSMUSG00000043991 0.143 UP
## ENSMUSG00000034189 0.223 UP
## ENSMUSG00000018076 0.140 UP
## ENSMUSG00000052446 0.213 UP
## ENSMUSG00000000901 0.460 UP
## ENSMUSG00000040209 0.149 UP
## ENSMUSG00000021140 0.140 UP
## ENSMUSG00000015942 0.281 UP
## ENSMUSG00000043241 0.154 UP
## ENSMUSG00000017897 0.479 UP
## ENSMUSG00000022353 0.170 UP
## ENSMUSG00000005893 0.127 UP
## ENSMUSG00000044791 0.108 UP
## ENSMUSG00000034156 0.304 UP
## ENSMUSG00000037736 0.156 UP
## ENSMUSG00000059486 0.191 UP
## ENSMUSG00000040865 0.146 UP
## ENSMUSG00000035666 0.183 UP
## ENSMUSG00000029647 0.151 UP
## ENSMUSG00000038486 0.223 UP
## ENSMUSG00000025927 0.260 UP
## ENSMUSG00000022415 0.260 UP
## ENSMUSG00000092558 0.233 UP
## ENSMUSG00000014195 0.132 UP
## ENSMUSG00000019866 0.168 UP
## ENSMUSG00000078202 0.238 UP
## ENSMUSG00000035495 0.162 UP
## ENSMUSG00000044674 0.147 UP
## ENSMUSG00000102869 0.133 UP
## ENSMUSG00000026923 0.238 UP
## ENSMUSG00000020642 0.304 UP
## ENSMUSG00000037503 0.132 UP
## ENSMUSG00000043716 0.155 UP
## ENSMUSG00000037822 0.123 UP
## ENSMUSG00000035413 0.280 UP
## ENSMUSG00000021395 0.145 UP
## ENSMUSG00000036097 0.170 UP
## ENSMUSG00000034297 0.142 UP
## ENSMUSG00000037742 0.183 UP
## ENSMUSG00000031841 0.367 UP
## ENSMUSG00000021661 0.236 UP
## ENSMUSG00000021959 0.201 UP
## ENSMUSG00000020594 0.133 UP
## ENSMUSG00000027079 0.245 UP
## ENSMUSG00000087150 0.369 UP
## ENSMUSG00000091811 0.293 UP
## ENSMUSG00000066415 0.147 UP
## ENSMUSG00000027524 0.334 UP
## ENSMUSG00000008683 0.179 UP
## ENSMUSG00000066235 0.238 UP
## ENSMUSG00000099689 0.326 UP
## ENSMUSG00000074994 0.154 UP
## ENSMUSG00000005698 0.136 UP
## ENSMUSG00000020357 0.220 UP
## ENSMUSG00000051864 0.167 UP
## ENSMUSG00000033857 0.262 UP
## ENSMUSG00000027678 0.141 UP
## ENSMUSG00000028522 0.191 UP
## ENSMUSG00000038677 0.256 UP
## ENSMUSG00000031365 0.243 UP
## ENSMUSG00000079215 0.136 UP
## ENSMUSG00000008435 0.196 UP
## ENSMUSG00000022791 0.185 UP
## ENSMUSG00000035284 0.150 UP
## ENSMUSG00000071660 0.149 UP
## ENSMUSG00000031209 0.180 UP
## ENSMUSG00000043384 0.177 UP
## ENSMUSG00000062519 0.203 UP
## ENSMUSG00000046792 0.154 UP
## ENSMUSG00000060036 0.183 UP
## ENSMUSG00000027799 0.128 UP
## ENSMUSG00000070280 1.104 UP
## ENSMUSG00000039086 0.302 UP
## ENSMUSG00000010721 0.279 UP
## ENSMUSG00000026694 0.224 UP
## ENSMUSG00000090958 0.260 UP
## ENSMUSG00000041329 0.340 UP
## ENSMUSG00000110353 0.559 UP
## ENSMUSG00000024431 0.155 UP
## ENSMUSG00000028970 0.653 UP
## ENSMUSG00000044617 0.169 UP
## ENSMUSG00000053110 0.113 UP
## ENSMUSG00000033721 0.194 UP
## ENSMUSG00000033237 0.125 UP
## ENSMUSG00000026782 0.154 UP
## ENSMUSG00000020716 0.107 UP
## ENSMUSG00000028869 0.157 UP
## ENSMUSG00000040123 0.176 UP
## ENSMUSG00000023279 0.660 UP
## ENSMUSG00000031618 0.182 UP
## ENSMUSG00000046480 0.153 UP
## ENSMUSG00000053604 0.222 UP
## ENSMUSG00000019817 0.216 UP
## ENSMUSG00000063317 0.192 UP
## ENSMUSG00000027782 0.169 UP
## ENSMUSG00000003970 0.175 UP
## ENSMUSG00000020448 0.156 UP
## ENSMUSG00000075592 0.209 UP
## ENSMUSG00000005886 0.114 UP
## ENSMUSG00000097119 0.186 UP
## ENSMUSG00000049739 0.148 UP
## ENSMUSG00000033671 0.132 UP
## ENSMUSG00000036197 0.155 UP
## ENSMUSG00000048379 0.183 UP
## ENSMUSG00000039315 0.549 UP
## ENSMUSG00000033943 0.124 UP
## ENSMUSG00000017418 0.178 UP
## ENSMUSG00000041037 0.153 UP
## ENSMUSG00000048696 0.292 UP
## ENSMUSG00000039801 0.137 UP
## ENSMUSG00000060938 0.154 UP
## ENSMUSG00000049800 0.175 UP
## ENSMUSG00000013921 0.359 UP
## ENSMUSG00000031169 0.321 UP
## ENSMUSG00000106062 0.642 UP
## ENSMUSG00000017405 0.225 UP
## ENSMUSG00000034647 0.187 UP
## ENSMUSG00000095597 0.398 UP
## ENSMUSG00000048170 0.119 UP
## ENSMUSG00000038437 0.144 UP
## ENSMUSG00000043929 0.200 UP
## ENSMUSG00000052751 0.127 UP
## ENSMUSG00000038481 0.163 UP
## ENSMUSG00000090083 0.190 UP
## ENSMUSG00000034893 0.133 UP
## ENSMUSG00000069682 0.295 UP
## ENSMUSG00000025571 0.170 UP
## ENSMUSG00000040481 0.137 UP
## ENSMUSG00000009470 0.137 UP
## ENSMUSG00000038056 0.107 UP
## ENSMUSG00000020491 0.286 UP
## ENSMUSG00000038544 0.226 UP
## ENSMUSG00000015290 0.124 UP
## ENSMUSG00000017667 0.232 UP
## ENSMUSG00000037712 0.152 UP
## ENSMUSG00000031754 0.152 UP
## ENSMUSG00000039377 0.409 UP
## ENSMUSG00000065954 0.130 UP
## ENSMUSG00000021180 0.252 UP
## ENSMUSG00000021144 0.123 UP
## ENSMUSG00000032892 0.266 UP
## ENSMUSG00000047632 0.361 UP
## ENSMUSG00000036990 0.138 UP
## ENSMUSG00000025997 0.165 UP
## ENSMUSG00000025195 0.133 UP
## ENSMUSG00000018736 0.156 UP
## ENSMUSG00000002748 0.160 UP
## ENSMUSG00000062866 0.139 UP
## ENSMUSG00000047888 0.157 UP
## ENSMUSG00000053754 0.086 UP
## ENSMUSG00000105692 0.424 UP
## ENSMUSG00000029154 0.224 UP
## ENSMUSG00000062328 0.135 UP
## ENSMUSG00000020460 0.134 UP
## ENSMUSG00000026455 0.140 UP
## ENSMUSG00000028936 0.188 UP
## ENSMUSG00000048285 0.324 UP
##
## $B_only
## logFC Trend
## ENSMUSG00000095616 -2.18 DOWN
## ENSMUSG00000055254 -1.26 DOWN
## ENSMUSG00000036083 -0.42 DOWN
## ENSMUSG00000062901 0.37 UP
## ENSMUSG00000066113 0.54 UP
## ENSMUSG00000047861 0.48 UP
## ENSMUSG00000068270 0.37 UP
## ENSMUSG00000026315 0.48 UP
## ENSMUSG00000010651 0.37 UP
## ENSMUSG00000057914 0.65 UP
## ENSMUSG00000038567 0.79 UP
## ENSMUSG00000021379 0.43 UP
## ENSMUSG00000038528 0.28 UP
## ENSMUSG00000021876 1.02 UP
## ENSMUSG00000020566 0.38 UP
## ENSMUSG00000002289 0.96 UP
## ENSMUSG00000025880 0.33 UP
## ENSMUSG00000032898 0.30 UP
## ENSMUSG00000051344 0.40 UP
## ENSMUSG00000028234 0.28 UP
## ENSMUSG00000059173 0.31 UP
## ENSMUSG00000061410 0.26 UP
## ENSMUSG00000104445 0.29 UP
## ENSMUSG00000029004 0.26 UP
## ENSMUSG00000006599 0.30 UP
## ENSMUSG00000031530 0.46 UP
## ENSMUSG00000075520 0.32 UP
## ENSMUSG00000032604 0.27 UP
## ENSMUSG00000013089 0.48 UP
## ENSMUSG00000037058 0.25 UP
## ENSMUSG00000090165 0.72 UP
## ENSMUSG00000078429 0.25 UP
## ENSMUSG00000032554 0.52 UP
## ENSMUSG00000028266 0.28 UP
## ENSMUSG00000035504 0.44 UP
## ENSMUSG00000042379 0.66 UP
## ENSMUSG00000063415 0.58 UP
## ENSMUSG00000074179 0.55 UP
## ENSMUSG00000006494 0.29 UP
## ENSMUSG00000024472 0.31 UP
##
## $AB
## logFC_A logFC_B Trend
## ENSMUSG00000000253 -0.36 -0.29 DOWN
## ENSMUSG00000002250 -0.93 -0.83 DOWN
## ENSMUSG00000002797 -0.53 -0.45 DOWN
## ENSMUSG00000010663 -0.38 -0.31 DOWN
## ENSMUSG00000019944 -0.43 -0.50 DOWN
## ENSMUSG00000020326 -0.35 -0.28 DOWN
## ENSMUSG00000020538 -0.34 -0.41 DOWN
## ENSMUSG00000021135 -1.24 -1.17 DOWN
## ENSMUSG00000021185 -0.37 -0.31 DOWN
## ENSMUSG00000021214 -1.29 -1.23 DOWN
## ENSMUSG00000021364 -0.35 -0.29 DOWN
## ENSMUSG00000021670 -0.52 -0.51 DOWN
## ENSMUSG00000022797 -0.76 -0.61 DOWN
## ENSMUSG00000023067 -0.92 -0.86 DOWN
## ENSMUSG00000023120 -0.66 -0.74 DOWN
## ENSMUSG00000024772 -0.21 -0.26 DOWN
## ENSMUSG00000024866 -0.46 -0.40 DOWN
## ENSMUSG00000025185 -0.78 -0.72 DOWN
## ENSMUSG00000026077 -1.19 -1.36 DOWN
## ENSMUSG00000026188 -0.97 -0.86 DOWN
## ENSMUSG00000026189 -0.49 -0.42 DOWN
## ENSMUSG00000026202 -0.48 -0.41 DOWN
## ENSMUSG00000026827 -0.58 -0.53 DOWN
## ENSMUSG00000027111 -0.47 -0.39 DOWN
## ENSMUSG00000028357 -0.37 -0.32 DOWN
## ENSMUSG00000028919 -0.42 -0.39 DOWN
## ENSMUSG00000029361 -0.40 -0.61 DOWN
## ENSMUSG00000029752 -0.83 -0.74 DOWN
## ENSMUSG00000031283 -0.82 -0.81 DOWN
## ENSMUSG00000031349 -0.39 -0.31 DOWN
## ENSMUSG00000031725 -0.63 -0.59 DOWN
## ENSMUSG00000031994 -1.33 -1.26 DOWN
## ENSMUSG00000032420 -0.39 -0.32 DOWN
## ENSMUSG00000032758 -1.47 -1.41 DOWN
## ENSMUSG00000036752 -0.49 -0.42 DOWN
## ENSMUSG00000038224 -0.77 -0.69 DOWN
## ENSMUSG00000039062 -0.37 -0.32 DOWN
## ENSMUSG00000040998 -0.51 -0.43 DOWN
## ENSMUSG00000041605 -0.51 -0.45 DOWN
## ENSMUSG00000041920 -0.70 -0.66 DOWN
## ENSMUSG00000042487 -0.49 -0.42 DOWN
## ENSMUSG00000043091 -0.43 -0.35 DOWN
## ENSMUSG00000043681 -0.81 -0.73 DOWN
## ENSMUSG00000045136 -0.53 -0.44 DOWN
## ENSMUSG00000046070 -0.50 -0.42 DOWN
## ENSMUSG00000050097 -0.69 -0.66 DOWN
## ENSMUSG00000052562 -1.20 -1.12 DOWN
## ENSMUSG00000053303 -1.45 -1.36 DOWN
## ENSMUSG00000054520 -0.53 -0.55 DOWN
## ENSMUSG00000054986 -1.13 -1.04 DOWN
## ENSMUSG00000055116 -1.04 -0.97 DOWN
## ENSMUSG00000056749 -1.03 -0.95 DOWN
## ENSMUSG00000058258 -0.50 -0.47 DOWN
## ENSMUSG00000058672 -0.49 -0.41 DOWN
## ENSMUSG00000059743 -0.42 -0.35 DOWN
## ENSMUSG00000063694 -0.35 -0.30 DOWN
## ENSMUSG00000070985 -0.49 -0.38 DOWN
## ENSMUSG00000074261 -0.37 -0.30 DOWN
## ENSMUSG00000074715 -2.05 -1.99 DOWN
## ENSMUSG00000090236 -0.36 -0.33 DOWN
## ENSMUSG00000090264 -0.85 -0.79 DOWN
## ENSMUSG00000001280 0.17 0.26 UP
## ENSMUSG00000002265 0.34 0.39 UP
## ENSMUSG00000002346 0.41 0.49 UP
## ENSMUSG00000003477 0.62 0.70 UP
## ENSMUSG00000003849 0.33 0.39 UP
## ENSMUSG00000004105 0.31 0.37 UP
## ENSMUSG00000005034 0.15 0.23 UP
## ENSMUSG00000005483 0.36 0.42 UP
## ENSMUSG00000006127 0.24 0.28 UP
## ENSMUSG00000006216 0.22 0.34 UP
## ENSMUSG00000006269 0.19 0.26 UP
## ENSMUSG00000006333 0.19 0.26 UP
## ENSMUSG00000007872 0.82 0.89 UP
## ENSMUSG00000008682 0.24 0.30 UP
## ENSMUSG00000009927 0.16 0.23 UP
## ENSMUSG00000012848 0.22 0.28 UP
## ENSMUSG00000015656 0.36 0.39 UP
## ENSMUSG00000015957 1.15 1.41 UP
## ENSMUSG00000018900 0.45 0.52 UP
## ENSMUSG00000020372 0.20 0.28 UP
## ENSMUSG00000020427 0.64 0.71 UP
## ENSMUSG00000020473 0.26 0.33 UP
## ENSMUSG00000020482 0.34 0.41 UP
## ENSMUSG00000020607 0.38 0.45 UP
## ENSMUSG00000020653 0.81 0.89 UP
## ENSMUSG00000020889 0.74 0.79 UP
## ENSMUSG00000021482 0.22 0.27 UP
## ENSMUSG00000021508 0.75 0.81 UP
## ENSMUSG00000021775 0.84 0.95 UP
## ENSMUSG00000022122 0.42 0.47 UP
## ENSMUSG00000022389 0.65 0.71 UP
## ENSMUSG00000022949 0.72 0.80 UP
## ENSMUSG00000023022 0.23 0.27 UP
## ENSMUSG00000024298 0.22 0.32 UP
## ENSMUSG00000024900 0.38 0.45 UP
## ENSMUSG00000025019 0.26 0.30 UP
## ENSMUSG00000025197 0.28 0.36 UP
## ENSMUSG00000025511 0.44 0.51 UP
## ENSMUSG00000025764 0.23 0.30 UP
## ENSMUSG00000025815 0.59 0.44 UP
## ENSMUSG00000026313 0.15 0.32 UP
## ENSMUSG00000027314 0.59 0.64 UP
## ENSMUSG00000027796 0.72 0.79 UP
## ENSMUSG00000027875 1.68 1.72 UP
## ENSMUSG00000028081 0.23 0.30 UP
## ENSMUSG00000028957 1.23 1.26 UP
## ENSMUSG00000029587 0.24 0.39 UP
## ENSMUSG00000029714 0.20 0.34 UP
## ENSMUSG00000030201 0.26 0.30 UP
## ENSMUSG00000030256 0.98 1.19 UP
## ENSMUSG00000031167 0.53 0.61 UP
## ENSMUSG00000031320 0.26 0.32 UP
## ENSMUSG00000032097 0.20 0.26 UP
## ENSMUSG00000032594 0.18 0.29 UP
## ENSMUSG00000032624 0.22 0.28 UP
## ENSMUSG00000033327 0.43 0.50 UP
## ENSMUSG00000033350 0.37 0.43 UP
## ENSMUSG00000033411 0.20 0.36 UP
## ENSMUSG00000034111 0.21 0.28 UP
## ENSMUSG00000034450 1.88 1.93 UP
## ENSMUSG00000034460 0.32 0.38 UP
## ENSMUSG00000035469 0.19 0.29 UP
## ENSMUSG00000035530 0.20 0.27 UP
## ENSMUSG00000035614 0.18 0.32 UP
## ENSMUSG00000037172 0.22 0.28 UP
## ENSMUSG00000037465 0.58 0.61 UP
## ENSMUSG00000037523 0.18 0.24 UP
## ENSMUSG00000037621 0.52 0.59 UP
## ENSMUSG00000038393 0.68 0.77 UP
## ENSMUSG00000039108 0.16 0.26 UP
## ENSMUSG00000039783 0.37 0.45 UP
## ENSMUSG00000039789 0.23 0.39 UP
## ENSMUSG00000039831 0.19 0.32 UP
## ENSMUSG00000040363 0.20 0.29 UP
## ENSMUSG00000040423 0.21 0.30 UP
## ENSMUSG00000040584 0.52 0.59 UP
## ENSMUSG00000040740 0.58 0.65 UP
## ENSMUSG00000041075 0.28 0.35 UP
## ENSMUSG00000041351 0.41 0.46 UP
## ENSMUSG00000041841 0.25 0.30 UP
## ENSMUSG00000042046 0.20 0.26 UP
## ENSMUSG00000042659 0.36 0.45 UP
## ENSMUSG00000042745 0.46 0.53 UP
## ENSMUSG00000043144 0.60 0.67 UP
## ENSMUSG00000044026 0.23 0.30 UP
## ENSMUSG00000045382 0.58 0.69 UP
## ENSMUSG00000045441 0.34 0.41 UP
## ENSMUSG00000045519 0.37 0.53 UP
## ENSMUSG00000047215 0.21 0.28 UP
## ENSMUSG00000047878 0.25 0.35 UP
## ENSMUSG00000048826 0.26 0.34 UP
## ENSMUSG00000049241 0.55 0.61 UP
## ENSMUSG00000050100 0.73 0.76 UP
## ENSMUSG00000053411 0.35 0.44 UP
## ENSMUSG00000053964 0.64 0.82 UP
## ENSMUSG00000054499 0.26 0.32 UP
## ENSMUSG00000054793 0.22 0.29 UP
## ENSMUSG00000055866 0.81 0.77 UP
## ENSMUSG00000055980 0.27 0.34 UP
## ENSMUSG00000056851 0.18 0.27 UP
## ENSMUSG00000058056 0.34 0.41 UP
## ENSMUSG00000058600 0.19 0.30 UP
## ENSMUSG00000058655 0.26 0.33 UP
## ENSMUSG00000059824 2.26 2.26 UP
## ENSMUSG00000061143 0.28 0.34 UP
## ENSMUSG00000061353 0.55 0.60 UP
## ENSMUSG00000061477 0.16 0.23 UP
## ENSMUSG00000062563 0.29 0.35 UP
## ENSMUSG00000063681 0.66 0.76 UP
## ENSMUSG00000064065 0.26 0.31 UP
## ENSMUSG00000067586 0.36 0.42 UP
## ENSMUSG00000068742 0.45 0.48 UP
## ENSMUSG00000069495 0.18 0.28 UP
## ENSMUSG00000070348 0.32 0.39 UP
## ENSMUSG00000071415 0.19 0.26 UP
## ENSMUSG00000074063 0.48 0.55 UP
## ENSMUSG00000074578 0.32 0.45 UP
## ENSMUSG00000086583 0.42 0.48 UP
## ENSMUSG00000098557 0.28 0.34 UP
## ENSMUSG00000106847 0.29 0.36 UP
## ENSMUSG00000110185 0.26 0.33 UP
## ENSMUSG00000110195 0.32 0.33 UP
The table is important to identify which pathways are represented in
the Venn diagram. Often, it is a good idea to run both EdgeR and DESeq2.
You can choose whether to focus on the results of one, both, or the
shared genes. It can be argued that the shared pathways have some higher
confidence in their significance.
Now let’s work with a couple of different diverse tools in R.
We will perform a multiple sequence alignment (msa).
Let’s load
the required library.
library(msa)
Let’s load our amino acid sequences from a fasta file.
seq <- readAAStringSet("hglobin.fa")
seq
## AAStringSet object of length 3:
## width seq names
## [1] 142 MVLSPADKTNVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] 142 MVLSGEDKSNIKAAWGKIGGHGA...PAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] 142 MSLTRTERTIILSLWSKISTQAD...ADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
Let’s do a msa of the amino acid sequences, using the ClustalW tool.
alignments <- msa(seq, method = "ClustalW")
## use default substitution matrix
alignments
## CLUSTAL 2.1
##
## Call:
## msa(seq, method = "ClustalW")
##
## MsaAAMultipleAlignment with 3 rows and 142 columns
## aln names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] MSLTRTERTIILSLWSKISTQADVIG...FTADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
## Con MVLS??DKTNIKAAWGKIG?HA?EYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus
Let’s print our alignments to a pdf file.
msaPrettyPrint(alignments, output = "pdf", showNames = "left",
showLogo = "none", askForOverwrite = FALSE,
verbose = TRUE, file = "whole_align.pdf")
## Multiple alignment written to temporary file /tmp/Rtmp2XBKuv/seq225f4d281c47.fasta
## File whole_align.tex created
## Output file whole_align.pdf created
msaPrettyPrint(alignments, c(10, 30), output = "pdf", showNames = "left",
file = "zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
verbose = TRUE)
## Multiple alignment written to temporary file /tmp/Rtmp2XBKuv/seq225f4e6328e2.fasta
## File zoomed_align.tex created
## Output file zoomed_align.pdf created
Let’s make a tree for our alignment.
First, let’s load the required packages.
library(ape)
library(seqinr)
Convert to seqinr alignment -> get the distances and make a tree.
alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")
distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")
tree <- ape::nj(distances1)
plot(tree, main = "Phylogenetic Tree of HBA Sequences")
Here, we used the neighbour-joining method for our alignment.
Let’s load the required package.
library(DECIPHER)
In the first step, we will load the libraries and the sequence into long_seqs. This is a DNAStringSet object.
long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq
## DNAStringSet object of length 5:
## width seq names
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT NC_018523.1 Sacch...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA NC_022431.1 Ascle...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC NC_022259.1 Nanno...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC NC_022417.1 Cocos...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT NC_022459.1 Camel...
Now we build a temporary SQLite database.
Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
## Adding 5 sequences to the database.
##
## Added 5 new sequences to table Seqs.
## 55 total sequences in table Seqs.
## Time difference of 0.2 secs
Now that we’ve built the database, we can do the following:
Find
the syntheny blocks.
synteny <- FindSynteny("long_db")
## ================================================================================
##
## Time difference of 7.5 secs
Let’s plot the synteny blocks to view them.
pairs(synteny)
plot(synteny)
The colors across the line is the basepairs, the blocks are then
conserved regions.
The pairing and comparison depends on the order
of the sequences, as they are all compared to the first sequence.
Let’s make an actual alignment file.
alignment <- AlignSynteny(synteny, "long_db")
## ================================================================================
##
## Time difference of 78 secs
This is a pairwise alignment of all blocks of synteny between sets of sequences.
Let’s create a structure holding all aligned syntenic blocks for a pair of sequences.
block <- unlist(alignment[[1]])
Let’s finish by writing to file one alignment at a time.
writeXStringSet(block, "genome_blocks_out.fa")
Let’s load the required packages.
library(locfit)
library(Rsamtools)
Let’s create a function that will load the gene region information in a GFF file and convert it to a bioconductor GRanges object.
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
Get counts into windows across the genome in 500bp segments.
whole_genome <- csaw::windowCounts(
file.path("windows.bam"),
bin = TRUE,
filter = 0,
width = 500,
param = csaw::readParam(
minq = 20, #determines what is a passing read
dedup = TRUE, #removes pcr duplicates
pe = "both" #requires that both pairs of reads are aligned
)
)
Since this is a single column of data, let’s rename it.
colnames(whole_genome) <- c("small_data")
Let’s extract our annotated regions.
annotated_regions <- get_annotated_regions_from_gff(file.path("genes.gff"))
Now that we have the windows of high expression, we want to see if
any of them overlap with annotated regions.
Let’s load the required packages.
library(IRanges)
library(SummarizedExperiment)
Let’s find the overlaps between the windows we made.
windows_in_genes <- IRanges::overlapsAny(
SummarizedExperiment::rowRanges(whole_genome), # creates a GRanges object
annotated_regions
)
This goes over the entire annotated region and windows to see if overlap is TRUE.
windows_in_genes
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
Here let’s subset the whole_genome object into annotated and nonannotated regions.
annotated_window_counts <- whole_genome[windows_in_genes, ]
non_annotated_window_counts <- whole_genome[!windows_in_genes, ]
Now, let’s use assay() to extract the actual counts from the GRanges object.
assay(non_annotated_window_counts)
## small_data
## [1,] 0
## [2,] 31
## [3,] 25
## [4,] 0
## [5,] 0
## [6,] 24
## [7,] 25
## [8,] 0
## [9,] 0
In this next step, we use Rsamtools Pileup() function to get a
per-base coverage dataframe.
Each row represents a single
nucleotide in the reference count and the count column gives the depth
of coverage at that point.
First, we load the required package.
library(bumphunter)
pile_df <- Rsamtools::pileup("windows.bam")
Next, let’s group reads with a certain distance from each other into
a cluster, we then give the sequence names, position and distance.
A gene is very big, so you have reads that overlap, but do not
completely overlap, thus, you want to group together those that overlap
a lot.
clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap = 100)
table(clusters)
## clusters
## 1 2 3
## 1486 1552 1520
Lastly, we will map the reads to the regions we found for the genome.
bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff = 1)
## chr start end value area cluster indexStart indexEnd L clusterL
## 3 Chr1 4503 5500 10.4 15811 3 3039 4558 1520 1520
## 1 Chr1 502 1500 10.0 14839 1 1 1486 1486 1486
## 2 Chr1 2501 3500 8.7 13436 2 1487 3038 1552 1552
Let’s go on to working with Phylogenetic trees!
Let’s load the required packages.
library(ggplot2)
library(ggtree)
library(treeio)
First, we need to load our raw tree data. It’s a Newick format so we use:
itol <- ape::read.tree("itol.nwk")
Now we will print out a very basic phylogenetic tree.
ggtree(itol)
We can also change the format to make it a circular tree
ggtree(itol, layout = "circular")
We can also change the left-right / up-down direction.
ggtree(itol) + coord_flip() + scale_x_reverse()
By using geom_tiplab() we can add names to the end of tips.
ggtree(itol) + geom_tiplab(color = "indianred", size = 1.5)
By adding a geom_strip() layer we can annotate clades in the tree with a block of color.
ggtree(itol) + geom_strip(13, 14, color = "red", barsize = 1)
We can highlight clades in unrooted trees with blobs of color using geom_hilight.
ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "steelblue")
We have to give a node number for it to know where to draw. But how
do we know, when we cannot see node numbers on the tree?
Let’s load some more packages:
install.packages("BAMMtools")
library(BAMMtools)
We can use the MRCA (most recent common ancestor) function to find the node we want.
getmrca(itol, "Photorhabdus_luminescens", "Blochmannia_floridanus")
## [1] 206
Now if we want to highlight the section of the most recent common ancestor between the two above, then we can do the following:
ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "steelblue")
Quantifying difference between trees can be done with treespace.
First, let’s load the required packages
library(ape)
library(adegraphics)
library(treespace)
Now we need to load all the treefiles into a multiPhylo object.
treefiles <- list.files(file.path(getwd(), "genetrees"), full.names = TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)
## [1] "list"
We want the class to be a mulitPhylo, thus we do the following:
class(tree_list) <- "multiPhylo"
class(tree_list)
## [1] "multiPhylo"
Now we can compute the kendall-colijn distances between trees. This
function does a LOT of analysis:
These results are returned in a list of objects, where $D contains
the pairwise metric of the trees, and $pco contains the PCA. The method
we use (Kendal-Colijn) is particularly suitable for rooted trees as we
have here. The option NF tells us how many principal components to
retain.
comparisons <- treespace(tree_list, nf = 3)
We can plot the pairwise distances between trees trees with table.image
table.image(comparisons$D, nclass = 25)
The number of classes determines the resolution of the image.
Now let’s print the PCA and clusters, this shows us how the group of trees cluster.
plotGroves(comparisons$pco, lab.show = TRUE, lab.cex = 1.5)
The more clustered together, the more similar the sequences are. In
this case, the PCA is all over the place.
To work with this, we will take our PCA, and show what we think are different clusters. Thus we have grouped the clusters. The number of clusters is given by nclust. Forcing too many clusters will give conflicting reports.
groves <- findGroves(comparisons, nclust = 4)
plotGroves(groves)
Doing fewer clusters, like 2, may also look fine. This is where it
becomes subjective how many clusters there should be.
Extracting and working with subtrees can be done using APE.
Let’s load our required packages.
library(ape)
Now let’s load the tree data we will be working with.
newick <- read.tree("mammal_tree.nwk")
l <- subtrees(newick)
Let’s plot the tree to see what it looks like.
plot(newick)
We can subset this plot using the “node” function.
plot(l[[4]], sub = "Node 4")
Now we extract the tree manually.
small_tree <- extract.clade(newick, 9)
plot(small_tree)
Now what if we want to bind two trees together? We can do the following:
new_tree <- bind.tree(newick, small_tree, 3)
plot(new_tree)
This binds the subsetted tree to the regular tree. This would make
sense if we had different subsets of trees, which we would like to bind
together.
Reconstructing trees from alignments can be done with phangorn.
Let’s load the required packages.
library(Biostrings)
library(msa)
library(phangorn)
First, we’ll load the sequences into a seqs variable.
seqs <- readAAStringSet("abc.fa")
Now, let’s construct an alignment with the msa package and ClustalOmega.
aln <- msa::msa(seqs, method = c("ClustalOmega"))
## using Gonnet
To create a tree, we need to convert the alignment to a phyData objects.
aln <- as.phyDat(aln, type = "AA")
class(aln)
## [1] "phyDat"
In this step, we’ll actually make the trees. Trees are made from a distance matrix, which can be computed with dist.ml() - ML stands for maximum likelihood.
dist_mat <- dist.ml(aln)
Now we pass the distance matrix to UPGMA to make a UPGMA tree.
upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main = "UPGMA")
Conversely, we can pass the distance matrix to a neighbour-joining function.
nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")
Lastly, we are going to use the bootstraps.phyDat() function to
compute bootstrap support for the branches of the tree. The first
argument is the object (aln), while the second argument in the function
is nj.
Bootstraps are essentially confidence intervals for how the clade is placed in the correct position.
fit <- pml(nj_tree, aln)
bootstraps <- bootstrap.phyDat(aln, FUN = function(x) {NJ(dist.ml(x))}, bs = 100)
plotBS(nj_tree, bootstraps, p = 10)
Finally, let’s finish by working with genome wide association studies
in R.
Here, we will work with tools to find and map genetic variants.
First, let’s load the required libraries.
library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
library(VariantTools)
Now we want to load our datasets. We need .Bam and .fa files for this pipeline to work.
bam_file <- file.path(getwd(), "hg17_snps.bam")
fasta_file <- file.path(getwd(), "chr17_83k.fa")
Now we need to set up the genome object and the parameters object:
fa <- rtracklayer::FastaFile(fasta_file)
Now we create a GMapGenome object, which describes the genome to the later variant calling function.
genome <- gmapR::GmapGenome(fa, create = TRUE)
This next step sets our parameter for what is considered a variant.
There can be a lot of fine-tuning to this function. We are just
going to use the standard settings.
If done for a research purpose,
you could have a test set of bam files, making your own SNPs to see how
good it is at identifying them.
qual_params <- TallyVariantsParam(
genome = genome,
minimum_mapq = 20)
var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)
Now we use the callVariants function in accordance with the parameters we defined above:
called_Variants <- callVariants(bam_file,
qual_params,
calling.filter = var_params)
head(called_Variants)
## VRanges object with 6 ranges and 17 metadata columns:
## seqnames ranges strand ref alt totalDepth
## <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
## [1] NC_000017.10 64 * G T 759
## [2] NC_000017.10 69 * G T 812
## [3] NC_000017.10 70 * G T 818
## [4] NC_000017.10 73 * T A 814
## [5] NC_000017.10 77 * T A 802
## [6] NC_000017.10 78 * G T 798
## refDepth altDepth sampleNames softFilterMatrix | n.read.pos
## <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <integer>
## [1] 739 20 <NA> | 17
## [2] 790 22 <NA> | 19
## [3] 796 22 <NA> | 20
## [4] 795 19 <NA> | 13
## [5] 780 22 <NA> | 19
## [6] 777 21 <NA> | 17
## n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
## <integer> <integer> <integer> <integer> <integer>
## [1] 64 759 20 739 0
## [2] 69 812 22 790 0
## [3] 70 818 22 796 0
## [4] 70 814 19 795 0
## [5] 70 802 22 780 0
## [6] 70 798 21 777 0
## count.minus.ref count.del.plus count.del.minus read.pos.mean
## <integer> <integer> <integer> <numeric>
## [1] 0 0 0 30.9000
## [2] 0 0 0 40.7273
## [3] 0 0 0 34.7727
## [4] 0 0 0 36.1579
## [5] 0 0 0 38.3636
## [6] 0 0 0 39.7143
## read.pos.mean.ref read.pos.var read.pos.var.ref mdfne mdfne.ref
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 32.8755 318.558 347.804 NA NA
## [2] 35.4190 377.004 398.876 NA NA
## [3] 36.3442 497.762 402.360 NA NA
## [4] 36.2176 519.551 402.843 NA NA
## [5] 36.0064 472.327 397.070 NA NA
## [6] 35.9241 609.076 390.463 NA NA
## count.high.nm count.high.nm.ref
## <integer> <integer>
## [1] 20 738
## [2] 22 789
## [3] 22 796
## [4] 19 769
## [5] 22 780
## [6] 21 777
## -------
## seqinfo: 1 sequence from chr17_83k genome
## hardFilters(4): nonRef nonNRef readCount likelihoodRatio
We have identified 6 variants.
Now, we move on to annotation and load the feature position information from gff.
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
Note that you can load this data from a BED file. A Browser
Extensible Data (BED) file, is a text file format used to store genomic
regions as coordinates and associated annotations. You can read more
about the bed files and the package here.
Let’s save our annotated genes into a variable using the function above.
genes <- get_annotated_regions_from_gff("chr17.83k.gff3")
Now we can calculate which variants overlap which genes:
overlaps <- GenomicRanges::findOverlaps(called_Variants, genes)
overlaps
## Hits object with 12684 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 35176 1
## [2] 35176 2
## [3] 35176 3
## [4] 35177 1
## [5] 35177 2
## ... ... ...
## [12680] 40944 2
## [12681] 40944 7
## [12682] 40945 1
## [12683] 40945 2
## [12684] 40945 7
## -------
## queryLength: 44949 / subjectLength: 8
And lastly, we subset the genes with the list of overlaps:
identified <- genes[subjectHits(overlaps)]
identified
## GRanges object with 12684 ranges and 20 metadata columns:
## seqnames ranges strand | source type score
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
## [1] NC_000017.10 64099-76866 - | havana ncRNA_gene NA
## [2] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## [3] NC_000017.10 64099-65736 - | havana exon NA
## [4] NC_000017.10 64099-76866 - | havana ncRNA_gene NA
## [5] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## ... ... ... ... . ... ... ...
## [12680] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## [12681] NC_000017.10 76723-76866 - | havana exon NA
## [12682] NC_000017.10 64099-76866 - | havana ncRNA_gene NA
## [12683] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## [12684] NC_000017.10 76723-76866 - | havana exon NA
## phase ID Name biotype
## <integer> <character> <character> <character>
## [1] <NA> gene:ENSG00000280279 AC240565.2 lincRNA
## [2] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## [3] <NA> <NA> ENSE00003759547 <NA>
## [4] <NA> gene:ENSG00000280279 AC240565.2 lincRNA
## [5] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## ... ... ... ... ...
## [12680] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## [12681] <NA> <NA> ENSE00003756684 <NA>
## [12682] <NA> gene:ENSG00000280279 AC240565.2 lincRNA
## [12683] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## [12684] <NA> <NA> ENSE00003756684 <NA>
## description gene_id logic_name version
## <character> <character> <character> <character>
## [1] novel transcript ENSG00000280279 havana 1
## [2] <NA> <NA> <NA> 1
## [3] <NA> <NA> <NA> 1
## [4] novel transcript ENSG00000280279 havana 1
## [5] <NA> <NA> <NA> 1
## ... ... ... ... ...
## [12680] <NA> <NA> <NA> 1
## [12681] <NA> <NA> <NA> 1
## [12682] novel transcript ENSG00000280279 havana 1
## [12683] <NA> <NA> <NA> 1
## [12684] <NA> <NA> <NA> 1
## Parent tag transcript_id
## <CharacterList> <character> <character>
## [1] <NA> <NA>
## [2] gene:ENSG00000280279 basic ENST00000623180
## [3] transcript:ENST00000.. <NA> <NA>
## [4] <NA> <NA>
## [5] gene:ENSG00000280279 basic ENST00000623180
## ... ... ... ...
## [12680] gene:ENSG00000280279 basic ENST00000623180
## [12681] transcript:ENST00000.. <NA> <NA>
## [12682] <NA> <NA>
## [12683] gene:ENSG00000280279 basic ENST00000623180
## [12684] transcript:ENST00000.. <NA> <NA>
## transcript_support_level constitutive ensembl_end_phase ensembl_phase
## <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA>
## [2] 5 <NA> <NA> <NA>
## [3] <NA> 1 -1 -1
## [4] <NA> <NA> <NA> <NA>
## [5] 5 <NA> <NA> <NA>
## ... ... ... ... ...
## [12680] 5 <NA> <NA> <NA>
## [12681] <NA> 1 -1 -1
## [12682] <NA> <NA> <NA> <NA>
## [12683] 5 <NA> <NA> <NA>
## [12684] <NA> 1 -1 -1
## exon_id rank
## <character> <character>
## [1] <NA> <NA>
## [2] <NA> <NA>
## [3] ENSE00003759547 5
## [4] <NA> <NA>
## [5] <NA> <NA>
## ... ... ...
## [12680] <NA> <NA>
## [12681] ENSE00003756684 1
## [12682] <NA> <NA>
## [12683] <NA> <NA>
## [12684] ENSE00003756684 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Let’s try predicting Open Reading Frames (ORFs) in R.
First, let’s load the required packages.
library(Biostrings)
library(systemPipeR)
Let’s load the data into a DNAStrings object, in this case, an Arabidopsis chloroplast genome.
dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa")
Now let’s predict the open reading frames with predORF(), we’ll predict all ORFs on both strands.
predict_orfs <- predORF(dna_object, n = "all", type = "gr", mode = "ORF", strand = "both",
longest_disjoint = TRUE)
predict_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 1162 chloroplast 2056-2532 - | 1 3
## 2 chloroplast 72371-73897 + | 2 2
## 1163 chloroplast 77901-78362 - | 2 1
## 3 chloroplast 54937-56397 + | 3 3
## ... ... ... ... . ... ...
## 2497 chloroplast 129757-129762 - | 1336 3
## 2498 chloroplast 139258-139263 - | 1337 3
## 2499 chloroplast 140026-140031 - | 1338 3
## 2500 chloroplast 143947-143952 - | 1339 3
## 2501 chloroplast 153619-153624 - | 1340 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
This printed out a GRanges object in return, with 2501 open reading
frames.
This is FAR too many open reading frames. It finds every TF
binding site, Start codon, etc.
To filter out erronous ORFs, we do a simulation. We first estimate the length an ORF can reach by chance. We will create a string of random nucleotides that is the length of our chloroplast genome, and determine the longer ORFs that can arise by chance.
bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")
Now we ensure that our random nucleotides match the proportion of nucleotides in our chloroplast genome, so we have no bias.
seq_length <- width(dna_object[1])
Now we make a function to go over the column, every row, find the bases, and add them.
counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})
Now, we find the probabilities of each nucleotide being at a position.
probs <- unlist(lapply(counts, function(base_count) {signif(base_count/seq_length, 2)}))
probs
## [1] 6.5e-06 6.5e-06 6.5e-06 6.5e-06
As can be seen, the probability for each nucleotide to be at each
position is equal. Thus, it seems there is no bias.
Now we can build our function to simulate our genome.
get_longest_ORF_in_random_genome <- function(x,
length = 1000,
probs = c(0.25, 0.25, 0.25, 0.25),
bases = c("A", "T", "G", "C")){
# Here we create our random genome and allow replacement for the next iteration.
random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")
random_dna_object <- DNAStringSet(random_genome)
names(random_dna_object) <- c("random_dna_string")
orfs <- predORF(random_dna_object, n = 1, type = "gr", mode = "ORF", strand = "both", longest_disjoint = TRUE)
return(max(width(orfs)))
}
Now we use the function we just created to predict the ORFs in 10 random genomes.
random_lengths <- unlist(lapply(1:10, get_longest_ORF_in_random_genome, length = seq_length, probs = probs, bases = bases))
Let’s pull out the longest length from our 10 simulations.
longest_random_orf <- max(random_lengths)
We have now done the simulation, where each basepair was randomly
selected.
By chance, how many ORFs do we predict, and what is the
longest possible one?
Let’s only keep the frames that are longer in our chloroplast genome.
keep <- width(predict_orfs) > longest_random_orf
This creates TRUE / FALSE statements. We can use this to filter.
orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
## GRanges object with 12 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 2 chloroplast 72371-73897 + | 2 2
## 3 chloroplast 54937-56397 + | 3 3
## 4 chloroplast 57147-58541 + | 4 1
## 5 chloroplast 33918-35141 + | 5 1
## .. ... ... ... . ... ...
## 8 chloroplast 114461-115447 + | 8 2
## 9 chloroplast 141539-142276 + | 9 2
## 10 chloroplast 60741-61430 + | 10 1
## 11 chloroplast 143088-143708 + | 11 1
## 12 chloroplast 62038-62619 + | 12 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Let’s write this data to a file.
extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")
Let’s work with finding karyotypes.
First, let’s load the required packages.
library(karyoploteR)
library(GenomicRanges)
Now we need to set up the genome object for our karyotype.
genome_df <- data.frame(
# first we dictate the number of chromosomes.
chr = paste0("chr", 1:5),
start = rep(1, 5),
# and then we will dictate the length of each chromosome.
end = c(34964571, 22037565, 25499034, 20862711, 31279811)
)
Now we convert the dataframe to a GRanges object.
genome_gr <- makeGRangesFromDataFrame(genome_df)
Now let’s create some snp positions to map out. We do this by using the sample() function.
snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
chr = paste0("chr", sample(1:5, 25, replace = TRUE)),
start = snp_pos,
end = snp_pos
)
Again we convert the dataframe to a GRanges object.
snps_gr <- makeGRangesFromDataFrame(snps)
Now let’s create some snp labels.
snp_labels <- paste0("snp_", 1:25)
Here we will set the parameters for our plot.
plot.params <- getDefaultPlotParams(plot.type = 1)
Here we will set the margins of our plot.
plot.params$data1outmargin <- 600
Now let’s plot our snps.
kp <- plotKaryotype(genome = genome_gr, plot.type = 1, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
We can also add some numeric data to our plots. We will generate 100 random numbers that plot to 100 windows on our chromosome 4.
numeric_data <- data.frame(
y = rnorm(100, mean = 1, sd = 0.5),
chr = rep("chr4", 100),
start = seq(1, 20862771, 20862771 / 100),
end = seq(1, 20862771, 20862771 / 100)
)
Now let’s make the data a GRanges object.
numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)
Let’s set our plot parameters again.
plot.params <- getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800
Let’s plot the data.
kp <- plotKaryotype(genome = genome_gr, plot.type = 2, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)
That was the end of the final project, going through the course Bioinformatics with R on Praxis AI.