This is a tutorial on how to use R Markdown for Reproducible research
Here we can type long messages or long descripitions of our data without hashtags before the text. In our first example we will be using ToothGrowth dataset. In this experiment, the rodent, Guinea Pigs (literal) were given varying amounts of vitamin c. This was done to see the effects on the animal’s tooth growth.
To run code in a Markdown File, we need to denote the section that is considered actual R code. We call these sections, “code chunks.”
Below is a code chunk:
Toothdata <- ToothGrowth
head (Toothdata)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
As we can see, from running the “play” functionality in the code chunk, the results have been printed in the inline of the R Markdown file.
fit <- lm(len ~ dose, data = Toothdata)
b <- fit$coefficients
plot(len ~ dose, data = Toothdata)
abline(lm(len ~ dose, data = Toothdata))
Figure 1: The tooth growth of Guinea Pigs when given variable amounts of vitamin C
The slope of the regression line is 9.7635714.
We can also put sections and subsections in our R Markdown files, similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in an R Script.
ENSURE you have have a space after the “#”. If not present, it will fail to understand what you are trying to accomplish.
We can also add bullet point-type marks in our R Markdown file
It’s important to note here that in R Markdown indentation matters
We can put really nice quotes into the markdown document. We do this by using the “>” symbol.
“Now cracks a noble heart. Good night, sweet prince, and flights of angels sing thee to thy rest.”
— William Shakespeare
Hyperlinks can also be incorporated into these files. This is quite useful in HTML formatted files, since being in a web browser, will redirect the reader to the material that you wish to show them. Here we will demonstrate the link to R Markdown’s home page for this example. RMarkdown
We can also insert really nice formulas into R Markdown using two
dollar sigs
Hard-Weinberg Formula
\[p^2 + 2qp + q^2 = 1\]
And you can get really complex ones as well
\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]
There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options
Eval (T or F): whether or not to evaluate the code chunks
Echo (T or F): Whatever or not to show the code for the chunks, but results will still print.
Cache: If enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG run times.
fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.
out.width or out.height: The output size of the R Plots IN THE R DOCUMENT.
fig.cap: the words for the figure caption
We can also add a table of contents to our HTML Document. We do this altering the YAML code (the weird code chunk at the VERY top of the documents.) We can add this:
title: “HTML Tutorial 1,2,3” author: “Karsten Condron” date: “2024-07-16” output: html_document: toc: true toc_float: true
This will give us a very nice floating table of contents on the right hand side of the document.
You can also add TABS in our report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.
You can also add themes to your HTML document that change the highlighting color and hyperlink color of your HTML output. This can be nice aesthetically. To do this, you can change your theme in the YAML to one of the following
cerulean journal flatly spacelab united cosmo lumen paper sandstone simplex yeti null
You can also change the color by specifying highlight:
default tango payments kate monochrome espresso zenburn haddock textmate
You can also use the code_folding option to allow the reader to toggle between displaying the code and hiding the code. This is done with:
code_folding: hide
There are a ton of different aspects and avenues to mess around with in R Markdown “HTML” when compared to the traditional R Script. Making a sort of portfolio within this is very easily done.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
??flights
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
First we will just look at the data on October 14th.
filter(my_data, month == 10, day==14)
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we want to subset this into a new variable, we do the following
Oct_14_flights <- filter(my_data, month == 10, day==14)
What if you want to do both print and save the variable?
(Oct_14_flights_2 <- filter(my_data, month == 10, day==14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If you want to filter based on different operators, you can use the following:
Equals == Not equal to != greater than > less than < Greater than or equal to >= Less than or equal to <=
(flights_through_September <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we dont want to use the == to mean equals, we get this:
(Oct_14_flights_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
You can also use logical components to be more selective
Lets use the “or” function to pick flights in March and April
March_April_Flights <- filter(my_data, month == 3 | month == 4)
March_4th_Flights <- filter(my_data, month == 3 & day == 4)
Non_January_flights <- filter(my_data, month != 1)
Arrange allows us to arrange the data set based on the variables we desire.
arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can also do this in this in descending fashion
descending <- arrange(my_data, desc(year), desc(day), desc (month))
Missing values are always placed at the end of the dataframe, regardless of ascending or descending
We can also select specific columns that we want to look at.
calender <- select(my_data, year, month, day)
print(calender)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
We can also look at a range of columns
calender2 <- select(my_data, year:day)
Lets look at all the columns months through carrier
calender3 <- select(my_data, year:carrier)
We can also choose which columns NOT to include
everything_else <- select(my_data, -(year:day))
In this instance we can also use the exclamation point
everything_else <- select(my_data, !(year:day))
There are also some other helper functions that can help you select the columns or data you are looking for
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
rename(my_data, departure_time = dep_time)
## # A tibble: 336,776 × 19
## year month day departure_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
my_data <- rename(my_data, departure_time = dep_time)
What if you want to add new columns to your data frame? We have the Mutate function for that
First, lets make a smaller data frame so we can see what we are doing.
my_data_small <- select(my_data, year:day, distance, air_time)
Lets calculate the speed of the flights.
mutate(my_data_small, speed = distance / air_time * 60)
## # A tibble: 336,776 × 6
## year month day distance air_time speed
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2013 1 1 1400 227 370.
## 2 2013 1 1 1416 227 374.
## 3 2013 1 1 1089 160 408.
## 4 2013 1 1 1576 183 517.
## 5 2013 1 1 762 116 394.
## 6 2013 1 1 719 150 288.
## 7 2013 1 1 1065 158 404.
## 8 2013 1 1 229 53 259.
## 9 2013 1 1 944 140 405.
## 10 2013 1 1 733 138 319.
## # ℹ 336,766 more rows
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)
What if we wanted to create a new data frame with only your calculations (transmute)
airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed = distance / air_time)
We can use summerize to run a function on a data column to get a single return
summarize(my_data, delay = mean(dep_delay, na.rm = TRUE ))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
So we can use see here that the average delay is about 12 minutes
We gain additional value in summarize by pairing it with the by_group()
by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ℹ 355 more rows
As you can see, we now have the delay by the days of the year
What happens if we don’t tell R what to do with the missing data?
summarize(by_day, delay = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # ℹ 355 more rows
We can also filter our data based on NA (which in this dataset was cancelled flights)
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
Lets run summarize again on this days
summarize(not_cancelled, delay = mean(dep_delay))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
We can also count the number of variables that are NA
sum(is.na(my_data$dep_delay))
## [1] 8255
With tibble datasets (more on them soon), we can pip results to get rid of the need to use the dollar sign
We can then summarize the number of flights by minutes delayed
my_data %>%
group_by(year, month, day) %>%
summarize(mean = mean(departure_time,na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 1385.
## 2 2013 1 2 1354.
## 3 2013 1 3 1357.
## 4 2013 1 4 1348.
## 5 2013 1 5 1326.
## 6 2013 1 6 1399.
## 7 2013 1 7 1341.
## 8 2013 1 8 1335.
## 9 2013 1 9 1326.
## 10 2013 1 10 1333.
## # ℹ 355 more rows
library(tibble)
Now we will take the time to explore tibbles. Tibbles are modified dataframes which tweak some of the older features from data frames. R is a tad old, and useful things from 20 years ago aren’t useful anymore
as_tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
As we can see, we have the same data frame, but we have different features
We can also create a tibble from scratch with tibble()
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
## # A tibble: 5 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
You can also use tribble() for basic data table creation
tribble(
~ genea, ~ geneb, ~ genec,
##########################
110, 112, 114,
6, 5, 4
)
## # A tibble: 2 × 3
## genea geneb genec
## <dbl> <dbl> <dbl>
## 1 110 112 114
## 2 6 5 4
Tibbles are built not to overwhelm your console when printing the data. Only showing the first few lines.
This is how a data frame prints
print(by_day)
as.data.frame(by_day)
head(by_day)
nycflights13::flights %>%
print(n=10, wdth = 100)
Subsetting tibbles is easy, similar to data.frames
df_tibble <- tibble(nycflights13::flights)
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can subset by column name using the $
head(df_tibble$carrier, 30)
## [1] "UA" "UA" "AA" "B6" "DL" "UA" "B6" "EV" "B6" "AA" "B6" "B6" "UA" "UA" "AA"
## [16] "B6" "UA" "B6" "MQ" "B6" "DL" "MQ" "AA" "DL" "UA" "MQ" "UA" "B6" "B6" "DL"
We can subset by position using [[]]
head(df_tibble[[2]], 30)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If you want to use this in a pipe, you need to use the “.” placeholder
head(df_tibble %>%
.$carrier, 30)
## [1] "UA" "UA" "AA" "B6" "DL" "UA" "B6" "EV" "B6" "AA" "B6" "B6" "UA" "UA" "AA"
## [16] "B6" "UA" "B6" "MQ" "B6" "DL" "MQ" "AA" "DL" "UA" "MQ" "UA" "B6" "B6" "DL"
Some older functions do not like tibbles. Thus you might have to convert them back to dataframe
class(df_tibble)
## [1] "tbl_df" "tbl" "data.frame"
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble_2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
library(tidyverse)
How do we make a tidy dataset? The Tidyverse. It follows three rules
It is impossible to satisfy two of the three rules.
This leads to the following instructions for tidy data
Picking one consistent method of data storage makes for easier understanding of your code and what is happening behind the scene
Lets now look at what is working with tibbles
bmi <- tibble(women)
bmi %>%
mutate(bmi = (703 * weight)/(height)^2)
## # A tibble: 15 × 3
## height weight bmi
## <dbl> <dbl> <dbl>
## 1 58 115 24.0
## 2 59 117 23.6
## 3 60 120 23.4
## 4 61 123 23.2
## 5 62 126 23.0
## 6 63 129 22.8
## 7 64 132 22.7
## 8 65 135 22.5
## 9 66 139 22.4
## 10 67 142 22.2
## 11 68 146 22.2
## 12 69 150 22.1
## 13 70 154 22.1
## 14 71 159 22.2
## 15 72 164 22.2
Sometimes you’ll find datasets that don’t fit well
We’ll use the built-in data from tidyverse for this part
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
As you can see from this data, we have one variable in column A (country).
But columns B and C are both the same. Thus, there are two observations in each row
To fix this, we can use the gather function
table4a %>%
gather('1999', '2000', key = 'year', value = 'cases')
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Lets look at anotehr example
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
As you can see we have a similar problem in table 4b
table4b %>%
gather("1999", "2000", key = "year", value = "population")
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
Now what if we want to join these two tables? We can use dplyr
table4a <- table4b %>%
gather('1999', '2000', key = 'year', value = 'cases')
table4b <- table4b %>%
gather("1999", "2000", key = "year", value = "population")
print
## function (x, ...)
## UseMethod("print")
## <bytecode: 0x55865e4fde10>
## <environment: namespace:base>
left_join(table4a, table4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 19987071 19987071
## 2 Brazil 1999 172006362 172006362
## 3 China 1999 1272915272 1272915272
## 4 Afghanistan 2000 20595360 20595360
## 5 Brazil 2000 174504898 174504898
## 6 China 2000 1280428583 1280428583
Spreading is the opposite of gathering. Lets look at table 2
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
As you can see that we have redundant info in columns 1 and 2
We can fix that by combing rows 1&2, 3&4, etc
spread(table2, key = type, value = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
This is the key of what we are turning into columns, the value is what becomes rows/observations
In summary, spread makes long tables shorter
While Gather makes wide tables, narrower and longer
Now what happens when we have two observations stuck in one column?
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
As you can see, the rate is just the population and cases combined
We can use separate
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
However, if you notice, the column type is not correct.
table3 %>%
separate(rate, into =c("cases", "population"), conver = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can specify what you want to separate based on.
table3 %>%
separate(rate, into =c("cases", "population"), sep = "/", conver = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Lets make this look more tidy
table3 %>%
separate(
year,
into = c("century", "year"),
convert = TRUE,
sep = 2
)
## # A tibble: 6 × 4
## country century year rate
## <chr> <int> <int> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 0 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 0 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 0 213766/1280428583
What happens if we want to do the inverse of separate
table5 %>%
unite(date, century, year)
## # A tibble: 6 × 3
## country date rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
table5 %>%
unite(date, century, year, sep = "")
## # A tibble: 6 × 3
## country date rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
There can be two types of missing values. NA (Explicit) or just no entry (incomplete)
gene_data <- tibble(
gene = c('a', 'a', 'a','a', 'b', 'b', 'b'),
nuc = c(20, 22, 24, 25, NA, 42, 67),
run = c(1,2,3,4,2,3,4)
)
gene_data
## # A tibble: 7 × 3
## gene nuc run
## <chr> <dbl> <dbl>
## 1 a 20 1
## 2 a 22 2
## 3 a 24 3
## 4 a 25 4
## 5 b NA 2
## 6 b 42 3
## 7 b 67 4
The nucleotide count for Gene B Run 2 is explicitly missing
The nucleotide count for gene run 1 is implicitly missing
One way we can make implicit missing values explicit is by putting runs in columns
gene_data %>%
spread(gene, nuc)
## # A tibble: 4 × 3
## run a b
## <dbl> <dbl> <dbl>
## 1 1 20 NA
## 2 2 22 NA
## 3 3 24 42
## 4 4 25 67
If we want to remove the missing values, we can use spread and gather, and na.rm = TRUE
gene_data %>%
spread(gene,nuc) %>%
gather(gene, nuc, 'a':'b', na.rm = TRUE)
## # A tibble: 6 × 3
## run gene nuc
## <dbl> <chr> <dbl>
## 1 1 a 20
## 2 2 a 22
## 3 3 a 24
## 4 4 a 25
## 5 3 b 42
## 6 4 b 67
Another way that we can make implicit values explicit, is complete()
gene_data %>%
complete(gene, run)
## # A tibble: 8 × 3
## gene run nuc
## <chr> <dbl> <dbl>
## 1 a 1 20
## 2 a 2 22
## 3 a 3 24
## 4 a 4 25
## 5 b 1 NA
## 6 b 2 NA
## 7 b 3 42
## 8 b 4 67
Sometimes an NA is present to represent a value being carried forward
treatment <- tribble(
~ person, ~treatment, ~response,
################################################
"Isaac", 1, 7,
NA, 2, 10,
NA, 3, 9,
"VDB", 1, 8,
NA, 2, 11,
NA, 3, 10,
)
treatment
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 VDB 1 8
## 5 <NA> 2 11
## 6 <NA> 3 10
What we can do here is use the fill() option
treatment %>%
fill(person)
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Isaac 1 7
## 2 Isaac 2 10
## 3 Isaac 3 9
## 4 VDB 1 8
## 5 VDB 2 11
## 6 VDB 3 10
It is rare that you will be working with a single data table, The DPLYR package allows you to join the data tables based on common values.
library(tidyverse)
library(nycflights13)
Gives you full carrier names based on letter codes
airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
Lets get info about airports
airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
Lets get info about each plane
planes
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## # ℹ 3,312 more rows
Lets get some info on the weather at the airports
weather
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
Lets get some info on singular flights
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Lets look at how these tables connect
Keys are unique identifiers per observation
Primary key uniquely identifies an observation in its own table
One way to identify a primary key is as follows:
planes %>%
count(tailnum) %>%
filter(n>1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
This indicates that the tailnumber is unique
planes %>%
count(model) %>%
filter(n>1)
## # A tibble: 79 × 2
## model n
## <chr> <int>
## 1 717-200 88
## 2 737-301 2
## 3 737-3G7 2
## 4 737-3H4 105
## 5 737-3K2 2
## 6 737-3L9 2
## 7 737-3Q8 5
## 8 737-3TO 2
## 9 737-401 4
## 10 737-4B7 18
## # ℹ 69 more rows
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 × 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ℹ 336,766 more rows
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = 'carrier')
## # A tibble: 336,776 × 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ℹ 336,766 more rows
We’ve now added the airlines name to our dataframe from the airline dataframe
Other types of joins
library(tidyverse)
library(stringr)
You can create strings using single or double quotas
string1 <- "This is a string"
string2 <- 'to put a "quote" in you string, use the opposite'
string1
## [1] "This is a string"
string2
## [1] "to put a \"quote\" in you string, use the opposite"
If you forgot to close your string, you’ll get this:
string3 <- "where is this string going?"
string3
## [1] "where is this string going?"
Just hit escape and try again
Multiple strings are stored in character vectors
string4 <- c("one", "two", "three")
string4
## [1] "one" "two" "three"
Measuring string length
str_length(string3)
## [1] 27
str_length(string4)
## [1] 3 3 5
Lets combine two strings
str_c("X", "Y")
## [1] "XY"
str_c(string1, string2)
## [1] "This is a stringto put a \"quote\" in you string, use the opposite"
You can use sep to control how they are separated
str_c(string1, string2, sep = " ")
## [1] "This is a string to put a \"quote\" in you string, use the opposite"
str_c("X", "Y", "Z", sep = " ")
## [1] "X Y Z"
You can subset a string using str_sub()
HSP <- c("HSP123", "HSP234", "HSP456")
str_sub(HSP, 4,6)
## [1] "123" "234" "456"
This just drops the first four letters from the strings
Or you can use negatives to count back from the end
str_sub(HSP, -3, -1)
## [1] "123" "234" "456"
You can convert the cases of strings like follows:
HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_lower(HSP)
## [1] "hsp123" "hsp234" "hsp456"
str_to_upper()
install.packages("htmlwidgets")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
x <- c('ATTAGA', 'CGCCCCCGGAT', 'TATTA')
str_view(x, "G")
## [1] │ ATTA<G>A
## [2] │ C<G>CCCCC<G><G>AT
str_view(x, "TA")
## [1] │ AT<TA>GA
## [3] │ <TA>T<TA>
The next step is, “.” where the “.” matches an entry
str_view(x, ".G.")
## [1] │ ATT<AGA>
## [2] │ <CGC>CCC<CGG>AT
Anchors allow you to match at the start or the ending
str_view(x, "^TA")
## [3] │ <TA>TTA
str_view(x, "TA$")
## [3] │ TAT<TA>
Character classes/alternatives
str_view (x, "TA[GT]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA
[^anc] matches anything but a, b or c
str_view( x, "TA[^T]")
## [1] │ AT<TAG>A
You can also use | to pick two alternatives
str_view(x, "TA|G|T")
## [1] │ A<T><TA><G>A
## [2] │ C<G>CCCCC<G><G>A<T>
## [3] │ <TA><T><TA>
str_detect() returns a logical error the same length of the input
y <- c("apple", "banana", "pear")
y
## [1] "apple" "banana" "pear"
str_detect(y, "e")
## [1] TRUE FALSE TRUE
How many common words contain letter e?
head(words, 30)
## [1] "a" "able" "about" "absolute" "accept" "account"
## [7] "achieve" "across" "act" "active" "actual" "add"
## [13] "address" "admit" "advertise" "affect" "afford" "after"
## [19] "afternoon" "again" "against" "age" "agent" "ago"
## [25] "agree" "air" "all" "allow" "almost" "along"
sum(str_detect(words, "e"))
## [1] 536
Lets get more complex, how many words end in a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306
mean(str_detect(words, "^[aeiou]"))
## [1] 0.1785714
Lets find all the words that don’t contain “o” or “u”
no_o <- !str_detect(words, "[ou]")
head(no_o, 30)
## [1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
## [13] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [25] TRUE TRUE TRUE FALSE FALSE FALSE
Now lets extract those words
words[!str_detect(words, "[ou]")]
## [1] "a" "able" "accept" "achieve" "act"
## [6] "active" "add" "address" "admit" "advertise"
## [11] "affect" "after" "again" "against" "age"
## [16] "agent" "agree" "air" "all" "already"
## [21] "alright" "always" "america" "and" "answer"
## [26] "any" "apart" "apparent" "appear" "apply"
## [31] "area" "arm" "arrange" "art" "as"
## [36] "ask" "at" "attend" "available" "aware"
## [41] "away" "baby" "back" "bad" "bag"
## [46] "balance" "ball" "bank" "bar" "base"
## [51] "basis" "be" "bear" "beat" "bed"
## [56] "begin" "behind" "believe" "benefit" "best"
## [61] "bet" "between" "big" "bill" "birth"
## [66] "bit" "black" "break" "brief" "brilliant"
## [71] "bring" "britain" "by" "cake" "call"
## [76] "can" "car" "card" "care" "carry"
## [81] "case" "cat" "catch" "cent" "centre"
## [86] "certain" "chair" "chairman" "chance" "change"
## [91] "chap" "character" "charge" "cheap" "check"
## [96] "child" "Christ" "Christmas" "city" "claim"
## [101] "class" "clean" "clear" "client" "create"
## [106] "dad" "danger" "date" "day" "dead"
## [111] "deal" "dear" "debate" "decide" "deep"
## [116] "definite" "degree" "department" "depend" "describe"
## [121] "design" "detail" "die" "difference" "dinner"
## [126] "direct" "district" "divide" "draw" "dress"
## [131] "drink" "drive" "dry" "each" "early"
## [136] "east" "easy" "eat" "effect" "egg"
## [141] "eight" "either" "elect" "electric" "eleven"
## [146] "else" "end" "engine" "english" "enter"
## [151] "especial" "even" "evening" "ever" "every"
## [156] "evidence" "exact" "example" "except" "exercise"
## [161] "exist" "expect" "expense" "experience" "explain"
## [166] "express" "extra" "eye" "face" "fact"
## [171] "fair" "fall" "family" "far" "farm"
## [176] "fast" "father" "feed" "feel" "few"
## [181] "field" "fight" "file" "fill" "film"
## [186] "final" "finance" "find" "fine" "finish"
## [191] "fire" "first" "fish" "fit" "five"
## [196] "flat" "fly" "france" "free" "friday"
## [201] "friend" "game" "garden" "gas" "general"
## [206] "germany" "get" "girl" "give" "glass"
## [211] "grand" "grant" "great" "green" "hair"
## [216] "half" "hall" "hand" "hang" "happen"
## [221] "happy" "hard" "hate" "have" "he"
## [226] "head" "health" "hear" "heart" "heat"
## [231] "heavy" "hell" "help" "here" "high"
## [236] "hit" "idea" "identify" "if" "imagine"
## [241] "in" "increase" "indeed" "inside" "instead"
## [246] "interest" "invest" "it" "item" "keep"
## [251] "key" "kid" "kill" "kind" "king"
## [256] "kitchen" "lad" "lady" "land" "large"
## [261] "last" "late" "law" "lay" "lead"
## [266] "learn" "leave" "left" "leg" "less"
## [271] "let" "letter" "level" "lie" "life"
## [276] "light" "like" "likely" "limit" "line"
## [281] "link" "list" "listen" "little" "live"
## [286] "machine" "main" "make" "man" "manage"
## [291] "many" "mark" "market" "marry" "match"
## [296] "matter" "may" "maybe" "mean" "meaning"
## [301] "meet" "member" "middle" "might" "mile"
## [306] "milk" "mind" "minister" "miss" "mister"
## [311] "mrs" "name" "near" "necessary" "need"
## [316] "never" "new" "news" "next" "nice"
## [321] "night" "nine" "pack" "page" "paint"
## [326] "pair" "paper" "paragraph" "parent" "park"
## [331] "part" "party" "pass" "past" "pay"
## [336] "pence" "per" "percent" "perfect" "perhaps"
## [341] "pick" "piece" "place" "plan" "play"
## [346] "please" "practise" "prepare" "present" "press"
## [351] "pretty" "price" "print" "private" "rail"
## [356] "raise" "range" "rate" "rather" "read"
## [361] "ready" "real" "realise" "really" "receive"
## [366] "recent" "red" "refer" "regard" "remember"
## [371] "represent" "research" "respect" "rest" "rid"
## [376] "right" "ring" "rise" "safe" "sale"
## [381] "same" "save" "say" "scheme" "science"
## [386] "seat" "secretary" "see" "seem" "self"
## [391] "sell" "send" "sense" "separate" "serve"
## [396] "service" "set" "settle" "seven" "sex"
## [401] "shall" "share" "she" "sheet" "sick"
## [406] "side" "sign" "similar" "simple" "since"
## [411] "sing" "single" "sir" "sister" "sit"
## [416] "site" "six" "size" "sleep" "slight"
## [421] "small" "space" "speak" "special" "specific"
## [426] "speed" "spell" "spend" "staff" "stage"
## [431] "stairs" "stand" "standard" "start" "state"
## [436] "stay" "step" "stick" "still" "straight"
## [441] "strategy" "street" "strike" "switch" "system"
## [446] "table" "take" "talk" "tape" "tax"
## [451] "tea" "teach" "team" "tell" "ten"
## [456] "tend" "term" "terrible" "test" "than"
## [461] "thank" "the" "then" "there" "they"
## [466] "thing" "think" "thirteen" "thirty" "this"
## [471] "three" "tie" "time" "trade" "traffic"
## [476] "train" "travel" "treat" "tree" "try"
## [481] "twelve" "twenty" "type" "very" "view"
## [486] "village" "visit" "wage" "wait" "walk"
## [491] "wall" "want" "war" "warm" "wash"
## [496] "waste" "watch" "water" "way" "we"
## [501] "wear" "wednesday" "wee" "week" "weigh"
## [506] "well" "west" "what" "when" "where"
## [511] "whether" "which" "while" "white" "why"
## [516] "wide" "wife" "will" "win" "wind"
## [521] "wish" "with" "within" "write" "year"
## [526] "yes" "yesterday" "yet"
You can also use str_count() to say how many matches there are in string
x
## [1] "ATTAGA" "CGCCCCCGGAT" "TATTA"
str_count(x, "[GC]")
## [1] 1 9 0
Lets couple this with mutate
df <- tibble(
word = words,
i = seq_along(word)
)
df
## # A tibble: 980 × 2
## word i
## <chr> <int>
## 1 a 1
## 2 able 2
## 3 about 3
## 4 absolute 4
## 5 accept 5
## 6 account 6
## 7 achieve 7
## 8 across 8
## 9 act 9
## 10 active 10
## # ℹ 970 more rows
df %>%
mutate(
vowels = str_count(words, "[aeiou]"),
constonants = str_count(words, "[^aeiou]")
)
## # A tibble: 980 × 4
## word i vowels constonants
## <chr> <int> <int> <int>
## 1 a 1 1 0
## 2 able 2 2 2
## 3 about 3 3 2
## 4 absolute 4 4 4
## 5 accept 5 2 4
## 6 account 6 3 4
## 7 achieve 7 4 3
## 8 across 8 2 4
## 9 act 9 1 2
## 10 active 10 3 3
## # ℹ 970 more rows
library(ath1121501cdf)
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
## loading 'ath1121501cdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
## loading 'ath1121501cdf'
##
library(ath1121501.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:lubridate':
##
## second, second<-
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:tidyr':
##
## expand
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:lubridate':
##
## %within%
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:purrr':
##
## reduce
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: org.At.tair.db
##
##
library(affy)
##
## Attaching package: 'affy'
## The following object is masked from 'package:lubridate':
##
## pm
library(affyio)
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
library(oligo)
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.64.0
##
## Attaching package: 'oligoClasses'
## The following object is masked from 'package:affy':
##
## list.celfiles
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## ================================================================================
## Welcome to oligo version 1.66.0
## ================================================================================
##
## Attaching package: 'oligo'
## The following object is masked from 'package:limma':
##
## backgroundCorrect
## The following objects are masked from 'package:affy':
##
## intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
## probeNames, rma
## The following object is masked from 'package:lubridate':
##
## pm
## The following object is masked from 'package:dplyr':
##
## summarize
library(dplyr)
library(stats)
library(reshape)
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:S4Vectors':
##
## expand, rename
## The following object is masked from 'package:lubridate':
##
## stamp
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
setwd(“~/Desktop/classroom/myfiles”)
Read the cel files into the directory
targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")
ab <- ReadAffy()
hist(ab)
eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
## sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL 1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL 2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL 3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL 4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL 5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL 6
Lets visualize the normalized data
hist(eset)
Lets characterize the data a bit
ID <- featureNames(eset)
Symbol <- getSymbol(ID,"ath1121501.db")
affydata <- read.delim("AFFY_ATH1_array_elements.txt")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : EOF within quoted string
Flight vs Ground
condition <- targets$gravity
design <- model.matrix(~factor(condition))
colnames(design) <- c("flight", "ground")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef =2, n=Inf, adjust='fdr')
Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
GO = sapply(contents(ath1121501GO), paste, collapse = ", "),
SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "),
ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ", "))
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
Lets merge all the data into one data frame
all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)
all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")
Lets convert the ACCNUM to a character Value
all2$ACCNUM <- as.character(all2$ACCNUM)
write.table(all2, file="BRIC_16_Finals.csv", row.names = TRUE, col.names = TRUE, sep ="\t")
columns(org.At.tair.db)
## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
foldchanges <- as.data.frame(all2$logFC)
all2$entrez = mapIds(org.At.tair.db,
keys = all2$ACCNUM,
column = "ENTREZID",
keytype = "TAIR",
multivals = "first")
## 'select()' returned 1:1 mapping between keys and columns
head(all2, 10)
## Row.names GENE KEGG
## 1 244903_at hypothetical protein NA
## 2 244904_at hypothetical protein NA
## 3 244906_at hypothetical protein NA
## 4 244907_at hypothetical protein NA
## 5 244908_at hypothetical protein NA
## 6 244911_at hypothetical protein NA
## 7 244913_at hypothetical protein NA
## 8 244914_at hypothetical protein NA
## 9 244916_at hypothetical protein NA
## 10 244917_at hypothetical protein NA
## GO
## 1 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 2 list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 3 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005575", Evidence = "ND", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 7 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 8 list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005634", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"), list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"), list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## SYMBOL ACCNUM logFC AveExpr t P.Value adj.P.Val B
## 1 ORF149 ATMG00660 -1.214 7.8 -6.20 0.0004 0.018 0.5
## 2 ORF275 ATMG00670 -0.568 3.2 -2.07 0.0758 0.300 -4.9
## 3 ORF240A ATMG00690 0.055 9.1 0.18 0.8586 0.947 -6.8
## 4 ORF120 ATMG00710 -0.883 4.3 -2.91 0.0220 0.154 -3.7
## 5 ORF107D ATMG00720 -0.463 2.2 -1.73 0.1262 0.396 -5.4
## 6 ORF170 ATMG00820 -0.193 2.7 -0.72 0.4948 0.749 -6.5
## 7 ORF121B ATMG00840 -0.339 1.8 -1.18 0.2764 0.574 -6.1
## 8 ORF107E ATMG00850 -0.305 5.5 -1.30 0.2338 0.534 -6.0
## 9 ORF187 ATMG00880 -0.888 2.8 -2.32 0.0527 0.245 -4.6
## 10 ORF184 ATMG00870 -0.379 3.6 -1.31 0.2307 0.531 -5.9
## array_element_type organism is_control locus
## 1 oligonucleotide Arabidopsis thaliana no ATMG00660
## 2 oligonucleotide Arabidopsis thaliana no ATMG00670
## 3 oligonucleotide Arabidopsis thaliana no ATMG00690
## 4 oligonucleotide Arabidopsis thaliana no ATMG00710
## 5 oligonucleotide Arabidopsis thaliana no ATMG00720
## 6 oligonucleotide Arabidopsis thaliana no ATMG00820
## 7 oligonucleotide Arabidopsis thaliana no AT2G07626
## 8 oligonucleotide Arabidopsis thaliana no ATMG00850
## 9 oligonucleotide Arabidopsis thaliana no ATMG00880
## 10 oligonucleotide Arabidopsis thaliana no ATMG00870
## description
## 1 hypothetical protein;(source:Araport11)
## 2 transmembrane protein;(source:Araport11)
## 3 FO-ATPase subunit;(source:Araport11)
## 4 Polynucleotidyl transferase, ribonuclease H-like superfamily protein;(source:Araport11)
## 5 hypothetical protein;(source:Araport11)
## 6 Reverse transcriptase (RNA-dependent DNA polymerase);(source:Araport11)
## 7 hypothetical protein;(source:Araport11)
## 8 DNA/RNA polymerases superfamily protein;(source:Araport11)
## 9 hypothetical protein;(source:Araport11)
## 10 hypothetical protein;(source:Araport11)
## chromosome entrez
## 1 M <NA>
## 2 M <NA>
## 3 M <NA>
## 4 M <NA>
## 5 M <NA>
## 6 M <NA>
## 7 2 <NA>
## 8 M <NA>
## 9 M <NA>
## 10 M <NA>
library(pathview)
##
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
##
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
library(gage)
##
library(gageData)
data(kegg.sets.hs)
foldchanges = all2$logFC
names(foldchanges) = all2$entrez
head(foldchanges, 30)
## <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## -1.214 -0.568 0.055 -0.883 -0.463 -0.193 -0.339 -0.305 -0.888 -0.379 -0.207
## <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## -1.041 -1.044 -1.235 -0.195 -0.312 -0.787 -0.906 -1.054 -1.072 0.438 0.205
## <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## -0.119 0.512 0.091 -0.040 0.312 0.190 -0.569 -1.260
kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]
Lets get the results
keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val
## ath03010 Ribosome 1.5e-14 8.2 1.5e-14
## ath01230 Biosynthesis of amino acids 2.9e-04 3.5 2.9e-04
## ath00040 Pentose and glucuronate interconversions 2.0e-03 3.0 2.0e-03
## ath00195 Photosynthesis 2.6e-03 3.0 2.6e-03
## ath00966 Glucosinolate biosynthesis 7.0e-03 2.7 7.0e-03
## ath01232 Nucleotide metabolism 9.2e-03 2.4 9.2e-03
## q.val set.size exp1
## ath03010 Ribosome 1.6e-12 129 1.5e-14
## ath01230 Biosynthesis of amino acids 1.5e-02 87 2.9e-04
## ath00040 Pentose and glucuronate interconversions 6.6e-02 49 2.0e-03
## ath00195 Photosynthesis 6.6e-02 19 2.6e-03
## ath00966 Glucosinolate biosynthesis 1.3e-01 12 7.0e-03
## ath01232 Nucleotide metabolism 1.3e-01 42 9.2e-03
##
## $less
## p.geomean stat.mean p.val q.val
## ath04120 Ubiquitin mediated proteolysis 0.043 -1.7 0.043 1
## ath04016 MAPK signaling pathway - plant 0.044 -1.7 0.044 1
## ath00592 alpha-Linolenic acid metabolism 0.045 -1.7 0.045 1
## ath03040 Spliceosome 0.075 -1.4 0.075 1
## ath00350 Tyrosine metabolism 0.093 -1.4 0.093 1
## ath00906 Carotenoid biosynthesis 0.116 -1.2 0.116 1
## set.size exp1
## ath04120 Ubiquitin mediated proteolysis 63 0.043
## ath04016 MAPK signaling pathway - plant 73 0.044
## ath00592 alpha-Linolenic acid metabolism 19 0.045
## ath03040 Spliceosome 77 0.075
## ath00350 Tyrosine metabolism 19 0.093
## ath00906 Carotenoid biosynthesis 18 0.116
##
## $stats
## stat.mean exp1
## ath03010 Ribosome 8.2 8.2
## ath01230 Biosynthesis of amino acids 3.5 3.5
## ath00040 Pentose and glucuronate interconversions 3.0 3.0
## ath00195 Photosynthesis 3.0 3.0
## ath00966 Glucosinolate biosynthesis 2.7 2.7
## ath01232 Nucleotide metabolism 2.4 2.4
greater <- keggres$greater
lessers <- keggres$less
write.csv(greater, file = "BRIC_16_pathview_Greater_Pathways.csv")
write.csv(lessers, file = "BRIC_16_pathview_Lesser_Pathways.csv")
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath03010 Ribosome"
## [2] "ath01230 Biosynthesis of amino acids"
## [3] "ath00040 Pentose and glucuronate interconversions"
## [4] "ath00195 Photosynthesis"
## [5] "ath00966 Glucosinolate biosynthesis"
keggresids_greater = substr(keggrespathways, start=1, stop=8)
keggresids_greater
## [1] "ath03010" "ath01230" "ath00040" "ath00195" "ath00966"
genedata <- as.character(all2$logFC)
Define a plotting function to apply next
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE )
How to plot multiple pathways (plots are saved to disk and returns a throwaway object list)
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath" ))
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath04120 Ubiquitin mediated proteolysis"
## [2] "ath04016 MAPK signaling pathway - plant"
## [3] "ath00592 alpha-Linolenic acid metabolism"
## [4] "ath03040 Spliceosome"
## [5] "ath00350 Tyrosine metabolism"
keggresids_less = substr(keggrespathways, start=1, stop=8)
keggresids_less
## [1] "ath04120" "ath04016" "ath00592" "ath03040" "ath00350"
genedata <- as.character(all2$logFC)
Define a plotting function to apply next
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath", new.signature = FALSE )
Plot multiple pathways (plots are saved to disk and returns a throwaway object list)
tmp = sapply(keggresids_less, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath" ))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04120.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04016.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00592.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath03040.pathview.png
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00350.pathview.png
library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
##
rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_id")
group <- factor(c('1','1','1','1','1','1','2','2','2','2','2','2'))
dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
## ..@ .Data:List of 2
## .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
## .. .. .. ..$ : chr [1:12] "Mmus_C57.6J_KDN_FLT_Rep1_M23" "Mmus_C57.6J_KDN_FLT_Rep2_M24" "Mmus_C57.6J_KDN_FLT_Rep3_M25" "Mmus_C57.6J_KDN_FLT_Rep4_M26" ...
## .. ..$ :'data.frame': 12 obs. of 3 variables:
## .. .. ..$ group : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
## .. .. ..$ lib.size : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
## .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ names: chr [1:2] "counts" "samples"
str(group)
## Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
keep <-rowSums(cpm(dgeGlm)>2) >=4
dgeGlm <- dgeGlm[keep,]
names(dgeGlm)
## [1] "counts" "samples"
dgeGlm[["samples"]]
## group lib.size norm.factors
## Mmus_C57.6J_KDN_FLT_Rep1_M23 1 4.0e+07 1
## Mmus_C57.6J_KDN_FLT_Rep2_M24 1 4.1e+07 1
## Mmus_C57.6J_KDN_FLT_Rep3_M25 1 3.8e+07 1
## Mmus_C57.6J_KDN_FLT_Rep4_M26 1 3.9e+07 1
## Mmus_C57.6J_KDN_FLT_Rep5_M27 1 3.7e+07 1
## Mmus_C57.6J_KDN_FLT_Rep6_M28 1 3.6e+07 1
## Mmus_C57.6J_KDN_GC_Rep1_M33 2 3.7e+07 1
## Mmus_C57.6J_KDN_GC_Rep2_M34 2 3.7e+07 1
## Mmus_C57.6J_KDN_GC_Rep3_M35 2 4.0e+07 1
## Mmus_C57.6J_KDN_GC_Rep4_M36 2 3.6e+07 1
## Mmus_C57.6J_KDN_GC_Rep5_M37 2 3.8e+07 1
## Mmus_C57.6J_KDN_GC_Rep6_M38 2 3.5e+07 1
names(dgeGlm)
## [1] "counts" "samples"
design <- model.matrix(~group)
design
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 1 1
## 12 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmComDisp)
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef =2)
ttGlm <- topTags(lrt, n = Inf)
class(ttGlm)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))
## group2
## Down 64
## NotSig 13390
## Up 159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]
write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
ttGlm2 <- as.data.frame(ttGlm$table)
ttGlm2$symbol = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "SYMBOL",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$entrez = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "ENTREZID",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$name = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "GENENAME",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(ttGlm2)
## logFC logCPM LR PValue FDR symbol entrez
## ENSMUSG00000026077 -1.36 3.6 80 4.3e-19 5.9e-15 Npas2 18143
## ENSMUSG00000007872 0.89 5.5 77 1.9e-18 1.3e-14 Id3 15903
## ENSMUSG00000021775 0.95 6.2 63 2.0e-15 9.1e-12 Nr1d2 353187
## ENSMUSG00000002250 -0.83 5.3 62 2.7e-15 9.2e-12 Ppard 19015
## ENSMUSG00000059824 2.26 4.6 58 2.6e-14 7.2e-11 Dbp 13170
## ENSMUSG00000074715 -1.99 3.8 47 7.0e-12 1.6e-08 Ccl28 56838
## name
## ENSMUSG00000026077 neuronal PAS domain protein 2
## ENSMUSG00000007872 inhibitor of DNA binding 3
## ENSMUSG00000021775 nuclear receptor subfamily 1, group D, member 2
## ENSMUSG00000002250 peroxisome proliferator activator receptor delta
## ENSMUSG00000059824 D site albumin promoter binding protein
## ENSMUSG00000074715 C-C motif chemokine ligand 28
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]
# Lets get the results
keggres = gage(foldchanges, gsets=kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway 6.1e-03
## mmu04350 TGF-beta signaling pathway 1.3e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## p.val q.val
## mmu03010 Ribosome 9.5e-05 0.023
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.235
## mmu04330 Notch signaling pathway 6.1e-03 0.488
## mmu04350 TGF-beta signaling pathway 1.3e-02 0.783
## mmu04390 Hippo signaling pathway 2.0e-02 0.826
## mmu00830 Retinol metabolism 2.1e-02 0.826
## set.size
## mmu03010 Ribosome 127
## mmu04550 Signaling pathways regulating pluripotency of stem cells 100
## mmu04330 Notch signaling pathway 54
## mmu04350 TGF-beta signaling pathway 84
## mmu04390 Hippo signaling pathway 125
## mmu00830 Retinol metabolism 37
## exp1
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway 6.1e-03
## mmu04350 TGF-beta signaling pathway 1.3e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
##
## $less
## p.geomean stat.mean p.val
## mmu04613 Neutrophil extracellular trap formation 0.00012 -3.7 0.00012
## mmu04145 Phagosome 0.00192 -2.9 0.00192
## mmu04110 Cell cycle 0.00276 -2.8 0.00276
## mmu04714 Thermogenesis 0.00472 -2.6 0.00472
## mmu04217 Necroptosis 0.00614 -2.5 0.00614
## mmu00910 Nitrogen metabolism 0.00867 -2.6 0.00867
## q.val set.size exp1
## mmu04613 Neutrophil extracellular trap formation 0.029 137 0.00012
## mmu04145 Phagosome 0.221 121 0.00192
## mmu04110 Cell cycle 0.221 134 0.00276
## mmu04714 Thermogenesis 0.283 208 0.00472
## mmu04217 Necroptosis 0.295 113 0.00614
## mmu00910 Nitrogen metabolism 0.347 13 0.00867
##
## $stats
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## exp1
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
greaters <- keggres$greater
lessers <- keggres$less
keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number()<=5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu03010 Ribosome"
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04330 Notch signaling pathway"
## [4] "mmu04350 TGF-beta signaling pathway"
## [5] "mmu04390 Hippo signaling pathway"
keggresids = substr(keggrespathways, start=1, stop = 8)
keggresids
## [1] "mmu03010" "mmu04550" "mmu04330" "mmu04350" "mmu04390"
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id =pid, speices = "mouse",
new.signature = FALSE)
# Plot Multiple Pathways
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "mouse"))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03010.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04550.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04330.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04350.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04390.pathview.png
library(imager)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following object is masked from 'package:Biostrings':
##
## width
## The following object is masked from 'package:XVector':
##
## width
## The following objects are masked from 'package:oligoClasses':
##
## B, B<-
## The following objects are masked from 'package:IRanges':
##
## resize, width
## The following object is masked from 'package:S4Vectors':
##
## width
## The following object is masked from 'package:Biobase':
##
## channel
## The following object is masked from 'package:BiocGenerics':
##
## width
## The following object is masked from 'package:stringr':
##
## boundary
## The following object is masked from 'package:dplyr':
##
## where
## The following object is masked from 'package:tidyr':
##
## fill
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
filenames <- list.files(path = "/home/student/Desktop/classroom/myfiles", pattern = ".*pathview.png")
all_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames)
Lets Load the required libraries for this analysis
library("DESeq2")
## Loading required package: GenomicRanges
##
## Attaching package: 'GenomicRanges'
## The following object is masked from 'package:magrittr':
##
## subtract
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
##
## count
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
library("dplyr")
library("apeglm")
Lets load in our data
countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")
colData <- read.csv("PHENO_DATA_mouse.csv")
Now we need to addd leveling factors to the colData
colData$group <- factor(colData$group, levels =c("0", "1"))
Now lets make sure we have the same number of individuals as well as groups
all(rownames(colData)) %in% colnames(countData)
## Warning in all(rownames(colData)): coercing argument of type 'character' to
## logical
## [1] FALSE
We need to round up the counts, because DESeq2 only allows integers as an input, and not fractional numbers. This depends on the method of alignment that was used
countData %>%
mutate_if(is.numeric, ceiling)
countData[, 2:13] <- sapply(countData[, 2:13], as.integer)
row.names(countData) <- countData[,1]
countData <- countData[2:13]
row.names(colData) == colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)
dds <- dds[rowSums(counts(dds)>2) >=4]
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
resLFC <- lfcShrink(dds, coef= 2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
(resOrdered <- res[order(res$padj), ])
## log2 fold change (MLE): group 1 vs 0
## Wald test p-value: group 1 vs 0
## DataFrame with 22008 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000002250 1459.223 -0.926292 0.1112628 -8.32526 8.41430e-17
## ENSMUSG00000007872 1719.375 0.818829 0.1122820 7.29261 3.04014e-13
## ENSMUSG00000026077 437.035 -1.191812 0.1655873 -7.19748 6.13338e-13
## ENSMUSG00000040998 14579.593 -0.506307 0.0703771 -7.19421 6.28252e-13
## ENSMUSG00000021775 2804.923 0.842511 0.1233312 6.83129 8.41546e-12
## ... ... ... ... ... ...
## ENSMUSG00000118345 4.22314 -0.12097478 0.599072 -0.20193699 0.839966
## ENSMUSG00000118353 6.60578 0.56456713 0.481195 1.17326031 0.240691
## ENSMUSG00000118358 3.30902 -0.00273584 0.763559 -0.00358301 0.997141
## ENSMUSG00000118369 2.91657 -1.11623145 0.790702 -1.41169702 0.158039
## ENSMUSG00000118384 7.43136 0.23830798 0.489273 0.48706567 0.626212
## padj
## <numeric>
## ENSMUSG00000002250 1.24077e-12
## ENSMUSG00000007872 2.24149e-09
## ENSMUSG00000026077 2.31605e-09
## ENSMUSG00000040998 2.31605e-09
## ENSMUSG00000021775 2.48189e-08
## ... ...
## ENSMUSG00000118345 NA
## ENSMUSG00000118353 NA
## ENSMUSG00000118358 NA
## ENSMUSG00000118369 NA
## ENSMUSG00000118384 NA
summary(res)
##
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 325, 1.5%
## LFC < 0 (down) : 327, 1.5%
## outliers [1] : 15, 0.068%
## low counts [2] : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
FLT_Vs_GC <- as.data.frame(res$log2FoldChange)
head(FLT_Vs_GC)
## res$log2FoldChange
## 1 0.0421
## 2 -0.1334
## 3 -0.0185
## 4 -0.0882
## 5 -0.0079
## 6 0.1136
plotMA(resLFC, ylim = c(-2,2))
pdf(file = "MA_Plot_FLT_vs_GC.pdf")
dev.off()
## png
## 2
Lets save our differential expression results to file.
write.csv(as.data.frame(resOrdered), file = "Mouse_DESeq.csv")
Lets perform a custom transformation
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize= TRUE) +1), colData = colData(dds))
plotPCA(DESeqTransform(se), intgroup = "group")
## using ntop=500 top features by variance
Lets load our required packages
library(AnnotationDbi)
library(org.Mm.eg.db)
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
foldchanges <- as.data.frame(res$log2FoldChanges, row.names = row.names(res))
head(foldchanges)
## data frame with 0 columns and 0 rows
res$symbol = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVALS = "first")
## 'select()' returned 1:many mapping between keys and columns
res$entrez = mapIds = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "ENTREZID",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
res$name = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "GENENAME",
keytype = "ENSEMBL",
multivals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(res, 10)
## log2 fold change (MLE): group 1 vs 0
## Wald test p-value: group 1 vs 0
## DataFrame with 10 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 3132.35128 0.04214340 0.0436714 0.9650117 0.334539
## ENSMUSG00000000028 68.75801 -0.13342706 0.1565936 -0.8520597 0.394181
## ENSMUSG00000000031 21.05397 -0.01853142 0.2486477 -0.0745288 0.940590
## ENSMUSG00000000037 24.42314 -0.08817270 0.2982220 -0.2956613 0.767489
## ENSMUSG00000000049 3.24919 -0.00790342 0.9613572 -0.0082211 0.993441
## ENSMUSG00000000056 1424.88216 0.11355979 0.0777635 1.4603234 0.144201
## ENSMUSG00000000058 1420.78992 -0.03893850 0.0850602 -0.4577759 0.647113
## ENSMUSG00000000078 2254.53129 -0.07540275 0.0874314 -0.8624214 0.388456
## ENSMUSG00000000085 822.68179 0.04586772 0.0584699 0.7844667 0.432766
## ENSMUSG00000000088 5946.81754 -0.05461549 0.0691178 -0.7901795 0.429423
## padj symbol entrez name
## <numeric> <character> <character> <character>
## ENSMUSG00000000001 0.739637 Gnai3 14679 G protein subunit al..
## ENSMUSG00000000028 0.777085 Cdc45 12544 cell division cycle 45
## ENSMUSG00000000031 NA H19 14955 H19, imprinted mater..
## ENSMUSG00000000037 NA Scml2 107815 Scm polycomb group p..
## ENSMUSG00000000049 NA Apoh 11818 apolipoprotein H
## ENSMUSG00000000056 0.547527 Narf 67608 nuclear prelamin A r..
## ENSMUSG00000000058 0.901904 Cav2 12390 caveolin 2
## ENSMUSG00000000078 0.774294 Klf6 23849 Kruppel-like transcr..
## ENSMUSG00000000085 0.800018 Scmh1 29871 sex comb on midleg h..
## ENSMUSG00000000088 0.798420 Cox5a 12858 cytochrome c oxidase..
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
## 14679 12544 14955 107815 11818 67608
## 0.0421 -0.1334 -0.0185 -0.0882 -0.0079 0.1136
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]
Lets map the results
keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## mmu03010 Ribosome 0.017 2.1 0.017 0.9
## mmu04022 cGMP-PKG signaling pathway 0.030 1.9 0.030 0.9
## mmu04360 Axon guidance 0.038 1.8 0.038 0.9
## mmu04330 Notch signaling pathway 0.042 1.8 0.042 0.9
## mmu04658 Th1 and Th2 cell differentiation 0.054 1.6 0.054 0.9
## mmu02010 ABC transporters 0.075 1.5 0.075 0.9
## set.size exp1
## mmu03010 Ribosome 139 0.017
## mmu04022 cGMP-PKG signaling pathway 152 0.030
## mmu04360 Axon guidance 176 0.038
## mmu04330 Notch signaling pathway 58 0.042
## mmu04658 Th1 and Th2 cell differentiation 78 0.054
## mmu02010 ABC transporters 46 0.075
##
## $less
## p.geomean stat.mean p.val
## mmu04613 Neutrophil extracellular trap formation 0.0043 -2.6 0.0043
## mmu04110 Cell cycle 0.0062 -2.5 0.0062
## mmu04657 IL-17 signaling pathway 0.0113 -2.3 0.0113
## mmu04145 Phagosome 0.0332 -1.8 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.0388 -1.8 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.0441 -1.7 0.0441
## q.val set.size exp1
## mmu04613 Neutrophil extracellular trap formation 0.75 160 0.0043
## mmu04110 Cell cycle 0.75 151 0.0062
## mmu04657 IL-17 signaling pathway 0.90 75 0.0113
## mmu04145 Phagosome 0.91 144 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.91 154 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.91 99 0.0441
##
## $stats
## stat.mean exp1
## mmu03010 Ribosome 2.1 2.1
## mmu04022 cGMP-PKG signaling pathway 1.9 1.9
## mmu04360 Axon guidance 1.8 1.8
## mmu04330 Notch signaling pathway 1.8 1.8
## mmu04658 Th1 and Th2 cell differentiation 1.6 1.6
## mmu02010 ABC transporters 1.5 1.5
Lets save our greater and less than pathways
greaters <- keggres$greater
lessers <- keggres$less
keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu03010 Ribosome" "mmu04022 cGMP-PKG signaling pathway"
## [3] "mmu04360 Axon guidance"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu03010" "mmu04022" "mmu04360"
PLOT!
plot_pathway = function(pid) pathview(gene.data = foldchange, pathway.id = pid, species = "mouse", new.signature_= FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu03010.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04022.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04360.pathview.png
keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04110 Cell cycle"
## [3] "mmu04657 IL-17 signaling pathway"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu04613" "mmu04110" "mmu04657"
plot_pathway = function(pid) pathview(gene.data = foldchange, pathway.id = pid, species = "mouse", new.signature_= FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04613.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04110.pathview.png
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04657.pathview.png
library(imager)
filenames <- list.files(path = "E/:Desktop/classroom/myfiles", pattern = ".pathview.png")
all_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames)
library(tidyverse)
EdgeR <- read.csv("Mouse_EdgeR_Results_Zero_vs_1.csv")
DESeq <- read.csv("Mouse_DESeq.csv")
DESeq2 <- DESeq %>%
filter(padj < 0.1)
DESeq2 <- DESeq2[,c(1,3)]
EdgeR <- EdgeR[,1:2]
colnames(DESeq2) <- c("ID", "logFC")
colnames(EdgeR) <- c("ID", "logFC")
install.packages("GOplot")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(GOplot)
## Loading required package: ggdendro
##
## Attaching package: 'ggdendro'
## The following object is masked from 'package:imager':
##
## label
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:BiocGenerics':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## Loading required package: RColorBrewer
comp <- GOVenn(DESeq2, EdgeR, label = c("DESeq2", "EdgeR"), title = "Comparison of DESeq and EdgeR DE Genes", plot = FALSE)
comp$plot
head(comp$table, 30)
## $A_only
## logFC Trend
## ENSMUSG00000022580 -0.340 DOWN
## ENSMUSG00000006517 -0.361 DOWN
## ENSMUSG00000038375 -0.311 DOWN
## ENSMUSG00000032374 -0.331 DOWN
## ENSMUSG00000014158 -0.245 DOWN
## ENSMUSG00000026179 -0.277 DOWN
## ENSMUSG00000037455 -0.319 DOWN
## ENSMUSG00000020232 -0.214 DOWN
## ENSMUSG00000023367 -0.222 DOWN
## ENSMUSG00000079465 -0.348 DOWN
## ENSMUSG00000032911 -0.253 DOWN
## ENSMUSG00000028671 -0.529 DOWN
## ENSMUSG00000022763 -0.463 DOWN
## ENSMUSG00000069565 -0.233 DOWN
## ENSMUSG00000029763 -0.322 DOWN
## ENSMUSG00000040263 -0.267 DOWN
## ENSMUSG00000039047 -0.200 DOWN
## ENSMUSG00000023938 -0.298 DOWN
## ENSMUSG00000048707 -0.256 DOWN
## ENSMUSG00000024958 -0.216 DOWN
## ENSMUSG00000022351 -0.374 DOWN
## ENSMUSG00000092500 -0.678 DOWN
## ENSMUSG00000031594 -0.705 DOWN
## ENSMUSG00000053414 -0.423 DOWN
## ENSMUSG00000035910 -0.465 DOWN
## ENSMUSG00000046079 -0.230 DOWN
## ENSMUSG00000039834 -0.209 DOWN
## ENSMUSG00000024064 -0.322 DOWN
## ENSMUSG00000017428 -0.192 DOWN
## ENSMUSG00000040188 -0.182 DOWN
## ENSMUSG00000026399 -0.463 DOWN
## ENSMUSG00000000594 -0.353 DOWN
## ENSMUSG00000001473 -0.520 DOWN
## ENSMUSG00000029032 -0.173 DOWN
## ENSMUSG00000034457 -0.656 DOWN
## ENSMUSG00000038704 -0.442 DOWN
## ENSMUSG00000004565 -0.185 DOWN
## ENSMUSG00000020381 -0.355 DOWN
## ENSMUSG00000041889 -0.407 DOWN
## ENSMUSG00000023259 -0.549 DOWN
## ENSMUSG00000100131 -0.745 DOWN
## ENSMUSG00000062203 -0.169 DOWN
## ENSMUSG00000025735 -0.547 DOWN
## ENSMUSG00000028545 -0.349 DOWN
## ENSMUSG00000058454 -0.322 DOWN
## ENSMUSG00000066441 -0.206 DOWN
## ENSMUSG00000022540 -0.222 DOWN
## ENSMUSG00000041939 -0.369 DOWN
## ENSMUSG00000035561 -0.629 DOWN
## ENSMUSG00000038023 -0.162 DOWN
## ENSMUSG00000101249 -0.424 DOWN
## ENSMUSG00000032370 -0.277 DOWN
## ENSMUSG00000004266 -0.241 DOWN
## ENSMUSG00000001627 -0.368 DOWN
## ENSMUSG00000029810 -0.212 DOWN
## ENSMUSG00000024579 -0.339 DOWN
## ENSMUSG00000030739 -0.257 DOWN
## ENSMUSG00000011382 -0.289 DOWN
## ENSMUSG00000032306 -0.197 DOWN
## ENSMUSG00000031960 -0.232 DOWN
## ENSMUSG00000022512 -0.313 DOWN
## ENSMUSG00000037243 -0.384 DOWN
## ENSMUSG00000028041 -0.228 DOWN
## ENSMUSG00000026307 -0.180 DOWN
## ENSMUSG00000029580 -0.225 DOWN
## ENSMUSG00000033107 -0.465 DOWN
## ENSMUSG00000112324 -1.178 DOWN
## ENSMUSG00000020744 -0.239 DOWN
## ENSMUSG00000055745 -0.330 DOWN
## ENSMUSG00000041959 -0.302 DOWN
## ENSMUSG00000036957 -0.236 DOWN
## ENSMUSG00000021697 -0.244 DOWN
## ENSMUSG00000037894 -0.149 DOWN
## ENSMUSG00000021792 -0.271 DOWN
## ENSMUSG00000037347 -0.583 DOWN
## ENSMUSG00000020877 -0.305 DOWN
## ENSMUSG00000087574 -0.525 DOWN
## ENSMUSG00000030512 -0.297 DOWN
## ENSMUSG00000006800 -0.384 DOWN
## ENSMUSG00000027412 -0.238 DOWN
## ENSMUSG00000031161 -0.193 DOWN
## ENSMUSG00000027429 -0.197 DOWN
## ENSMUSG00000045594 -0.322 DOWN
## ENSMUSG00000018171 -0.210 DOWN
## ENSMUSG00000032349 -0.205 DOWN
## ENSMUSG00000015094 -0.274 DOWN
## ENSMUSG00000028538 -0.214 DOWN
## ENSMUSG00000029616 -0.170 DOWN
## ENSMUSG00000019797 -0.260 DOWN
## ENSMUSG00000095193 -0.465 DOWN
## ENSMUSG00000068876 -0.369 DOWN
## ENSMUSG00000026784 -0.337 DOWN
## ENSMUSG00000057363 -0.230 DOWN
## ENSMUSG00000064367 -0.397 DOWN
## ENSMUSG00000024319 -0.161 DOWN
## ENSMUSG00000030137 -0.733 DOWN
## ENSMUSG00000030538 -0.210 DOWN
## ENSMUSG00000032978 -0.412 DOWN
## ENSMUSG00000069633 -0.241 DOWN
## ENSMUSG00000067158 -0.167 DOWN
## ENSMUSG00000024292 -0.697 DOWN
## ENSMUSG00000031958 -0.326 DOWN
## ENSMUSG00000064341 -0.427 DOWN
## ENSMUSG00000024993 -0.240 DOWN
## ENSMUSG00000074170 -0.323 DOWN
## ENSMUSG00000030641 -0.708 DOWN
## ENSMUSG00000034858 -0.241 DOWN
## ENSMUSG00000036040 -0.449 DOWN
## ENSMUSG00000063229 -0.274 DOWN
## ENSMUSG00000027490 -0.501 DOWN
## ENSMUSG00000047735 -0.296 DOWN
## ENSMUSG00000011752 -0.216 DOWN
## ENSMUSG00000034714 -0.251 DOWN
## ENSMUSG00000034613 -0.198 DOWN
## ENSMUSG00000032452 -0.366 DOWN
## ENSMUSG00000038775 -0.428 DOWN
## ENSMUSG00000045294 -0.410 DOWN
## ENSMUSG00000084128 -0.205 DOWN
## ENSMUSG00000037686 -0.488 DOWN
## ENSMUSG00000014245 -0.343 DOWN
## ENSMUSG00000029093 -0.702 DOWN
## ENSMUSG00000024736 -0.295 DOWN
## ENSMUSG00000028937 -0.460 DOWN
## ENSMUSG00000096795 -0.429 DOWN
## ENSMUSG00000051518 -0.238 DOWN
## ENSMUSG00000000934 -0.210 DOWN
## ENSMUSG00000030880 -0.374 DOWN
## ENSMUSG00000025726 -0.471 DOWN
## ENSMUSG00000052117 -0.383 DOWN
## ENSMUSG00000006342 -0.322 DOWN
## ENSMUSG00000062825 -0.214 DOWN
## ENSMUSG00000041733 -0.164 DOWN
## ENSMUSG00000028780 -0.390 DOWN
## ENSMUSG00000024665 -0.283 DOWN
## ENSMUSG00000025317 -0.452 DOWN
## ENSMUSG00000020142 -0.513 DOWN
## ENSMUSG00000082016 -0.500 DOWN
## ENSMUSG00000034371 -0.356 DOWN
## ENSMUSG00000032492 -0.239 DOWN
## ENSMUSG00000110755 -0.855 DOWN
## ENSMUSG00000064254 -0.353 DOWN
## ENSMUSG00000035845 -0.230 DOWN
## ENSMUSG00000017210 -0.184 DOWN
## ENSMUSG00000023832 -0.242 DOWN
## ENSMUSG00000037999 -0.136 DOWN
## ENSMUSG00000068220 -0.327 DOWN
## ENSMUSG00000064345 -0.470 DOWN
## ENSMUSG00000109532 -0.468 DOWN
## ENSMUSG00000024503 -0.299 DOWN
## ENSMUSG00000004843 -0.193 DOWN
## ENSMUSG00000030298 -0.187 DOWN
## ENSMUSG00000048578 -0.185 DOWN
## ENSMUSG00000085042 -0.537 DOWN
## ENSMUSG00000016942 -0.573 DOWN
## ENSMUSG00000020116 -0.221 DOWN
## ENSMUSG00000028989 -0.830 DOWN
## ENSMUSG00000045328 -0.675 DOWN
## ENSMUSG00000022012 -0.602 DOWN
## ENSMUSG00000030111 -0.899 DOWN
## ENSMUSG00000070283 -0.218 DOWN
## ENSMUSG00000015357 -0.250 DOWN
## ENSMUSG00000023004 -0.286 DOWN
## ENSMUSG00000026627 -0.314 DOWN
## ENSMUSG00000031604 -0.471 DOWN
## ENSMUSG00000032515 -0.308 DOWN
## ENSMUSG00000067924 -0.224 DOWN
## ENSMUSG00000037263 -0.843 DOWN
## ENSMUSG00000027230 -0.362 DOWN
## ENSMUSG00000110663 -0.548 DOWN
## ENSMUSG00000026042 -0.492 DOWN
## ENSMUSG00000024818 -0.340 DOWN
## ENSMUSG00000028063 -0.176 DOWN
## ENSMUSG00000002393 -0.151 DOWN
## ENSMUSG00000029304 -0.304 DOWN
## ENSMUSG00000063897 -0.232 DOWN
## ENSMUSG00000047281 -0.292 DOWN
## ENSMUSG00000020827 -0.092 DOWN
## ENSMUSG00000057103 -0.272 DOWN
## ENSMUSG00000035863 -0.136 DOWN
## ENSMUSG00000038217 -0.427 DOWN
## ENSMUSG00000021414 -0.467 DOWN
## ENSMUSG00000029394 -0.155 DOWN
## ENSMUSG00000025203 -0.440 DOWN
## ENSMUSG00000022021 -0.747 DOWN
## ENSMUSG00000021666 -0.172 DOWN
## ENSMUSG00000023153 -0.476 DOWN
## ENSMUSG00000032269 -0.487 DOWN
## ENSMUSG00000029701 -0.155 DOWN
## ENSMUSG00000044786 -0.441 DOWN
## ENSMUSG00000037470 -0.128 DOWN
## ENSMUSG00000003813 -0.177 DOWN
## ENSMUSG00000073405 -0.630 DOWN
## ENSMUSG00000022893 -0.270 DOWN
## ENSMUSG00000040694 -0.480 DOWN
## ENSMUSG00000078695 -0.226 DOWN
## ENSMUSG00000079261 -0.517 DOWN
## ENSMUSG00000022033 -0.717 DOWN
## ENSMUSG00000037169 -0.428 DOWN
## ENSMUSG00000031781 -0.220 DOWN
## ENSMUSG00000033538 -0.611 DOWN
## ENSMUSG00000058569 -0.131 DOWN
## ENSMUSG00000024683 -0.174 DOWN
## ENSMUSG00000027327 -0.296 DOWN
## ENSMUSG00000046999 -0.821 DOWN
## ENSMUSG00000036002 -0.179 DOWN
## ENSMUSG00000048277 -0.161 DOWN
## ENSMUSG00000039254 -0.171 DOWN
## ENSMUSG00000074890 -0.265 DOWN
## ENSMUSG00000020898 -0.176 DOWN
## ENSMUSG00000045725 -0.348 DOWN
## ENSMUSG00000069302 -0.826 DOWN
## ENSMUSG00000020917 -0.414 DOWN
## ENSMUSG00000025762 -0.150 DOWN
## ENSMUSG00000051397 -0.602 DOWN
## ENSMUSG00000071076 -0.329 DOWN
## ENSMUSG00000020062 -0.557 DOWN
## ENSMUSG00000027610 -0.208 DOWN
## ENSMUSG00000046603 -0.150 DOWN
## ENSMUSG00000026249 -0.398 DOWN
## ENSMUSG00000062248 -0.403 DOWN
## ENSMUSG00000067144 -0.584 DOWN
## ENSMUSG00000018427 -0.298 DOWN
## ENSMUSG00000029161 -0.316 DOWN
## ENSMUSG00000054951 -0.420 DOWN
## ENSMUSG00000026201 -0.168 DOWN
## ENSMUSG00000028001 -0.429 DOWN
## ENSMUSG00000055725 -0.339 DOWN
## ENSMUSG00000036452 -0.209 DOWN
## ENSMUSG00000038527 -0.278 DOWN
## ENSMUSG00000028042 -0.182 DOWN
## ENSMUSG00000056608 -0.362 DOWN
## ENSMUSG00000005667 -0.619 DOWN
## ENSMUSG00000037386 -0.407 DOWN
## ENSMUSG00000037725 -0.819 DOWN
## ENSMUSG00000047986 -0.327 DOWN
## ENSMUSG00000029771 -0.386 DOWN
## ENSMUSG00000032902 -0.548 DOWN
## ENSMUSG00000030805 -0.154 DOWN
## ENSMUSG00000102070 -0.342 DOWN
## ENSMUSG00000048911 -0.636 DOWN
## ENSMUSG00000020307 -0.208 DOWN
## ENSMUSG00000028327 -0.429 DOWN
## ENSMUSG00000061474 -0.216 DOWN
## ENSMUSG00000031776 -0.133 DOWN
## ENSMUSG00000023048 -0.185 DOWN
## ENSMUSG00000028834 -0.453 DOWN
## ENSMUSG00000032786 -0.342 DOWN
## ENSMUSG00000020571 -0.189 DOWN
## ENSMUSG00000026036 -0.183 DOWN
## ENSMUSG00000037103 -0.234 DOWN
## ENSMUSG00000032806 -0.215 DOWN
## ENSMUSG00000032115 -0.226 DOWN
## ENSMUSG00000026914 -0.188 DOWN
## ENSMUSG00000017221 -0.154 DOWN
## ENSMUSG00000093930 -0.256 DOWN
## ENSMUSG00000020775 -0.193 DOWN
## ENSMUSG00000015085 -0.277 DOWN
## ENSMUSG00000007783 -0.681 DOWN
## ENSMUSG00000040688 -0.163 DOWN
## ENSMUSG00000053291 -0.219 DOWN
## ENSMUSG00000027496 -0.549 DOWN
## ENSMUSG00000033031 -0.563 DOWN
## ENSMUSG00000030122 -0.203 DOWN
## ENSMUSG00000030340 -0.153 DOWN
## ENSMUSG00000045438 -0.171 DOWN
## ENSMUSG00000083287 -0.702 DOWN
## ENSMUSG00000017776 0.143 UP
## ENSMUSG00000050310 0.157 UP
## ENSMUSG00000039068 0.166 UP
## ENSMUSG00000021540 0.170 UP
## ENSMUSG00000031393 0.166 UP
## ENSMUSG00000036550 0.122 UP
## ENSMUSG00000020257 0.165 UP
## ENSMUSG00000022604 0.245 UP
## ENSMUSG00000028053 0.129 UP
## ENSMUSG00000030213 0.228 UP
## ENSMUSG00000060657 0.171 UP
## ENSMUSG00000002266 0.798 UP
## ENSMUSG00000027351 0.201 UP
## ENSMUSG00000041530 0.142 UP
## ENSMUSG00000031216 0.185 UP
## ENSMUSG00000025223 0.199 UP
## ENSMUSG00000032086 0.160 UP
## ENSMUSG00000050812 0.123 UP
## ENSMUSG00000038290 0.181 UP
## ENSMUSG00000038530 0.458 UP
## ENSMUSG00000038766 0.194 UP
## ENSMUSG00000050947 0.195 UP
## ENSMUSG00000045098 0.167 UP
## ENSMUSG00000026918 0.151 UP
## ENSMUSG00000037003 0.254 UP
## ENSMUSG00000074748 0.135 UP
## ENSMUSG00000097412 0.362 UP
## ENSMUSG00000031729 0.136 UP
## ENSMUSG00000060419 0.567 UP
## ENSMUSG00000027519 0.127 UP
## ENSMUSG00000021669 0.173 UP
## ENSMUSG00000051817 0.205 UP
## ENSMUSG00000043090 0.246 UP
## ENSMUSG00000037029 0.161 UP
## ENSMUSG00000027395 0.233 UP
## ENSMUSG00000021488 0.140 UP
## ENSMUSG00000073678 0.238 UP
## ENSMUSG00000041378 0.247 UP
## ENSMUSG00000046947 0.230 UP
## ENSMUSG00000058793 0.158 UP
## ENSMUSG00000037369 0.169 UP
## ENSMUSG00000035247 0.144 UP
## ENSMUSG00000026436 0.151 UP
## ENSMUSG00000046897 0.138 UP
## ENSMUSG00000057133 0.152 UP
## ENSMUSG00000027680 0.155 UP
## ENSMUSG00000043991 0.143 UP
## ENSMUSG00000034189 0.223 UP
## ENSMUSG00000018076 0.140 UP
## ENSMUSG00000052446 0.213 UP
## ENSMUSG00000000901 0.460 UP
## ENSMUSG00000040209 0.149 UP
## ENSMUSG00000021140 0.140 UP
## ENSMUSG00000015942 0.281 UP
## ENSMUSG00000043241 0.154 UP
## ENSMUSG00000017897 0.479 UP
## ENSMUSG00000022353 0.170 UP
## ENSMUSG00000005893 0.127 UP
## ENSMUSG00000044791 0.108 UP
## ENSMUSG00000034156 0.304 UP
## ENSMUSG00000037736 0.156 UP
## ENSMUSG00000059486 0.191 UP
## ENSMUSG00000040865 0.146 UP
## ENSMUSG00000035666 0.183 UP
## ENSMUSG00000029647 0.151 UP
## ENSMUSG00000038486 0.223 UP
## ENSMUSG00000025927 0.260 UP
## ENSMUSG00000022415 0.260 UP
## ENSMUSG00000092558 0.233 UP
## ENSMUSG00000014195 0.132 UP
## ENSMUSG00000019866 0.168 UP
## ENSMUSG00000078202 0.238 UP
## ENSMUSG00000035495 0.162 UP
## ENSMUSG00000044674 0.147 UP
## ENSMUSG00000102869 0.133 UP
## ENSMUSG00000026923 0.238 UP
## ENSMUSG00000020642 0.304 UP
## ENSMUSG00000037503 0.132 UP
## ENSMUSG00000043716 0.155 UP
## ENSMUSG00000037822 0.123 UP
## ENSMUSG00000035413 0.280 UP
## ENSMUSG00000021395 0.145 UP
## ENSMUSG00000036097 0.170 UP
## ENSMUSG00000034297 0.142 UP
## ENSMUSG00000037742 0.183 UP
## ENSMUSG00000031841 0.367 UP
## ENSMUSG00000021661 0.236 UP
## ENSMUSG00000021959 0.201 UP
## ENSMUSG00000020594 0.133 UP
## ENSMUSG00000027079 0.245 UP
## ENSMUSG00000087150 0.369 UP
## ENSMUSG00000091811 0.293 UP
## ENSMUSG00000066415 0.147 UP
## ENSMUSG00000027524 0.334 UP
## ENSMUSG00000008683 0.179 UP
## ENSMUSG00000066235 0.238 UP
## ENSMUSG00000099689 0.326 UP
## ENSMUSG00000074994 0.154 UP
## ENSMUSG00000005698 0.136 UP
## ENSMUSG00000020357 0.220 UP
## ENSMUSG00000051864 0.167 UP
## ENSMUSG00000033857 0.262 UP
## ENSMUSG00000027678 0.141 UP
## ENSMUSG00000028522 0.191 UP
## ENSMUSG00000038677 0.256 UP
## ENSMUSG00000031365 0.243 UP
## ENSMUSG00000079215 0.136 UP
## ENSMUSG00000008435 0.196 UP
## ENSMUSG00000022791 0.185 UP
## ENSMUSG00000035284 0.150 UP
## ENSMUSG00000071660 0.149 UP
## ENSMUSG00000031209 0.180 UP
## ENSMUSG00000043384 0.177 UP
## ENSMUSG00000062519 0.203 UP
## ENSMUSG00000046792 0.154 UP
## ENSMUSG00000060036 0.183 UP
## ENSMUSG00000027799 0.128 UP
## ENSMUSG00000070280 1.104 UP
## ENSMUSG00000039086 0.302 UP
## ENSMUSG00000010721 0.279 UP
## ENSMUSG00000026694 0.224 UP
## ENSMUSG00000090958 0.260 UP
## ENSMUSG00000041329 0.340 UP
## ENSMUSG00000110353 0.559 UP
## ENSMUSG00000024431 0.155 UP
## ENSMUSG00000028970 0.653 UP
## ENSMUSG00000044617 0.169 UP
## ENSMUSG00000053110 0.113 UP
## ENSMUSG00000033721 0.194 UP
## ENSMUSG00000033237 0.125 UP
## ENSMUSG00000026782 0.154 UP
## ENSMUSG00000020716 0.107 UP
## ENSMUSG00000028869 0.157 UP
## ENSMUSG00000040123 0.176 UP
## ENSMUSG00000023279 0.660 UP
## ENSMUSG00000031618 0.182 UP
## ENSMUSG00000046480 0.153 UP
## ENSMUSG00000053604 0.222 UP
## ENSMUSG00000019817 0.216 UP
## ENSMUSG00000063317 0.192 UP
## ENSMUSG00000027782 0.169 UP
## ENSMUSG00000003970 0.175 UP
## ENSMUSG00000020448 0.156 UP
## ENSMUSG00000075592 0.209 UP
## ENSMUSG00000005886 0.114 UP
## ENSMUSG00000097119 0.186 UP
## ENSMUSG00000049739 0.148 UP
## ENSMUSG00000033671 0.132 UP
## ENSMUSG00000036197 0.155 UP
## ENSMUSG00000048379 0.183 UP
## ENSMUSG00000039315 0.549 UP
## ENSMUSG00000033943 0.124 UP
## ENSMUSG00000017418 0.178 UP
## ENSMUSG00000041037 0.153 UP
## ENSMUSG00000048696 0.292 UP
## ENSMUSG00000039801 0.137 UP
## ENSMUSG00000060938 0.154 UP
## ENSMUSG00000049800 0.175 UP
## ENSMUSG00000013921 0.359 UP
## ENSMUSG00000031169 0.321 UP
## ENSMUSG00000106062 0.642 UP
## ENSMUSG00000017405 0.225 UP
## ENSMUSG00000034647 0.187 UP
## ENSMUSG00000095597 0.398 UP
## ENSMUSG00000048170 0.119 UP
## ENSMUSG00000038437 0.144 UP
## ENSMUSG00000043929 0.200 UP
## ENSMUSG00000052751 0.127 UP
## ENSMUSG00000038481 0.163 UP
## ENSMUSG00000090083 0.190 UP
## ENSMUSG00000034893 0.133 UP
## ENSMUSG00000069682 0.295 UP
## ENSMUSG00000025571 0.170 UP
## ENSMUSG00000040481 0.137 UP
## ENSMUSG00000009470 0.137 UP
## ENSMUSG00000038056 0.107 UP
## ENSMUSG00000020491 0.286 UP
## ENSMUSG00000038544 0.226 UP
## ENSMUSG00000015290 0.124 UP
## ENSMUSG00000017667 0.232 UP
## ENSMUSG00000037712 0.152 UP
## ENSMUSG00000031754 0.152 UP
## ENSMUSG00000039377 0.409 UP
## ENSMUSG00000065954 0.130 UP
## ENSMUSG00000021180 0.252 UP
## ENSMUSG00000021144 0.123 UP
## ENSMUSG00000032892 0.266 UP
## ENSMUSG00000047632 0.361 UP
## ENSMUSG00000036990 0.138 UP
## ENSMUSG00000025997 0.165 UP
## ENSMUSG00000025195 0.133 UP
## ENSMUSG00000018736 0.156 UP
## ENSMUSG00000002748 0.160 UP
## ENSMUSG00000062866 0.139 UP
## ENSMUSG00000047888 0.157 UP
## ENSMUSG00000053754 0.086 UP
## ENSMUSG00000105692 0.424 UP
## ENSMUSG00000029154 0.224 UP
## ENSMUSG00000062328 0.135 UP
## ENSMUSG00000020460 0.134 UP
## ENSMUSG00000026455 0.140 UP
## ENSMUSG00000028936 0.188 UP
## ENSMUSG00000048285 0.324 UP
##
## $B_only
## logFC Trend
## ENSMUSG00000095616 -2.18 DOWN
## ENSMUSG00000055254 -1.26 DOWN
## ENSMUSG00000036083 -0.42 DOWN
## ENSMUSG00000062901 0.37 UP
## ENSMUSG00000066113 0.54 UP
## ENSMUSG00000047861 0.48 UP
## ENSMUSG00000068270 0.37 UP
## ENSMUSG00000026315 0.48 UP
## ENSMUSG00000010651 0.37 UP
## ENSMUSG00000057914 0.65 UP
## ENSMUSG00000038567 0.79 UP
## ENSMUSG00000021379 0.43 UP
## ENSMUSG00000038528 0.28 UP
## ENSMUSG00000021876 1.02 UP
## ENSMUSG00000020566 0.38 UP
## ENSMUSG00000002289 0.96 UP
## ENSMUSG00000025880 0.33 UP
## ENSMUSG00000032898 0.30 UP
## ENSMUSG00000051344 0.40 UP
## ENSMUSG00000028234 0.28 UP
## ENSMUSG00000059173 0.31 UP
## ENSMUSG00000061410 0.26 UP
## ENSMUSG00000104445 0.29 UP
## ENSMUSG00000029004 0.26 UP
## ENSMUSG00000006599 0.30 UP
## ENSMUSG00000031530 0.46 UP
## ENSMUSG00000075520 0.32 UP
## ENSMUSG00000032604 0.27 UP
## ENSMUSG00000013089 0.48 UP
## ENSMUSG00000037058 0.25 UP
## ENSMUSG00000090165 0.72 UP
## ENSMUSG00000078429 0.25 UP
## ENSMUSG00000032554 0.52 UP
## ENSMUSG00000028266 0.28 UP
## ENSMUSG00000035504 0.44 UP
## ENSMUSG00000042379 0.66 UP
## ENSMUSG00000063415 0.58 UP
## ENSMUSG00000074179 0.55 UP
## ENSMUSG00000006494 0.29 UP
## ENSMUSG00000024472 0.31 UP
##
## $AB
## logFC_A logFC_B Trend
## ENSMUSG00000000253 -0.36 -0.29 DOWN
## ENSMUSG00000002250 -0.93 -0.83 DOWN
## ENSMUSG00000002797 -0.53 -0.45 DOWN
## ENSMUSG00000010663 -0.38 -0.31 DOWN
## ENSMUSG00000019944 -0.43 -0.50 DOWN
## ENSMUSG00000020326 -0.35 -0.28 DOWN
## ENSMUSG00000020538 -0.34 -0.41 DOWN
## ENSMUSG00000021135 -1.24 -1.17 DOWN
## ENSMUSG00000021185 -0.37 -0.31 DOWN
## ENSMUSG00000021214 -1.29 -1.23 DOWN
## ENSMUSG00000021364 -0.35 -0.29 DOWN
## ENSMUSG00000021670 -0.52 -0.51 DOWN
## ENSMUSG00000022797 -0.76 -0.61 DOWN
## ENSMUSG00000023067 -0.92 -0.86 DOWN
## ENSMUSG00000023120 -0.66 -0.74 DOWN
## ENSMUSG00000024772 -0.21 -0.26 DOWN
## ENSMUSG00000024866 -0.46 -0.40 DOWN
## ENSMUSG00000025185 -0.78 -0.72 DOWN
## ENSMUSG00000026077 -1.19 -1.36 DOWN
## ENSMUSG00000026188 -0.97 -0.86 DOWN
## ENSMUSG00000026189 -0.49 -0.42 DOWN
## ENSMUSG00000026202 -0.48 -0.41 DOWN
## ENSMUSG00000026827 -0.58 -0.53 DOWN
## ENSMUSG00000027111 -0.47 -0.39 DOWN
## ENSMUSG00000028357 -0.37 -0.32 DOWN
## ENSMUSG00000028919 -0.42 -0.39 DOWN
## ENSMUSG00000029361 -0.40 -0.61 DOWN
## ENSMUSG00000029752 -0.83 -0.74 DOWN
## ENSMUSG00000031283 -0.82 -0.81 DOWN
## ENSMUSG00000031349 -0.39 -0.31 DOWN
## ENSMUSG00000031725 -0.63 -0.59 DOWN
## ENSMUSG00000031994 -1.33 -1.26 DOWN
## ENSMUSG00000032420 -0.39 -0.32 DOWN
## ENSMUSG00000032758 -1.47 -1.41 DOWN
## ENSMUSG00000036752 -0.49 -0.42 DOWN
## ENSMUSG00000038224 -0.77 -0.69 DOWN
## ENSMUSG00000039062 -0.37 -0.32 DOWN
## ENSMUSG00000040998 -0.51 -0.43 DOWN
## ENSMUSG00000041605 -0.51 -0.45 DOWN
## ENSMUSG00000041920 -0.70 -0.66 DOWN
## ENSMUSG00000042487 -0.49 -0.42 DOWN
## ENSMUSG00000043091 -0.43 -0.35 DOWN
## ENSMUSG00000043681 -0.81 -0.73 DOWN
## ENSMUSG00000045136 -0.53 -0.44 DOWN
## ENSMUSG00000046070 -0.50 -0.42 DOWN
## ENSMUSG00000050097 -0.69 -0.66 DOWN
## ENSMUSG00000052562 -1.20 -1.12 DOWN
## ENSMUSG00000053303 -1.45 -1.36 DOWN
## ENSMUSG00000054520 -0.53 -0.55 DOWN
## ENSMUSG00000054986 -1.13 -1.04 DOWN
## ENSMUSG00000055116 -1.04 -0.97 DOWN
## ENSMUSG00000056749 -1.03 -0.95 DOWN
## ENSMUSG00000058258 -0.50 -0.47 DOWN
## ENSMUSG00000058672 -0.49 -0.41 DOWN
## ENSMUSG00000059743 -0.42 -0.35 DOWN
## ENSMUSG00000063694 -0.35 -0.30 DOWN
## ENSMUSG00000070985 -0.49 -0.38 DOWN
## ENSMUSG00000074261 -0.37 -0.30 DOWN
## ENSMUSG00000074715 -2.05 -1.99 DOWN
## ENSMUSG00000090236 -0.36 -0.33 DOWN
## ENSMUSG00000090264 -0.85 -0.79 DOWN
## ENSMUSG00000001280 0.17 0.26 UP
## ENSMUSG00000002265 0.34 0.39 UP
## ENSMUSG00000002346 0.41 0.49 UP
## ENSMUSG00000003477 0.62 0.70 UP
## ENSMUSG00000003849 0.33 0.39 UP
## ENSMUSG00000004105 0.31 0.37 UP
## ENSMUSG00000005034 0.15 0.23 UP
## ENSMUSG00000005483 0.36 0.42 UP
## ENSMUSG00000006127 0.24 0.28 UP
## ENSMUSG00000006216 0.22 0.34 UP
## ENSMUSG00000006269 0.19 0.26 UP
## ENSMUSG00000006333 0.19 0.26 UP
## ENSMUSG00000007872 0.82 0.89 UP
## ENSMUSG00000008682 0.24 0.30 UP
## ENSMUSG00000009927 0.16 0.23 UP
## ENSMUSG00000012848 0.22 0.28 UP
## ENSMUSG00000015656 0.36 0.39 UP
## ENSMUSG00000015957 1.15 1.41 UP
## ENSMUSG00000018900 0.45 0.52 UP
## ENSMUSG00000020372 0.20 0.28 UP
## ENSMUSG00000020427 0.64 0.71 UP
## ENSMUSG00000020473 0.26 0.33 UP
## ENSMUSG00000020482 0.34 0.41 UP
## ENSMUSG00000020607 0.38 0.45 UP
## ENSMUSG00000020653 0.81 0.89 UP
## ENSMUSG00000020889 0.74 0.79 UP
## ENSMUSG00000021482 0.22 0.27 UP
## ENSMUSG00000021508 0.75 0.81 UP
## ENSMUSG00000021775 0.84 0.95 UP
## ENSMUSG00000022122 0.42 0.47 UP
## ENSMUSG00000022389 0.65 0.71 UP
## ENSMUSG00000022949 0.72 0.80 UP
## ENSMUSG00000023022 0.23 0.27 UP
## ENSMUSG00000024298 0.22 0.32 UP
## ENSMUSG00000024900 0.38 0.45 UP
## ENSMUSG00000025019 0.26 0.30 UP
## ENSMUSG00000025197 0.28 0.36 UP
## ENSMUSG00000025511 0.44 0.51 UP
## ENSMUSG00000025764 0.23 0.30 UP
## ENSMUSG00000025815 0.59 0.44 UP
## ENSMUSG00000026313 0.15 0.32 UP
## ENSMUSG00000027314 0.59 0.64 UP
## ENSMUSG00000027796 0.72 0.79 UP
## ENSMUSG00000027875 1.68 1.72 UP
## ENSMUSG00000028081 0.23 0.30 UP
## ENSMUSG00000028957 1.23 1.26 UP
## ENSMUSG00000029587 0.24 0.39 UP
## ENSMUSG00000029714 0.20 0.34 UP
## ENSMUSG00000030201 0.26 0.30 UP
## ENSMUSG00000030256 0.98 1.19 UP
## ENSMUSG00000031167 0.53 0.61 UP
## ENSMUSG00000031320 0.26 0.32 UP
## ENSMUSG00000032097 0.20 0.26 UP
## ENSMUSG00000032594 0.18 0.29 UP
## ENSMUSG00000032624 0.22 0.28 UP
## ENSMUSG00000033327 0.43 0.50 UP
## ENSMUSG00000033350 0.37 0.43 UP
## ENSMUSG00000033411 0.20 0.36 UP
## ENSMUSG00000034111 0.21 0.28 UP
## ENSMUSG00000034450 1.88 1.93 UP
## ENSMUSG00000034460 0.32 0.38 UP
## ENSMUSG00000035469 0.19 0.29 UP
## ENSMUSG00000035530 0.20 0.27 UP
## ENSMUSG00000035614 0.18 0.32 UP
## ENSMUSG00000037172 0.22 0.28 UP
## ENSMUSG00000037465 0.58 0.61 UP
## ENSMUSG00000037523 0.18 0.24 UP
## ENSMUSG00000037621 0.52 0.59 UP
## ENSMUSG00000038393 0.68 0.77 UP
## ENSMUSG00000039108 0.16 0.26 UP
## ENSMUSG00000039783 0.37 0.45 UP
## ENSMUSG00000039789 0.23 0.39 UP
## ENSMUSG00000039831 0.19 0.32 UP
## ENSMUSG00000040363 0.20 0.29 UP
## ENSMUSG00000040423 0.21 0.30 UP
## ENSMUSG00000040584 0.52 0.59 UP
## ENSMUSG00000040740 0.58 0.65 UP
## ENSMUSG00000041075 0.28 0.35 UP
## ENSMUSG00000041351 0.41 0.46 UP
## ENSMUSG00000041841 0.25 0.30 UP
## ENSMUSG00000042046 0.20 0.26 UP
## ENSMUSG00000042659 0.36 0.45 UP
## ENSMUSG00000042745 0.46 0.53 UP
## ENSMUSG00000043144 0.60 0.67 UP
## ENSMUSG00000044026 0.23 0.30 UP
## ENSMUSG00000045382 0.58 0.69 UP
## ENSMUSG00000045441 0.34 0.41 UP
## ENSMUSG00000045519 0.37 0.53 UP
## ENSMUSG00000047215 0.21 0.28 UP
## ENSMUSG00000047878 0.25 0.35 UP
## ENSMUSG00000048826 0.26 0.34 UP
## ENSMUSG00000049241 0.55 0.61 UP
## ENSMUSG00000050100 0.73 0.76 UP
## ENSMUSG00000053411 0.35 0.44 UP
## ENSMUSG00000053964 0.64 0.82 UP
## ENSMUSG00000054499 0.26 0.32 UP
## ENSMUSG00000054793 0.22 0.29 UP
## ENSMUSG00000055866 0.81 0.77 UP
## ENSMUSG00000055980 0.27 0.34 UP
## ENSMUSG00000056851 0.18 0.27 UP
## ENSMUSG00000058056 0.34 0.41 UP
## ENSMUSG00000058600 0.19 0.30 UP
## ENSMUSG00000058655 0.26 0.33 UP
## ENSMUSG00000059824 2.26 2.26 UP
## ENSMUSG00000061143 0.28 0.34 UP
## ENSMUSG00000061353 0.55 0.60 UP
## ENSMUSG00000061477 0.16 0.23 UP
## ENSMUSG00000062563 0.29 0.35 UP
## ENSMUSG00000063681 0.66 0.76 UP
## ENSMUSG00000064065 0.26 0.31 UP
## ENSMUSG00000067586 0.36 0.42 UP
## ENSMUSG00000068742 0.45 0.48 UP
## ENSMUSG00000069495 0.18 0.28 UP
## ENSMUSG00000070348 0.32 0.39 UP
## ENSMUSG00000071415 0.19 0.26 UP
## ENSMUSG00000074063 0.48 0.55 UP
## ENSMUSG00000074578 0.32 0.45 UP
## ENSMUSG00000086583 0.42 0.48 UP
## ENSMUSG00000098557 0.28 0.34 UP
## ENSMUSG00000106847 0.29 0.36 UP
## ENSMUSG00000110185 0.26 0.33 UP
## ENSMUSG00000110195 0.32 0.33 UP
library(msa)
seq <- readAAStringSet("hglobin.fa")
seq
## AAStringSet object of length 3:
## width seq names
## [1] 142 MVLSPADKTNVKAAWGKVGAHAG...PAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] 142 MVLSGEDKSNIKAAWGKIGGHGA...PAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] 142 MSLTRTERTIILSLWSKISTQAD...ADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
Lets align the 8 different amino acids sequences
alignments <- msa(seq, method ="ClustalW")
## use default substitution matrix
alignments
## CLUSTAL 2.1
##
## Call:
## msa(seq, method = "ClustalW")
##
## MsaAAMultipleAlignment with 3 rows and 142 columns
## aln names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] MSLTRTERTIILSLWSKISTQADVIG...FTADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
## Con MVLS??DKTNIKAAWGKIG?HA?EYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus
msaPrettyPrint(alignments, output = "pdf", showNames = "left",
showLogo = "none", askForOverwrite = FALSE, verbose = TRUE, file = "whole_align.pdf")
## Multiple alignment written to temporary file /tmp/RtmpUuEIIc/seq1f5c3da25.fasta
## File whole_align.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file whole_align.pdf created
msaPrettyPrint(alignments, c(10,30), output = "pdf", showNames = "left",
file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
verbose = TRUE)
## Multiple alignment written to temporary file /tmp/RtmpUuEIIc/seq1f5c1b9f2ca0.fasta
## File Zoomed_align.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file Zoomed_align.pdf created
Lets make a tree from our alignment
library(ape)
##
## Attaching package: 'ape'
## The following object is masked from 'package:imager':
##
## where
## The following object is masked from 'package:Biostrings':
##
## complement
## The following object is masked from 'package:dplyr':
##
## where
library(seqinr)
##
## Attaching package: 'seqinr'
## The following objects are masked from 'package:ape':
##
## as.alignment, consensus
## The following object is masked from 'package:matrixStats':
##
## count
## The following object is masked from 'package:Biostrings':
##
## translate
## The following object is masked from 'package:limma':
##
## zscore
## The following object is masked from 'package:dplyr':
##
## count
Convert to seqnir alignment -> get the distances and make a tree
alignment_seqinr <- msaConvert(alignment, type = "seqinr::alignment")
distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")
tree <- ape::nj(distances1)
plot(tree, main = "Phylogenetic Tree of HBA Sequence")
library(DECIPHER)
## Loading required package: RSQLite
## Loading required package: parallel
In the first step, we load the libraries and the sequence into long_seqs. This is a DNAStringSet
long_seq <- readDNAStringSet(file.path(getwd(), "plastid_genomes.fa"))
long_seq
## DNAStringSet object of length 5:
## width seq names
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT NC_018523.1 Sacch...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA NC_022431.1 Ascle...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC NC_022259.1 Nanno...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC NC_022417.1 Cocos...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT NC_022459.1 Camel...
Now lets build a temporary SOLite Database
Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
## Adding 5 sequences to the database.
##
## Added 5 new sequences to table Seqs.
## 100 total sequences in table Seqs.
## Time difference of 0.23 secs
Now that we have built the database, we can do the following
Find the syntenic blocks
synteny <- FindSynteny("long_db")
## ================================================================================
##
## Time difference of 14 secs
View blocks with plotting
pairs(synteny)
plot(synteny)
Lets make an actual alignment file
alignment <- AlignSynteny(synteny, "long_db")
## ================================================================================
##
## Time difference of 126 secs
Lets create a structure all aligned sytenic blocks for a pair of sequences
block <- unlist(alignment[[1]])
We can write to file one alignment at a time
writeXStringSet(block, "genome_block_out.fa")
library(locfit)
## locfit 1.5-9.8 2023-06-11
##
## Attaching package: 'locfit'
## The following object is masked from 'package:purrr':
##
## none
library(Rsamtools)
Lets create a function that will load the gene region information in a GFF file and convert it to a bioconductor GRanges object
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
Get count in windows across the genome in 500bp segments
whole_genome <- csaw::windowCounts(
file.path(getwd(), "windows.bam"),
bin = TRUE,
filter = 0,
width = 500,
param = csaw::readParam(
minq = 20,
dedup = TRUE,
pe = "both"
)
)
Since this is a single column of data, lets rename it
colnames(whole_genome) <-c("small_data")
annotated_regions <- get_annotated_regions_from_gff(file.path(getwd(), "genes.gff"))
Now that we have the windows of high expression, we want to see if any of them overlap with annotated regions
library(IRanges)
library(SummarizedExperiment)
windows_in_genes <- IRanges::overlapsAny(
SummarizedExperiment::rowRanges(whole_genome),
annotated_regions
)
windows_in_genes
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
Here we subset the whole_genome object into annotated and nonannotated regions
annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]
Use assay () to extract the actual counts from the Granges object
assay(non_annotated_window_counts)
## small_data
## [1,] 0
## [2,] 31
## [3,] 25
## [4,] 0
## [5,] 0
## [6,] 24
## [7,] 25
## [8,] 0
## [9,] 0
In this step, we use Rsamtools Pileup() function to get a per base coverage metadata, each row represents a single nucleotide in the reference count and the count column gives the depth of coverage at that point
library(bumphunter)
## Loading required package: foreach
## Parallel computing support for 'oligo/crlmm': Disabled
## - Load 'ff'
## - Load and register a 'foreach' adaptor
## Example - Using 'multicore' for 2 cores:
## library(doMC)
## registerDoMC(2)
## ================================================================================
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loading required package: iterators
pile_df <- Rsamtools::pileup(file.path(getwd(), "windows.bam"))
This step groups the reads with certain distance of each other into a cluster, we give the sequence names, position, and distance
clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap = 100)
table(clusters)
## clusters
## 1 2 3
## 1486 1552 1520
In this step, we will mpa the reads to the regions we found for the genome
bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$post, clusters, cutoff = 1)
## getSegments: segmenting
## getSegments: splitting
## Warning in min(pos[ind[Index]]): no non-missing arguments to min; returning Inf
## Warning in min(pos[ind[Index]]): no non-missing arguments to min; returning Inf
## Warning in min(pos[ind[Index]]): no non-missing arguments to min; returning Inf
## Warning in max(pos[ind[Index]]): no non-missing arguments to max; returning
## -Inf
## Warning in max(pos[ind[Index]]): no non-missing arguments to max; returning
## -Inf
## Warning in max(pos[ind[Index]]): no non-missing arguments to max; returning
## -Inf
## chr start end value area cluster indexStart indexEnd L clusterL
## 3 Chr1 Inf -Inf 10.4 15811 3 3039 4558 1520 1520
## 1 Chr1 Inf -Inf 10.0 14839 1 1 1486 1486 1486
## 2 Chr1 Inf -Inf 8.7 13436 2 1487 3038 1552 1552
Lets get the required packages
library(ggplot2)
library(ggtree)
## ggtree v3.10.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
##
## Shuangbin Xu, Lin Li, Xiao Luo, Meijun Chen, Wenli Tang, Li Zhan, Zehan
## Dai, Tommy T. Lam, Yi Guan, Guangchuang Yu. Ggtree: A serialized data
## object for visualization of a phylogenetic tree and annotation data.
## iMeta 2022, 1(4):e56. doi:10.1002/imt2.56
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:ape':
##
## rotate
## The following object is masked from 'package:magrittr':
##
## inset
## The following object is masked from 'package:reshape':
##
## expand
## The following object is masked from 'package:Biostrings':
##
## collapse
## The following object is masked from 'package:IRanges':
##
## collapse
## The following object is masked from 'package:S4Vectors':
##
## expand
## The following object is masked from 'package:tidyr':
##
## expand
library(treeio)
## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
##
## Guangchuang Yu. Data Integration, Manipulation and Visualization of
## Phylogenetic Trees (1st edition). Chapman and Hall/CRC. 2022,
## doi:10.1201/9781003279242
##
## Attaching package: 'treeio'
## The following object is masked from 'package:seqinr':
##
## read.fasta
## The following object is masked from 'package:Biostrings':
##
## mask
library(BAMMtools)
First we need to load our raw tree data, It is a Newick Format so we use:
itol <- ape::read.tree("itol.nwk")
Now we will print out a very basic phylogenetic tree
ggtree(itol)
We can also change the format to make it a circular tree
ggtree(itol, layout = "circular")
We can also change the left-right/ up-down direction
ggtree(itol) + coord_flip() + scale_x_reverse()
By using geom_tiplab() we can add names to the end of the tips
ggtree(itol) + geom_tiplab(color = "blue", size = 2)
By adding a geom_strip layer we can annotate clades in the tree with a block of color
ggtree(itol, layout = "unrooted") + geom_strip(13,14, color = "blue", barsize = 1)
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
We can highlight clades in unrooted trees with blobs of color using geom_highlight
ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "cyan")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
We can use the MRCA (most recent common ancestor) function to find the node we want
getmrca(itol, "Photohabdus_luminescens", "Blochmannia_floridanus")
Now if we want to highlight the section of the most recent common ancestor between the two
ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encirlce", fill = "yellow")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
First lets load the required packages
library(ape)
library(adegraphics)
## Registered S3 methods overwritten by 'adegraphics':
## method from
## biplot.dudi ade4
## kplot.foucart ade4
## kplot.mcoa ade4
## kplot.mfa ade4
## kplot.pta ade4
## kplot.sepan ade4
## kplot.statis ade4
## scatter.coa ade4
## scatter.dudi ade4
## scatter.nipals ade4
## scatter.pco ade4
## score.acm ade4
## score.mix ade4
## score.pca ade4
## screeplot.dudi ade4
##
## Attaching package: 'adegraphics'
## The following object is masked from 'package:ape':
##
## zoom
## The following object is masked from 'package:GenomicRanges':
##
## score
## The following object is masked from 'package:Biostrings':
##
## score
## The following object is masked from 'package:BiocGenerics':
##
## score
library(treespace)
## Loading required package: ade4
##
## Attaching package: 'ade4'
## The following objects are masked from 'package:adegraphics':
##
## kplotsepan.coa, s.arrow, s.class, s.corcircle, s.distri, s.image,
## s.label, s.logo, s.match, s.traject, s.value, table.value,
## triangle.class
## The following object is masked from 'package:GenomicRanges':
##
## score
## The following object is masked from 'package:Biostrings':
##
## score
## The following object is masked from 'package:BiocGenerics':
##
## score
Now we need to load all the trendlines into a multiPhlyo object
tree_files <- list.files(file.path(getwd(),"genetrees"), full.names = TRUE)
tree_list <- lapply(tree_files, read.tree)
class(tree_list)
class(tree_list) <- "multiPhylo"
class(tree_list)
Now we can compute the Kendall-coljin disatnces between treees. This functiondoes a Lot of analysis
Particularly Suitable for rooted trees as we have here. The option NF tells us how comparisons <- treespace(tree_list, nf = 3)
We can plot the painwise distances between trees with table.image
table.image(comparisons$D, nclass = 25)
Now Lets Print the PCA and clusters, this shows us how the group of trees cluster
plotGroves(comparisons$pco, lab.show = TRUE, lab.cox = 1.5)
groves <- findGroves(comparisons, nclust = 4)
plotGroves(groves)
##Extracting and wokring with subtrees using APE
Load our required packages
library(ape)
Now lets load the tree data we will be working with
newick <- read.tree("mammal_tree.nwk")
l <- subtrees(newick)
Lets plot the tree to see what it looks like
plot(newick)
We can subset this plot using the “node” function
plot(l[[4]], sub = "Node 4")
Extract the tree manbually
small_tree <- extract.clade(newick, 9)
plot(small_tree)
Now what if we want to bind two trees together
new_tree <-bind.tree(newick, small_tree, 3)
plot(new_tree)
Lets load the packages
library(Biostrings)
library(msa)
library(phangorn)
First we’ll load the sequences into a seqs variable
seqs <- readAAStringSet("abc.fa")
Now lets construct an alignment with the msa package and ClustalOmega
aln <- msa::msa(seqs, method = c("ClustalOmega"))
## using Gonnet
To create a tree, we need to convert the alignment to a phyData objects
aln <- as.phyDat(aln, type = "AA")
class(aln)
## [1] "phyDat"
In this step, we’ll actually make the trees, Trees are made from a distance matrix, which can be computed with dsit.ml() - ML stands for maximum likehood
dist_mat <- dist.ml(aln)
Now we pass the distance matrix to upgma to make a UPGMA tree
upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main = "UPGMA")
We can conversely pass the distance matrix to a neighbor joining function
nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")
Lastly, we are going to use the bootstraps.phyDat() function to compute bootstrap swupport for the branches of the tree. The first argument is the object (aln), while the second argument in the ufnction nj
Bootstraps are essentially confidence intervals for how the clade is placed in the correct position
fit <- pml(nj_tree, aln)
bootstraps <- bootstrap.phyDat(aln, FUN = function(x) {NJ(dist.ml(x))}, bs = 100)
plotBS(nj_tree, bootstraps, p = 10 )
First lets load the required libraries
library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
##
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:base':
##
## tabulate
library(VariantTools)
Now we want to load our data sets. We need .Bam and .fa files for this pipeline to work
bam_file <- file.path(getwd(), "hg17_snps.bam")
fasta_file <- file.path(getwd(),"chr17_83k.fa")
Now we need to set up the genome object and the parameter objects:
fa <- rtracklayer::FastaFile(fasta_file)
Now we create a GMapGenome object, which describes the genome to the latervariant calling function
genome <- gmapR::GmapGenome(fa, create = TRUE)
## NOTE: genome 'chr17_83k' already exists, not overwriting
This next step sets our paramters for what is considered a variant. There can be a lot of fine-tuning to this function we are just going to use the standard settings
qual_params <- TallyVariantsParam(
genome = genome,
minimum_mapq = 20)
var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)
Now we can use callvariants function in accordance to our parameters we defined above
called_variants <- callVariants(bam_file,
qual_params,
calling.filter = var_params)
head(called_variants)
## VRanges object with 6 ranges and 17 metadata columns:
## seqnames ranges strand ref alt totalDepth
## <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
## [1] NC_000017.10 64 * G T 759
## [2] NC_000017.10 69 * G T 812
## [3] NC_000017.10 70 * G T 818
## [4] NC_000017.10 73 * T A 814
## [5] NC_000017.10 77 * T A 802
## [6] NC_000017.10 78 * G T 798
## refDepth altDepth sampleNames softFilterMatrix | n.read.pos
## <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <integer>
## [1] 739 20 <NA> | 17
## [2] 790 22 <NA> | 19
## [3] 796 22 <NA> | 20
## [4] 795 19 <NA> | 13
## [5] 780 22 <NA> | 19
## [6] 777 21 <NA> | 17
## n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
## <integer> <integer> <integer> <integer> <integer>
## [1] 64 759 20 739 0
## [2] 69 812 22 790 0
## [3] 70 818 22 796 0
## [4] 70 814 19 795 0
## [5] 70 802 22 780 0
## [6] 70 798 21 777 0
## count.minus.ref count.del.plus count.del.minus read.pos.mean
## <integer> <integer> <integer> <numeric>
## [1] 0 0 0 30.9000
## [2] 0 0 0 40.7273
## [3] 0 0 0 34.7727
## [4] 0 0 0 36.1579
## [5] 0 0 0 38.3636
## [6] 0 0 0 39.7143
## read.pos.mean.ref read.pos.var read.pos.var.ref mdfne mdfne.ref
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 32.8755 318.558 347.804 NA NA
## [2] 35.4190 377.004 398.876 NA NA
## [3] 36.3442 497.762 402.360 NA NA
## [4] 36.2176 519.551 402.843 NA NA
## [5] 36.0064 472.327 397.070 NA NA
## [6] 35.9241 609.076 390.463 NA NA
## count.high.nm count.high.nm.ref
## <integer> <integer>
## [1] 20 738
## [2] 22 789
## [3] 22 796
## [4] 19 769
## [5] 22 780
## [6] 21 777
## -------
## seqinfo: 1 sequence from chr17_83k genome
## hardFilters(4): nonRef nonNRef readCount likelihoodRatio
We have identified 6 variants
Now we move on to annotation and load the feature position information
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
Note you can also load the data from a bed file
genes <- get_annotated_regions_from_gff("chr17.83k.gff3")
Now we can calculate which variants overlap which genes
overlaps <- GenomicRanges::findOverlaps(called_variants, genes)
overlaps
## Hits object with 12684 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 35176 1
## [2] 35176 2
## [3] 35176 3
## [4] 35177 1
## [5] 35177 2
## ... ... ...
## [12680] 40944 2
## [12681] 40944 7
## [12682] 40945 1
## [12683] 40945 2
## [12684] 40945 7
## -------
## queryLength: 44949 / subjectLength: 8
genes[subjectHits(overlaps)]
## GRanges object with 12684 ranges and 20 metadata columns:
## seqnames ranges strand | source type score
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
## [1] NC_000017.10 64099-76866 - | havana ncRNA_gene NA
## [2] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## [3] NC_000017.10 64099-65736 - | havana exon NA
## [4] NC_000017.10 64099-76866 - | havana ncRNA_gene NA
## [5] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## ... ... ... ... . ... ... ...
## [12680] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## [12681] NC_000017.10 76723-76866 - | havana exon NA
## [12682] NC_000017.10 64099-76866 - | havana ncRNA_gene NA
## [12683] NC_000017.10 64099-76866 - | havana lnc_RNA NA
## [12684] NC_000017.10 76723-76866 - | havana exon NA
## phase ID Name biotype
## <integer> <character> <character> <character>
## [1] <NA> gene:ENSG00000280279 AC240565.2 lincRNA
## [2] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## [3] <NA> <NA> ENSE00003759547 <NA>
## [4] <NA> gene:ENSG00000280279 AC240565.2 lincRNA
## [5] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## ... ... ... ... ...
## [12680] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## [12681] <NA> <NA> ENSE00003756684 <NA>
## [12682] <NA> gene:ENSG00000280279 AC240565.2 lincRNA
## [12683] <NA> transcript:ENST00000.. AC240565.2-201 lincRNA
## [12684] <NA> <NA> ENSE00003756684 <NA>
## description gene_id logic_name version
## <character> <character> <character> <character>
## [1] novel transcript ENSG00000280279 havana 1
## [2] <NA> <NA> <NA> 1
## [3] <NA> <NA> <NA> 1
## [4] novel transcript ENSG00000280279 havana 1
## [5] <NA> <NA> <NA> 1
## ... ... ... ... ...
## [12680] <NA> <NA> <NA> 1
## [12681] <NA> <NA> <NA> 1
## [12682] novel transcript ENSG00000280279 havana 1
## [12683] <NA> <NA> <NA> 1
## [12684] <NA> <NA> <NA> 1
## Parent tag transcript_id
## <CharacterList> <character> <character>
## [1] <NA> <NA>
## [2] gene:ENSG00000280279 basic ENST00000623180
## [3] transcript:ENST00000.. <NA> <NA>
## [4] <NA> <NA>
## [5] gene:ENSG00000280279 basic ENST00000623180
## ... ... ... ...
## [12680] gene:ENSG00000280279 basic ENST00000623180
## [12681] transcript:ENST00000.. <NA> <NA>
## [12682] <NA> <NA>
## [12683] gene:ENSG00000280279 basic ENST00000623180
## [12684] transcript:ENST00000.. <NA> <NA>
## transcript_support_level constitutive ensembl_end_phase ensembl_phase
## <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA>
## [2] 5 <NA> <NA> <NA>
## [3] <NA> 1 -1 -1
## [4] <NA> <NA> <NA> <NA>
## [5] 5 <NA> <NA> <NA>
## ... ... ... ... ...
## [12680] 5 <NA> <NA> <NA>
## [12681] <NA> 1 -1 -1
## [12682] <NA> <NA> <NA> <NA>
## [12683] 5 <NA> <NA> <NA>
## [12684] <NA> 1 -1 -1
## exon_id rank
## <character> <character>
## [1] <NA> <NA>
## [2] <NA> <NA>
## [3] ENSE00003759547 5
## [4] <NA> <NA>
## [5] <NA> <NA>
## ... ... ...
## [12680] <NA> <NA>
## [12681] ENSE00003756684 1
## [12682] <NA> <NA>
## [12683] <NA> <NA>
## [12684] ENSE00003756684 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
First thing, lets load the required packages
library(Biostrings)
library(systemPipeR)
## Loading required package: ShortRead
## Loading required package: BiocParallel
## Loading required package: GenomicAlignments
##
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:dplyr':
##
## last
##
## Attaching package: 'ShortRead'
## The following object is masked from 'package:adegraphics':
##
## zoom
## The following objects are masked from 'package:locfit':
##
## left, right
## The following object is masked from 'package:ape':
##
## zoom
## The following object is masked from 'package:imager':
##
## clean
## The following object is masked from 'package:magrittr':
##
## functions
## The following object is masked from 'package:oligo':
##
## intensity
## The following objects are masked from 'package:oligoClasses':
##
## chromosome, position
## The following object is masked from 'package:affy':
##
## intensity
## The following object is masked from 'package:dplyr':
##
## id
## The following object is masked from 'package:purrr':
##
## compose
## The following object is masked from 'package:tibble':
##
## view
##
## Attaching package: 'systemPipeR'
## The following object is masked from 'package:VariantAnnotation':
##
## reference
## The following object is masked from 'package:DESeq2':
##
## results
Lets load the data into a DNAStrings, in this case, an Arabidopsis chloroplast genome
dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa")
Now lets predict the open reading frames with predORF(), we’ll predict all ORF on both strands
predict_orfs <- predORF(dna_object, n = 'all', type = 'gr', mode = 'ORF', strand = 'both',
longest_disjoint = TRUE)
predict_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 1162 chloroplast 2056-2532 - | 1 3
## 2 chloroplast 72371-73897 + | 2 2
## 1163 chloroplast 77901-78362 - | 2 1
## 3 chloroplast 54937-56397 + | 3 3
## ... ... ... ... . ... ...
## 2497 chloroplast 129757-129762 - | 1336 3
## 2498 chloroplast 139258-139263 - | 1337 3
## 2499 chloroplast 140026-140031 - | 1338 3
## 2500 chloroplast 143947-143952 - | 1339 3
## 2501 chloroplast 153619-153624 - | 1340 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
This printed out a GRanges object in return, with 2, 501 open reading frames. This is FAR too many open reading frames
To filter out erroneous ORF’s, we do s dimulation. We first estimate the length an ORF can reach by chance. We will create a string of random mucleotides that is the length of our chloroplast genome and determine the longer ORF that can arise by chance
bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object),"")
Now we need tob ensure that our random nucleotides match the proportion of nucleotides in our chloroplast genome so we have no bias.
seq_length <- width(dna_object[1])
counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})
probs <- unlist(lapply(counts, function(base_count){signif(base_count/seq_length, 2)}))
probs
## [1] 6.5e-06 6.5e-06 6.5e-06 6.5e-06
Now we can build our function to simulate a genome
{r]} get_longest_orf_in_random_genome <- function(x, length = 1000, probs = c(0.25, 0.25, 0.25, 0.25), bases = c("A","T","G","C")){}
Here we create our random genome and allow replacement for the next iteration
random_genome <- paste0(sample(bases, size = 1000, replace = TRUE, prob = probs), collapse = "")
random_dna_object <- DNAStringSet(random_genome)
names (random_dna_object) <- c("random_dna_string")
orfs <- predORF(random_dna_object, n =1, type = 'gr', mode = 'ORF', strand = 'both', longest_disjoint = TRUE)
return(max(width(orfs)))
Now we can use the function we just created to predict the ORFs in 10 random genomes
random_length <- unlist(lapply(1:10, get_longest_orf_in_random_genome, length = seq_length, probs, bases = bases))
Lets pull out the longest length from our 10 simulation
longest_random_orf <- max(random_length)
Lets only keep the frames that are longer than our chloroplast GENOME
keep <- width(predict_orfs) > longest_random_orf
orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
Write this data to file
extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")
First lets load the required packages
library(karyoploteR)
## Loading required package: regioneR
library(GenomicRanges)
Now we need to set up the genome object for our
genome_df <- data.frame(
chr = paste0("chr", 1:5),
start = rep(1, 5),
end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)
Now we convert the dataframe to a granges object
genome_gr <- makeGRangesFromDataFrame(genome_df)
Now lets create some snp positions to map out. We do this by using the sample() function
snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
chr = paste0("chr", sample(1:5,25, replace = TRUE)),
state = snp_pos,
end = snp_pos
)
Again we convert the dataframe to groups
snps_gr <- makeGRangesListFromDataFrame(snps)
Lets create some snp laBELS
snp_labels <- paste0("snp_", 1:25)
Now we will set the margins for our plot
plot.params <- getDefaultPlotParams(plot.type =1)
Now lets plot our snps
kp <- plotKaryotype(genome_gr, plot.type= 1, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
We can also add some numeric data to our plots. We will generate 100 radnom numbers that plot to 100 windoows on chromosome 1
numeric_data <- data.frame(
y = rnorm(100, mean = 1, sd = 0.5),
chr = rep("chr4", 100),
start = seq(1,20862711, 20862711/100),
end = seq(1,20862711, 20862711/100)
)
Now lets make the data granges object
numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)
Again lets set our plot paramters
plot.params <- getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800
Lets plot out our data
kp <- plotKaryotype(genome = genome_gr, plot.type = 1, plot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)